Yorick Peterse
08d412da7e
First shot at removing the AST layer.
...
The AST layer is being removed because it doesn't really serve a useful
purpose. In particular when creating a streaming parser the AST nodes would
only introduce extra overhead.
As a result of this the parser will instead emit a DOM tree directly instead of
first emitting an AST.
2014-04-21 23:05:39 +02:00
Yorick Peterse
d9fa4b7c45
Lex input as a sequence of bytes.
...
Instead of lexing the input as a raw String or as a set of codepoints it's
treated as a sequence of bytes. This removes the need of String#[] (replaced by
String#byteslice) which in turn reduces the amount of memory needed and speeds
up the lexing time.
Thanks to @headius and @apeiros for suggesting this and rubber ducking along!
2014-04-17 17:45:05 +02:00
Yorick Peterse
b96f7c4852
Lex attributes with namespaces.
...
These are lexed as just the name instead of two separate tokens.
2014-04-10 11:01:49 +02:00
Yorick Peterse
8237d5791d
Stream tokens when lexing.
...
Instead of returning the tokens as a whole they are now streamed using
XML::Lexer#advance. This method returns the next token upon every call. It uses
a small buffer in case a particular block of text results in multiple tokens.
2014-04-09 22:08:13 +02:00
Yorick Peterse
10d0ec1573
Specs for parsing various empty nodes.
2014-04-07 21:33:23 +02:00
Yorick Peterse
cb74c7edf9
Specs for XML parser errors.
2014-04-07 21:31:36 +02:00
Yorick Peterse
915d3ee505
Expanded tests for XML::Document#inspect.
2014-04-07 20:11:12 +02:00
Yorick Peterse
e9412c9c4e
Tests for various inspect methods.
2014-04-07 09:58:31 +02:00
Yorick Peterse
a2c525dd7c
Insert newlines after XML dec/doctypes.
2014-04-03 23:04:21 +02:00
Yorick Peterse
c077988dd6
Tree building of doctypes.
2014-04-03 22:44:00 +02:00
Yorick Peterse
81b1155af3
Lex/parse doctype names separately.
2014-04-03 21:59:57 +02:00
Yorick Peterse
6cf906e500
Lexer tests for single quoted attributes.
2014-04-03 18:50:07 +02:00
Yorick Peterse
30c01a5aee
Tests for XML::TreeBuilder#handler_missing.
2014-04-03 09:43:30 +02:00
Yorick Peterse
0f129ceac9
Tests for XML::TreeBuilder#on_comment.
2014-04-03 09:38:18 +02:00
Yorick Peterse
bdb76cefc5
Dedicated handling of XML declaration nodes.
2014-04-02 22:30:45 +02:00
Yorick Peterse
d6c0a1f3f3
Lex/parser XML declaration attributes.
2014-04-02 22:01:17 +02:00
Yorick Peterse
fa2e71c790
Tests for TreeBuilder#on_document.
2014-03-28 18:52:08 +01:00
Yorick Peterse
f99c13b516
Tests + docs for the TreeBuilder class.
2014-03-28 17:11:54 +01:00
Yorick Peterse
331726b2ca
Tests for the various XML node types.
2014-03-28 16:34:30 +01:00
Yorick Peterse
79818eb349
Added a convenience class for parsing HTML.
...
This removes the need for users having to set the `:html` option themselves.
2014-03-25 09:40:24 +01:00
Yorick Peterse
58009614f6
Moved XML specs into spec/oga/xml.
2014-03-25 09:36:39 +01:00
Yorick Peterse
eae13d21ed
Namespaced the lexer/parser under Oga::XML.
...
With the upcoming XPath and CSS selector lexers/parsers it will be confusing to
keep these in the root namespace.
2014-03-25 09:34:38 +01:00
Yorick Peterse
641c54261e
Simplified lexer output for comments.
2014-03-24 21:34:30 +01:00
Yorick Peterse
eaf1669b07
Simplified lexer output for CDATA tags.
2014-03-24 21:33:05 +01:00
Yorick Peterse
470be5a839
Simplified the lexer output for doctypes.
2014-03-24 21:32:16 +01:00
Yorick Peterse
ac775918ee
Lexing/parsing of XML declaration tags.
...
This closes #12 .
2014-03-24 21:30:19 +01:00
Yorick Peterse
b695ecf0df
Renamed element lexer tags.
...
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
2014-03-24 20:32:43 +01:00
Yorick Peterse
91fb7523fd
Lex open tags with newlines in them.
2014-03-20 23:39:29 +01:00
Yorick Peterse
74bc11a239
Rip out column counting.
...
This makes both the lexer and parser quite a bit easier to use. Counting column
numbers isn't also really needed when parsing XML/HTML.
2014-03-20 19:44:28 +01:00
Yorick Peterse
192ba9bb54
Expanded the lexer comment tests.
2014-03-19 21:44:57 +01:00
Yorick Peterse
7271e74396
Revert "Compacter parser AST."
...
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).
This reverts commit 8898d08831
.
2014-03-18 18:55:16 +01:00
Yorick Peterse
274ab359ba
Don't use separate tokens/nodes for newlines.
...
Newlines are now lexed together with regular text. The line numbers are
advanced based on the amount of "\n" sequences in a text buffer.
2014-03-17 21:26:21 +01:00
Yorick Peterse
8898d08831
Compacter parser AST.
...
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
2014-03-17 21:03:54 +01:00
Yorick Peterse
8d3f3f15d7
Renamed parse_html() to parse().
2014-03-16 23:46:20 +01:00
Yorick Peterse
cb75edc30d
Basic support for lexing/parsing HTML5.
...
This will need a bunch of extra tests before I'll consider closing #7 .
2014-03-16 23:42:24 +01:00
Yorick Peterse
ce8bbdb64a
Parsing support for multiple nested nodes.
2014-03-15 20:19:54 +01:00
Yorick Peterse
05ee3c13c9
Parsing support for nested element/text nodes.
2014-03-14 00:44:11 +01:00
Yorick Peterse
6b2f682c5c
Tests for lexing a basic HTML document.
...
This also comes with some changes to the lexer so that it advances column/line
numbers correctly.
2014-03-13 23:55:18 +01:00
Yorick Peterse
edf2e4112b
Added a test for parsing bare text tokens.
2014-03-13 00:42:58 +01:00
Yorick Peterse
34f8779c94
Lexing of bare regular text.
...
This is currently a bit of a hack but at least we're slowly getting there.
2014-03-13 00:42:12 +01:00
Yorick Peterse
2fbca93ae8
Supported for parsing nested elements.
2014-03-12 23:13:28 +01:00
Yorick Peterse
8cfa81aed9
Basic support for parsing elements.
...
This includes support for elements with namespaces and attributes. Nested
elements are not yet supported.
2014-03-12 23:02:54 +01:00
Yorick Peterse
98b3443e7f
Lexing of element attributes without values.
2014-03-12 22:41:17 +01:00
Yorick Peterse
ed9d8c05a2
Added support for parsing comments.
2014-03-12 22:20:12 +01:00
Yorick Peterse
0a396043f8
Support for parsing CDATA tags.
2014-03-11 22:22:02 +01:00
Yorick Peterse
c9592856f0
Updated parsing of doctypes.
...
The resulting nodes now separate the type, public and system IDs in to separate
string values.
2014-03-11 22:08:21 +01:00
Yorick Peterse
4a41894e2c
Updated the doctype parser specs.
2014-03-11 22:02:26 +01:00
Yorick Peterse
8ce76be050
Moved the parser class to Oga::Parser.
...
Oga will use the same parser for XML and HTML so it doesn't make sense to
separate the two into different namespaces (at least for now).
2014-03-11 22:01:50 +01:00
Yorick Peterse
eacd9b88cf
Reworked token generation for elements.
...
This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name
(T_ELEMENT_NAME). This makes it easier to include the namespace of an element
(T_ELEMENT_NS) in the output.
2014-03-10 23:50:39 +01:00
Yorick Peterse
cd53d5e426
Fixed advancing column numbers.
...
In a bunch of cases the column number would not be increased correctly.
2014-03-07 23:54:56 +01:00