Yorick Peterse
eae13d21ed
Namespaced the lexer/parser under Oga::XML.
...
With the upcoming XPath and CSS selector lexers/parsers it will be confusing to
keep these in the root namespace.
2014-03-25 09:34:38 +01:00
Yorick Peterse
641c54261e
Simplified lexer output for comments.
2014-03-24 21:34:30 +01:00
Yorick Peterse
eaf1669b07
Simplified lexer output for CDATA tags.
2014-03-24 21:33:05 +01:00
Yorick Peterse
470be5a839
Simplified the lexer output for doctypes.
2014-03-24 21:32:16 +01:00
Yorick Peterse
ac775918ee
Lexing/parsing of XML declaration tags.
...
This closes #12 .
2014-03-24 21:30:19 +01:00
Yorick Peterse
b695ecf0df
Renamed element lexer tags.
...
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
2014-03-24 20:32:43 +01:00
Yorick Peterse
91fb7523fd
Lex open tags with newlines in them.
2014-03-20 23:39:29 +01:00
Yorick Peterse
74bc11a239
Rip out column counting.
...
This makes both the lexer and parser quite a bit easier to use. Counting column
numbers isn't also really needed when parsing XML/HTML.
2014-03-20 19:44:28 +01:00
Yorick Peterse
192ba9bb54
Expanded the lexer comment tests.
2014-03-19 21:44:57 +01:00
Yorick Peterse
7271e74396
Revert "Compacter parser AST."
...
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).
This reverts commit 8898d08831
.
2014-03-18 18:55:16 +01:00
Yorick Peterse
274ab359ba
Don't use separate tokens/nodes for newlines.
...
Newlines are now lexed together with regular text. The line numbers are
advanced based on the amount of "\n" sequences in a text buffer.
2014-03-17 21:26:21 +01:00
Yorick Peterse
8898d08831
Compacter parser AST.
...
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
2014-03-17 21:03:54 +01:00
Yorick Peterse
8d3f3f15d7
Renamed parse_html() to parse().
2014-03-16 23:46:20 +01:00
Yorick Peterse
cb75edc30d
Basic support for lexing/parsing HTML5.
...
This will need a bunch of extra tests before I'll consider closing #7 .
2014-03-16 23:42:24 +01:00
Yorick Peterse
ce8bbdb64a
Parsing support for multiple nested nodes.
2014-03-15 20:19:54 +01:00
Yorick Peterse
05ee3c13c9
Parsing support for nested element/text nodes.
2014-03-14 00:44:11 +01:00
Yorick Peterse
6b2f682c5c
Tests for lexing a basic HTML document.
...
This also comes with some changes to the lexer so that it advances column/line
numbers correctly.
2014-03-13 23:55:18 +01:00
Yorick Peterse
edf2e4112b
Added a test for parsing bare text tokens.
2014-03-13 00:42:58 +01:00
Yorick Peterse
34f8779c94
Lexing of bare regular text.
...
This is currently a bit of a hack but at least we're slowly getting there.
2014-03-13 00:42:12 +01:00
Yorick Peterse
2fbca93ae8
Supported for parsing nested elements.
2014-03-12 23:13:28 +01:00
Yorick Peterse
8cfa81aed9
Basic support for parsing elements.
...
This includes support for elements with namespaces and attributes. Nested
elements are not yet supported.
2014-03-12 23:02:54 +01:00
Yorick Peterse
98b3443e7f
Lexing of element attributes without values.
2014-03-12 22:41:17 +01:00
Yorick Peterse
ed9d8c05a2
Added support for parsing comments.
2014-03-12 22:20:12 +01:00
Yorick Peterse
0a396043f8
Support for parsing CDATA tags.
2014-03-11 22:22:02 +01:00
Yorick Peterse
c9592856f0
Updated parsing of doctypes.
...
The resulting nodes now separate the type, public and system IDs in to separate
string values.
2014-03-11 22:08:21 +01:00
Yorick Peterse
4a41894e2c
Updated the doctype parser specs.
2014-03-11 22:02:26 +01:00
Yorick Peterse
8ce76be050
Moved the parser class to Oga::Parser.
...
Oga will use the same parser for XML and HTML so it doesn't make sense to
separate the two into different namespaces (at least for now).
2014-03-11 22:01:50 +01:00
Yorick Peterse
eacd9b88cf
Reworked token generation for elements.
...
This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name
(T_ELEMENT_NAME). This makes it easier to include the namespace of an element
(T_ELEMENT_NS) in the output.
2014-03-10 23:50:39 +01:00
Yorick Peterse
cd53d5e426
Fixed advancing column numbers.
...
In a bunch of cases the column number would not be increased correctly.
2014-03-07 23:54:56 +01:00
Yorick Peterse
1c9a6c8b76
Tests for nested tags/text nodes.
...
Well guess what, apparently that did work. That was slightly unexpected.
2014-03-03 22:13:29 +01:00
Yorick Peterse
a5a3b8db3f
Basic lexing of HTML tags.
...
The current implementation is a bit messy. In particular the counting of column
numbers is not entirely the way it should be. There are also some problems with
nested tags/text that I still have to resolve.
2014-03-03 22:08:46 +01:00
Yorick Peterse
d9ef33e1f8
Lexing of comments.
...
This fixes #4 .
2014-02-28 23:27:23 +01:00
Yorick Peterse
ca6f422036
Lexing of doctypes.
...
This comes with various structural changes to the lexer as I'm slowly starting
to get the hang of Ragel. Ragel is a beast but damn it's an awesome piece of
software.
Note that the doctype public/system IDs are lexed as T_STRING. The parser will
figure out whether a ID is a public or system ID based on the order.
This fixes #1
2014-02-28 23:08:55 +01:00
Yorick Peterse
2294bf19f4
Better lexing of CDATA tags.
...
This means the lexer is now capable of lexing CDATA tags that contain text such
as ]].
2014-02-28 20:05:12 +01:00
Yorick Peterse
4883ac7384
Lexing of CDATA tags.
2014-02-28 00:03:37 +01:00
Yorick Peterse
c011e2faaa
Moved the lexer specs to spec/oga/lexer.
...
I accidently moved these inside the parser specs.
2014-02-27 21:30:10 +01:00
Yorick Peterse
cdaa14a28e
Broke up lexer specs into separate files.
2014-02-27 20:55:29 +01:00
Yorick Peterse
2c82f88f6c
Basic lexing + parsing of doctypes.
...
We're doing these the lazy way. I can't be bothered writing patterns/rules for
4 different formats for something such as doctypes.
2014-02-27 01:27:51 +01:00
Yorick Peterse
c4e0406ed9
Lexing of CDATA tags.
2014-02-26 22:01:07 +01:00
Yorick Peterse
0a336e76d3
Renamed T_EXCLAMATION to T_BANG.
...
This is way easier to type.
2014-02-26 21:54:27 +01:00
Yorick Peterse
684eccd3e2
Lex dashes as T_DASH instead of T_TEXT.
2014-02-26 21:52:32 +01:00
Yorick Peterse
39bbe5afc4
Expanded lexer tag/attribute tests.
2014-02-26 21:48:46 +01:00
Yorick Peterse
d32888f803
Basic lexer setup/tests.
...
Too lazy to do this the right way. ᕕ(ᐛ)ᕗ
2014-02-26 21:36:30 +01:00
Yorick Peterse
5755c325bd
Imported a half-assed lexer.
2014-02-26 19:54:11 +01:00
Yorick Peterse
702477ca28
Basic project layout.
2014-02-26 19:50:16 +01:00