Yorick Peterse
74bc11a239
Rip out column counting.
...
This makes both the lexer and parser quite a bit easier to use. Counting column
numbers isn't also really needed when parsing XML/HTML.
2014-03-20 19:44:28 +01:00
Yorick Peterse
70a39042e7
Removed useless rules from the parser.
2014-03-20 18:58:32 +01:00
Yorick Peterse
03774f2788
Documented the lexer.
2014-03-19 22:05:57 +01:00
Yorick Peterse
192ba9bb54
Expanded the lexer comment tests.
2014-03-19 21:44:57 +01:00
Yorick Peterse
f1fcdfbacb
Cleaned up the Ragel bits of the lexer.
...
This removes some of the complexity that existed before (e.g. too many state
machines) and fixes a bunch of problems with nested data.
2014-03-19 21:44:10 +01:00
Yorick Peterse
7271e74396
Revert "Compacter parser AST."
...
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).
This reverts commit 8898d08831
.
2014-03-18 18:55:16 +01:00
Yorick Peterse
9687dd379f
Added a .ruby-version file.
2014-03-18 18:08:25 +01:00
Yorick Peterse
56f22c311e
Allow JRuby to fail for now.
2014-03-18 00:13:33 +01:00
Yorick Peterse
422832fd68
Lowered the required Ragel version to 6.7.
2014-03-18 00:12:21 +01:00
Yorick Peterse
091e32c17a
Install Ragel on Travis CI.
2014-03-18 00:09:16 +01:00
Yorick Peterse
8d4d3999b5
Configuration file for Travis CI.
2014-03-17 21:52:24 +01:00
Yorick Peterse
9975c9c430
Removed the emit_text_buffer Ragel action.
2014-03-17 21:49:49 +01:00
Yorick Peterse
274ab359ba
Don't use separate tokens/nodes for newlines.
...
Newlines are now lexed together with regular text. The line numbers are
advanced based on the amount of "\n" sequences in a text buffer.
2014-03-17 21:26:21 +01:00
Yorick Peterse
8898d08831
Compacter parser AST.
...
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
2014-03-17 21:03:54 +01:00
Yorick Peterse
8d3f3f15d7
Renamed parse_html() to parse().
2014-03-16 23:46:20 +01:00
Yorick Peterse
cb75edc30d
Basic support for lexing/parsing HTML5.
...
This will need a bunch of extra tests before I'll consider closing #7 .
2014-03-16 23:42:24 +01:00
Yorick Peterse
ce8bbdb64a
Parsing support for multiple nested nodes.
2014-03-15 20:19:54 +01:00
Yorick Peterse
05ee3c13c9
Parsing support for nested element/text nodes.
2014-03-14 00:44:11 +01:00
Yorick Peterse
6b2f682c5c
Tests for lexing a basic HTML document.
...
This also comes with some changes to the lexer so that it advances column/line
numbers correctly.
2014-03-13 23:55:18 +01:00
Yorick Peterse
edf2e4112b
Added a test for parsing bare text tokens.
2014-03-13 00:42:58 +01:00
Yorick Peterse
34f8779c94
Lexing of bare regular text.
...
This is currently a bit of a hack but at least we're slowly getting there.
2014-03-13 00:42:12 +01:00
Yorick Peterse
2fbca93ae8
Supported for parsing nested elements.
2014-03-12 23:13:28 +01:00
Yorick Peterse
8cfa81aed9
Basic support for parsing elements.
...
This includes support for elements with namespaces and attributes. Nested
elements are not yet supported.
2014-03-12 23:02:54 +01:00
Yorick Peterse
5ce515d224
Small line wrapping change in the lexer.
2014-03-12 22:42:13 +01:00
Yorick Peterse
98b3443e7f
Lexing of element attributes without values.
2014-03-12 22:41:17 +01:00
Yorick Peterse
ed9d8c05a2
Added support for parsing comments.
2014-03-12 22:20:12 +01:00
Yorick Peterse
0a396043f8
Support for parsing CDATA tags.
2014-03-11 22:22:02 +01:00
Yorick Peterse
c9592856f0
Updated parsing of doctypes.
...
The resulting nodes now separate the type, public and system IDs in to separate
string values.
2014-03-11 22:08:21 +01:00
Yorick Peterse
c07edc767b
Updated the gitignore entry for the parser.
2014-03-11 22:03:02 +01:00
Yorick Peterse
4a41894e2c
Updated the doctype parser specs.
2014-03-11 22:02:26 +01:00
Yorick Peterse
8ce76be050
Moved the parser class to Oga::Parser.
...
Oga will use the same parser for XML and HTML so it doesn't make sense to
separate the two into different namespaces (at least for now).
2014-03-11 22:01:50 +01:00
Yorick Peterse
77b40d2e81
Use a separate machine for closing tags.
...
This makes it easier to advance column numbers for whitespace as well as
captuing and emitting tokens for the closing tag.
2014-03-11 21:55:36 +01:00
Yorick Peterse
eacd9b88cf
Reworked token generation for elements.
...
This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name
(T_ELEMENT_NAME). This makes it easier to include the namespace of an element
(T_ELEMENT_NS) in the output.
2014-03-10 23:50:39 +01:00
Yorick Peterse
cd53d5e426
Fixed advancing column numbers.
...
In a bunch of cases the column number would not be increased correctly.
2014-03-07 23:54:56 +01:00
Yorick Peterse
1c9a6c8b76
Tests for nested tags/text nodes.
...
Well guess what, apparently that did work. That was slightly unexpected.
2014-03-03 22:13:29 +01:00
Yorick Peterse
a5a3b8db3f
Basic lexing of HTML tags.
...
The current implementation is a bit messy. In particular the counting of column
numbers is not entirely the way it should be. There are also some problems with
nested tags/text that I still have to resolve.
2014-03-03 22:08:46 +01:00
Yorick Peterse
d9ef33e1f8
Lexing of comments.
...
This fixes #4 .
2014-02-28 23:27:23 +01:00
Yorick Peterse
92ae48f905
Use fcall + fret instead of fgoto.
...
This removes the hardcoded return to the main machine.
2014-02-28 23:19:31 +01:00
Yorick Peterse
30d3e455d1
Use squote/dquote everywhere in the lexer.
2014-02-28 23:18:23 +01:00
Yorick Peterse
970ce27283
Cleanup of buffering text/strings.
...
This removes the need to use ||= and such, which should speed things up a bit
and keeps the code cleaner.
2014-02-28 23:16:01 +01:00
Yorick Peterse
ca6f422036
Lexing of doctypes.
...
This comes with various structural changes to the lexer as I'm slowly starting
to get the hang of Ragel. Ragel is a beast but damn it's an awesome piece of
software.
Note that the doctype public/system IDs are lexed as T_STRING. The parser will
figure out whether a ID is a public or system ID based on the order.
This fixes #1
2014-02-28 23:08:55 +01:00
Yorick Peterse
3c825afee0
Cleaned up lexer rules a bit.
...
There's no benefit to adding variables for angle brackets and such, it's much
easier to grok to just use them directly.
2014-02-28 20:09:13 +01:00
Yorick Peterse
2294bf19f4
Better lexing of CDATA tags.
...
This means the lexer is now capable of lexing CDATA tags that contain text such
as ]].
2014-02-28 20:05:12 +01:00
Yorick Peterse
6138945d53
Moved some of the CDATA docs around.
2014-02-28 00:04:44 +01:00
Yorick Peterse
4883ac7384
Lexing of CDATA tags.
2014-02-28 00:03:37 +01:00
Yorick Peterse
c011e2faaa
Moved the lexer specs to spec/oga/lexer.
...
I accidently moved these inside the parser specs.
2014-02-27 21:30:10 +01:00
Yorick Peterse
cdaa14a28e
Broke up lexer specs into separate files.
2014-02-27 20:55:29 +01:00
Yorick Peterse
2c82f88f6c
Basic lexing + parsing of doctypes.
...
We're doing these the lazy way. I can't be bothered writing patterns/rules for
4 different formats for something such as doctypes.
2014-02-27 01:27:51 +01:00
Yorick Peterse
d7d20b4c23
Added a license.
2014-02-26 22:20:47 +01:00
Yorick Peterse
91f416f035
Moved ending tags into their own racc rule.
2014-02-26 22:20:11 +01:00