Yorick Peterse
55f116124c
Fix for showing lines in parser errors.
2014-03-21 00:16:20 +01:00
Yorick Peterse
7749f4abce
Corrected a comment in the parser.
2014-03-21 00:10:20 +01:00
Yorick Peterse
a20ec0000a
Show up to 5 surrounding lines in parser errors.
2014-03-20 23:40:25 +01:00
Yorick Peterse
91fb7523fd
Lex open tags with newlines in them.
2014-03-20 23:39:29 +01:00
Yorick Peterse
ba17996bfc
Fancier error messages for the parser.
...
The error messages of the parser now contain surrounding lines of code instead
of only the offending line of code. This should make debugging a bit easier.
Line numbers are also shown for each line.
2014-03-20 23:30:24 +01:00
Yorick Peterse
74bc11a239
Rip out column counting.
...
This makes both the lexer and parser quite a bit easier to use. Counting column
numbers isn't also really needed when parsing XML/HTML.
2014-03-20 19:44:28 +01:00
Yorick Peterse
70a39042e7
Removed useless rules from the parser.
2014-03-20 18:58:32 +01:00
Yorick Peterse
03774f2788
Documented the lexer.
2014-03-19 22:05:57 +01:00
Yorick Peterse
f1fcdfbacb
Cleaned up the Ragel bits of the lexer.
...
This removes some of the complexity that existed before (e.g. too many state
machines) and fixes a bunch of problems with nested data.
2014-03-19 21:44:10 +01:00
Yorick Peterse
7271e74396
Revert "Compacter parser AST."
...
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).
This reverts commit 8898d08831
.
2014-03-18 18:55:16 +01:00
Yorick Peterse
9975c9c430
Removed the emit_text_buffer Ragel action.
2014-03-17 21:49:49 +01:00
Yorick Peterse
274ab359ba
Don't use separate tokens/nodes for newlines.
...
Newlines are now lexed together with regular text. The line numbers are
advanced based on the amount of "\n" sequences in a text buffer.
2014-03-17 21:26:21 +01:00
Yorick Peterse
8898d08831
Compacter parser AST.
...
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
2014-03-17 21:03:54 +01:00
Yorick Peterse
cb75edc30d
Basic support for lexing/parsing HTML5.
...
This will need a bunch of extra tests before I'll consider closing #7 .
2014-03-16 23:42:24 +01:00
Yorick Peterse
ce8bbdb64a
Parsing support for multiple nested nodes.
2014-03-15 20:19:54 +01:00
Yorick Peterse
05ee3c13c9
Parsing support for nested element/text nodes.
2014-03-14 00:44:11 +01:00
Yorick Peterse
6b2f682c5c
Tests for lexing a basic HTML document.
...
This also comes with some changes to the lexer so that it advances column/line
numbers correctly.
2014-03-13 23:55:18 +01:00
Yorick Peterse
34f8779c94
Lexing of bare regular text.
...
This is currently a bit of a hack but at least we're slowly getting there.
2014-03-13 00:42:12 +01:00
Yorick Peterse
2fbca93ae8
Supported for parsing nested elements.
2014-03-12 23:13:28 +01:00
Yorick Peterse
8cfa81aed9
Basic support for parsing elements.
...
This includes support for elements with namespaces and attributes. Nested
elements are not yet supported.
2014-03-12 23:02:54 +01:00
Yorick Peterse
5ce515d224
Small line wrapping change in the lexer.
2014-03-12 22:42:13 +01:00
Yorick Peterse
98b3443e7f
Lexing of element attributes without values.
2014-03-12 22:41:17 +01:00
Yorick Peterse
ed9d8c05a2
Added support for parsing comments.
2014-03-12 22:20:12 +01:00
Yorick Peterse
0a396043f8
Support for parsing CDATA tags.
2014-03-11 22:22:02 +01:00
Yorick Peterse
c9592856f0
Updated parsing of doctypes.
...
The resulting nodes now separate the type, public and system IDs in to separate
string values.
2014-03-11 22:08:21 +01:00
Yorick Peterse
c07edc767b
Updated the gitignore entry for the parser.
2014-03-11 22:03:02 +01:00
Yorick Peterse
8ce76be050
Moved the parser class to Oga::Parser.
...
Oga will use the same parser for XML and HTML so it doesn't make sense to
separate the two into different namespaces (at least for now).
2014-03-11 22:01:50 +01:00
Yorick Peterse
77b40d2e81
Use a separate machine for closing tags.
...
This makes it easier to advance column numbers for whitespace as well as
captuing and emitting tokens for the closing tag.
2014-03-11 21:55:36 +01:00
Yorick Peterse
eacd9b88cf
Reworked token generation for elements.
...
This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name
(T_ELEMENT_NAME). This makes it easier to include the namespace of an element
(T_ELEMENT_NS) in the output.
2014-03-10 23:50:39 +01:00
Yorick Peterse
cd53d5e426
Fixed advancing column numbers.
...
In a bunch of cases the column number would not be increased correctly.
2014-03-07 23:54:56 +01:00
Yorick Peterse
a5a3b8db3f
Basic lexing of HTML tags.
...
The current implementation is a bit messy. In particular the counting of column
numbers is not entirely the way it should be. There are also some problems with
nested tags/text that I still have to resolve.
2014-03-03 22:08:46 +01:00
Yorick Peterse
d9ef33e1f8
Lexing of comments.
...
This fixes #4 .
2014-02-28 23:27:23 +01:00
Yorick Peterse
92ae48f905
Use fcall + fret instead of fgoto.
...
This removes the hardcoded return to the main machine.
2014-02-28 23:19:31 +01:00
Yorick Peterse
30d3e455d1
Use squote/dquote everywhere in the lexer.
2014-02-28 23:18:23 +01:00
Yorick Peterse
970ce27283
Cleanup of buffering text/strings.
...
This removes the need to use ||= and such, which should speed things up a bit
and keeps the code cleaner.
2014-02-28 23:16:01 +01:00
Yorick Peterse
ca6f422036
Lexing of doctypes.
...
This comes with various structural changes to the lexer as I'm slowly starting
to get the hang of Ragel. Ragel is a beast but damn it's an awesome piece of
software.
Note that the doctype public/system IDs are lexed as T_STRING. The parser will
figure out whether a ID is a public or system ID based on the order.
This fixes #1
2014-02-28 23:08:55 +01:00
Yorick Peterse
3c825afee0
Cleaned up lexer rules a bit.
...
There's no benefit to adding variables for angle brackets and such, it's much
easier to grok to just use them directly.
2014-02-28 20:09:13 +01:00
Yorick Peterse
2294bf19f4
Better lexing of CDATA tags.
...
This means the lexer is now capable of lexing CDATA tags that contain text such
as ]].
2014-02-28 20:05:12 +01:00
Yorick Peterse
6138945d53
Moved some of the CDATA docs around.
2014-02-28 00:04:44 +01:00
Yorick Peterse
4883ac7384
Lexing of CDATA tags.
2014-02-28 00:03:37 +01:00
Yorick Peterse
2c82f88f6c
Basic lexing + parsing of doctypes.
...
We're doing these the lazy way. I can't be bothered writing patterns/rules for
4 different formats for something such as doctypes.
2014-02-27 01:27:51 +01:00
Yorick Peterse
91f416f035
Moved ending tags into their own racc rule.
2014-02-26 22:20:11 +01:00
Yorick Peterse
4f04fa0d30
Untrack Racc generated files.
...
Yorick, you can stop being bad now.
2014-02-26 22:18:33 +01:00
Yorick Peterse
e764ba640a
Basic parser setup without tests.
...
Who needs tests anyway!
2014-02-26 22:17:47 +01:00
Yorick Peterse
c4e0406ed9
Lexing of CDATA tags.
2014-02-26 22:01:07 +01:00
Yorick Peterse
0a336e76d3
Renamed T_EXCLAMATION to T_BANG.
...
This is way easier to type.
2014-02-26 21:54:27 +01:00
Yorick Peterse
684eccd3e2
Lex dashes as T_DASH instead of T_TEXT.
2014-02-26 21:52:32 +01:00
Yorick Peterse
39bbe5afc4
Expanded lexer tag/attribute tests.
2014-02-26 21:48:46 +01:00
Yorick Peterse
d32888f803
Basic lexer setup/tests.
...
Too lazy to do this the right way. ᕕ(ᐛ)ᕗ
2014-02-26 21:36:30 +01:00
Yorick Peterse
5755c325bd
Imported a half-assed lexer.
2014-02-26 19:54:11 +01:00