Commit Graph

638 Commits

Author SHA1 Message Date
Yorick Peterse cb75edc30d Basic support for lexing/parsing HTML5.
This will need a bunch of extra tests before I'll consider closing #7.
2014-03-16 23:42:24 +01:00
Yorick Peterse ce8bbdb64a Parsing support for multiple nested nodes. 2014-03-15 20:19:54 +01:00
Yorick Peterse 05ee3c13c9 Parsing support for nested element/text nodes. 2014-03-14 00:44:11 +01:00
Yorick Peterse 6b2f682c5c Tests for lexing a basic HTML document.
This also comes with some changes to the lexer so that it advances column/line
numbers correctly.
2014-03-13 23:55:18 +01:00
Yorick Peterse 34f8779c94 Lexing of bare regular text.
This is currently a bit of a hack but at least we're slowly getting there.
2014-03-13 00:42:12 +01:00
Yorick Peterse 2fbca93ae8 Supported for parsing nested elements. 2014-03-12 23:13:28 +01:00
Yorick Peterse 8cfa81aed9 Basic support for parsing elements.
This includes support for elements with namespaces and attributes. Nested
elements are not yet supported.
2014-03-12 23:02:54 +01:00
Yorick Peterse 5ce515d224 Small line wrapping change in the lexer. 2014-03-12 22:42:13 +01:00
Yorick Peterse 98b3443e7f Lexing of element attributes without values. 2014-03-12 22:41:17 +01:00
Yorick Peterse ed9d8c05a2 Added support for parsing comments. 2014-03-12 22:20:12 +01:00
Yorick Peterse 0a396043f8 Support for parsing CDATA tags. 2014-03-11 22:22:02 +01:00
Yorick Peterse c9592856f0 Updated parsing of doctypes.
The resulting nodes now separate the type, public and system IDs in to separate
string values.
2014-03-11 22:08:21 +01:00
Yorick Peterse c07edc767b Updated the gitignore entry for the parser. 2014-03-11 22:03:02 +01:00
Yorick Peterse 8ce76be050 Moved the parser class to Oga::Parser.
Oga will use the same parser for XML and HTML so it doesn't make sense to
separate the two into different namespaces (at least for now).
2014-03-11 22:01:50 +01:00
Yorick Peterse 77b40d2e81 Use a separate machine for closing tags.
This makes it easier to advance column numbers for whitespace as well as
captuing and emitting tokens for the closing tag.
2014-03-11 21:55:36 +01:00
Yorick Peterse eacd9b88cf Reworked token generation for elements.
This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name
(T_ELEMENT_NAME). This makes it easier to include the namespace of an element
(T_ELEMENT_NS) in the output.
2014-03-10 23:50:39 +01:00
Yorick Peterse cd53d5e426 Fixed advancing column numbers.
In a bunch of cases the column number would not be increased correctly.
2014-03-07 23:54:56 +01:00
Yorick Peterse a5a3b8db3f Basic lexing of HTML tags.
The current implementation is a bit messy. In particular the counting of column
numbers is not entirely the way it should be. There are also some problems with
nested tags/text that I still have to resolve.
2014-03-03 22:08:46 +01:00
Yorick Peterse d9ef33e1f8 Lexing of comments.
This fixes #4.
2014-02-28 23:27:23 +01:00
Yorick Peterse 92ae48f905 Use fcall + fret instead of fgoto.
This removes the hardcoded return to the main machine.
2014-02-28 23:19:31 +01:00
Yorick Peterse 30d3e455d1 Use squote/dquote everywhere in the lexer. 2014-02-28 23:18:23 +01:00
Yorick Peterse 970ce27283 Cleanup of buffering text/strings.
This removes the need to use ||= and such, which should speed things up a bit
and keeps the code cleaner.
2014-02-28 23:16:01 +01:00
Yorick Peterse ca6f422036 Lexing of doctypes.
This comes with various structural changes to the lexer as I'm slowly starting
to get the hang of Ragel. Ragel is a beast but damn it's an awesome piece of
software.

Note that the doctype public/system IDs are lexed as T_STRING. The parser will
figure out whether a ID is a public or system ID based on the order.

This fixes #1
2014-02-28 23:08:55 +01:00
Yorick Peterse 3c825afee0 Cleaned up lexer rules a bit.
There's no benefit to adding variables for angle brackets and such, it's much
easier to grok to just use them directly.
2014-02-28 20:09:13 +01:00
Yorick Peterse 2294bf19f4 Better lexing of CDATA tags.
This means the lexer is now capable of lexing CDATA tags that contain text such
as ]].
2014-02-28 20:05:12 +01:00
Yorick Peterse 6138945d53 Moved some of the CDATA docs around. 2014-02-28 00:04:44 +01:00
Yorick Peterse 4883ac7384 Lexing of CDATA tags. 2014-02-28 00:03:37 +01:00
Yorick Peterse 2c82f88f6c Basic lexing + parsing of doctypes.
We're doing these the lazy way. I can't be bothered writing patterns/rules for
4 different formats for something such as doctypes.
2014-02-27 01:27:51 +01:00
Yorick Peterse 91f416f035 Moved ending tags into their own racc rule. 2014-02-26 22:20:11 +01:00
Yorick Peterse 4f04fa0d30 Untrack Racc generated files.
Yorick, you can stop being bad now.
2014-02-26 22:18:33 +01:00
Yorick Peterse e764ba640a Basic parser setup without tests.
Who needs tests anyway!
2014-02-26 22:17:47 +01:00
Yorick Peterse c4e0406ed9 Lexing of CDATA tags. 2014-02-26 22:01:07 +01:00
Yorick Peterse 0a336e76d3 Renamed T_EXCLAMATION to T_BANG.
This is way easier to type.
2014-02-26 21:54:27 +01:00
Yorick Peterse 684eccd3e2 Lex dashes as T_DASH instead of T_TEXT. 2014-02-26 21:52:32 +01:00
Yorick Peterse 39bbe5afc4 Expanded lexer tag/attribute tests. 2014-02-26 21:48:46 +01:00
Yorick Peterse d32888f803 Basic lexer setup/tests.
Too lazy to do this the right way. ᕕ(ᐛ)ᕗ
2014-02-26 21:36:30 +01:00
Yorick Peterse 5755c325bd Imported a half-assed lexer. 2014-02-26 19:54:11 +01:00
Yorick Peterse 702477ca28 Basic project layout. 2014-02-26 19:50:16 +01:00