Commit Graph

971 Commits

Author SHA1 Message Date
Yorick Peterse d7a40ec470 Simple benchmark for lexing elements. 2014-03-21 17:45:23 +01:00
Yorick Peterse 9fa694ad4f Use index based buffers for text nodes.
Instead of appending single characters to a String buffer the lexer now uses a
start and end position to figure out what the buffer is. This is a lot faster
than constantly appending to a String.
2014-03-21 17:32:07 +01:00
Yorick Peterse 2852afce9b Benchmark for measuring CDATA lexing. 2014-03-21 16:59:44 +01:00
Yorick Peterse 55f116124c Fix for showing lines in parser errors. 2014-03-21 00:16:20 +01:00
Yorick Peterse 7749f4abce Corrected a comment in the parser. 2014-03-21 00:10:20 +01:00
Yorick Peterse a20ec0000a Show up to 5 surrounding lines in parser errors. 2014-03-20 23:40:25 +01:00
Yorick Peterse 91fb7523fd Lex open tags with newlines in them. 2014-03-20 23:39:29 +01:00
Yorick Peterse ba17996bfc Fancier error messages for the parser.
The error messages of the parser now contain surrounding lines of code instead
of only the offending line of code. This should make debugging a bit easier.
Line numbers are also shown for each line.
2014-03-20 23:30:24 +01:00
Yorick Peterse 74bc11a239 Rip out column counting.
This makes both the lexer and parser quite a bit easier to use. Counting column
numbers isn't also really needed when parsing XML/HTML.
2014-03-20 19:44:28 +01:00
Yorick Peterse 70a39042e7 Removed useless rules from the parser. 2014-03-20 18:58:32 +01:00
Yorick Peterse 03774f2788 Documented the lexer. 2014-03-19 22:05:57 +01:00
Yorick Peterse 192ba9bb54 Expanded the lexer comment tests. 2014-03-19 21:44:57 +01:00
Yorick Peterse f1fcdfbacb Cleaned up the Ragel bits of the lexer.
This removes some of the complexity that existed before (e.g. too many state
machines) and fixes a bunch of problems with nested data.
2014-03-19 21:44:10 +01:00
Yorick Peterse 7271e74396 Revert "Compacter parser AST."
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).

This reverts commit 8898d08831.
2014-03-18 18:55:16 +01:00
Yorick Peterse 9687dd379f Added a .ruby-version file. 2014-03-18 18:08:25 +01:00
Yorick Peterse 56f22c311e Allow JRuby to fail for now. 2014-03-18 00:13:33 +01:00
Yorick Peterse 422832fd68 Lowered the required Ragel version to 6.7. 2014-03-18 00:12:21 +01:00
Yorick Peterse 091e32c17a Install Ragel on Travis CI. 2014-03-18 00:09:16 +01:00
Yorick Peterse 8d4d3999b5 Configuration file for Travis CI. 2014-03-17 21:52:24 +01:00
Yorick Peterse 9975c9c430 Removed the emit_text_buffer Ragel action. 2014-03-17 21:49:49 +01:00
Yorick Peterse 274ab359ba Don't use separate tokens/nodes for newlines.
Newlines are now lexed together with regular text. The line numbers are
advanced based on the amount of "\n" sequences in a text buffer.
2014-03-17 21:26:21 +01:00
Yorick Peterse 8898d08831 Compacter parser AST.
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
2014-03-17 21:03:54 +01:00
Yorick Peterse 8d3f3f15d7 Renamed parse_html() to parse(). 2014-03-16 23:46:20 +01:00
Yorick Peterse cb75edc30d Basic support for lexing/parsing HTML5.
This will need a bunch of extra tests before I'll consider closing #7.
2014-03-16 23:42:24 +01:00
Yorick Peterse ce8bbdb64a Parsing support for multiple nested nodes. 2014-03-15 20:19:54 +01:00
Yorick Peterse 05ee3c13c9 Parsing support for nested element/text nodes. 2014-03-14 00:44:11 +01:00
Yorick Peterse 6b2f682c5c Tests for lexing a basic HTML document.
This also comes with some changes to the lexer so that it advances column/line
numbers correctly.
2014-03-13 23:55:18 +01:00
Yorick Peterse edf2e4112b Added a test for parsing bare text tokens. 2014-03-13 00:42:58 +01:00
Yorick Peterse 34f8779c94 Lexing of bare regular text.
This is currently a bit of a hack but at least we're slowly getting there.
2014-03-13 00:42:12 +01:00
Yorick Peterse 2fbca93ae8 Supported for parsing nested elements. 2014-03-12 23:13:28 +01:00
Yorick Peterse 8cfa81aed9 Basic support for parsing elements.
This includes support for elements with namespaces and attributes. Nested
elements are not yet supported.
2014-03-12 23:02:54 +01:00
Yorick Peterse 5ce515d224 Small line wrapping change in the lexer. 2014-03-12 22:42:13 +01:00
Yorick Peterse 98b3443e7f Lexing of element attributes without values. 2014-03-12 22:41:17 +01:00
Yorick Peterse ed9d8c05a2 Added support for parsing comments. 2014-03-12 22:20:12 +01:00
Yorick Peterse 0a396043f8 Support for parsing CDATA tags. 2014-03-11 22:22:02 +01:00
Yorick Peterse c9592856f0 Updated parsing of doctypes.
The resulting nodes now separate the type, public and system IDs in to separate
string values.
2014-03-11 22:08:21 +01:00
Yorick Peterse c07edc767b Updated the gitignore entry for the parser. 2014-03-11 22:03:02 +01:00
Yorick Peterse 4a41894e2c Updated the doctype parser specs. 2014-03-11 22:02:26 +01:00
Yorick Peterse 8ce76be050 Moved the parser class to Oga::Parser.
Oga will use the same parser for XML and HTML so it doesn't make sense to
separate the two into different namespaces (at least for now).
2014-03-11 22:01:50 +01:00
Yorick Peterse 77b40d2e81 Use a separate machine for closing tags.
This makes it easier to advance column numbers for whitespace as well as
captuing and emitting tokens for the closing tag.
2014-03-11 21:55:36 +01:00
Yorick Peterse eacd9b88cf Reworked token generation for elements.
This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name
(T_ELEMENT_NAME). This makes it easier to include the namespace of an element
(T_ELEMENT_NS) in the output.
2014-03-10 23:50:39 +01:00
Yorick Peterse cd53d5e426 Fixed advancing column numbers.
In a bunch of cases the column number would not be increased correctly.
2014-03-07 23:54:56 +01:00
Yorick Peterse 1c9a6c8b76 Tests for nested tags/text nodes.
Well guess what, apparently that did work. That was slightly unexpected.
2014-03-03 22:13:29 +01:00
Yorick Peterse a5a3b8db3f Basic lexing of HTML tags.
The current implementation is a bit messy. In particular the counting of column
numbers is not entirely the way it should be. There are also some problems with
nested tags/text that I still have to resolve.
2014-03-03 22:08:46 +01:00
Yorick Peterse d9ef33e1f8 Lexing of comments.
This fixes #4.
2014-02-28 23:27:23 +01:00
Yorick Peterse 92ae48f905 Use fcall + fret instead of fgoto.
This removes the hardcoded return to the main machine.
2014-02-28 23:19:31 +01:00
Yorick Peterse 30d3e455d1 Use squote/dquote everywhere in the lexer. 2014-02-28 23:18:23 +01:00
Yorick Peterse 970ce27283 Cleanup of buffering text/strings.
This removes the need to use ||= and such, which should speed things up a bit
and keeps the code cleaner.
2014-02-28 23:16:01 +01:00
Yorick Peterse ca6f422036 Lexing of doctypes.
This comes with various structural changes to the lexer as I'm slowly starting
to get the hang of Ragel. Ragel is a beast but damn it's an awesome piece of
software.

Note that the doctype public/system IDs are lexed as T_STRING. The parser will
figure out whether a ID is a public or system ID based on the order.

This fixes #1
2014-02-28 23:08:55 +01:00
Yorick Peterse 3c825afee0 Cleaned up lexer rules a bit.
There's no benefit to adding variables for angle brackets and such, it's much
easier to grok to just use them directly.
2014-02-28 20:09:13 +01:00