Commit Graph

61 Commits

Author SHA1 Message Date
Yorick Peterse 470be5a839 Simplified the lexer output for doctypes. 2014-03-24 21:32:16 +01:00
Yorick Peterse ac775918ee Lexing/parsing of XML declaration tags.
This closes #12.
2014-03-24 21:30:19 +01:00
Yorick Peterse b695ecf0df Renamed element lexer tags.
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
2014-03-24 20:32:43 +01:00
Yorick Peterse 52abc9d29e Basic documentation for Oga::Parser. 2014-03-23 21:29:57 +01:00
Yorick Peterse 19c1d66287 Use String#unpack instead of String#codepoints.
The latter returns an Enumerable which on Ruby 1.9.3 doesn't have #length
available. Besides this it's better to just return an Array since we'll iterate
over every character anyway.
2014-03-23 21:21:27 +01:00
Yorick Peterse a2452b6371 Use codepoints instead of chars in the lexer.
Grand wizard overlord @whitequark recommended this as it will bypass the need
for creating individual String instance for every character (at least not until
needed). This becomes noticable on large inputs (e.g. 100 MB of XML).
Previously these would result in the kernel OOM killing the process. Using
codepoints memory increase by a "mere" 1-1,5 GB.
2014-03-23 20:20:07 +01:00
Yorick Peterse cdf5f1d541 Improve lexer performance by 20x or so.
This was a rather interesting turn of events. As it turned out the Ragel
generated lexer was extremely slow on large inputs. For example, lexing
benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark
benchmark/lexer/bench_html_time.rb:

    Rehearsal --------------------------------------------------------
    lex HTML              10.870000   0.000000  10.870000 ( 10.877920)
    ---------------------------------------------- total: 10.870000sec

                               user     system      total        real
    lex HTML              10.440000   0.010000  10.450000 ( 10.449500)

The corresponding benchmark-ips benchmark (bench_html.rb) presented the
following results:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        0.1 (±0.0%) i/s -          1 in  10.472534s

10 seconds for around 165 KB of HTML was not acceptable. I spent a good time
profiling things, even submitting a patch to Ragel
(https://github.com/athurston/ragel/pull/1). At some point I decided to give a
pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to
write C reminded me why I didn't want to do it in C in the first place.

Around 2AM I gave up and went to brush my teeth and head to bed. Then, a
miracle happened. More precisely, I actually gave my brain some time to think
away from the computer. I said to myself:

    What if I feed Ragel an Array of characters instead of an entire String?
    That way I bypass String#[] being expensive without having to change all of
    Ragel or use a different language.

The results of this change are rather interesting. With these changes the
benchmark bench_html_time.rb now gives back the following:

    Rehearsal --------------------------------------------------------
    lex HTML               0.550000   0.000000   0.550000 (  0.550649)
    ----------------------------------------------- total: 0.550000sec

                               user     system      total        real
    lex HTML               0.520000   0.000000   0.520000 (  0.520713)

The benchmark bench_html.rb in turn gives back this:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        2.0 (±0.0%) i/s -         10 in   5.120905s

According to both benchmarks we now have a speedup of about 20 times without
having to make any further changes to Ragel or the lexer itself.

I love it when a plan comes together.
2014-03-23 12:46:22 +01:00
Yorick Peterse 0e9d9b844c Removed duplicate start_element rule. 2014-03-21 18:54:47 +01:00
Yorick Peterse 56ed9e949c Use index based buffering for strings.
This uses the same system as for T_TEXT nodes.
2014-03-21 17:45:40 +01:00
Yorick Peterse 9fa694ad4f Use index based buffers for text nodes.
Instead of appending single characters to a String buffer the lexer now uses a
start and end position to figure out what the buffer is. This is a lot faster
than constantly appending to a String.
2014-03-21 17:32:07 +01:00
Yorick Peterse 55f116124c Fix for showing lines in parser errors. 2014-03-21 00:16:20 +01:00
Yorick Peterse 7749f4abce Corrected a comment in the parser. 2014-03-21 00:10:20 +01:00
Yorick Peterse a20ec0000a Show up to 5 surrounding lines in parser errors. 2014-03-20 23:40:25 +01:00
Yorick Peterse 91fb7523fd Lex open tags with newlines in them. 2014-03-20 23:39:29 +01:00
Yorick Peterse ba17996bfc Fancier error messages for the parser.
The error messages of the parser now contain surrounding lines of code instead
of only the offending line of code. This should make debugging a bit easier.
Line numbers are also shown for each line.
2014-03-20 23:30:24 +01:00
Yorick Peterse 74bc11a239 Rip out column counting.
This makes both the lexer and parser quite a bit easier to use. Counting column
numbers isn't also really needed when parsing XML/HTML.
2014-03-20 19:44:28 +01:00
Yorick Peterse 70a39042e7 Removed useless rules from the parser. 2014-03-20 18:58:32 +01:00
Yorick Peterse 03774f2788 Documented the lexer. 2014-03-19 22:05:57 +01:00
Yorick Peterse f1fcdfbacb Cleaned up the Ragel bits of the lexer.
This removes some of the complexity that existed before (e.g. too many state
machines) and fixes a bunch of problems with nested data.
2014-03-19 21:44:10 +01:00
Yorick Peterse 7271e74396 Revert "Compacter parser AST."
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).

This reverts commit 8898d08831.
2014-03-18 18:55:16 +01:00
Yorick Peterse 9975c9c430 Removed the emit_text_buffer Ragel action. 2014-03-17 21:49:49 +01:00
Yorick Peterse 274ab359ba Don't use separate tokens/nodes for newlines.
Newlines are now lexed together with regular text. The line numbers are
advanced based on the amount of "\n" sequences in a text buffer.
2014-03-17 21:26:21 +01:00
Yorick Peterse 8898d08831 Compacter parser AST.
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
2014-03-17 21:03:54 +01:00
Yorick Peterse cb75edc30d Basic support for lexing/parsing HTML5.
This will need a bunch of extra tests before I'll consider closing #7.
2014-03-16 23:42:24 +01:00
Yorick Peterse ce8bbdb64a Parsing support for multiple nested nodes. 2014-03-15 20:19:54 +01:00
Yorick Peterse 05ee3c13c9 Parsing support for nested element/text nodes. 2014-03-14 00:44:11 +01:00
Yorick Peterse 6b2f682c5c Tests for lexing a basic HTML document.
This also comes with some changes to the lexer so that it advances column/line
numbers correctly.
2014-03-13 23:55:18 +01:00
Yorick Peterse 34f8779c94 Lexing of bare regular text.
This is currently a bit of a hack but at least we're slowly getting there.
2014-03-13 00:42:12 +01:00
Yorick Peterse 2fbca93ae8 Supported for parsing nested elements. 2014-03-12 23:13:28 +01:00
Yorick Peterse 8cfa81aed9 Basic support for parsing elements.
This includes support for elements with namespaces and attributes. Nested
elements are not yet supported.
2014-03-12 23:02:54 +01:00
Yorick Peterse 5ce515d224 Small line wrapping change in the lexer. 2014-03-12 22:42:13 +01:00
Yorick Peterse 98b3443e7f Lexing of element attributes without values. 2014-03-12 22:41:17 +01:00
Yorick Peterse ed9d8c05a2 Added support for parsing comments. 2014-03-12 22:20:12 +01:00
Yorick Peterse 0a396043f8 Support for parsing CDATA tags. 2014-03-11 22:22:02 +01:00
Yorick Peterse c9592856f0 Updated parsing of doctypes.
The resulting nodes now separate the type, public and system IDs in to separate
string values.
2014-03-11 22:08:21 +01:00
Yorick Peterse c07edc767b Updated the gitignore entry for the parser. 2014-03-11 22:03:02 +01:00
Yorick Peterse 8ce76be050 Moved the parser class to Oga::Parser.
Oga will use the same parser for XML and HTML so it doesn't make sense to
separate the two into different namespaces (at least for now).
2014-03-11 22:01:50 +01:00
Yorick Peterse 77b40d2e81 Use a separate machine for closing tags.
This makes it easier to advance column numbers for whitespace as well as
captuing and emitting tokens for the closing tag.
2014-03-11 21:55:36 +01:00
Yorick Peterse eacd9b88cf Reworked token generation for elements.
This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name
(T_ELEMENT_NAME). This makes it easier to include the namespace of an element
(T_ELEMENT_NS) in the output.
2014-03-10 23:50:39 +01:00
Yorick Peterse cd53d5e426 Fixed advancing column numbers.
In a bunch of cases the column number would not be increased correctly.
2014-03-07 23:54:56 +01:00
Yorick Peterse a5a3b8db3f Basic lexing of HTML tags.
The current implementation is a bit messy. In particular the counting of column
numbers is not entirely the way it should be. There are also some problems with
nested tags/text that I still have to resolve.
2014-03-03 22:08:46 +01:00
Yorick Peterse d9ef33e1f8 Lexing of comments.
This fixes #4.
2014-02-28 23:27:23 +01:00
Yorick Peterse 92ae48f905 Use fcall + fret instead of fgoto.
This removes the hardcoded return to the main machine.
2014-02-28 23:19:31 +01:00
Yorick Peterse 30d3e455d1 Use squote/dquote everywhere in the lexer. 2014-02-28 23:18:23 +01:00
Yorick Peterse 970ce27283 Cleanup of buffering text/strings.
This removes the need to use ||= and such, which should speed things up a bit
and keeps the code cleaner.
2014-02-28 23:16:01 +01:00
Yorick Peterse ca6f422036 Lexing of doctypes.
This comes with various structural changes to the lexer as I'm slowly starting
to get the hang of Ragel. Ragel is a beast but damn it's an awesome piece of
software.

Note that the doctype public/system IDs are lexed as T_STRING. The parser will
figure out whether a ID is a public or system ID based on the order.

This fixes #1
2014-02-28 23:08:55 +01:00
Yorick Peterse 3c825afee0 Cleaned up lexer rules a bit.
There's no benefit to adding variables for angle brackets and such, it's much
easier to grok to just use them directly.
2014-02-28 20:09:13 +01:00
Yorick Peterse 2294bf19f4 Better lexing of CDATA tags.
This means the lexer is now capable of lexing CDATA tags that contain text such
as ]].
2014-02-28 20:05:12 +01:00
Yorick Peterse 6138945d53 Moved some of the CDATA docs around. 2014-02-28 00:04:44 +01:00
Yorick Peterse 4883ac7384 Lexing of CDATA tags. 2014-02-28 00:03:37 +01:00