Commit Graph

1190 Commits

Author SHA1 Message Date
Yorick Peterse 79818eb349 Added a convenience class for parsing HTML.
This removes the need for users having to set the `:html` option themselves.
2014-03-25 09:40:24 +01:00
Yorick Peterse 58009614f6 Moved XML specs into spec/oga/xml. 2014-03-25 09:36:39 +01:00
Yorick Peterse 7c03de0e2f Renamed HTML_PARSER to PARSER_OUTPUT.
This keeps it consistent with the lexer.
2014-03-25 09:35:48 +01:00
Yorick Peterse eae13d21ed Namespaced the lexer/parser under Oga::XML.
With the upcoming XPath and CSS selector lexers/parsers it will be confusing to
keep these in the root namespace.
2014-03-25 09:34:38 +01:00
Yorick Peterse 2259061c89 Don't require the 2nd Lexer#add_token argument. 2014-03-24 21:35:47 +01:00
Yorick Peterse 641c54261e Simplified lexer output for comments. 2014-03-24 21:34:30 +01:00
Yorick Peterse eaf1669b07 Simplified lexer output for CDATA tags. 2014-03-24 21:33:05 +01:00
Yorick Peterse 470be5a839 Simplified the lexer output for doctypes. 2014-03-24 21:32:16 +01:00
Yorick Peterse ac775918ee Lexing/parsing of XML declaration tags.
This closes #12.
2014-03-24 21:30:19 +01:00
Yorick Peterse b695ecf0df Renamed element lexer tags.
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
2014-03-24 20:32:43 +01:00
Yorick Peterse 0b6ba6e6b5 Fixed typ. 2014-03-24 20:20:19 +01:00
Yorick Peterse ca66339a08 README entry on donations. 2014-03-24 20:13:16 +01:00
Yorick Peterse 52abc9d29e Basic documentation for Oga::Parser. 2014-03-23 21:29:57 +01:00
Yorick Peterse 19c1d66287 Use String#unpack instead of String#codepoints.
The latter returns an Enumerable which on Ruby 1.9.3 doesn't have #length
available. Besides this it's better to just return an Array since we'll iterate
over every character anyway.
2014-03-23 21:21:27 +01:00
Yorick Peterse a2452b6371 Use codepoints instead of chars in the lexer.
Grand wizard overlord @whitequark recommended this as it will bypass the need
for creating individual String instance for every character (at least not until
needed). This becomes noticable on large inputs (e.g. 100 MB of XML).
Previously these would result in the kernel OOM killing the process. Using
codepoints memory increase by a "mere" 1-1,5 GB.
2014-03-23 20:20:07 +01:00
Yorick Peterse cdf5f1d541 Improve lexer performance by 20x or so.
This was a rather interesting turn of events. As it turned out the Ragel
generated lexer was extremely slow on large inputs. For example, lexing
benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark
benchmark/lexer/bench_html_time.rb:

    Rehearsal --------------------------------------------------------
    lex HTML              10.870000   0.000000  10.870000 ( 10.877920)
    ---------------------------------------------- total: 10.870000sec

                               user     system      total        real
    lex HTML              10.440000   0.010000  10.450000 ( 10.449500)

The corresponding benchmark-ips benchmark (bench_html.rb) presented the
following results:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        0.1 (±0.0%) i/s -          1 in  10.472534s

10 seconds for around 165 KB of HTML was not acceptable. I spent a good time
profiling things, even submitting a patch to Ragel
(https://github.com/athurston/ragel/pull/1). At some point I decided to give a
pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to
write C reminded me why I didn't want to do it in C in the first place.

Around 2AM I gave up and went to brush my teeth and head to bed. Then, a
miracle happened. More precisely, I actually gave my brain some time to think
away from the computer. I said to myself:

    What if I feed Ragel an Array of characters instead of an entire String?
    That way I bypass String#[] being expensive without having to change all of
    Ragel or use a different language.

The results of this change are rather interesting. With these changes the
benchmark bench_html_time.rb now gives back the following:

    Rehearsal --------------------------------------------------------
    lex HTML               0.550000   0.000000   0.550000 (  0.550649)
    ----------------------------------------------- total: 0.550000sec

                               user     system      total        real
    lex HTML               0.520000   0.000000   0.520000 (  0.520713)

The benchmark bench_html.rb in turn gives back this:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        2.0 (±0.0%) i/s -         10 in   5.120905s

According to both benchmarks we now have a speedup of about 20 times without
having to make any further changes to Ragel or the lexer itself.

I love it when a plan comes together.
2014-03-23 12:46:22 +01:00
Yorick Peterse 4b914b3d6f Added extra benchmarks for lexing large inputs. 2014-03-23 12:46:04 +01:00
Yorick Peterse 0e9d9b844c Removed duplicate start_element rule. 2014-03-21 18:54:47 +01:00
Yorick Peterse 56ed9e949c Use index based buffering for strings.
This uses the same system as for T_TEXT nodes.
2014-03-21 17:45:40 +01:00
Yorick Peterse d7a40ec470 Simple benchmark for lexing elements. 2014-03-21 17:45:23 +01:00
Yorick Peterse 9fa694ad4f Use index based buffers for text nodes.
Instead of appending single characters to a String buffer the lexer now uses a
start and end position to figure out what the buffer is. This is a lot faster
than constantly appending to a String.
2014-03-21 17:32:07 +01:00
Yorick Peterse 2852afce9b Benchmark for measuring CDATA lexing. 2014-03-21 16:59:44 +01:00
Yorick Peterse 55f116124c Fix for showing lines in parser errors. 2014-03-21 00:16:20 +01:00
Yorick Peterse 7749f4abce Corrected a comment in the parser. 2014-03-21 00:10:20 +01:00
Yorick Peterse a20ec0000a Show up to 5 surrounding lines in parser errors. 2014-03-20 23:40:25 +01:00
Yorick Peterse 91fb7523fd Lex open tags with newlines in them. 2014-03-20 23:39:29 +01:00
Yorick Peterse ba17996bfc Fancier error messages for the parser.
The error messages of the parser now contain surrounding lines of code instead
of only the offending line of code. This should make debugging a bit easier.
Line numbers are also shown for each line.
2014-03-20 23:30:24 +01:00
Yorick Peterse 74bc11a239 Rip out column counting.
This makes both the lexer and parser quite a bit easier to use. Counting column
numbers isn't also really needed when parsing XML/HTML.
2014-03-20 19:44:28 +01:00
Yorick Peterse 70a39042e7 Removed useless rules from the parser. 2014-03-20 18:58:32 +01:00
Yorick Peterse 03774f2788 Documented the lexer. 2014-03-19 22:05:57 +01:00
Yorick Peterse 192ba9bb54 Expanded the lexer comment tests. 2014-03-19 21:44:57 +01:00
Yorick Peterse f1fcdfbacb Cleaned up the Ragel bits of the lexer.
This removes some of the complexity that existed before (e.g. too many state
machines) and fixes a bunch of problems with nested data.
2014-03-19 21:44:10 +01:00
Yorick Peterse 7271e74396 Revert "Compacter parser AST."
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).

This reverts commit 8898d08831.
2014-03-18 18:55:16 +01:00
Yorick Peterse 9687dd379f Added a .ruby-version file. 2014-03-18 18:08:25 +01:00
Yorick Peterse 56f22c311e Allow JRuby to fail for now. 2014-03-18 00:13:33 +01:00
Yorick Peterse 422832fd68 Lowered the required Ragel version to 6.7. 2014-03-18 00:12:21 +01:00
Yorick Peterse 091e32c17a Install Ragel on Travis CI. 2014-03-18 00:09:16 +01:00
Yorick Peterse 8d4d3999b5 Configuration file for Travis CI. 2014-03-17 21:52:24 +01:00
Yorick Peterse 9975c9c430 Removed the emit_text_buffer Ragel action. 2014-03-17 21:49:49 +01:00
Yorick Peterse 274ab359ba Don't use separate tokens/nodes for newlines.
Newlines are now lexed together with regular text. The line numbers are
advanced based on the amount of "\n" sequences in a text buffer.
2014-03-17 21:26:21 +01:00
Yorick Peterse 8898d08831 Compacter parser AST.
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
2014-03-17 21:03:54 +01:00
Yorick Peterse 8d3f3f15d7 Renamed parse_html() to parse(). 2014-03-16 23:46:20 +01:00
Yorick Peterse cb75edc30d Basic support for lexing/parsing HTML5.
This will need a bunch of extra tests before I'll consider closing #7.
2014-03-16 23:42:24 +01:00
Yorick Peterse ce8bbdb64a Parsing support for multiple nested nodes. 2014-03-15 20:19:54 +01:00
Yorick Peterse 05ee3c13c9 Parsing support for nested element/text nodes. 2014-03-14 00:44:11 +01:00
Yorick Peterse 6b2f682c5c Tests for lexing a basic HTML document.
This also comes with some changes to the lexer so that it advances column/line
numbers correctly.
2014-03-13 23:55:18 +01:00
Yorick Peterse edf2e4112b Added a test for parsing bare text tokens. 2014-03-13 00:42:58 +01:00
Yorick Peterse 34f8779c94 Lexing of bare regular text.
This is currently a bit of a hack but at least we're slowly getting there.
2014-03-13 00:42:12 +01:00
Yorick Peterse 2fbca93ae8 Supported for parsing nested elements. 2014-03-12 23:13:28 +01:00
Yorick Peterse 8cfa81aed9 Basic support for parsing elements.
This includes support for elements with namespaces and attributes. Nested
elements are not yet supported.
2014-03-12 23:02:54 +01:00