Commit Graph

88 Commits

Author SHA1 Message Date
Yorick Peterse 8237d5791d Stream tokens when lexing.
Instead of returning the tokens as a whole they are now streamed using
XML::Lexer#advance. This method returns the next token upon every call. It uses
a small buffer in case a particular block of text results in multiple tokens.
2014-04-09 22:08:13 +02:00
Yorick Peterse e9bb97d261 First steps towards making the lexer stream tokens 2014-04-09 19:32:06 +02:00
Yorick Peterse cb74c7edf9 Specs for XML parser errors. 2014-04-07 21:31:36 +02:00
Yorick Peterse 54ef125637 Basic docs for everything under Oga::XML. 2014-04-04 17:48:36 +02:00
Yorick Peterse 13a9228563 Properly indent doctype/XML decl inspect values. 2014-04-04 11:13:39 +02:00
Yorick Peterse 37a12722cb Rough setup for a custom #inspect format.
This format is a lot more readable than the default Ruby #inspect format
(mostly due to not including previous/next/parent nodes).
2014-04-04 00:41:29 +02:00
Yorick Peterse a2c525dd7c Insert newlines after XML dec/doctypes. 2014-04-03 23:04:21 +02:00
Yorick Peterse 230fafa2d3 Document should not inherit from Node.
A document is not an XML node on itself. If logic has to be shared between the
Document and the Node class I'll resort to using mixins for this.
2014-04-03 22:45:40 +02:00
Yorick Peterse c077988dd6 Tree building of doctypes. 2014-04-03 22:44:00 +02:00
Yorick Peterse 81b1155af3 Lex/parse doctype names separately. 2014-04-03 21:59:57 +02:00
Yorick Peterse 8185656c1e Fixed typ. 2014-04-03 21:41:31 +02:00
Yorick Peterse 30c01a5aee Tests for XML::TreeBuilder#handler_missing. 2014-04-03 09:43:30 +02:00
Yorick Peterse bdb76cefc5 Dedicated handling of XML declaration nodes. 2014-04-02 22:30:45 +02:00
Yorick Peterse d6c0a1f3f3 Lex/parser XML declaration attributes. 2014-04-02 22:01:17 +02:00
Yorick Peterse f99c13b516 Tests + docs for the TreeBuilder class. 2014-03-28 17:11:54 +01:00
Yorick Peterse 6d866523b8 Renamed XML::Builder to XML::TreeBuilder. 2014-03-28 16:37:37 +01:00
Yorick Peterse e141c084f9 Dedicated DOM builder class for CDATA tags. 2014-03-28 09:27:53 +01:00
Yorick Peterse 2b250bbf42 Rough DOM building setup. 2014-03-28 08:59:48 +01:00
Yorick Peterse 6ae52c1b12 Initial rough sketches for the DOM API. 2014-03-26 18:12:00 +01:00
Yorick Peterse 4a48647d1e Removed generated lexer/parser.
I am a dumbass.
2014-03-25 21:47:40 +01:00
Yorick Peterse fb626278a8 Re-wrapped comments in the XML lexer. 2014-03-25 10:12:39 +01:00
Yorick Peterse 8ebd72158c Renamed XML::Lexer#t to #emit(). 2014-03-25 09:42:52 +01:00
Yorick Peterse 79818eb349 Added a convenience class for parsing HTML.
This removes the need for users having to set the `:html` option themselves.
2014-03-25 09:40:24 +01:00
Yorick Peterse eae13d21ed Namespaced the lexer/parser under Oga::XML.
With the upcoming XPath and CSS selector lexers/parsers it will be confusing to
keep these in the root namespace.
2014-03-25 09:34:38 +01:00
Yorick Peterse 2259061c89 Don't require the 2nd Lexer#add_token argument. 2014-03-24 21:35:47 +01:00
Yorick Peterse 641c54261e Simplified lexer output for comments. 2014-03-24 21:34:30 +01:00
Yorick Peterse eaf1669b07 Simplified lexer output for CDATA tags. 2014-03-24 21:33:05 +01:00
Yorick Peterse 470be5a839 Simplified the lexer output for doctypes. 2014-03-24 21:32:16 +01:00
Yorick Peterse ac775918ee Lexing/parsing of XML declaration tags.
This closes #12.
2014-03-24 21:30:19 +01:00
Yorick Peterse b695ecf0df Renamed element lexer tags.
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
2014-03-24 20:32:43 +01:00
Yorick Peterse 52abc9d29e Basic documentation for Oga::Parser. 2014-03-23 21:29:57 +01:00
Yorick Peterse 19c1d66287 Use String#unpack instead of String#codepoints.
The latter returns an Enumerable which on Ruby 1.9.3 doesn't have #length
available. Besides this it's better to just return an Array since we'll iterate
over every character anyway.
2014-03-23 21:21:27 +01:00
Yorick Peterse a2452b6371 Use codepoints instead of chars in the lexer.
Grand wizard overlord @whitequark recommended this as it will bypass the need
for creating individual String instance for every character (at least not until
needed). This becomes noticable on large inputs (e.g. 100 MB of XML).
Previously these would result in the kernel OOM killing the process. Using
codepoints memory increase by a "mere" 1-1,5 GB.
2014-03-23 20:20:07 +01:00
Yorick Peterse cdf5f1d541 Improve lexer performance by 20x or so.
This was a rather interesting turn of events. As it turned out the Ragel
generated lexer was extremely slow on large inputs. For example, lexing
benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark
benchmark/lexer/bench_html_time.rb:

    Rehearsal --------------------------------------------------------
    lex HTML              10.870000   0.000000  10.870000 ( 10.877920)
    ---------------------------------------------- total: 10.870000sec

                               user     system      total        real
    lex HTML              10.440000   0.010000  10.450000 ( 10.449500)

The corresponding benchmark-ips benchmark (bench_html.rb) presented the
following results:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        0.1 (±0.0%) i/s -          1 in  10.472534s

10 seconds for around 165 KB of HTML was not acceptable. I spent a good time
profiling things, even submitting a patch to Ragel
(https://github.com/athurston/ragel/pull/1). At some point I decided to give a
pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to
write C reminded me why I didn't want to do it in C in the first place.

Around 2AM I gave up and went to brush my teeth and head to bed. Then, a
miracle happened. More precisely, I actually gave my brain some time to think
away from the computer. I said to myself:

    What if I feed Ragel an Array of characters instead of an entire String?
    That way I bypass String#[] being expensive without having to change all of
    Ragel or use a different language.

The results of this change are rather interesting. With these changes the
benchmark bench_html_time.rb now gives back the following:

    Rehearsal --------------------------------------------------------
    lex HTML               0.550000   0.000000   0.550000 (  0.550649)
    ----------------------------------------------- total: 0.550000sec

                               user     system      total        real
    lex HTML               0.520000   0.000000   0.520000 (  0.520713)

The benchmark bench_html.rb in turn gives back this:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        2.0 (±0.0%) i/s -         10 in   5.120905s

According to both benchmarks we now have a speedup of about 20 times without
having to make any further changes to Ragel or the lexer itself.

I love it when a plan comes together.
2014-03-23 12:46:22 +01:00
Yorick Peterse 0e9d9b844c Removed duplicate start_element rule. 2014-03-21 18:54:47 +01:00
Yorick Peterse 56ed9e949c Use index based buffering for strings.
This uses the same system as for T_TEXT nodes.
2014-03-21 17:45:40 +01:00
Yorick Peterse 9fa694ad4f Use index based buffers for text nodes.
Instead of appending single characters to a String buffer the lexer now uses a
start and end position to figure out what the buffer is. This is a lot faster
than constantly appending to a String.
2014-03-21 17:32:07 +01:00
Yorick Peterse 55f116124c Fix for showing lines in parser errors. 2014-03-21 00:16:20 +01:00
Yorick Peterse 7749f4abce Corrected a comment in the parser. 2014-03-21 00:10:20 +01:00
Yorick Peterse a20ec0000a Show up to 5 surrounding lines in parser errors. 2014-03-20 23:40:25 +01:00
Yorick Peterse 91fb7523fd Lex open tags with newlines in them. 2014-03-20 23:39:29 +01:00
Yorick Peterse ba17996bfc Fancier error messages for the parser.
The error messages of the parser now contain surrounding lines of code instead
of only the offending line of code. This should make debugging a bit easier.
Line numbers are also shown for each line.
2014-03-20 23:30:24 +01:00
Yorick Peterse 74bc11a239 Rip out column counting.
This makes both the lexer and parser quite a bit easier to use. Counting column
numbers isn't also really needed when parsing XML/HTML.
2014-03-20 19:44:28 +01:00
Yorick Peterse 70a39042e7 Removed useless rules from the parser. 2014-03-20 18:58:32 +01:00
Yorick Peterse 03774f2788 Documented the lexer. 2014-03-19 22:05:57 +01:00
Yorick Peterse f1fcdfbacb Cleaned up the Ragel bits of the lexer.
This removes some of the complexity that existed before (e.g. too many state
machines) and fixes a bunch of problems with nested data.
2014-03-19 21:44:10 +01:00
Yorick Peterse 7271e74396 Revert "Compacter parser AST."
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).

This reverts commit 8898d08831.
2014-03-18 18:55:16 +01:00
Yorick Peterse 9975c9c430 Removed the emit_text_buffer Ragel action. 2014-03-17 21:49:49 +01:00
Yorick Peterse 274ab359ba Don't use separate tokens/nodes for newlines.
Newlines are now lexed together with regular text. The line numbers are
advanced based on the amount of "\n" sequences in a text buffer.
2014-03-17 21:26:21 +01:00
Yorick Peterse 8898d08831 Compacter parser AST.
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
2014-03-17 21:03:54 +01:00