Commit Graph

311 Commits

Author SHA1 Message Date
Yorick Peterse 230fafa2d3 Document should not inherit from Node.
A document is not an XML node on itself. If logic has to be shared between the
Document and the Node class I'll resort to using mixins for this.
2014-04-03 22:45:40 +02:00
Yorick Peterse c077988dd6 Tree building of doctypes. 2014-04-03 22:44:00 +02:00
Yorick Peterse 81b1155af3 Lex/parse doctype names separately. 2014-04-03 21:59:57 +02:00
Yorick Peterse 8185656c1e Fixed typ. 2014-04-03 21:41:31 +02:00
Yorick Peterse 6cf906e500 Lexer tests for single quoted attributes. 2014-04-03 18:50:07 +02:00
Yorick Peterse 30c01a5aee Tests for XML::TreeBuilder#handler_missing. 2014-04-03 09:43:30 +02:00
Yorick Peterse 0f129ceac9 Tests for XML::TreeBuilder#on_comment. 2014-04-03 09:38:18 +02:00
Yorick Peterse bdb76cefc5 Dedicated handling of XML declaration nodes. 2014-04-02 22:30:45 +02:00
Yorick Peterse d6c0a1f3f3 Lex/parser XML declaration attributes. 2014-04-02 22:01:17 +02:00
Yorick Peterse fa2e71c790 Tests for TreeBuilder#on_document. 2014-03-28 18:52:08 +01:00
Yorick Peterse f99c13b516 Tests + docs for the TreeBuilder class. 2014-03-28 17:11:54 +01:00
Yorick Peterse 6d866523b8 Renamed XML::Builder to XML::TreeBuilder. 2014-03-28 16:37:37 +01:00
Yorick Peterse 331726b2ca Tests for the various XML node types. 2014-03-28 16:34:30 +01:00
Yorick Peterse c366a96ce8 Rake task for generating code coverage. 2014-03-28 16:33:47 +01:00
Yorick Peterse e141c084f9 Dedicated DOM builder class for CDATA tags. 2014-03-28 09:27:53 +01:00
Yorick Peterse 2b250bbf42 Rough DOM building setup. 2014-03-28 08:59:48 +01:00
Yorick Peterse 6ae52c1b12 Initial rough sketches for the DOM API. 2014-03-26 18:12:00 +01:00
Yorick Peterse 6c661f3ee9 Removed the donations section.
I gave this some thought and I've removed it for two reasons:

1. My Dogecoin Wallet takes *forever* to sync with the network (13 weeks
   behind) so I uninstalled it. I can't be bothered waiting forever for a
   gimmick.

2. I don't like asking for donations/money. I'd much rather have people send me
   an Email thanking me for my work than for them to donate money. The latter
   means much more to me.
2014-03-25 23:55:10 +01:00
Yorick Peterse 4a48647d1e Removed generated lexer/parser.
I am a dumbass.
2014-03-25 21:47:40 +01:00
Yorick Peterse fb626278a8 Re-wrapped comments in the XML lexer. 2014-03-25 10:12:39 +01:00
Yorick Peterse 8ebd72158c Renamed XML::Lexer#t to #emit(). 2014-03-25 09:42:52 +01:00
Yorick Peterse 79818eb349 Added a convenience class for parsing HTML.
This removes the need for users having to set the `:html` option themselves.
2014-03-25 09:40:24 +01:00
Yorick Peterse 58009614f6 Moved XML specs into spec/oga/xml. 2014-03-25 09:36:39 +01:00
Yorick Peterse 7c03de0e2f Renamed HTML_PARSER to PARSER_OUTPUT.
This keeps it consistent with the lexer.
2014-03-25 09:35:48 +01:00
Yorick Peterse eae13d21ed Namespaced the lexer/parser under Oga::XML.
With the upcoming XPath and CSS selector lexers/parsers it will be confusing to
keep these in the root namespace.
2014-03-25 09:34:38 +01:00
Yorick Peterse 2259061c89 Don't require the 2nd Lexer#add_token argument. 2014-03-24 21:35:47 +01:00
Yorick Peterse 641c54261e Simplified lexer output for comments. 2014-03-24 21:34:30 +01:00
Yorick Peterse eaf1669b07 Simplified lexer output for CDATA tags. 2014-03-24 21:33:05 +01:00
Yorick Peterse 470be5a839 Simplified the lexer output for doctypes. 2014-03-24 21:32:16 +01:00
Yorick Peterse ac775918ee Lexing/parsing of XML declaration tags.
This closes #12.
2014-03-24 21:30:19 +01:00
Yorick Peterse b695ecf0df Renamed element lexer tags.
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
2014-03-24 20:32:43 +01:00
Yorick Peterse 0b6ba6e6b5 Fixed typ. 2014-03-24 20:20:19 +01:00
Yorick Peterse ca66339a08 README entry on donations. 2014-03-24 20:13:16 +01:00
Yorick Peterse 52abc9d29e Basic documentation for Oga::Parser. 2014-03-23 21:29:57 +01:00
Yorick Peterse 19c1d66287 Use String#unpack instead of String#codepoints.
The latter returns an Enumerable which on Ruby 1.9.3 doesn't have #length
available. Besides this it's better to just return an Array since we'll iterate
over every character anyway.
2014-03-23 21:21:27 +01:00
Yorick Peterse a2452b6371 Use codepoints instead of chars in the lexer.
Grand wizard overlord @whitequark recommended this as it will bypass the need
for creating individual String instance for every character (at least not until
needed). This becomes noticable on large inputs (e.g. 100 MB of XML).
Previously these would result in the kernel OOM killing the process. Using
codepoints memory increase by a "mere" 1-1,5 GB.
2014-03-23 20:20:07 +01:00
Yorick Peterse cdf5f1d541 Improve lexer performance by 20x or so.
This was a rather interesting turn of events. As it turned out the Ragel
generated lexer was extremely slow on large inputs. For example, lexing
benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark
benchmark/lexer/bench_html_time.rb:

    Rehearsal --------------------------------------------------------
    lex HTML              10.870000   0.000000  10.870000 ( 10.877920)
    ---------------------------------------------- total: 10.870000sec

                               user     system      total        real
    lex HTML              10.440000   0.010000  10.450000 ( 10.449500)

The corresponding benchmark-ips benchmark (bench_html.rb) presented the
following results:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        0.1 (±0.0%) i/s -          1 in  10.472534s

10 seconds for around 165 KB of HTML was not acceptable. I spent a good time
profiling things, even submitting a patch to Ragel
(https://github.com/athurston/ragel/pull/1). At some point I decided to give a
pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to
write C reminded me why I didn't want to do it in C in the first place.

Around 2AM I gave up and went to brush my teeth and head to bed. Then, a
miracle happened. More precisely, I actually gave my brain some time to think
away from the computer. I said to myself:

    What if I feed Ragel an Array of characters instead of an entire String?
    That way I bypass String#[] being expensive without having to change all of
    Ragel or use a different language.

The results of this change are rather interesting. With these changes the
benchmark bench_html_time.rb now gives back the following:

    Rehearsal --------------------------------------------------------
    lex HTML               0.550000   0.000000   0.550000 (  0.550649)
    ----------------------------------------------- total: 0.550000sec

                               user     system      total        real
    lex HTML               0.520000   0.000000   0.520000 (  0.520713)

The benchmark bench_html.rb in turn gives back this:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        2.0 (±0.0%) i/s -         10 in   5.120905s

According to both benchmarks we now have a speedup of about 20 times without
having to make any further changes to Ragel or the lexer itself.

I love it when a plan comes together.
2014-03-23 12:46:22 +01:00
Yorick Peterse 4b914b3d6f Added extra benchmarks for lexing large inputs. 2014-03-23 12:46:04 +01:00
Yorick Peterse 0e9d9b844c Removed duplicate start_element rule. 2014-03-21 18:54:47 +01:00
Yorick Peterse 56ed9e949c Use index based buffering for strings.
This uses the same system as for T_TEXT nodes.
2014-03-21 17:45:40 +01:00
Yorick Peterse d7a40ec470 Simple benchmark for lexing elements. 2014-03-21 17:45:23 +01:00
Yorick Peterse 9fa694ad4f Use index based buffers for text nodes.
Instead of appending single characters to a String buffer the lexer now uses a
start and end position to figure out what the buffer is. This is a lot faster
than constantly appending to a String.
2014-03-21 17:32:07 +01:00
Yorick Peterse 2852afce9b Benchmark for measuring CDATA lexing. 2014-03-21 16:59:44 +01:00
Yorick Peterse 55f116124c Fix for showing lines in parser errors. 2014-03-21 00:16:20 +01:00
Yorick Peterse 7749f4abce Corrected a comment in the parser. 2014-03-21 00:10:20 +01:00
Yorick Peterse a20ec0000a Show up to 5 surrounding lines in parser errors. 2014-03-20 23:40:25 +01:00
Yorick Peterse 91fb7523fd Lex open tags with newlines in them. 2014-03-20 23:39:29 +01:00
Yorick Peterse ba17996bfc Fancier error messages for the parser.
The error messages of the parser now contain surrounding lines of code instead
of only the offending line of code. This should make debugging a bit easier.
Line numbers are also shown for each line.
2014-03-20 23:30:24 +01:00
Yorick Peterse 74bc11a239 Rip out column counting.
This makes both the lexer and parser quite a bit easier to use. Counting column
numbers isn't also really needed when parsing XML/HTML.
2014-03-20 19:44:28 +01:00
Yorick Peterse 70a39042e7 Removed useless rules from the parser. 2014-03-20 18:58:32 +01:00