T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
The latter returns an Enumerable which on Ruby 1.9.3 doesn't have #length
available. Besides this it's better to just return an Array since we'll iterate
over every character anyway.
Grand wizard overlord @whitequark recommended this as it will bypass the need
for creating individual String instance for every character (at least not until
needed). This becomes noticable on large inputs (e.g. 100 MB of XML).
Previously these would result in the kernel OOM killing the process. Using
codepoints memory increase by a "mere" 1-1,5 GB.
This was a rather interesting turn of events. As it turned out the Ragel
generated lexer was extremely slow on large inputs. For example, lexing
benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark
benchmark/lexer/bench_html_time.rb:
Rehearsal --------------------------------------------------------
lex HTML 10.870000 0.000000 10.870000 ( 10.877920)
---------------------------------------------- total: 10.870000sec
user system total real
lex HTML 10.440000 0.010000 10.450000 ( 10.449500)
The corresponding benchmark-ips benchmark (bench_html.rb) presented the
following results:
Calculating -------------------------------------
lex HTML 1 i/100ms
-------------------------------------------------
lex HTML 0.1 (±0.0%) i/s - 1 in 10.472534s
10 seconds for around 165 KB of HTML was not acceptable. I spent a good time
profiling things, even submitting a patch to Ragel
(https://github.com/athurston/ragel/pull/1). At some point I decided to give a
pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to
write C reminded me why I didn't want to do it in C in the first place.
Around 2AM I gave up and went to brush my teeth and head to bed. Then, a
miracle happened. More precisely, I actually gave my brain some time to think
away from the computer. I said to myself:
What if I feed Ragel an Array of characters instead of an entire String?
That way I bypass String#[] being expensive without having to change all of
Ragel or use a different language.
The results of this change are rather interesting. With these changes the
benchmark bench_html_time.rb now gives back the following:
Rehearsal --------------------------------------------------------
lex HTML 0.550000 0.000000 0.550000 ( 0.550649)
----------------------------------------------- total: 0.550000sec
user system total real
lex HTML 0.520000 0.000000 0.520000 ( 0.520713)
The benchmark bench_html.rb in turn gives back this:
Calculating -------------------------------------
lex HTML 1 i/100ms
-------------------------------------------------
lex HTML 2.0 (±0.0%) i/s - 10 in 5.120905s
According to both benchmarks we now have a speedup of about 20 times without
having to make any further changes to Ragel or the lexer itself.
I love it when a plan comes together.
Instead of appending single characters to a String buffer the lexer now uses a
start and end position to figure out what the buffer is. This is a lot faster
than constantly appending to a String.
The error messages of the parser now contain surrounding lines of code instead
of only the offending line of code. This should make debugging a bit easier.
Line numbers are also shown for each line.
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).
This reverts commit 8898d08831.
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name
(T_ELEMENT_NAME). This makes it easier to include the namespace of an element
(T_ELEMENT_NS) in the output.
The current implementation is a bit messy. In particular the counting of column
numbers is not entirely the way it should be. There are also some problems with
nested tags/text that I still have to resolve.