I gave this some thought and I've removed it for two reasons:
1. My Dogecoin Wallet takes *forever* to sync with the network (13 weeks
behind) so I uninstalled it. I can't be bothered waiting forever for a
gimmick.
2. I don't like asking for donations/money. I'd much rather have people send me
an Email thanking me for my work than for them to donate money. The latter
means much more to me.
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
The latter returns an Enumerable which on Ruby 1.9.3 doesn't have #length
available. Besides this it's better to just return an Array since we'll iterate
over every character anyway.
Grand wizard overlord @whitequark recommended this as it will bypass the need
for creating individual String instance for every character (at least not until
needed). This becomes noticable on large inputs (e.g. 100 MB of XML).
Previously these would result in the kernel OOM killing the process. Using
codepoints memory increase by a "mere" 1-1,5 GB.
This was a rather interesting turn of events. As it turned out the Ragel
generated lexer was extremely slow on large inputs. For example, lexing
benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark
benchmark/lexer/bench_html_time.rb:
Rehearsal --------------------------------------------------------
lex HTML 10.870000 0.000000 10.870000 ( 10.877920)
---------------------------------------------- total: 10.870000sec
user system total real
lex HTML 10.440000 0.010000 10.450000 ( 10.449500)
The corresponding benchmark-ips benchmark (bench_html.rb) presented the
following results:
Calculating -------------------------------------
lex HTML 1 i/100ms
-------------------------------------------------
lex HTML 0.1 (±0.0%) i/s - 1 in 10.472534s
10 seconds for around 165 KB of HTML was not acceptable. I spent a good time
profiling things, even submitting a patch to Ragel
(https://github.com/athurston/ragel/pull/1). At some point I decided to give a
pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to
write C reminded me why I didn't want to do it in C in the first place.
Around 2AM I gave up and went to brush my teeth and head to bed. Then, a
miracle happened. More precisely, I actually gave my brain some time to think
away from the computer. I said to myself:
What if I feed Ragel an Array of characters instead of an entire String?
That way I bypass String#[] being expensive without having to change all of
Ragel or use a different language.
The results of this change are rather interesting. With these changes the
benchmark bench_html_time.rb now gives back the following:
Rehearsal --------------------------------------------------------
lex HTML 0.550000 0.000000 0.550000 ( 0.550649)
----------------------------------------------- total: 0.550000sec
user system total real
lex HTML 0.520000 0.000000 0.520000 ( 0.520713)
The benchmark bench_html.rb in turn gives back this:
Calculating -------------------------------------
lex HTML 1 i/100ms
-------------------------------------------------
lex HTML 2.0 (±0.0%) i/s - 10 in 5.120905s
According to both benchmarks we now have a speedup of about 20 times without
having to make any further changes to Ragel or the lexer itself.
I love it when a plan comes together.
Instead of appending single characters to a String buffer the lexer now uses a
start and end position to figure out what the buffer is. This is a lot faster
than constantly appending to a String.
The error messages of the parser now contain surrounding lines of code instead
of only the offending line of code. This should make debugging a bit easier.
Line numbers are also shown for each line.
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).
This reverts commit 8898d08831.