Improve lexer performance by 20x or so.
This was a rather interesting turn of events. As it turned out the Ragel
generated lexer was extremely slow on large inputs. For example, lexing
benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark
benchmark/lexer/bench_html_time.rb:
Rehearsal --------------------------------------------------------
lex HTML 10.870000 0.000000 10.870000 ( 10.877920)
---------------------------------------------- total: 10.870000sec
user system total real
lex HTML 10.440000 0.010000 10.450000 ( 10.449500)
The corresponding benchmark-ips benchmark (bench_html.rb) presented the
following results:
Calculating -------------------------------------
lex HTML 1 i/100ms
-------------------------------------------------
lex HTML 0.1 (±0.0%) i/s - 1 in 10.472534s
10 seconds for around 165 KB of HTML was not acceptable. I spent a good time
profiling things, even submitting a patch to Ragel
(https://github.com/athurston/ragel/pull/1). At some point I decided to give a
pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to
write C reminded me why I didn't want to do it in C in the first place.
Around 2AM I gave up and went to brush my teeth and head to bed. Then, a
miracle happened. More precisely, I actually gave my brain some time to think
away from the computer. I said to myself:
What if I feed Ragel an Array of characters instead of an entire String?
That way I bypass String#[] being expensive without having to change all of
Ragel or use a different language.
The results of this change are rather interesting. With these changes the
benchmark bench_html_time.rb now gives back the following:
Rehearsal --------------------------------------------------------
lex HTML 0.550000 0.000000 0.550000 ( 0.550649)
----------------------------------------------- total: 0.550000sec
user system total real
lex HTML 0.520000 0.000000 0.520000 ( 0.520713)
The benchmark bench_html.rb in turn gives back this:
Calculating -------------------------------------
lex HTML 1 i/100ms
-------------------------------------------------
lex HTML 2.0 (±0.0%) i/s - 10 in 5.120905s
According to both benchmarks we now have a speedup of about 20 times without
having to make any further changes to Ragel or the lexer itself.
I love it when a plan comes together.
This commit is contained in:
parent
4b914b3d6f
commit
cdf5f1d541
|
|
@ -95,7 +95,7 @@ module Oga
|
|||
# @return [Array]
|
||||
#
|
||||
def lex(data)
|
||||
@data = data
|
||||
@data = data.chars.to_a
|
||||
lexer_start = self.class.lexer_start
|
||||
eof = data.length
|
||||
|
||||
|
|
@ -152,7 +152,7 @@ module Oga
|
|||
# @return [String]
|
||||
#
|
||||
def text(start = @ts, stop = @te)
|
||||
return @data[start...stop]
|
||||
return @data[start...stop].join('')
|
||||
end
|
||||
|
||||
##
|
||||
|
|
|
|||
Loading…
Reference in New Issue