From cdf5f1d541611f31541b0d84a742a15aecdef093 Mon Sep 17 00:00:00 2001 From: Yorick Peterse Date: Sun, 23 Mar 2014 12:46:22 +0100 Subject: [PATCH] Improve lexer performance by 20x or so. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This was a rather interesting turn of events. As it turned out the Ragel generated lexer was extremely slow on large inputs. For example, lexing benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark benchmark/lexer/bench_html_time.rb: Rehearsal -------------------------------------------------------- lex HTML 10.870000 0.000000 10.870000 ( 10.877920) ---------------------------------------------- total: 10.870000sec user system total real lex HTML 10.440000 0.010000 10.450000 ( 10.449500) The corresponding benchmark-ips benchmark (bench_html.rb) presented the following results: Calculating ------------------------------------- lex HTML 1 i/100ms ------------------------------------------------- lex HTML 0.1 (±0.0%) i/s - 1 in 10.472534s 10 seconds for around 165 KB of HTML was not acceptable. I spent a good time profiling things, even submitting a patch to Ragel (https://github.com/athurston/ragel/pull/1). At some point I decided to give a pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to write C reminded me why I didn't want to do it in C in the first place. Around 2AM I gave up and went to brush my teeth and head to bed. Then, a miracle happened. More precisely, I actually gave my brain some time to think away from the computer. I said to myself: What if I feed Ragel an Array of characters instead of an entire String? That way I bypass String#[] being expensive without having to change all of Ragel or use a different language. The results of this change are rather interesting. With these changes the benchmark bench_html_time.rb now gives back the following: Rehearsal -------------------------------------------------------- lex HTML 0.550000 0.000000 0.550000 ( 0.550649) ----------------------------------------------- total: 0.550000sec user system total real lex HTML 0.520000 0.000000 0.520000 ( 0.520713) The benchmark bench_html.rb in turn gives back this: Calculating ------------------------------------- lex HTML 1 i/100ms ------------------------------------------------- lex HTML 2.0 (±0.0%) i/s - 10 in 5.120905s According to both benchmarks we now have a speedup of about 20 times without having to make any further changes to Ragel or the lexer itself. I love it when a plan comes together. --- lib/oga/lexer.rl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/oga/lexer.rl b/lib/oga/lexer.rl index 97fb46b..31d2d02 100644 --- a/lib/oga/lexer.rl +++ b/lib/oga/lexer.rl @@ -95,7 +95,7 @@ module Oga # @return [Array] # def lex(data) - @data = data + @data = data.chars.to_a lexer_start = self.class.lexer_start eof = data.length @@ -152,7 +152,7 @@ module Oga # @return [String] # def text(start = @ts, stop = @te) - return @data[start...stop] + return @data[start...stop].join('') end ##