From cdf5f1d541611f31541b0d84a742a15aecdef093 Mon Sep 17 00:00:00 2001
From: Yorick Peterse <yorickpeterse@gmail.com>
Date: Sun, 23 Mar 2014 12:46:22 +0100
Subject: [PATCH] Improve lexer performance by 20x or so.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This was a rather interesting turn of events. As it turned out the Ragel
generated lexer was extremely slow on large inputs. For example, lexing
benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark
benchmark/lexer/bench_html_time.rb:

    Rehearsal --------------------------------------------------------
    lex HTML              10.870000   0.000000  10.870000 ( 10.877920)
    ---------------------------------------------- total: 10.870000sec

                               user     system      total        real
    lex HTML              10.440000   0.010000  10.450000 ( 10.449500)

The corresponding benchmark-ips benchmark (bench_html.rb) presented the
following results:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        0.1 (±0.0%) i/s -          1 in  10.472534s

10 seconds for around 165 KB of HTML was not acceptable. I spent a good time
profiling things, even submitting a patch to Ragel
(https://github.com/athurston/ragel/pull/1). At some point I decided to give a
pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to
write C reminded me why I didn't want to do it in C in the first place.

Around 2AM I gave up and went to brush my teeth and head to bed. Then, a
miracle happened. More precisely, I actually gave my brain some time to think
away from the computer. I said to myself:

    What if I feed Ragel an Array of characters instead of an entire String?
    That way I bypass String#[] being expensive without having to change all of
    Ragel or use a different language.

The results of this change are rather interesting. With these changes the
benchmark bench_html_time.rb now gives back the following:

    Rehearsal --------------------------------------------------------
    lex HTML               0.550000   0.000000   0.550000 (  0.550649)
    ----------------------------------------------- total: 0.550000sec

                               user     system      total        real
    lex HTML               0.520000   0.000000   0.520000 (  0.520713)

The benchmark bench_html.rb in turn gives back this:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        2.0 (±0.0%) i/s -         10 in   5.120905s

According to both benchmarks we now have a speedup of about 20 times without
having to make any further changes to Ragel or the lexer itself.

I love it when a plan comes together.
---
 lib/oga/lexer.rl | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/oga/lexer.rl b/lib/oga/lexer.rl
index 97fb46b..31d2d02 100644
--- a/lib/oga/lexer.rl
+++ b/lib/oga/lexer.rl
@@ -95,7 +95,7 @@ module Oga
     # @return [Array]
     #
     def lex(data)
-      @data       = data
+      @data       = data.chars.to_a
       lexer_start = self.class.lexer_start
       eof         = data.length
 
@@ -152,7 +152,7 @@ module Oga
     # @return [String]
     #
     def text(start = @ts, stop = @te)
-      return @data[start...stop]
+      return @data[start...stop].join('')
     end
 
     ##