Go to file
Yorick Peterse cdf5f1d541 Improve lexer performance by 20x or so.
This was a rather interesting turn of events. As it turned out the Ragel
generated lexer was extremely slow on large inputs. For example, lexing
benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark
benchmark/lexer/bench_html_time.rb:

    Rehearsal --------------------------------------------------------
    lex HTML              10.870000   0.000000  10.870000 ( 10.877920)
    ---------------------------------------------- total: 10.870000sec

                               user     system      total        real
    lex HTML              10.440000   0.010000  10.450000 ( 10.449500)

The corresponding benchmark-ips benchmark (bench_html.rb) presented the
following results:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        0.1 (±0.0%) i/s -          1 in  10.472534s

10 seconds for around 165 KB of HTML was not acceptable. I spent a good time
profiling things, even submitting a patch to Ragel
(https://github.com/athurston/ragel/pull/1). At some point I decided to give a
pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to
write C reminded me why I didn't want to do it in C in the first place.

Around 2AM I gave up and went to brush my teeth and head to bed. Then, a
miracle happened. More precisely, I actually gave my brain some time to think
away from the computer. I said to myself:

    What if I feed Ragel an Array of characters instead of an entire String?
    That way I bypass String#[] being expensive without having to change all of
    Ragel or use a different language.

The results of this change are rather interesting. With these changes the
benchmark bench_html_time.rb now gives back the following:

    Rehearsal --------------------------------------------------------
    lex HTML               0.550000   0.000000   0.550000 (  0.550649)
    ----------------------------------------------- total: 0.550000sec

                               user     system      total        real
    lex HTML               0.520000   0.000000   0.520000 (  0.520713)

The benchmark bench_html.rb in turn gives back this:

    Calculating -------------------------------------
                lex HTML         1 i/100ms
    -------------------------------------------------
                lex HTML        2.0 (±0.0%) i/s -         10 in   5.120905s

According to both benchmarks we now have a speedup of about 20 times without
having to make any further changes to Ragel or the lexer itself.

I love it when a plan comes together.
2014-03-23 12:46:22 +01:00
benchmark Added extra benchmarks for lexing large inputs. 2014-03-23 12:46:04 +01:00
checksum Basic project layout. 2014-02-26 19:50:16 +01:00
doc Basic project layout. 2014-02-26 19:50:16 +01:00
lib Improve lexer performance by 20x or so. 2014-03-23 12:46:22 +01:00
spec Lex open tags with newlines in them. 2014-03-20 23:39:29 +01:00
task Lowered the required Ragel version to 6.7. 2014-03-18 00:12:21 +01:00
.editorconfig Added a EditorConfig file. 2014-02-26 19:56:47 +01:00
.gitignore Updated the gitignore entry for the parser. 2014-03-11 22:03:02 +01:00
.ruby-version Added a .ruby-version file. 2014-03-18 18:08:25 +01:00
.travis.yml Allow JRuby to fail for now. 2014-03-18 00:13:33 +01:00
.yardopts Basic project layout. 2014-02-26 19:50:16 +01:00
Gemfile Basic project layout. 2014-02-26 19:50:16 +01:00
LICENSE Added a license. 2014-02-26 22:20:47 +01:00
MANIFEST Basic project layout. 2014-02-26 19:50:16 +01:00
README.md Added a license. 2014-02-26 22:20:47 +01:00
Rakefile Moved the parser class to Oga::Parser. 2014-03-11 22:01:50 +01:00
oga.gemspec Benchmark for measuring CDATA lexing. 2014-03-21 16:59:44 +01:00

README.md

Oga

Oga is (or will be) a pure Ruby, thread-safe HTML (and XML in the future) parser that doesn't trigger segmentation faults on Ruby implementations other than MRI. Oga will initially not focus on performance but instead will focus on proper handling of encodings, stability and a sane API. More importantly it will be pure Ruby only. No C extensions, no Java, no x86 64 assembly, just Ruby.

From Wikipedia:

Oga: A large two-person saw used for ripping large boards in the days before power saws. One person stood on a raised platform, with the board below him, and the other person stood underneath them.

Planned Features

  • Full support for HTML(5)
  • Full support for XML, DTDs will probably be ignored.
  • Support for xpath and CSS selector based queries
  • SAX/pull parsing APIs that don't make you want to cut yourself

Features

  • A README

Requirements

  • Ruby

Development requirements:

  • Ragel
  • Racc
  • Other stuff

Usage

Basic DOM parsing example:

require 'oga'

parser   = Oga::Parser::DOM.new
document = parser.parse('<p>Hello</p>')

puts document.css('p').first.text # => "Hello"

Pull parsing:

require 'oga'

parser = Oga::Parser::Pull.new('<p>Hello</p>')

parser.each do |node|
  puts node.text
end

These examples will probably change once I actually start writing some code.

Why Another HTML/XML parser?

Currently there are a few existing parser out there, the most famous one being Nokogiri. Another parser that's becoming more popular these days is Ox. Ruby's standard library also comes with REXML.

The sad truth is that these existing libraries are problematic in their own ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works because of the non conccurent nature of MRI, on Jruby it works because it's implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a library, is not thread-safe and problematic to install on certain platforms (apparently). I don't want to compile libxml2 every time I install Nokogiri either.

To give an example about the issues with Nokogiri on Rubinius (or any other Ruby implementation that is not MRI or JRuby), take a look at these issues:

Some of these have been fixed, some have not. The core problem remains: Nokogiri acts in a way that there can be a large number of places where it might break due to throwing around void pointers and what not and expecting that things magically work. Note that I have nothing against the people running these projects, I just heavily, heavily dislike the resulting codebase one has to deal with today.

Ox looks very promising but it lacks a rather crucial feature: parsing HTML (without using a SAX API). It's also again a C extension making debugging more of a pain (at least for me).

I just want an HTML parser that I can rely on stability wise and that is written in Ruby so I can actually debug it. In theory it should also make it easier for other Ruby developers to contribute.

Oga is an attempt at solving this problem. By writing it in pure Ruby the initial performance will probably not be as great. However, I feel this is a problem with individual Ruby implementations, not the language itself. Also, by writing it in Ruby we don't have to deal with all the crazy things of C/C++ or even Java.

In theory it should also allow it to run on every Ruby implementation, be it JRuby, Rubinius, Topaz or even mruby.

License

All source code in this repository is licensed under the MIT license unless specified otherwise. A copy of this license can be found in the file "LICENSE" in the root directory of this repository.