Commit Graph

830 Commits

Author SHA1 Message Date
Yorick Peterse 2689d3f65a Initial setup using a C extension.
While I've tried to keep Oga pure Ruby for as long as possible the performance
of Ragel's Ruby output was not worth the trouble. For example, lexing 10MB of
XML would take 5 to 6 seconds at least. Nokogiri on the other hand can parse
that same XML into a DOM document in about 300 miliseconds. Such a big
performance difference is not acceptable.

To work around this the XML/HTML lexer will be implemented in C for
MRI/Rubinius and Java for JRuby. For now there's only a C extension as I
haven't read up yet on the JRuby API. The end goal is to provide some sort of
Ragel "template" that can be used to generate the corresponding C/Java
extension code. This would remove the need of duplicating the grammar and
associated code.

The native extension setup is a hybrid between native and Ruby. The raw Ragel
stuff happens in C/Java while the actual logic of actions happens in Ruby. This
adds a small amount of overhead but makes it much easier to maintain the lexer.
Even with this extra overhead the performance is much better than pure Ruby.
The 10MB of XML mentioned above is lexed in about 600 miliseconds. In other
words, it's 10 times faster.
2014-05-05 00:31:28 +02:00
Yorick Peterse baaa24a760 Indentation fix in the lexer. 2014-05-04 18:06:43 +02:00
Yorick Peterse f18e8893de Removed the buffering crap from the lexer. 2014-05-04 17:39:08 +02:00
Yorick Peterse 57255012b7 Patch the Ragel lexer after generating it.
This further increases throughput of the lexer. On MRI this seems to save
around one second or so. It now sits at ~6,8 seconds in the big XML benchmark.

On JRuby, combined with some JIT options and invoke dynamic enabled, this can
reduce the average lexing time to around 3,5 seconds.  Rubinius, also with a
few aggressive JIT options, seems to stick around 9 seocnds.
2014-05-02 00:40:10 +02:00
Yorick Peterse 9dfdefee47 Removed XML::Lexer#buffering?
Instead of wrapping a predicate method around the ivar we'll just access it
directly. This reduces average lexing times in the big XML benchmark from 7,5
to ~7 seconds.
2014-05-01 22:59:56 +02:00
Yorick Peterse b854f737cd Run memory profiling for 60 seconds. 2014-05-01 21:47:51 +02:00
Yorick Peterse 676a5333c0 Use a default gnuplot script. 2014-05-01 21:27:08 +02:00
Yorick Peterse 3344f373bd Plot time offsets on X axes when profiling. 2014-05-01 21:26:05 +02:00
Yorick Peterse f4a71d7f63 Use wx as a gnuplot terminal.
This allows users to zoom in and such, which doesn't work on the qt terminal
for some reason.
2014-05-01 21:01:25 +02:00
Yorick Peterse e33bb6f901 Remove sample files when running rake clean. 2014-05-01 20:57:17 +02:00
Yorick Peterse 1c35317165 Revamped the profiling setup.
This removes the need for dozens of standalone gnuplot scripts, adds extra
profiling data and makes the actual profiling easier.
2014-05-01 20:54:25 +02:00
Yorick Peterse e54d77fc2f Cleaned up the average timing benchmark. 2014-05-01 13:43:33 +02:00
Yorick Peterse 203aea6b1a Cleaned up benchmarking code. 2014-05-01 13:08:44 +02:00
Yorick Peterse ebf9099f0e Dropped the benchmark_ prefixes.
These files reside in a benchmark/ directory. Gee, I wonder what they do.
2014-05-01 13:03:21 +02:00
Yorick Peterse 20f2f256f6 Benchmark for measuring average lexing times. 2014-05-01 13:01:52 +02:00
Yorick Peterse f607cf50dc Use local variables for Ragel.
Instead of using instance variables for ts, te, etc we'll use local variables.
Grand wizard overloard @whitequark suggested that this would be quite a bit
faster, which turns out to be true. For example, the big XML lexer benchmark
would, prior to this commit, complete in about 9 - 9,3 seconds. With this
commit that hovers around 8,5 seconds.
2014-05-01 13:00:29 +02:00
Yorick Peterse e26d5a8664 Removed unused variable in a lexer benchmark. 2014-05-01 12:25:49 +02:00
Yorick Peterse 2f36692abe Fixed the big XML lexer benchmark. 2014-04-30 09:28:28 +02:00
Yorick Peterse 83f6d5437e Contextual pull parsing.
This adds the ability to more easily act upon specific node types and nestings
when using the pull parsing API.

A basic example of this API looks like the following (only including relevant
code):

    parser.parse do |node|
      parser.on(:element, %w{people person}) do
        people << {:name => nil, :age => nil}
      end

      parser.on(:text, %w{people person name}) do
        people.last[:name] = node.text
      end

      parser.on(:text, %w{people person age}) do
        people.last[:age] = node.text.to_i
      end
    end

This fixes #6.
2014-04-29 23:05:49 +02:00
Yorick Peterse 1a413998a3 Track the current node in the pull parser.
The current node is tracked in the instance method `node`.
2014-04-29 21:21:05 +02:00
Yorick Peterse d0b3653785 Updated the manifest, again. 2014-04-29 20:42:17 +02:00
Yorick Peterse 5339664f33 Include .yardopts in the Gem. 2014-04-29 20:42:09 +02:00
Yorick Peterse 8522a82cf9 Updated the manifest. 2014-04-29 20:41:11 +02:00
Yorick Peterse 503b254216 Generate files before generting the manifest. 2014-04-29 20:41:02 +02:00
Yorick Peterse 586c8f1d46 Generated an initial manifest. 2014-04-29 20:40:34 +02:00
Yorick Peterse 59dae873e4 Don't rely on Git for generating the MANIFEST.
When using Git the resulting Gem will contain far too many useless files. For
example, the profile/ and spec/ directories are not needed when building Gems.
2014-04-29 20:39:20 +02:00
Yorick Peterse 579c0499ed Benchmark for the pull parser. 2014-04-29 14:48:43 +02:00
Yorick Peterse 5ed09236f9 Big XML benchmark for the lexer. 2014-04-29 14:48:36 +02:00
Yorick Peterse a1e9e74b9c Updated a benchmark description. 2014-04-29 14:24:33 +02:00
Yorick Peterse a42240bc2e Profiling setup for the pull parser. 2014-04-29 13:50:50 +02:00
Yorick Peterse d5e59c38ac Profiling setup for the DOM parser. 2014-04-29 13:47:55 +02:00
Yorick Peterse 2c4890533c Set initial size for the lexer graph. 2014-04-29 13:47:36 +02:00
Yorick Peterse a111e673cb Changed the big XML file size to 10 MB.
This makes various calculations a bit easier opposed to when the file is 11MB
in size.
2014-04-29 13:42:02 +02:00
Yorick Peterse 53c45c621b Basic memory profiling setup.
This makes it a bit easier to profile memory usage of certain components and
plot them using Gnuplot. In the past I would write one-off scripts for this and
throw them away, only to figure out I needed them again later on.

Profiling samples are written to profile/samples and can be plotted using
corresponding Gnuplot scripts found in profile/plot. The latter requires
Gnuplot to be installed.
2014-04-29 13:38:56 +02:00
Yorick Peterse 70fcc8534c Benchmark for parsing big XML documents. 2014-04-29 13:05:45 +02:00
Yorick Peterse 45b0cdf811 Track element name nesting in the pull parser.
Tracking the names of nested elements makes it a lot easier to do contextual
pull parsing. Without this it's impossible to know what context the parser is
in at a given moment.

For memory reasons the parser currently only tracks the element names. In the
future it might perhaps also track extra information to make parsing easier.
2014-04-28 23:40:36 +02:00
Yorick Peterse 030a0068bd Basic pull parsing setup.
This parser extends the regular DOM parser but instead delegates certain nodes
to a block instead of building a DOM tree.

The API is a bit raw in its current form but I'll extend it and make it a bit
more user friendly in the following commits. In particular I want to make it
easier to figure out if a certain node is nested inside another node.
2014-04-28 17:22:17 +02:00
Yorick Peterse fd5bbbc9a2 Move element recursion handling into a method.
This makes it easier to disable later on in the streaming parser.
2014-04-28 10:25:05 +02:00
Yorick Peterse 785ec26fe7 Create Element instances before recursing. 2014-04-28 10:21:34 +02:00
Yorick Peterse 9939cf49eb Move parser callback code into dedicated methods. 2014-04-28 10:18:55 +02:00
Yorick Peterse 5d05aed6ec Corrected docs for XML::Parser. 2014-04-26 12:57:35 +02:00
Yorick Peterse f53fe4ed7c Reset the lexer when resetting the parser.
Also removed the unused @lines instance variable.
2014-04-25 00:15:24 +02:00
Yorick Peterse 83ff0e6656 Various small parser cleanups. 2014-04-25 00:07:53 +02:00
Yorick Peterse ecf6851711 Revert "Move linking of child nodes to a dedicated mixin."
This doesn't actually make things any easier. It also introduces a weirdly
named mixin.

This reverts commit 0968465f0c.
2014-04-24 21:16:31 +02:00
Yorick Peterse 0968465f0c Move linking of child nodes to a dedicated mixin. 2014-04-24 09:43:50 +02:00
Yorick Peterse 08d412da7e First shot at removing the AST layer.
The AST layer is being removed because it doesn't really serve a useful
purpose. In particular when creating a streaming parser the AST nodes would
only introduce extra overhead.

As a result of this the parser will instead emit a DOM tree directly instead of
first emitting an AST.
2014-04-21 23:05:39 +02:00
Yorick Peterse 9ee9ec14cb Lexer: only pop elements when needed. 2014-04-19 01:10:32 +02:00
Yorick Peterse c8c9da2922 Track the XML fixture in Git.
To make running benchmarks easier we'll track the XML file in Git in its
compressed form. I also decreased the size of the XML file from ~50 MB to
~10MB.
2014-04-19 01:03:14 +02:00
Yorick Peterse 97d8450cba Removed the `regenerate` task. 2014-04-19 00:59:09 +02:00
Yorick Peterse 6f1ce17b31 Benchmark for lexer lines/second.
This benchmark uses a fixture file that is automatically downloaded.
2014-04-17 20:06:24 +02:00