Commit Graph

11 Commits

Author SHA1 Message Date
Yorick Peterse 4fa88fcbde Cache rb_intern/symbol lookups in the lexer.
For JRuby this has little to no benefits as it uses strings for method names.
However, both MRI and Rubinius will perform a Symbol lookup whenever rb_intern()
is called. By doing this once for all callback names and caching the resulting
VALUE objects the lexer timings can be reduced by about 25%. In case of the
benchmark benchmark/xml/lexer/string_average_bench.rb this means it runs in
around 500ms instead of 700ms.
2014-11-22 01:53:37 +01:00
Yorick Peterse fca88a69d1 Track Ragel call stacks in the Java lexer.
This will be needed for the upcoming string lexing changes.
2014-10-26 11:39:19 +01:00
Yorick Peterse 8db77c0a09 Count newlines of text nodes in native code.
Instead of relying on String#count for counting newlines in text nodes, Oga now
does this in C/Java. String#count isn't exactly the fastest way of counting
characters. Performance was measured using
benchmark/xml/lexer/string_average_bench.rb. Before this patch the results were
as following:

    MRI:   0.529s
    Rbx:   4.965s
    JRuby: 0.622s

After this patch:

    MRI:   0.424s
    Rbx:   1.942s
    JRuby: 0.665s => numbers vary a bit, seem roughly the same as before

The commands used for benchmarking:

    $ rake clean # to make sure that C exts aren't shared between MRI/Rbx
    $ rake generate
    $ rake fixtures
    $ ruby benchmark/xml/lexer/string_average_bench.rb

The big difference for Rbx is probably due to the implementation of String#count
not being super fast. Some changes were made
(https://github.com/rubinius/rubinius/pull/3133) to the method, but this hasn't
been released yet.

JRuby seems to perform in a similar way, so either it was already optimizing
things for me or I suck at writing well performing Java code.

This fixes #51.
2014-09-25 22:49:11 +02:00
Yorick Peterse 81edce2eb8 Fixed lexing of XML comments.
The previous setup would consume too much. For example the following HTML:

    <a><!--foo--><b><!--bar--></b></a>

would result in the following T_COMMENT token:

    "foo--><b><!--bar"

The new setup requires the marking of a start position. I'm not a huge fan of
this but there doesn't appear to be a way around this.
2014-08-15 20:42:32 +02:00
Yorick Peterse 629dcd3fe6 Support for IO inputs in the lexer.
Using IO/StringIO objects one can parse large XML files without first having to
read the entire file into memory. This can potentially save a lot of memory at
the cost of a slightly slower runtime.

For IO like instances the lexer will consume the input line by line. If a
String is given it's consumed as a whole instead. A small side effect of
reading the input line by line is that text such as "foo\nbar" will be lexed as
two tokens instead of one.

Fixes #19.
2014-05-26 00:30:39 +02:00
Yorick Peterse 6b9d65923a Use a method for getting input in the XML lexer.
Instead of directly accessing the `data` instance variable the C/Java code now
uses the method `read_data`. This is part of one of the various steps required
to allow Oga to read data from IO like instances. It also means I can freely
change the name of the instance variable without also having to change the
C/Java code.
2014-05-21 00:27:23 +02:00
Yorick Peterse 4542f06d0f Replaced fcall/fret with fnext in the XML lexer.
With the rules being cleaned up/moved around a bit we can drop the use of
fcall/fret. This saves the need of having to maintain a stack (position).
2014-05-21 00:08:48 +02:00
Yorick Peterse bbdc7966db Documentation for the JRuby extension. 2014-05-07 10:24:24 +02:00
Yorick Peterse 3afef5f7cc Lexer support for JRuby.
JRuby now passes all tests. Benchmark wise it completes the big XML benchmark
in about 500-600 milliseconds.
2014-05-07 09:40:22 +02:00
Yorick Peterse b9a4038e42 Callback boilerplate for the Java lexer. 2014-05-07 01:01:24 +02:00
Yorick Peterse 9abc5c1c92 Separated the Java and C ext codebases. 2014-05-07 00:29:10 +02:00