Using IO/StringIO objects one can parse large XML files without first having to
read the entire file into memory. This can potentially save a lot of memory at
the cost of a slightly slower runtime.
For IO like instances the lexer will consume the input line by line. If a
String is given it's consumed as a whole instead. A small side effect of
reading the input line by line is that text such as "foo\nbar" will be lexed as
two tokens instead of one.
Fixes#19.
Instead of directly accessing the `data` instance variable the C/Java code now
uses the method `read_data`. This is part of one of the various steps required
to allow Oga to read data from IO like instances. It also means I can freely
change the name of the instance variable without also having to change the
C/Java code.
After discussing this with @headius I've decided to do this the manual way
anyway. Apparently the basic load service stuff is deprecated and not very
reliable.
While I've tried to keep Oga pure Ruby for as long as possible the performance
of Ragel's Ruby output was not worth the trouble. For example, lexing 10MB of
XML would take 5 to 6 seconds at least. Nokogiri on the other hand can parse
that same XML into a DOM document in about 300 miliseconds. Such a big
performance difference is not acceptable.
To work around this the XML/HTML lexer will be implemented in C for
MRI/Rubinius and Java for JRuby. For now there's only a C extension as I
haven't read up yet on the JRuby API. The end goal is to provide some sort of
Ragel "template" that can be used to generate the corresponding C/Java
extension code. This would remove the need of duplicating the grammar and
associated code.
The native extension setup is a hybrid between native and Ruby. The raw Ragel
stuff happens in C/Java while the actual logic of actions happens in Ruby. This
adds a small amount of overhead but makes it much easier to maintain the lexer.
Even with this extra overhead the performance is much better than pure Ruby.
The 10MB of XML mentioned above is lexed in about 600 miliseconds. In other
words, it's 10 times faster.
Instead of wrapping a predicate method around the ivar we'll just access it
directly. This reduces average lexing times in the big XML benchmark from 7,5
to ~7 seconds.
Instead of using instance variables for ts, te, etc we'll use local variables.
Grand wizard overloard @whitequark suggested that this would be quite a bit
faster, which turns out to be true. For example, the big XML lexer benchmark
would, prior to this commit, complete in about 9 - 9,3 seconds. With this
commit that hovers around 8,5 seconds.
This adds the ability to more easily act upon specific node types and nestings
when using the pull parsing API.
A basic example of this API looks like the following (only including relevant
code):
parser.parse do |node|
parser.on(:element, %w{people person}) do
people << {:name => nil, :age => nil}
end
parser.on(:text, %w{people person name}) do
people.last[:name] = node.text
end
parser.on(:text, %w{people person age}) do
people.last[:age] = node.text.to_i
end
end
This fixes#6.
Tracking the names of nested elements makes it a lot easier to do contextual
pull parsing. Without this it's impossible to know what context the parser is
in at a given moment.
For memory reasons the parser currently only tracks the element names. In the
future it might perhaps also track extra information to make parsing easier.
This parser extends the regular DOM parser but instead delegates certain nodes
to a block instead of building a DOM tree.
The API is a bit raw in its current form but I'll extend it and make it a bit
more user friendly in the following commits. In particular I want to make it
easier to figure out if a certain node is nested inside another node.