This allows Oga to parse documents that contain an XML declaration at a
place other than at the document root. Oga still only assigns the XML
declaration to the document whenever it is at the top-level. This
matches libxml/XML specification behaviour as far as I can tell.
While using recursion is an easy way of generating XML it can lead to
the call stack overflowing when serialising documents with lots of
nested nodes.
Generally there are two ways of working around this:
1. Use an explicit stack (e.g. an array or a queue of sorts) instead of
relying on the call stack.
2. Use an algorithm that doesn't use a stack at all (e.g. Morris
traversal).
This commit introduces the XML::Generator class which can serialize
documents back to XML without using a stack at all. This class takes
advantage of XML nodes having access to not only their child nodes, but
also their siblings and their parents.
All XML serialisation logic now resides in the XML::Generator class. In
turn the various "to_xml" methods just use this class and serialize
everything starting at "self".
Re-using the Binding of the XPath::Compiler#compile method would lead to
race conditions, and possibly a memory leak due to the Binding sticking
around for compiled Proc's lifetime.
By using a dedicated class (and its corresponding Binding) we can work
around this. Access to this class is not synchronized as compiled Procs
don't mutate their enclosing environment.
The race condition can be demonstrated using code such as the
following:
xml = <<-EOF
<people>
<person>
<name>Alice</name>
</person>
<person>
<name>Bob</name>
</person>
<person>
<name>Eve</name>
</person>
</people>
EOF
4.times.map do
Thread.new do
10_000.times do
document = Oga.parse_xml(xml)
document.at_xpath('people/person/name').text
end
end
end.each(&:join)
Running this code would result in NoMethodErrors due to "at_xpath"
returning a NilClass opposed to an Oga::XML::Element.
Instead of decoding entities in the lexer we'll do this whenever XML::Text#text
is called. This removes the overhead from the parsing phase and ensures the
process is only triggered when actually needed. Note that calling #to_xml and/or
the #inspect methods on a Text (or parent) instance will also trigger the entity
conversion process.
The new entity decoding API supports both regular entities (e.g. &) as well
as codepoint based entities (both regular and hexadecimal codepoints).
To allow safe read-only access to Text instances from multiple threads a mutex
is used. This mutex ensures that only 1 thread can trigger the conversion
process.
Fixes#68
The new setup will not involve a separate transformation stage, instead the CSS
parser will directly emit an XPath AST. This reduces the overhead needed for
parsing/evaluating CSS selectors while also simplifying the code. The downside
is that I basically have to re-write 80% of the parser.
When lexing XML entities such as & and < these sequences are now
converted into their "actual" forms. In turn, Oga::XML::Text#to_xml ensures they
are encoded when the method is called.
Performance wise this puts some strain on the lexer, for every T_TEXT/T_STRING
node now potentially has to have its content modified. In the benchmark
xml/lexer/string_average_bench.rb the average processing time is now about the
same as before the improvements made in
8db77c0a09. I was hoping that the lexer would
still be a bit faster, but alas this is not the case. Doing this in native code
would be a nightmare as C doesn't have a proper string replacement function. I'm
not old/sadistic enough to write on myself just yet.
This fixes#49
This API is a little bit dodgy (similar to Nokogiri's API) due to the use of
separate parser and handler classes. This is done to ensure that the return
values of callback methods (e.g. on_element) aren't used by Racc for building
AST trees. This also ensures that whatever variables are set by the handler
don't conflict with any variables of the parser.
This fixes#42.
When an XML element has no child nodes a self-closing tag is used. When parsing
documents/elements in HTML mode this is only done if the element is a so called
"void element" (e.g. <link> tags).
This fixes#46.
When an attribute is prefixed with "xml" the default namespace should be used
automatically. This namespace is not registered on element level by default as
this namespace isn't registered manually, instead it's a "magic" namespace. This
also ensures we match the behaviour of libxml more closely, hopefully reducing
confusion.
After discussing this with @headius I've decided to do this the manual way
anyway. Apparently the basic load service stuff is deprecated and not very
reliable.
While I've tried to keep Oga pure Ruby for as long as possible the performance
of Ragel's Ruby output was not worth the trouble. For example, lexing 10MB of
XML would take 5 to 6 seconds at least. Nokogiri on the other hand can parse
that same XML into a DOM document in about 300 miliseconds. Such a big
performance difference is not acceptable.
To work around this the XML/HTML lexer will be implemented in C for
MRI/Rubinius and Java for JRuby. For now there's only a C extension as I
haven't read up yet on the JRuby API. The end goal is to provide some sort of
Ragel "template" that can be used to generate the corresponding C/Java
extension code. This would remove the need of duplicating the grammar and
associated code.
The native extension setup is a hybrid between native and Ruby. The raw Ragel
stuff happens in C/Java while the actual logic of actions happens in Ruby. This
adds a small amount of overhead but makes it much easier to maintain the lexer.
Even with this extra overhead the performance is much better than pure Ruby.
The 10MB of XML mentioned above is lexed in about 600 miliseconds. In other
words, it's 10 times faster.
This parser extends the regular DOM parser but instead delegates certain nodes
to a block instead of building a DOM tree.
The API is a bit raw in its current form but I'll extend it and make it a bit
more user friendly in the following commits. In particular I want to make it
easier to figure out if a certain node is nested inside another node.