Haven't bumped into any problems just yet. However, in theory all sorts of evil
could happen here. Which is part of the problem of C: so much shit is undefined
behaviour that you can take a single step and fall in 15 holes at the same time.
In theory, because nobody bothered to actually specify it properly.
When lexing XML entities such as & and < these sequences are now
converted into their "actual" forms. In turn, Oga::XML::Text#to_xml ensures they
are encoded when the method is called.
Performance wise this puts some strain on the lexer, for every T_TEXT/T_STRING
node now potentially has to have its content modified. In the benchmark
xml/lexer/string_average_bench.rb the average processing time is now about the
same as before the improvements made in
8db77c0a09. I was hoping that the lexer would
still be a bit faster, but alas this is not the case. Doing this in native code
would be a nightmare as C doesn't have a proper string replacement function. I'm
not old/sadistic enough to write on myself just yet.
This fixes#49
This was a gimmick in the first place. It doesn't work well with IO instances
(= requires re-reading of the input), the code was too complex and it wasn't
that useful in the end. Lets just get rid of it.
This fixes#53.
Instead of relying on String#count for counting newlines in text nodes, Oga now
does this in C/Java. String#count isn't exactly the fastest way of counting
characters. Performance was measured using
benchmark/xml/lexer/string_average_bench.rb. Before this patch the results were
as following:
MRI: 0.529s
Rbx: 4.965s
JRuby: 0.622s
After this patch:
MRI: 0.424s
Rbx: 1.942s
JRuby: 0.665s => numbers vary a bit, seem roughly the same as before
The commands used for benchmarking:
$ rake clean # to make sure that C exts aren't shared between MRI/Rbx
$ rake generate
$ rake fixtures
$ ruby benchmark/xml/lexer/string_average_bench.rb
The big difference for Rbx is probably due to the implementation of String#count
not being super fast. Some changes were made
(https://github.com/rubinius/rubinius/pull/3133) to the method, but this hasn't
been released yet.
JRuby seems to perform in a similar way, so either it was already optimizing
things for me or I suck at writing well performing Java code.
This fixes#51.
This ensures we only call String#downcase if we can't find an all lowercased
*and* all uppercased version of the element name. This in turn can save as many
object allocations as there are HTML opening tags.
This fixes#52.
Instead of using `namespace.name` lets just use `namespace_name`. This fixes the
problem of serializing attributes where the namespace prefix is "xmlns" as the
namespace for this isn't registered by default.
This fixes#47.
This API is a little bit dodgy (similar to Nokogiri's API) due to the use of
separate parser and handler classes. This is done to ensure that the return
values of callback methods (e.g. on_element) aren't used by Racc for building
AST trees. This also ensures that whatever variables are set by the handler
don't conflict with any variables of the parser.
This fixes#42.
While still a bit cryptic this is probably as best as we can get it. An example:
Oga.parse_xml("<namefoo:bar=\"10\"")
parser.rb:116:in `on_error': Unexpected string on line 1: (Racc::ParseError)
=> 1: <namefoo:bar="10"
This fixes#43.
When an XML element has no child nodes a self-closing tag is used. When parsing
documents/elements in HTML mode this is only done if the element is a so called
"void element" (e.g. <link> tags).
This fixes#46.
When the default namespace is registered (using xmlns="...") Oga now properly
sets the namespace of the container and all child elements.
This fixes#44.