This uses stricter (and more correct) rules in both the lexer and the parser.
The resulting AST has also received a small rework to make it more compact and
less confusing.
This includes support for the crazy 2n+1 syntax you can use with selectors such
as :nth-child().
CSS selectors: doing what XPath already does using an even crazier syntax,
because screw you.
Haven't bumped into any problems just yet. However, in theory all sorts of evil
could happen here. Which is part of the problem of C: so much shit is undefined
behaviour that you can take a single step and fall in 15 holes at the same time.
In theory, because nobody bothered to actually specify it properly.
When lexing XML entities such as & and < these sequences are now
converted into their "actual" forms. In turn, Oga::XML::Text#to_xml ensures they
are encoded when the method is called.
Performance wise this puts some strain on the lexer, for every T_TEXT/T_STRING
node now potentially has to have its content modified. In the benchmark
xml/lexer/string_average_bench.rb the average processing time is now about the
same as before the improvements made in
8db77c0a09. I was hoping that the lexer would
still be a bit faster, but alas this is not the case. Doing this in native code
would be a nightmare as C doesn't have a proper string replacement function. I'm
not old/sadistic enough to write on myself just yet.
This fixes#49
This was a gimmick in the first place. It doesn't work well with IO instances
(= requires re-reading of the input), the code was too complex and it wasn't
that useful in the end. Lets just get rid of it.
This fixes#53.
Instead of relying on String#count for counting newlines in text nodes, Oga now
does this in C/Java. String#count isn't exactly the fastest way of counting
characters. Performance was measured using
benchmark/xml/lexer/string_average_bench.rb. Before this patch the results were
as following:
MRI: 0.529s
Rbx: 4.965s
JRuby: 0.622s
After this patch:
MRI: 0.424s
Rbx: 1.942s
JRuby: 0.665s => numbers vary a bit, seem roughly the same as before
The commands used for benchmarking:
$ rake clean # to make sure that C exts aren't shared between MRI/Rbx
$ rake generate
$ rake fixtures
$ ruby benchmark/xml/lexer/string_average_bench.rb
The big difference for Rbx is probably due to the implementation of String#count
not being super fast. Some changes were made
(https://github.com/rubinius/rubinius/pull/3133) to the method, but this hasn't
been released yet.
JRuby seems to perform in a similar way, so either it was already optimizing
things for me or I suck at writing well performing Java code.
This fixes#51.
This ensures we only call String#downcase if we can't find an all lowercased
*and* all uppercased version of the element name. This in turn can save as many
object allocations as there are HTML opening tags.
This fixes#52.
Instead of using `namespace.name` lets just use `namespace_name`. This fixes the
problem of serializing attributes where the namespace prefix is "xmlns" as the
namespace for this isn't registered by default.
This fixes#47.