Previous HTML such as this would be lexed incorrectly:
<div>
<ul>
<li>foo
</ul>
inside div
</div>
outside div
The lexer would see this as the following instead:
<div>
<ul>
<li>foo</li>
inside div
</ul>
outside div
</div>
This commit exposes the name of the closing tag to
XML::Lexer#on_element_end (omitted for self closing tags). This can be
used to automatically close nested tags that were left open, ensuring
the above HTML is lexer correctly.
The new setup ignores namespace prefixes as these are not used in HTML,
XML in turn won't even run the code to begin with since it doesn't allow
one to leave out closing tags.
By encoding single/double quotes we can potentially break input, so lets
stop doing this. This now ensures that this:
<foo>a"b</foo>
Is actually serialized back into the exact same instead of being
serialized into:
<foo>a"b</foo>
This allows for more fine grained control over when to close certain
elements. For example, an unclosed <tr> element should be closed first
when bumping into any element other than <td> or <th>. Using the old
NodeNameSet this would mean having to list every possible HTML element
out there. Using this new setup one can just create a whitelist of the
<td> and <th> elements.
While the MIT license is a fantastic license for those too lazy (or
unable) to understand more complex licenses it's too lax when it comes
to protecting authors (= me). For example, there are no clauses regarding
patents or ownership of source code. This means that patent trolls
could, in theory, drag me to court.
Of course one can still do that when using the MPL, but at least it has
an explicit clause regarding patents. The MPL also provides a nice
balance between the MIT license and the Apache license. I don't like the
Apache license as it requires listing any significant changes in every
changed file.
In short, I don't really care what people do with my software (they
could sell it for millions for all I care), as long as they don't drag
me to court or otherwise hold me accountable for something.
When closing certain HTML elements the lexer should also close whatever
parent elements remain. For example, consider the following HTML:
<table>
<thead>
<tr>
<th>Foo
<th>Bar
<tbody>
...
</tbody>
</table>
Here the "<tbody>" element shouldn't only close the "<th>Bar" element
but also the parent "<tr>" and "<thead>" elements. This ensures we'd end
up with the following HTML:
<table>
<thead>
<tr>
<th>Foo</th>
<th>Bar</th>
</tr>
</thead>
<tbody>
...
</tbody>
</table>
Instead of garbage along the lines of this:
<table>
<thead>
<tr>
<th>Foo</th>
<th>Bar</th>
<tbody>
...
</tbody>
</table></tr></thead>
Fixes#99 (hopefully for good this time)
By using AST::Node#children directly with a splat we save ourselves an
extra method call. This in turn speeds up both the
xpath/evaluator/big_xml_average_bench.rb and
xpath/evaluator/node_matches_bench.rb benchmarks a little bit.
Using the benchmark xpath/evaluator/node_matches_bench.rb the results
prior to this commit were as following for 3 cases:
name only: 737633 i/s
namespace wildcard: 612196 i/s
name wildcard: 516030 i/s
With this commit said numbers have changed to the following:
name only: 746086 i/s
namespace wildcard: 1097168 i/s
name wildcard: 1151255 i/s
This results in the following increase of performance for each case:
name only: 1,011x (insignificant)
namespace wildcard: 1,79x
name wildcard: 2,23x
In the benchmark xpath/evaluator/big_xml_average_bench.rb the difference
isn't really noticable as said benchmark only queries elements by names,
of which the performance hasn't really improved.