This allows the lexer to process input such as:
<a href=foo"></a>
For XML input the lexer still expects properly opened/closed attribute
values.
Fixes#109
It pains me that I have to write this in the first place (I was hoping
people would figure this out). Sadly history has shown it's required to
document this properly.
This ensures that entities such as "½" are decoded properly.
Previously this would be ignored as the regular expression used for this
only matched [a-zA-Z].
This was adapted from PR #111.
Previous HTML such as this would be lexed incorrectly:
<div>
<ul>
<li>foo
</ul>
inside div
</div>
outside div
The lexer would see this as the following instead:
<div>
<ul>
<li>foo</li>
inside div
</ul>
outside div
</div>
This commit exposes the name of the closing tag to
XML::Lexer#on_element_end (omitted for self closing tags). This can be
used to automatically close nested tags that were left open, ensuring
the above HTML is lexer correctly.
The new setup ignores namespace prefixes as these are not used in HTML,
XML in turn won't even run the code to begin with since it doesn't allow
one to leave out closing tags.
By encoding single/double quotes we can potentially break input, so lets
stop doing this. This now ensures that this:
<foo>a"b</foo>
Is actually serialized back into the exact same instead of being
serialized into:
<foo>a"b</foo>
This allows for more fine grained control over when to close certain
elements. For example, an unclosed <tr> element should be closed first
when bumping into any element other than <td> or <th>. Using the old
NodeNameSet this would mean having to list every possible HTML element
out there. Using this new setup one can just create a whitelist of the
<td> and <th> elements.
While the MIT license is a fantastic license for those too lazy (or
unable) to understand more complex licenses it's too lax when it comes
to protecting authors (= me). For example, there are no clauses regarding
patents or ownership of source code. This means that patent trolls
could, in theory, drag me to court.
Of course one can still do that when using the MPL, but at least it has
an explicit clause regarding patents. The MPL also provides a nice
balance between the MIT license and the Apache license. I don't like the
Apache license as it requires listing any significant changes in every
changed file.
In short, I don't really care what people do with my software (they
could sell it for millions for all I care), as long as they don't drag
me to court or otherwise hold me accountable for something.
When closing certain HTML elements the lexer should also close whatever
parent elements remain. For example, consider the following HTML:
<table>
<thead>
<tr>
<th>Foo
<th>Bar
<tbody>
...
</tbody>
</table>
Here the "<tbody>" element shouldn't only close the "<th>Bar" element
but also the parent "<tr>" and "<thead>" elements. This ensures we'd end
up with the following HTML:
<table>
<thead>
<tr>
<th>Foo</th>
<th>Bar</th>
</tr>
</thead>
<tbody>
...
</tbody>
</table>
Instead of garbage along the lines of this:
<table>
<thead>
<tr>
<th>Foo</th>
<th>Bar</th>
<tbody>
...
</tbody>
</table></tr></thead>
Fixes#99 (hopefully for good this time)