Prevents a superfluous end tag of a self-closing HTML tag from
closing its parent element prematurely, for example:
```html
<object><param></param><param></param></object>
```
(note <param> is self closing) being turned into:
```html
<object><param/></object><param/>
```
This is a Nokogiri extension (as far as I'm aware) but it's useful
enough to also include in Oga. Selectors such as "foo:nth(2)" are simply
compiled to XPath "descendant::foo[position() = 2]".
Fixes#123
This allows for parsing of HTML such as:
<a href=lol("javascript")></a>
Here the "href" attribute would have its value set to:
lol("javascript")
Fixes#119
```
element = Oga::XML::Element.new(:name => 'div')
some_node.replace(element)
```
You can also pass a `String` to `replace` and it will be replaced with
a `Oga::XML::Text` node
```
some_node.replace('this will replace the current node with a text node')
```
closes#115
Currently this only disabled the automatic insertion of closing tags, in
the future this may also disable other features if deemed worth the
effort.
Fixes#107
This allows the lexer to process input such as:
<a href=foo"></a>
For XML input the lexer still expects properly opened/closed attribute
values.
Fixes#109
This ensures that entities such as "½" are decoded properly.
Previously this would be ignored as the regular expression used for this
only matched [a-zA-Z].
This was adapted from PR #111.
Previous HTML such as this would be lexed incorrectly:
<div>
<ul>
<li>foo
</ul>
inside div
</div>
outside div
The lexer would see this as the following instead:
<div>
<ul>
<li>foo</li>
inside div
</ul>
outside div
</div>
This commit exposes the name of the closing tag to
XML::Lexer#on_element_end (omitted for self closing tags). This can be
used to automatically close nested tags that were left open, ensuring
the above HTML is lexer correctly.
The new setup ignores namespace prefixes as these are not used in HTML,
XML in turn won't even run the code to begin with since it doesn't allow
one to leave out closing tags.
By encoding single/double quotes we can potentially break input, so lets
stop doing this. This now ensures that this:
<foo>a"b</foo>
Is actually serialized back into the exact same instead of being
serialized into:
<foo>a"b</foo>
When closing certain HTML elements the lexer should also close whatever
parent elements remain. For example, consider the following HTML:
<table>
<thead>
<tr>
<th>Foo
<th>Bar
<tbody>
...
</tbody>
</table>
Here the "<tbody>" element shouldn't only close the "<th>Bar" element
but also the parent "<tr>" and "<thead>" elements. This ensures we'd end
up with the following HTML:
<table>
<thead>
<tr>
<th>Foo</th>
<th>Bar</th>
</tr>
</thead>
<tbody>
...
</tbody>
</table>
Instead of garbage along the lines of this:
<table>
<thead>
<tr>
<th>Foo</th>
<th>Bar</th>
<tbody>
...
</tbody>
</table></tr></thead>
Fixes#99 (hopefully for good this time)
This ensures that HTML such as this:
<li>foo
<li>bar
is parsed as this:
<li>foo</li>
<li>bar</li>
and not as this:
<li>
foo
<li>bar</li>
</li>
Fixes#97
This makes it easier to automatically insert preceding tokens when
starting a new element as we now have access to the name. Previously
on_element_start would be invoked first which doesn't receive an
argument.
The XML/HTML lexer is now capable of processing most invalid XML/HTML
(that I can think of at least). This is achieved by inserting missing
closing tags (where needed) and/or ignoring excessive closing tags. For
example, HTML such as this:
<a></a></p>
Results in the following tokens:
[:T_ELEM_START, nil, 1]
[:T_ELEM_NAME, 'a', 1]
[:T_ELEM_CLOSE, nil, 1]
In turn this HTML:
<a>
Results in these tokens:
[:T_ELEM_START, nil, 1]
[:T_ELEM_NAME, 'a', 1]
[:T_ELEM_CLOSE, nil, 1]
Fixes#84
Previously a single Ragel machine was used for processing HTML
script and style tags. This had the unfortunate side-effect that the
following was not parsed correctly (while being valid HTML):
<script>
var foo = "</style>";
</script>
The same applied to style tags:
<style>
/* </script> */
</style>
By using separate machines we can work around the above issue. The
downside is that this can produce multiple T_TEXT nodes, which have to
be stitched back together in the parser.
This adds lexing support for HTML/XML such as:
<foo bar="""></foo>
While technically invalid, some websites (e.g. yahoo.com) contain HTML
just like this.
The lexer handles this as following:
1. When we're in the "element_head" machine, do business as usual until
we bump into a "=".
2. Call (using Ragel's "fcall") the machine to use for processing the
attribute value (if any).
3. In this machine quoted strings are processed. The moment a string has
been processed the lexer jumps right back in to the "element_head"
machine. This ensures that any stray quotes are ignored instead of
being processed as extra attribute values (eventually leading to
parsing errors due to unbalanced quotes).