This changes the XPath AST so that every segment in a path (e.g.
foo/bar) is parsed as a child node of the node that precedes it. For
example, take the following expression:
foo/bar
This used to be parsed into the following AST:
(path
(axis "child" (test nil "foo"))
(axis "child" (test nil "bar")))
This is now parsed into the following AST:
(axis "child"
(test nil "foo")
(axis "child"
(test nil "bar")))
This new AST is much easier to deal with in the XPath::Compiler class,
especially when trying to ensure that each segment operates on the
correct input.
This commit also fixes parsing of type tests with predicates, such as:
comment()[10]
This used to throw a parser error.
This is a shortcut for "!foo". Using this method one doesn't have to
worry about how the "!" operator binds. For example, this:
!foo.or(bar)
would be parsed/evaluated as this:
!(foo.or(bar))
when instead we want it to be this:
(!foo).or(bar)
Using explicit parenthesis leads to ugly code, so now we can do this
instead:
foo.not.or(bar)
This also comes with some changes to the specs as the old behaviour of
the Evaluator was incorrect. The Evaluator would bail after matching a
single node but instead it's meant to continue until it runs out of
parent nodes.
Prevents a superfluous end tag of a self-closing HTML tag from
closing its parent element prematurely, for example:
```html
<object><param></param><param></param></object>
```
(note <param> is self closing) being turned into:
```html
<object><param/></object><param/>
```
This is a Nokogiri extension (as far as I'm aware) but it's useful
enough to also include in Oga. Selectors such as "foo:nth(2)" are simply
compiled to XPath "descendant::foo[position() = 2]".
Fixes#123
This allows for parsing of HTML such as:
<a href=lol("javascript")></a>
Here the "href" attribute would have its value set to:
lol("javascript")
Fixes#119
```
element = Oga::XML::Element.new(:name => 'div')
some_node.replace(element)
```
You can also pass a `String` to `replace` and it will be replaced with
a `Oga::XML::Text` node
```
some_node.replace('this will replace the current node with a text node')
```
closes#115
Currently this only disabled the automatic insertion of closing tags, in
the future this may also disable other features if deemed worth the
effort.
Fixes#107
This allows the lexer to process input such as:
<a href=foo"></a>
For XML input the lexer still expects properly opened/closed attribute
values.
Fixes#109
This ensures that entities such as "½" are decoded properly.
Previously this would be ignored as the regular expression used for this
only matched [a-zA-Z].
This was adapted from PR #111.
Previous HTML such as this would be lexed incorrectly:
<div>
<ul>
<li>foo
</ul>
inside div
</div>
outside div
The lexer would see this as the following instead:
<div>
<ul>
<li>foo</li>
inside div
</ul>
outside div
</div>
This commit exposes the name of the closing tag to
XML::Lexer#on_element_end (omitted for self closing tags). This can be
used to automatically close nested tags that were left open, ensuring
the above HTML is lexer correctly.
The new setup ignores namespace prefixes as these are not used in HTML,
XML in turn won't even run the code to begin with since it doesn't allow
one to leave out closing tags.
By encoding single/double quotes we can potentially break input, so lets
stop doing this. This now ensures that this:
<foo>a"b</foo>
Is actually serialized back into the exact same instead of being
serialized into:
<foo>a"b</foo>
When closing certain HTML elements the lexer should also close whatever
parent elements remain. For example, consider the following HTML:
<table>
<thead>
<tr>
<th>Foo
<th>Bar
<tbody>
...
</tbody>
</table>
Here the "<tbody>" element shouldn't only close the "<th>Bar" element
but also the parent "<tr>" and "<thead>" elements. This ensures we'd end
up with the following HTML:
<table>
<thead>
<tr>
<th>Foo</th>
<th>Bar</th>
</tr>
</thead>
<tbody>
...
</tbody>
</table>
Instead of garbage along the lines of this:
<table>
<thead>
<tr>
<th>Foo</th>
<th>Bar</th>
<tbody>
...
</tbody>
</table></tr></thead>
Fixes#99 (hopefully for good this time)
This ensures that HTML such as this:
<li>foo
<li>bar
is parsed as this:
<li>foo</li>
<li>bar</li>
and not as this:
<li>
foo
<li>bar</li>
</li>
Fixes#97
This makes it easier to automatically insert preceding tokens when
starting a new element as we now have access to the name. Previously
on_element_start would be invoked first which doesn't receive an
argument.
The XML/HTML lexer is now capable of processing most invalid XML/HTML
(that I can think of at least). This is achieved by inserting missing
closing tags (where needed) and/or ignoring excessive closing tags. For
example, HTML such as this:
<a></a></p>
Results in the following tokens:
[:T_ELEM_START, nil, 1]
[:T_ELEM_NAME, 'a', 1]
[:T_ELEM_CLOSE, nil, 1]
In turn this HTML:
<a>
Results in these tokens:
[:T_ELEM_START, nil, 1]
[:T_ELEM_NAME, 'a', 1]
[:T_ELEM_CLOSE, nil, 1]
Fixes#84
Previously a single Ragel machine was used for processing HTML
script and style tags. This had the unfortunate side-effect that the
following was not parsed correctly (while being valid HTML):
<script>
var foo = "</style>";
</script>
The same applied to style tags:
<style>
/* </script> */
</style>
By using separate machines we can work around the above issue. The
downside is that this can produce multiple T_TEXT nodes, which have to
be stitched back together in the parser.
This adds lexing support for HTML/XML such as:
<foo bar="""></foo>
While technically invalid, some websites (e.g. yahoo.com) contain HTML
just like this.
The lexer handles this as following:
1. When we're in the "element_head" machine, do business as usual until
we bump into a "=".
2. Call (using Ragel's "fcall") the machine to use for processing the
attribute value (if any).
3. In this machine quoted strings are processed. The moment a string has
been processed the lexer jumps right back in to the "element_head"
machine. This ensures that any stray quotes are ignored instead of
being processed as extra attribute values (eventually leading to
parsing errors due to unbalanced quotes).
Similar to comments (ea8b4aa92f) and CDATA
tags (8acc7fc743) processing instructions
are now lexed in separate chunks _with_ proper support for streaming
input.
Related issue: #93