This ensures that HTML such as this:
<li>foo
<li>bar
is parsed as this:
<li>foo</li>
<li>bar</li>
and not as this:
<li>
foo
<li>bar</li>
</li>
Fixes#97
This makes it easier to automatically insert preceding tokens when
starting a new element as we now have access to the name. Previously
on_element_start would be invoked first which doesn't receive an
argument.
The XML/HTML lexer is now capable of processing most invalid XML/HTML
(that I can think of at least). This is achieved by inserting missing
closing tags (where needed) and/or ignoring excessive closing tags. For
example, HTML such as this:
<a></a></p>
Results in the following tokens:
[:T_ELEM_START, nil, 1]
[:T_ELEM_NAME, 'a', 1]
[:T_ELEM_CLOSE, nil, 1]
In turn this HTML:
<a>
Results in these tokens:
[:T_ELEM_START, nil, 1]
[:T_ELEM_NAME, 'a', 1]
[:T_ELEM_CLOSE, nil, 1]
Fixes#84
Previously a single Ragel machine was used for processing HTML
script and style tags. This had the unfortunate side-effect that the
following was not parsed correctly (while being valid HTML):
<script>
var foo = "</style>";
</script>
The same applied to style tags:
<style>
/* </script> */
</style>
By using separate machines we can work around the above issue. The
downside is that this can produce multiple T_TEXT nodes, which have to
be stitched back together in the parser.
Similar to comments (ea8b4aa92f) and CDATA
tags (8acc7fc743) processing instructions
are now lexed in separate chunks _with_ proper support for streaming
input.
Related issue: #93
Instead of using a single token (T_CDATA) for a CDATA tag the lexer now
uses 3 tokens:
1. T_CDATA_START
2. T_CDATA_BODY
3. T_CDATA_END
The T_CDATA_BODY token can occur multiple times and is turned into a
single value in the XML parser. This is similar to the way strings are
lexed.
By changing the way CDATA tags are lexed Oga can now lex CDATA tags
containing newlines when using an IO as input. For example, this would
previously fail:
Oga.parse_xml(StringIO.new("<![CDATA[\nfoo]]>"))
Because IO input reads input per line the input for the lexer would be
as following:
"<![CDATA[\n"
"foo]]>"
Related issues: #93
This cache is flushed whenever Element#register_namespace is called.
When this cache is flushed it's also recursively flushed for all child
elements. This makes calls to Element#register_namespace a bit more
expensive but in turn calls to Element#available_namespaces will be a
lot faster.
The results of these methods is now cached until a Node is moved into
another NodeSet. This reduces the time spent in the
xpath/evaluator/big_xml_average_bench.rb benchmark from roughly 10
seconds to roughly 5 seconds per iteration.
In HTML the text of a script/style tag should be left untouched, no
entities must be converted. Doing so would break Javascript such as the
following:
foo&&bar;
Such code is often the result of minifiers doing their dirty business.
This was broken by introducing the process of lazy decoding of XML/HTML
entities. The new setup works similar to how XML::Text#text decodes any
entities that may be present.
Fixes#91
When querying an XML document that explicitly defines the default XML
namespace the XPath evaluator now correctly matches all nodes within
that namespace if no namespace prefix is given in the query. Previously
this would always return an empty set.
Instead of trying to make this class thread-safe I'm going with the
option of simply declaring it unsafe to mutate instances of XML::Text
while reading it in parallel. This removes the need for Mutex
allocations and keeps the code simple.
Fixes#82
Currently all operators are left-associative with no particular precedence. This
causes a few specs to fail for now. Outside of that the new parser should be
able to parse the same input as the Racc based parser.