Instead of using a single token (T_CDATA) for a CDATA tag the lexer now
uses 3 tokens:
1. T_CDATA_START
2. T_CDATA_BODY
3. T_CDATA_END
The T_CDATA_BODY token can occur multiple times and is turned into a
single value in the XML parser. This is similar to the way strings are
lexed.
By changing the way CDATA tags are lexed Oga can now lex CDATA tags
containing newlines when using an IO as input. For example, this would
previously fail:
Oga.parse_xml(StringIO.new("<![CDATA[\nfoo]]>"))
Because IO input reads input per line the input for the lexer would be
as following:
"<![CDATA[\n"
"foo]]>"
Related issues: #93
This cache is flushed whenever Element#register_namespace is called.
When this cache is flushed it's also recursively flushed for all child
elements. This makes calls to Element#register_namespace a bit more
expensive but in turn calls to Element#available_namespaces will be a
lot faster.
The results of these methods is now cached until a Node is moved into
another NodeSet. This reduces the time spent in the
xpath/evaluator/big_xml_average_bench.rb benchmark from roughly 10
seconds to roughly 5 seconds per iteration.
In HTML the text of a script/style tag should be left untouched, no
entities must be converted. Doing so would break Javascript such as the
following:
foo&&bar;
Such code is often the result of minifiers doing their dirty business.
This was broken by introducing the process of lazy decoding of XML/HTML
entities. The new setup works similar to how XML::Text#text decodes any
entities that may be present.
Fixes#91
When querying an XML document that explicitly defines the default XML
namespace the XPath evaluator now correctly matches all nodes within
that namespace if no namespace prefix is given in the query. Previously
this would always return an empty set.
This benchmark is simple enough that the overhead of evaluation is not
far greater than parsing. This makes it suitable for benchmarking the
performance increase of caching XPath ASTs.
Instead of trying to make this class thread-safe I'm going with the
option of simply declaring it unsafe to mutate instances of XML::Text
while reading it in parallel. This removes the need for Mutex
allocations and keeps the code simple.
Fixes#82