Instead of calculating the previous/next node on the fly this data is
now set automatically whenever a node is stored in a NodeSet with an
owner. While this introduces some overhead and complexity when adding or
removing nodes from a NodeSet, it greatly reduces the runtime overhead
of calling Node#previous or Node#next.
While using recursion is an easy way of generating XML it can lead to
the call stack overflowing when serialising documents with lots of
nested nodes.
Generally there are two ways of working around this:
1. Use an explicit stack (e.g. an array or a queue of sorts) instead of
relying on the call stack.
2. Use an algorithm that doesn't use a stack at all (e.g. Morris
traversal).
This commit introduces the XML::Generator class which can serialize
documents back to XML without using a stack at all. This class takes
advantage of XML nodes having access to not only their child nodes, but
also their siblings and their parents.
All XML serialisation logic now resides in the XML::Generator class. In
turn the various "to_xml" methods just use this class and serialize
everything starting at "self".
Basically this will process the left-hand side first, assign the result
to a variable and then append this set with the nodes from the
right-hand side.
Fixes#149
Certain entities when decoded will produce a String with an invalid
encoding. This commit ensures that instead of raising an EncodingError
further down the line (e.g. when calling "inspect" on a document) the
entities are preserved as-is.
Fixes#143
HTML identifiers containing colons should be treated in two ways:
* For element names the prefix (= the namespace prefix in case of XML)
should be ignored as HTML doesn't support/use namespaces.
* For attribute names a colon is a valid character, thus "foo:bar:baz"
should be treated as a single attribute name.
This fixes#142.
On JRuby 9.0.1.0 this is a bit faster than using "is_a?":
require 'benchmark/ips'
input = false
Benchmark.ips do |bench|
bench.report 'is_a?' do
input.is_a?(TrueClass) || input.is_a?(FalseClass)
end
bench.report '==' do
input == true || input == false
end
bench.compare!
end
This outputs:
Calculating -------------------------------------
is_a? 86.129k i/100ms
== 112.837k i/100ms
-------------------------------------------------
is_a? 7.375M (±15.3%) i/s - 35.227M
== 10.428M (±12.0%) i/s - 50.889M
Comparison:
==: 10427617.5 i/s
is_a?: 7374666.2 i/s - 1.41x slower
On both MRI 2.2 and Rubinius 2.5.8 there's little to no difference
between these two methods.
Re-using the Binding of the XPath::Compiler#compile method would lead to
race conditions, and possibly a memory leak due to the Binding sticking
around for compiled Proc's lifetime.
By using a dedicated class (and its corresponding Binding) we can work
around this. Access to this class is not synchronized as compiled Procs
don't mutate their enclosing environment.
The race condition can be demonstrated using code such as the
following:
xml = <<-EOF
<people>
<person>
<name>Alice</name>
</person>
<person>
<name>Bob</name>
</person>
<person>
<name>Eve</name>
</person>
</people>
EOF
4.times.map do
Thread.new do
10_000.times do
document = Oga.parse_xml(xml)
document.at_xpath('people/person/name').text
end
end
end.each(&:join)
Running this code would result in NoMethodErrors due to "at_xpath"
returning a NilClass opposed to an Oga::XML::Element.
Escaping hash characters and whitespace is _not_ supported as neither
are valid element/attribute names (e.g. <foo#bar /> is invalid
XML/HTML).
Escaping single/double quotes also won't be supported for the time
being. It's quite a pain to get this to work right in not just CSS but
also XPath and XML/HTML, for very little gain. Should there be enough
users with an actual use case (other than "But the spec says ...!") I'll
look into this again.
Fixes#124
This does _not_ support element states such as DISABLED, nor does it
support the special handling of namespaces (e.g. *|*:not(*)). Instead
this selector basically acts as a negation, some examples:
:not(foo) # All but any "foo" nodes
:not(#foo) # Skips nodes with id="foo"
:not(.foo) # Skips nodes with a class "foo"
Fixes#125
''.start_with?('') returns false on JRuby 1.7. While I'd love to drop
support for shit like this, JRuby 1.7 is still in common use today, so
lets just work around this for now.
This updates the CSS parser to make it compatible with the XPath AST
changes introduced in commit 365a9e9fa9.
This also, finally, means I can get rid of some of the hacks that were
used for "+ foo" selectors and building (path) nodes.
This changes the XPath AST so that every segment in a path (e.g.
foo/bar) is parsed as a child node of the node that precedes it. For
example, take the following expression:
foo/bar
This used to be parsed into the following AST:
(path
(axis "child" (test nil "foo"))
(axis "child" (test nil "bar")))
This is now parsed into the following AST:
(axis "child"
(test nil "foo")
(axis "child"
(test nil "bar")))
This new AST is much easier to deal with in the XPath::Compiler class,
especially when trying to ensure that each segment operates on the
correct input.
This commit also fixes parsing of type tests with predicates, such as:
comment()[10]
This used to throw a parser error.