Basically this will process the left-hand side first, assign the result
to a variable and then append this set with the nodes from the
right-hand side.
Fixes#149
Certain entities when decoded will produce a String with an invalid
encoding. This commit ensures that instead of raising an EncodingError
further down the line (e.g. when calling "inspect" on a document) the
entities are preserved as-is.
Fixes#143
HTML identifiers containing colons should be treated in two ways:
* For element names the prefix (= the namespace prefix in case of XML)
should be ignored as HTML doesn't support/use namespaces.
* For attribute names a colon is a valid character, thus "foo:bar:baz"
should be treated as a single attribute name.
This fixes#142.
On JRuby 9.0.1.0 this is a bit faster than using "is_a?":
require 'benchmark/ips'
input = false
Benchmark.ips do |bench|
bench.report 'is_a?' do
input.is_a?(TrueClass) || input.is_a?(FalseClass)
end
bench.report '==' do
input == true || input == false
end
bench.compare!
end
This outputs:
Calculating -------------------------------------
is_a? 86.129k i/100ms
== 112.837k i/100ms
-------------------------------------------------
is_a? 7.375M (±15.3%) i/s - 35.227M
== 10.428M (±12.0%) i/s - 50.889M
Comparison:
==: 10427617.5 i/s
is_a?: 7374666.2 i/s - 1.41x slower
On both MRI 2.2 and Rubinius 2.5.8 there's little to no difference
between these two methods.
Re-using the Binding of the XPath::Compiler#compile method would lead to
race conditions, and possibly a memory leak due to the Binding sticking
around for compiled Proc's lifetime.
By using a dedicated class (and its corresponding Binding) we can work
around this. Access to this class is not synchronized as compiled Procs
don't mutate their enclosing environment.
The race condition can be demonstrated using code such as the
following:
xml = <<-EOF
<people>
<person>
<name>Alice</name>
</person>
<person>
<name>Bob</name>
</person>
<person>
<name>Eve</name>
</person>
</people>
EOF
4.times.map do
Thread.new do
10_000.times do
document = Oga.parse_xml(xml)
document.at_xpath('people/person/name').text
end
end
end.each(&:join)
Running this code would result in NoMethodErrors due to "at_xpath"
returning a NilClass opposed to an Oga::XML::Element.
Escaping hash characters and whitespace is _not_ supported as neither
are valid element/attribute names (e.g. <foo#bar /> is invalid
XML/HTML).
Escaping single/double quotes also won't be supported for the time
being. It's quite a pain to get this to work right in not just CSS but
also XPath and XML/HTML, for very little gain. Should there be enough
users with an actual use case (other than "But the spec says ...!") I'll
look into this again.
Fixes#124
This does _not_ support element states such as DISABLED, nor does it
support the special handling of namespaces (e.g. *|*:not(*)). Instead
this selector basically acts as a negation, some examples:
:not(foo) # All but any "foo" nodes
:not(#foo) # Skips nodes with id="foo"
:not(.foo) # Skips nodes with a class "foo"
Fixes#125
''.start_with?('') returns false on JRuby 1.7. While I'd love to drop
support for shit like this, JRuby 1.7 is still in common use today, so
lets just work around this for now.
This updates the CSS parser to make it compatible with the XPath AST
changes introduced in commit 365a9e9fa9.
This also, finally, means I can get rid of some of the hacks that were
used for "+ foo" selectors and building (path) nodes.
This changes the XPath AST so that every segment in a path (e.g.
foo/bar) is parsed as a child node of the node that precedes it. For
example, take the following expression:
foo/bar
This used to be parsed into the following AST:
(path
(axis "child" (test nil "foo"))
(axis "child" (test nil "bar")))
This is now parsed into the following AST:
(axis "child"
(test nil "foo")
(axis "child"
(test nil "bar")))
This new AST is much easier to deal with in the XPath::Compiler class,
especially when trying to ensure that each segment operates on the
correct input.
This commit also fixes parsing of type tests with predicates, such as:
comment()[10]
This used to throw a parser error.
Handling of predicates is delegated to 3 different methods:
* on_predicate_direct: for predicates such as foo[bar] and foo[x < y].
* on_predicate_temporary: for predicates that use the last() function
somewhere.
* on_predicate_index: for predicates that only contain a literal index,
foo[10] being an example.
This enables the compiler to use more optimized code depending on the
type of predicate while still being able to support last() and
position().
The code is currently still quite rough on the edges, this will be taken
care of in following commits.
This was a hack that never should've made it into the code in the first
place. This breaks some of the CSS specs, but this should be taken care
of when I refactor predicate support.
I can't quite recall why this throw was present in the Evaluator in the
first place, but removing it doesn't seem to break anything (in the
Evaluator). In fact, removing it actually fixes some of the CSS specs
that were broken when using the Compiler.