Certain entities when decoded will produce a String with an invalid
encoding. This commit ensures that instead of raising an EncodingError
further down the line (e.g. when calling "inspect" on a document) the
entities are preserved as-is.
Fixes#143
This is largely based on the Contributor Covenant but with the list of
unacceptable behaviours updated according to the Rubinius CoC (as I feel
the latter is more explicit/accurate/better).
HTML identifiers containing colons should be treated in two ways:
* For element names the prefix (= the namespace prefix in case of XML)
should be ignored as HTML doesn't support/use namespaces.
* For attribute names a colon is a valid character, thus "foo:bar:baz"
should be treated as a single attribute name.
This fixes#142.
On JRuby 9.0.1.0 this is a bit faster than using "is_a?":
require 'benchmark/ips'
input = false
Benchmark.ips do |bench|
bench.report 'is_a?' do
input.is_a?(TrueClass) || input.is_a?(FalseClass)
end
bench.report '==' do
input == true || input == false
end
bench.compare!
end
This outputs:
Calculating -------------------------------------
is_a? 86.129k i/100ms
== 112.837k i/100ms
-------------------------------------------------
is_a? 7.375M (±15.3%) i/s - 35.227M
== 10.428M (±12.0%) i/s - 50.889M
Comparison:
==: 10427617.5 i/s
is_a?: 7374666.2 i/s - 1.41x slower
On both MRI 2.2 and Rubinius 2.5.8 there's little to no difference
between these two methods.
Re-using the Binding of the XPath::Compiler#compile method would lead to
race conditions, and possibly a memory leak due to the Binding sticking
around for compiled Proc's lifetime.
By using a dedicated class (and its corresponding Binding) we can work
around this. Access to this class is not synchronized as compiled Procs
don't mutate their enclosing environment.
The race condition can be demonstrated using code such as the
following:
xml = <<-EOF
<people>
<person>
<name>Alice</name>
</person>
<person>
<name>Bob</name>
</person>
<person>
<name>Eve</name>
</person>
</people>
EOF
4.times.map do
Thread.new do
10_000.times do
document = Oga.parse_xml(xml)
document.at_xpath('people/person/name').text
end
end
end.each(&:join)
Running this code would result in NoMethodErrors due to "at_xpath"
returning a NilClass opposed to an Oga::XML::Element.
Escaping hash characters and whitespace is _not_ supported as neither
are valid element/attribute names (e.g. <foo#bar /> is invalid
XML/HTML).
Escaping single/double quotes also won't be supported for the time
being. It's quite a pain to get this to work right in not just CSS but
also XPath and XML/HTML, for very little gain. Should there be enough
users with an actual use case (other than "But the spec says ...!") I'll
look into this again.
Fixes#124
This does _not_ support element states such as DISABLED, nor does it
support the special handling of namespaces (e.g. *|*:not(*)). Instead
this selector basically acts as a negation, some examples:
:not(foo) # All but any "foo" nodes
:not(#foo) # Skips nodes with id="foo"
:not(.foo) # Skips nodes with a class "foo"
Fixes#125
''.start_with?('') returns false on JRuby 1.7. While I'd love to drop
support for shit like this, JRuby 1.7 is still in common use today, so
lets just work around this for now.