As the community progressively moves to a useful practice of enabling
ruby warnings on tests, assigning an instance variable before use becomes
a necessary practice. Here we set some variables at initialization that
were previously lazily or conditionally set:
- `decoded` is assigned false which seems to make more semantic sense
than than using nil
- `namespace` is assigned nil, its value being lazily computed later
- `available_namespaces` is assigned nil so as to respect the cache
invalidation mechanism
In Ruby 2.4, Fixnum is deprecated and the following produces a warning:
$ irb
irb(main):001:0> puts RUBY_VERSION
2.4.0
=> nil
irb(main):002:0> require 'oga'
=> true
irb(main):003:0> Oga.parse_xml('<people><person>Alice</person></people>').css(':nth-child(1)')
/Users/pcheah/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/oga-2.7/lib/oga/xpath/conversion.rb:79: warning: constant ::Fixnum is deprecated
=> NodeSet(Element(name: "people" children: NodeSet(Element(name: "person" children: NodeSet(Text("Alice"))))), Element(name: "person" children: NodeSet(Text("Alice"))))
irb(main):004:0>
$
So oga/xpath/conversion.rb needs to test for Integer instead of Fixnum.
Also, it appears that json 1.8.3 no longer builds under Ruby 2.4, so the gemfile has to be upgraded to json 2.0.
This commit fixes two problems:
1. Doctypes introducing too many newlines
2. Elements with siblings and a common parent not being closed properly
== Doctypes
When generating the XML for a doctype the XML::Generator class would
append a trailing newline. This however meant that if the next text node
was also a newline you'd now have two newlines. In previous versions of
Oga this worked because the old XML generation code would call
String#strip on the XML to add after the doctype.
To support this in the new version we perform a lookahead in
XML::Generator#on_doctype to remove any trailing newlines added by this
method in case the first child node is a newline text node.
== Closing Elements
When an element has a sibling following it _and_ does not have any child
nodes it would not be closed properly when generating XML. This is due
to the "until next_node = ..." expression evaluating to true, thus never
executing its body.
There's probably some way to work around this by using the "loop"
method, but considering it's 02:09 I think the current approach is good
enough. Future me will probably hate me for it.
This allows Oga to parse documents that contain an XML declaration at a
place other than at the document root. Oga still only assigns the XML
declaration to the document whenever it is at the top-level. This
matches libxml/XML specification behaviour as far as I can tell.
When generating XML we should not process the siblings of a root node.
Doing so results in invalid XML being returned (due to siblings not
being children of the root node).
Not processing the siblings in this case also prevents the siblings loop
from getting stuck. To explain what's happening, let's assume we're
using the following document tree:
Document
|_ Text
|_ Element
Now let's say we take the Text node and call "to_xml" on it. When we
start the loop we'll run into the following code:
if child_node = children && current.children[0]
current = child_node
else
Here the if statement will evaluate to false because a Text node doesn't
have any child nodes, as such we enter the else branch. We now reach the
following code:
until next_node = current.is_a?(Node) && current.next
A Text node is a descendant of Node and it happens to have another node
(the Element node) as the next sibling. As a result we enter the `until`
loop's body. We now run into this code:
if current.is_a?(Node) && current != @start
current = current.parent
end
Here `current` is still our Text node and it is the @start node. As a
result the `current` re-assignment won't be evaluated.
Next we run into the following:
after_element(current, output) if current.is_a?(Element)
break if current == @start
The first line will not evaluate because `current` is still the `Text`
node. The `break` *will* evaluate because `current` is the same as
@start.
This will then lead to the following code being executed:
current = next_node
Here `next_node` is the next sibling of the Text node, which in the
above example is the Element node.
Because all of the above runs in a `while` loop we'll at some point end
up again at the start of the `until` loop. At this point the `current`
variable contains an `Element`. Because this node does *not* have a node
following it we'll once again enter the `until` loop's body.
This loop will now get stuck because `current` is a Node, it's not the
same as @start, thus `current` is set to its parent (the Document),
which also isn't the same as @start.
On the next iteration this loop will break because `current` is no
longer a node. However, because a Document _does_ have child nodes the
whole process of traversing children/siblings will keep repeating itself
forever.
To work around this we now use the following statement:
if child_node = children && current.children[0]
...
elsif current == @start
after_element(current, output) if current.is_a?(Element)
break
else
until next_node = current.is_a?(Node) && current.next
...
end
This prevents processing of any siblings once we have reached the root
node, in turn preventing the loop getting stuck forever.
I'm willing to bet there are probably a few more edge cases, but I can't
think of any others at the moment.
Fixes#161
Instead of calculating the previous/next node on the fly this data is
now set automatically whenever a node is stored in a NodeSet with an
owner. While this introduces some overhead and complexity when adding or
removing nodes from a NodeSet, it greatly reduces the runtime overhead
of calling Node#previous or Node#next.
While using recursion is an easy way of generating XML it can lead to
the call stack overflowing when serialising documents with lots of
nested nodes.
Generally there are two ways of working around this:
1. Use an explicit stack (e.g. an array or a queue of sorts) instead of
relying on the call stack.
2. Use an algorithm that doesn't use a stack at all (e.g. Morris
traversal).
This commit introduces the XML::Generator class which can serialize
documents back to XML without using a stack at all. This class takes
advantage of XML nodes having access to not only their child nodes, but
also their siblings and their parents.
All XML serialisation logic now resides in the XML::Generator class. In
turn the various "to_xml" methods just use this class and serialize
everything starting at "self".
Basically this will process the left-hand side first, assign the result
to a variable and then append this set with the nodes from the
right-hand side.
Fixes#149
Certain entities when decoded will produce a String with an invalid
encoding. This commit ensures that instead of raising an EncodingError
further down the line (e.g. when calling "inspect" on a document) the
entities are preserved as-is.
Fixes#143