Go to file
Yorick Peterse 38284278d5
Don't process siblings when reaching a root node
When generating XML we should not process the siblings of a root node.
Doing so results in invalid XML being returned (due to siblings not
being children of the root node).

Not processing the siblings in this case also prevents the siblings loop
from getting stuck. To explain what's happening, let's assume we're
using the following document tree:

    Document
      |_ Text
      |_ Element

Now let's say we take the Text node and call "to_xml" on it. When we
start the loop we'll run into the following code:

    if child_node = children && current.children[0]
      current = child_node
    else

Here the if statement will evaluate to false because a Text node doesn't
have any child nodes, as such we enter the else branch. We now reach the
following code:

    until next_node = current.is_a?(Node) && current.next

A Text node is a descendant of Node and it happens to have another node
(the Element node) as the next sibling. As a result we enter the `until`
loop's body. We now run into this code:

    if current.is_a?(Node) && current != @start
      current = current.parent
    end

Here `current` is still our Text node and it is the @start node. As a
result the `current` re-assignment won't be evaluated.

Next we run into the following:

    after_element(current, output) if current.is_a?(Element)

    break if current == @start

The first line will not evaluate because `current` is still the `Text`
node.  The `break` *will* evaluate because `current` is the same as
@start.

This will then lead to the following code being executed:

    current = next_node

Here `next_node` is the next sibling of the Text node, which in the
above example is the Element node.

Because all of the above runs in a `while` loop we'll at some point end
up again at the start of the `until` loop. At this point the `current`
variable contains an `Element`. Because this node does *not* have a node
following it we'll once again enter the `until` loop's body.

This loop will now get stuck because `current` is a Node, it's not the
same as @start, thus `current` is set to its parent (the Document),
which also isn't the same as @start.

On the next iteration this loop will break because `current` is no
longer a node. However, because a Document _does_ have child nodes the
whole process of traversing children/siblings will keep repeating itself
forever.

To work around this we now use the following statement:

    if child_node = children && current.children[0]
      ...
    elsif current == @start
      after_element(current, output) if current.is_a?(Element)

      break
    else
      until next_node = current.is_a?(Node) && current.next
      ...
    end

This prevents processing of any siblings once we have reached the root
node, in turn preventing the loop getting stuck forever.

I'm willing to bet there are probably a few more edge cases, but I can't
think of any others at the moment.

Fixes #161
2016-09-10 02:49:05 +02:00
benchmark Moved comparing_gems_bench to xpath/compiler 2015-09-06 22:14:17 +02:00
checksum Release 2.5 2016-09-06 22:30:50 +02:00
doc Removed max width from the YARD output 2016-09-06 22:42:57 +02:00
ext Fixed parsing HTML identifiers containing colons 2015-12-26 20:28:35 +01:00
lib Don't process siblings when reaching a root node 2016-09-10 02:49:05 +02:00
profile Fixed oga requires for benchmarking/profiling 2015-08-19 01:27:01 +02:00
spec Don't process siblings when reaching a root node 2016-09-10 02:49:05 +02:00
task Added a KAF fixture file 2015-08-18 22:32:57 +02:00
.editorconfig Updated EditorConfig file for ruby-ll files. 2015-02-13 09:38:29 +01:00
.gitignore Added a KAF fixture file 2015-08-18 22:32:57 +02:00
.travis.yml Change build order to optimize build speed 2016-07-13 22:34:31 +02:00
.yardopts Define the public API using YARD/semver 2015-05-11 21:34:34 +02:00
CHANGELOG.md Release 2.5 2016-09-06 22:30:50 +02:00
COC.md Added code of conduct 2016-01-22 01:51:44 +01:00
CONTRIBUTING.md Fixed typo 2016-02-12 23:11:20 +11:00
Gemfile Lock json dependency to ~> 1.8 on Windows Ruby 1.9 2016-07-13 17:19:42 +02:00
LICENSE Change license from MIT to MPL 2.0 2015-05-15 23:48:18 +02:00
README.md Clarify README performance feature a bit 2016-04-21 15:37:15 +02:00
Rakefile Setup devkit for rake-compiler/Windows 2015-03-22 15:26:59 +01:00
appveyor.yml Download Bundler manually on AppVeyor 2015-03-23 13:34:38 +01:00
oga.gemspec Updated Gemspec license 2015-05-15 23:57:50 +02:00

README.md

Oga

Oga is an XML/HTML parser written in Ruby. It provides an easy to use API for parsing, modifying and querying documents (using XPath expressions). Oga does not require system libraries such as libxml, making it easier and faster to install on various platforms. To achieve better performance Oga uses a small, native extension (C for MRI/Rubinius, Java for JRuby).

Oga provides an API that allows you to safely parse and query documents in a multi-threaded environment, without having to worry about your applications blowing up.

From Wikipedia:

Oga: A large two-person saw used for ripping large boards in the days before power saws. One person stood on a raised platform, with the board below him, and the other person stood underneath them.

The name is a pun on Nokogiri.

Versioning Policy

Oga uses the version format MAJOR.MINOR (e.g. 2.1). An increase of the MAJOR version indicates backwards incompatible changes were introduced. The MINOR version is only increased when changes are backwards compatible, regardless of whether those changes are bugfixes or new features. Up until version 1.0 the code should be considered unstable meaning it can change (and break) at any given moment.

APIs explicitly tagged as private (e.g. using Ruby's private keyword or YARD's @api private tag) are not covered by these rules.

Examples

Parsing a simple string of XML:

Oga.parse_xml('<people><person>Alice</person></people>')

Parsing XML using strict mode (disables automatic tag insertion):

Oga.parse_xml('<people>foo</people>', :strict => true) # works fine
Oga.parse_xml('<people>foo', :strict => true)          # throws an error

Parsing a simple string of HTML:

Oga.parse_html('<link rel="stylesheet" href="foo.css">')

Parsing an IO handle pointing to XML (this also works when using Oga.parse_html):

handle = File.open('path/to/file.xml')

Oga.parse_xml(handle)

Parsing an IO handle using the pull parser:

handle = File.open('path/to/file.xml')
parser = Oga::XML::PullParser.new(handle)

parser.parse do |node|
  parser.on(:text) do
    puts node.text
  end
end

Using an Enumerator to download and parse an XML document on the fly:

enum = Enumerator.new do |yielder|
  HTTPClient.get('http://some-website.com/some-big-file.xml') do |chunk|
    yielder << chunk
  end
end

document = Oga.parse_xml(enum)

Parse a string of XML using the SAX parser:

class ElementNames
  attr_reader :names

  def initialize
    @names = []
  end

  def on_element(namespace, name, attrs = {})
    @names << name
  end
end

handler = ElementNames.new

Oga.sax_parse_xml(handler, '<foo><bar></bar></foo>')

handler.names # => ["foo", "bar"]

Querying a document using XPath:

document = Oga.parse_xml <<-EOF
<people>
  <person id="1">
    <name>Alice</name>
    <age>28</name>
  </person>
</people>
EOF

# The "xpath" method returns an enumerable (Oga::XML::NodeSet) that you can
# iterate over.
document.xpath('people/person').each do |person|
  puts person.get('id') # => "1"

  # The "at_xpath" method returns a single node from a set, it's the same as
  # person.xpath('name').first.
  puts person.at_xpath('name').text # => "Alice"
end

Querying the same document using CSS:

document = Oga.parse_xml <<-EOF
<people>
  <person id="1">
    <name>Alice</name>
    <age>28</name>
  </person>
</people>
EOF

# The "css" method returns an enumerable (Oga::XML::NodeSet) that you can
# iterate over.
document.css('people person').each do |person|
  puts person.get('id') # => "1"

  # The "at_css" method returns a single node from a set, it's the same as
  # person.css('name').first.
  puts person.at_css('name').text # => "Alice"
end

Modifying a document and serializing it back to XML:

document = Oga.parse_xml('<people><person>Alice</person></people>')
name     = document.at_xpath('people/person[1]/text()')

name.text = 'Bob'

document.to_xml # => "<people><person>Bob</person></people>"

Querying a document using a namespace:

document = Oga.parse_xml('<root xmlns:x="foo"><x:div></x:div></root>')
div      = document.xpath('root/x:div').first

div.namespace # => Namespace(name: "x" uri: "foo")

Features

  • Support for parsing XML and HTML(5)
    • DOM parsing
    • Stream/pull parsing
    • SAX parsing
  • Low memory footprint
  • High performance (taking into account most work happens in Ruby)
  • Support for XPath 1.0
  • CSS3 selector support
  • XML namespace support (registering, querying, etc)
  • Windows support

Requirements

Ruby Required Recommended
MRI >= 1.9.3 >= 2.1.2
Rubinius >= 2.2 >= 2.2.10
JRuby >= 1.7 >= 1.7.12
Maglev Not supported
Topaz Not supported
mruby Not supported

Maglev and Topaz are not supported due to the lack of a C API (that I know of) and the lack of active development of these Ruby implementations. mruby is not supported because it's a very different implementation all together.

To install Oga on MRI or Rubinius you'll need to have a working compiler such as gcc or clang. Oga's C extension can be compiled with both. JRuby does not require a compiler as the native extension is compiled during the Gem building process and bundled inside the Gem itself.

Thread Safety

Oga does not use a unsynchronized global mutable state. As a result of this you can parse/create documents concurrently without any problems. Modifying documents concurrently can lead to bugs as these operations are not synchronized.

Some querying operations will cache data in instance variables, without synchronization. An example is Oga::XML::Element#namespace which will cache an element's namespace after the first call.

In general it's recommended to not use the same document in multiple threads at the same time.

Namespace Support

Oga fully supports parsing/registering XML namespaces as well as querying them using XPath. For example, take the following XML:

<root xmlns="http://example.com">
    <bar>bar</bar>
</root>

If one were to try and query the bar element (e.g. using XPath root/bar) they'd end up with an empty node set. This is due to <root> defining an alternative default namespace. Instead you can query this element using the following XPath:

*[local-name() = "root"]/*[local-name() = "bar"]

Alternatively, if you don't really care where the <bar> element is located you can use the following:

descendant::*[local-name() = "bar"]

And if you want to specify an explicit namespace URI, you can use this:

descendant::*[local-name() = "bar" and namespace-uri() = "http://example.com"]

Unlike Nokogiri, Oga does not provide a way to create "dynamic" namespaces. That is, Nokogiri allows one to query the above document as following:

document = Nokogiri::XML('<root xmlns="http://example.com"><bar>bar</bar></root>')

document.xpath('x:root/x:bar', :x => 'http://example.com')

Oga does have a small trick you can use to cut down the size of your XPath queries. Because Oga assigns the name "xmlns" to default namespaces you can use this in your XPath queries:

document = Oga.parse_xml('<root xmlns="http://example.com"><bar>bar</bar></root>')

document.xpath('xmlns:root/xmlns:bar')

When using this you can still restrict the query to the correct namespace URI:

document.xpath('xmlns:root[namespace-uri() = "http://example.com"]/xmlns:bar')

In the future I might add an API to ease this process, although at this time I have little interest in providing an API similar to Nokogiri.

HTML5 Support

Oga fully supports HTML5 including the omission of certain tags. For example, the following is parsed just fine:

<li>Hello
<li>World

This is effectively parsed into:

<li>Hello</li>
<li>World</li>

One exception Oga makes is that it does not automatically insert html, head and body tags. Automatically inserting these tags requires a distinction between documents and fragments as a user might not always want these tags to be inserted if left out. This complicates the user facing API as well as complicating the parsing internals of Oga. As a result I have decided that Oga does not insert these tags when left out.

A more in depth explanation can be found here: https://github.com/YorickPeterse/oga/issues/98#issuecomment-96833066.

Documentation

The documentation is best viewed on the documentation website.

  • {file:CONTRIBUTING Contributing}
  • {file:changelog Changelog}
  • {file:migrating_from_nokogiri Migrating From Nokogiri}
  • {Oga::XML::Parser XML Parser}
  • {Oga::XML::SaxParser XML SAX Parser}
  • {file:xml_namespaces XML Namespaces}

Why Another HTML/XML parser?

Currently there are a few existing parser out there, the most famous one being Nokogiri. Another parser that's becoming more popular these days is Ox. Ruby's standard library also comes with REXML.

The sad truth is that these existing libraries are problematic in their own ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works because of the non conccurent nature of MRI, on JRuby it works because it's implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a library, is not thread-safe and problematic to install on certain platforms (apparently). I don't want to compile libxml2 every time I install Nokogiri either.

To give an example about the issues with Nokogiri on Rubinius (or any other Ruby implementation that is not MRI or JRuby), take a look at these issues:

Some of these have been fixed, some have not. The core problem remains: Nokogiri acts in a way that there can be a large number of places where it might break due to throwing around void pointers and what not and expecting that things magically work. Note that I have nothing against the people running these projects, I just heavily, heavily dislike the resulting codebase one has to deal with today.

Ox looks very promising but it lacks a rather crucial feature: parsing HTML (without using a SAX API). It's also again a C extension making debugging more of a pain (at least for me).

I just want an XML/HTML parser that I can rely on stability wise and that is written in Ruby so I can actually debug it. In theory it should also make it easier for other Ruby developers to contribute.

License

All source code in this repository is subject to the terms of the Mozilla Public License, version 2.0 unless stated otherwise. A copy of this license can be found the file "LICENSE" or at https://www.mozilla.org/MPL/2.0/.