Go to file
Yorick Peterse 5f7256eb8f Encode/decode XML entities.
When lexing XML entities such as & and < these sequences are now
converted into their "actual" forms. In turn, Oga::XML::Text#to_xml ensures they
are encoded when the method is called.

Performance wise this puts some strain on the lexer, for every T_TEXT/T_STRING
node now potentially has to have its content modified. In the benchmark
xml/lexer/string_average_bench.rb the average processing time is now about the
same as before the improvements made in
8db77c0a09. I was hoping that the lexer would
still be a bit faster, but alas this is not the case. Doing this in native code
would be a nightmare as C doesn't have a proper string replacement function. I'm
not old/sadistic enough to write on myself just yet.

This fixes #49
2014-09-28 21:53:25 +02:00
benchmark Benchmark for lexing HTML void elements. 2014-09-24 10:43:49 +02:00
checksum Release 0.1.3 2014-09-24 00:24:00 +02:00
doc Release 0.1.3 2014-09-24 00:24:00 +02:00
ext Count newlines of text nodes in native code. 2014-09-25 22:49:11 +02:00
lib Encode/decode XML entities. 2014-09-28 21:53:25 +02:00
profile Namespaced the profile directories. 2014-06-13 00:33:53 +02:00
spec Encode/decode XML entities. 2014-09-28 21:53:25 +02:00
task Namespace YARD Rake tasks under "doc". 2014-09-16 14:49:49 +02:00
.editorconfig Updated editor configuration. 2014-05-08 00:17:12 +02:00
.gitignore Namespaced the profile directories. 2014-06-13 00:33:53 +02:00
.ruby-version Added a .ruby-version file. 2014-03-18 18:08:25 +01:00
.travis.yml Updated the Rubies to run Travis on. 2014-09-10 22:17:04 +02:00
.yardopts Basic project layout. 2014-02-26 19:50:16 +01:00
CONTRIBUTING.md Replaced another mention of 79 characters. 2014-07-09 11:02:17 +02:00
Gemfile Basic project layout. 2014-02-26 19:50:16 +01:00
LICENSE Added a license. 2014-02-26 22:20:47 +01:00
README.md Added SAX parsing to the list of parsing features. 2014-09-16 14:50:48 +02:00
Rakefile Basic XPath parser setup. 2014-06-01 23:02:28 +02:00
oga.gemspec Updated the Gem description. 2014-09-12 14:40:01 +02:00

README.md

Oga

Oga is an XML/HTML parser written in Ruby. It provides an easy to use API for parsing, modifying and querying documents (using XPath expressions). Oga does not require system libraries such as libxml, making it easier and faster to install on various platforms. To achieve better performance Oga uses a small, native extension (C for MRI/Rubinius, Java for JRuby).

Oga provides an API that allows you to safely parse and query documents in a multi-threaded environment, without having to worry about your applications blowing up.

From Wikipedia:

Oga: A large two-person saw used for ripping large boards in the days before power saws. One person stood on a raised platform, with the board below him, and the other person stood underneath them.

Examples

Parsing a simple string of XML:

Oga.parse_xml('<people><person>Alice</person></people>')

Parsing a simple string of HTML:

Oga.parse_html('<link rel="stylesheet" href="foo.css">')

Parsing an IO handle pointing to XML (this also works when using Oga.parse_html):

handle = File.open('path/to/file.xml')

Oga.parse_xml(handle)

Parsing an IO handle using the pull parser:

handle = File.open('path/to/file.xml')
parser = Oga::XML::PullParser.new(handle)

parser.parse do |node|
  parser.on(:text) do
    puts node.text
  end
end

Parse a string of XML using the SAX parser:

class ElementNames
  attr_reader :names

  def initialize
    @names = []
  end

  def on_element(namespace, name, attrs = {})
    @names << name
  end
end

handler = ElementNames.new

Oga.sax_parse_xml(handler, '<foo><bar></bar></foo>')

handler.names # => ["foo", "bar"]

Querying a document using XPath:

document = Oga.parse_xml('<people><person>Alice</person></people>')

document.xpath('string(people/person)') # => "Alice"

Modifying a document and serializing it back to XML:

document = Oga.parse_xml('<people><person>Alice</person></people>')
name     = document.at_xpath('people/person[1]/text()')

name.text = 'Bob'

document.to_xml # => "<people><person>Bob</person></people>"

Querying a document using a namespace:

document = Oga.parse_xml('<root xmlns:x="foo"><x:div></x:div></root>')
div      = document.xpath('root/x:div').first

div.namespace # => Namespace(name: "x" uri: "foo")

Features

  • Support for parsing XML and HTML(5)
    • DOM parsing
    • Stream/pull parsing
    • SAX parsing
  • Low memory footprint
  • High performance, if something doesn't perform well enough it's a bug
  • Support for XPath 1.0
  • XML namespace support (registering, querying, etc)

Requirements

Ruby Required Recommended
MRI >= 1.9.3 >= 2.1.2
Rubinius >= 2.2 >= 2.2.10
JRuby >= 1.7 >= 1.7.12
Maglev Not supported
Topaz Not supported
mruby Not supported

Maglev and Topaz are not supported due to the lack of a C API (that I know of) and the lack of active development of these Ruby implementations. mruby is not supported because it's a very different implementation all together.

To install Oga on MRI or Rubinius you'll need to have a working compiler such as gcc or clang. Oga's C extension can be compiled with both. JRuby does not require a compiler as the native extension is compiled during the Gem building process and bundled inside the Gem itself.

Thread Safety

Documents parsed using Oga are thread-safe as long as they are not modified by multiple threads at the same time. Querying documents using XPath can be done by multiple threads just fine. Write operations, such as removing attributes, are not thread-safe and should not be done by multiple threads at once.

It is advised that you do not share parsed documents between threads unless you really have to.

Documentation

The documentation is best viewed on the documentation website.

  • {file:CONTRIBUTING Contributing}
  • {file:changelog Changelog}
  • {file:migrating_from_nokogiri Migrating From Nokogiri}

Native Extension Setup

The native extensions can be found in ext/ and are divided into a C and Java extension. These extensions are only used for the XML lexer built using Ragel. The grammar for this lexer is shared between C and Java and can be found in ext/ragel/base_lexer.rl.

The extensions delegate most of their work back to Ruby code. As a result of this maintenance of this codebase is much easier. If one wants to change the grammar they only have to do so in one place and they don't have to worry about C and/or Java specific details.

For more details on calling Ruby methods from Ragel see the source documentation in ext/ragel/base_lexer.rl.

Why Another HTML/XML parser?

Currently there are a few existing parser out there, the most famous one being Nokogiri. Another parser that's becoming more popular these days is Ox. Ruby's standard library also comes with REXML.

The sad truth is that these existing libraries are problematic in their own ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works because of the non conccurent nature of MRI, on JRuby it works because it's implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a library, is not thread-safe and problematic to install on certain platforms (apparently). I don't want to compile libxml2 every time I install Nokogiri either.

To give an example about the issues with Nokogiri on Rubinius (or any other Ruby implementation that is not MRI or JRuby), take a look at these issues:

Some of these have been fixed, some have not. The core problem remains: Nokogiri acts in a way that there can be a large number of places where it might break due to throwing around void pointers and what not and expecting that things magically work. Note that I have nothing against the people running these projects, I just heavily, heavily dislike the resulting codebase one has to deal with today.

Ox looks very promising but it lacks a rather crucial feature: parsing HTML (without using a SAX API). It's also again a C extension making debugging more of a pain (at least for me).

I just want an XML/HTML parser that I can rely on stability wise and that is written in Ruby so I can actually debug it. In theory it should also make it easier for other Ruby developers to contribute.

License

All source code in this repository is licensed under the MIT license unless specified otherwise. A copy of this license can be found in the file "LICENSE" in the root directory of this repository.