Leaked Oga on Github.
This commit is contained in:
commit
6326bdd8c9
|
@ -0,0 +1,93 @@
|
|||
# Oga
|
||||
|
||||
Oga is (or will be) a pure Ruby, thread-safe HTML (and XML in the future)
|
||||
parser that doesn't trigger segmentation faults on Ruby implementations other
|
||||
than MRI. Oga will initially **not** focus on performance but instead will
|
||||
focus on proper handling of encodings, stability and a sane API. More
|
||||
importantly it will be pure Ruby **only**. No C extensions, no Java, no x86 64
|
||||
assembly, just Ruby.
|
||||
|
||||
From [Wikipedia][oga-wikipedia]:
|
||||
|
||||
> Oga: A large two-person saw used for ripping large boards in the days before
|
||||
> power saws. One person stood on a raised platform, with the board below him,
|
||||
> and the other person stood underneath them.
|
||||
|
||||
## Planned Features
|
||||
|
||||
* Full support for HTML(5)
|
||||
* Full support for XML, DTDs will probably be ignored.
|
||||
* Support for xpath and CSS selector based queries
|
||||
* SAX/pull parsing APIs that don't make you want to cut yourself
|
||||
|
||||
## Features
|
||||
|
||||
* A README
|
||||
|
||||
## Requirements
|
||||
|
||||
* Ruby
|
||||
|
||||
Development requirements:
|
||||
|
||||
* Ragel
|
||||
* Racc
|
||||
* Other stuff
|
||||
|
||||
## Usage
|
||||
|
||||
Basic DOM parsing example:
|
||||
|
||||
require 'oga'
|
||||
|
||||
parser = Oga::Parser::DOM.new
|
||||
document = parser.parse('<p>Hello</p>')
|
||||
|
||||
puts document.css('p').first.text # => "Hello"
|
||||
|
||||
Pull parsing:
|
||||
|
||||
require 'oga'
|
||||
|
||||
parser = Oga::Parser::Pull.new('<p>Hello</p>')
|
||||
|
||||
parser.each do |node|
|
||||
puts node.text
|
||||
end
|
||||
|
||||
These examples will probably change once I actually start writing some code.
|
||||
|
||||
## Why Another HTML/XML parser?
|
||||
|
||||
Currently there are a few existing parser out there, the most famous one being
|
||||
[Nokogiri][nokogiri]. Another parser that's becoming more popular these days is
|
||||
[Ox][ox]. Ruby's standard library also comes with REXML.
|
||||
|
||||
The sad truth is that these existing libraries are problematic in their own
|
||||
ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works
|
||||
because of the non conccurent nature of MRI, on Jruby it works because it's
|
||||
implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a
|
||||
library, is not thread-safe and problematic to install on certain platforms
|
||||
(apparently). I don't want to compile libxml2 every time I install Nokogiri
|
||||
either.
|
||||
|
||||
Ox looks very promising but it lacks a rather crucial feature: parsing HTML
|
||||
(without using a SAX API). It's also again a C extension making debugging more
|
||||
of a pain (at least for me).
|
||||
|
||||
I just want an HTML parser that I can rely on stability wise and that is
|
||||
written in Ruby so I can actually debug it. In theory it should also make it
|
||||
easier for other Ruby developers to contribute.
|
||||
|
||||
Oga is an attempt at solving this problem. By writing it in pure Ruby the
|
||||
initial performance will probably not be as great. However, I feel this is a
|
||||
problem with individual Ruby implementations, not the language itself. Also, by
|
||||
writing it in Ruby we don't have to deal with all the crazy things of C/C++ or
|
||||
even Java.
|
||||
|
||||
In theory it should also allow it to run on every Ruby implementation, be it
|
||||
Jruby, Rubinius, Topaz or even mruby.
|
||||
|
||||
[nokogiri]: https://github.com/sparklemotion/nokogiri
|
||||
[oga-wikipedia]: https://en.wikipedia.org/wiki/Japanese_saw#Other_Japanese_saws
|
||||
[ox]: https://github.com/ohler55/ox
|
Loading…
Reference in New Issue