commit 6326bdd8c943299e9adc4d2cb6de00934da3609b Author: Yorick Peterse Date: Wed Feb 26 14:14:48 2014 +0100 Leaked Oga on Github. diff --git a/README.md b/README.md new file mode 100644 index 0000000..418ea44 --- /dev/null +++ b/README.md @@ -0,0 +1,93 @@ +# Oga + +Oga is (or will be) a pure Ruby, thread-safe HTML (and XML in the future) +parser that doesn't trigger segmentation faults on Ruby implementations other +than MRI. Oga will initially **not** focus on performance but instead will +focus on proper handling of encodings, stability and a sane API. More +importantly it will be pure Ruby **only**. No C extensions, no Java, no x86 64 +assembly, just Ruby. + +From [Wikipedia][oga-wikipedia]: + +> Oga: A large two-person saw used for ripping large boards in the days before +> power saws. One person stood on a raised platform, with the board below him, +> and the other person stood underneath them. + +## Planned Features + +* Full support for HTML(5) +* Full support for XML, DTDs will probably be ignored. +* Support for xpath and CSS selector based queries +* SAX/pull parsing APIs that don't make you want to cut yourself + +## Features + +* A README + +## Requirements + +* Ruby + +Development requirements: + +* Ragel +* Racc +* Other stuff + +## Usage + +Basic DOM parsing example: + + require 'oga' + + parser = Oga::Parser::DOM.new + document = parser.parse('

Hello

') + + puts document.css('p').first.text # => "Hello" + +Pull parsing: + + require 'oga' + + parser = Oga::Parser::Pull.new('

Hello

') + + parser.each do |node| + puts node.text + end + +These examples will probably change once I actually start writing some code. + +## Why Another HTML/XML parser? + +Currently there are a few existing parser out there, the most famous one being +[Nokogiri][nokogiri]. Another parser that's becoming more popular these days is +[Ox][ox]. Ruby's standard library also comes with REXML. + +The sad truth is that these existing libraries are problematic in their own +ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works +because of the non conccurent nature of MRI, on Jruby it works because it's +implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a +library, is not thread-safe and problematic to install on certain platforms +(apparently). I don't want to compile libxml2 every time I install Nokogiri +either. + +Ox looks very promising but it lacks a rather crucial feature: parsing HTML +(without using a SAX API). It's also again a C extension making debugging more +of a pain (at least for me). + +I just want an HTML parser that I can rely on stability wise and that is +written in Ruby so I can actually debug it. In theory it should also make it +easier for other Ruby developers to contribute. + +Oga is an attempt at solving this problem. By writing it in pure Ruby the +initial performance will probably not be as great. However, I feel this is a +problem with individual Ruby implementations, not the language itself. Also, by +writing it in Ruby we don't have to deal with all the crazy things of C/C++ or +even Java. + +In theory it should also allow it to run on every Ruby implementation, be it +Jruby, Rubinius, Topaz or even mruby. + +[nokogiri]: https://github.com/sparklemotion/nokogiri +[oga-wikipedia]: https://en.wikipedia.org/wiki/Japanese_saw#Other_Japanese_saws +[ox]: https://github.com/ohler55/ox