Leaked Oga on Github.

2014-02-26 14:14:48 +01:00 · 2014-02-26 14:14:48 +01:00 · 6326bdd8c9
commit 6326bdd8c9
1 changed files with 93 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,93 @@
+# Oga
+
+Oga is (or will be) a pure Ruby, thread-safe HTML (and XML in the future)
+parser that doesn't trigger segmentation faults on Ruby implementations other
+than MRI. Oga will initially **not** focus on performance but instead will
+focus on proper handling of encodings, stability and a sane API. More
+importantly it will be pure Ruby **only**. No C extensions, no Java, no x86 64
+assembly, just Ruby.
+
+From [Wikipedia][oga-wikipedia]:
+
+> Oga: A large two-person saw used for ripping large boards in the days before
+> power saws. One person stood on a raised platform, with the board below him,
+> and the other person stood underneath them. 
+
+## Planned Features
+
+* Full support for HTML(5)
+* Full support for XML, DTDs will probably be ignored.
+* Support for xpath and CSS selector based queries
+* SAX/pull parsing APIs that don't make you want to cut yourself
+
+## Features
+
+* A README
+
+## Requirements
+
+* Ruby
+
+Development requirements:
+
+* Ragel
+* Racc
+* Other stuff
+
+## Usage
+
+Basic DOM parsing example:
+
+    require 'oga'
+
+    parser   = Oga::Parser::DOM.new
+    document = parser.parse('<p>Hello</p>')
+
+    puts document.css('p').first.text # => "Hello"
+
+Pull parsing:
+
+    require 'oga'
+
+    parser = Oga::Parser::Pull.new('<p>Hello</p>')
+    
+    parser.each do |node|
+      puts node.text
+    end
+
+These examples will probably change once I actually start writing some code.
+
+## Why Another HTML/XML parser?
+
+Currently there are a few existing parser out there, the most famous one being
+[Nokogiri][nokogiri]. Another parser that's becoming more popular these days is
+[Ox][ox]. Ruby's standard library also comes with REXML.
+
+The sad truth is that these existing libraries are problematic in their own
+ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works
+because of the non conccurent nature of MRI, on Jruby it works because it's
+implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a
+library, is not thread-safe and problematic to install on certain platforms
+(apparently). I don't want to compile libxml2 every time I install Nokogiri
+either.
+
+Ox looks very promising but it lacks a rather crucial feature: parsing HTML
+(without using a SAX API). It's also again a C extension making debugging more
+of a pain (at least for me).
+
+I just want an HTML parser that I can rely on stability wise and that is
+written in Ruby so I can actually debug it. In theory it should also make it
+easier for other Ruby developers to contribute.
+
+Oga is an attempt at solving this problem. By writing it in pure Ruby the
+initial performance will probably not be as great. However, I feel this is a
+problem with individual Ruby implementations, not the language itself. Also, by
+writing it in Ruby we don't have to deal with all the crazy things of C/C++ or
+even Java. 
+
+In theory it should also allow it to run on every Ruby implementation, be it
+Jruby, Rubinius, Topaz or even mruby.
+
+[nokogiri]: https://github.com/sparklemotion/nokogiri
+[oga-wikipedia]: https://en.wikipedia.org/wiki/Japanese_saw#Other_Japanese_saws
+[ox]: https://github.com/ohler55/ox