Commit Graph

119 Commits

Author SHA1 Message Date
Yorick Peterse fe74d60138 Manually bootstrap JRuby after all.
After discussing this with @headius I've decided to do this the manual way
anyway. Apparently the basic load service stuff is deprecated and not very
reliable.
2014-05-07 22:32:34 +02:00
Yorick Peterse b8efed5177 Renamed on_start_doctype to on_doctype_start. 2014-05-06 23:18:44 +02:00
Yorick Peterse 2053018d07 Slap JRuby so that it can load the .jar file. 2014-05-06 20:45:26 +02:00
Yorick Peterse 6e685378e0 Setup Ragel for JRuby and load things the hard way 2014-05-06 19:06:04 +02:00
Yorick Peterse ee756037e7 Removed unused YARD tag. 2014-05-05 09:45:10 +02:00
Yorick Peterse aeab885a7f Docs for the Ruby part of the XML lexer. 2014-05-05 09:44:35 +02:00
Yorick Peterse 2689d3f65a Initial setup using a C extension.
While I've tried to keep Oga pure Ruby for as long as possible the performance
of Ragel's Ruby output was not worth the trouble. For example, lexing 10MB of
XML would take 5 to 6 seconds at least. Nokogiri on the other hand can parse
that same XML into a DOM document in about 300 miliseconds. Such a big
performance difference is not acceptable.

To work around this the XML/HTML lexer will be implemented in C for
MRI/Rubinius and Java for JRuby. For now there's only a C extension as I
haven't read up yet on the JRuby API. The end goal is to provide some sort of
Ragel "template" that can be used to generate the corresponding C/Java
extension code. This would remove the need of duplicating the grammar and
associated code.

The native extension setup is a hybrid between native and Ruby. The raw Ragel
stuff happens in C/Java while the actual logic of actions happens in Ruby. This
adds a small amount of overhead but makes it much easier to maintain the lexer.
Even with this extra overhead the performance is much better than pure Ruby.
The 10MB of XML mentioned above is lexed in about 600 miliseconds. In other
words, it's 10 times faster.
2014-05-05 00:31:28 +02:00
Yorick Peterse baaa24a760 Indentation fix in the lexer. 2014-05-04 18:06:43 +02:00
Yorick Peterse f18e8893de Removed the buffering crap from the lexer. 2014-05-04 17:39:08 +02:00
Yorick Peterse 9dfdefee47 Removed XML::Lexer#buffering?
Instead of wrapping a predicate method around the ivar we'll just access it
directly. This reduces average lexing times in the big XML benchmark from 7,5
to ~7 seconds.
2014-05-01 22:59:56 +02:00
Yorick Peterse f607cf50dc Use local variables for Ragel.
Instead of using instance variables for ts, te, etc we'll use local variables.
Grand wizard overloard @whitequark suggested that this would be quite a bit
faster, which turns out to be true. For example, the big XML lexer benchmark
would, prior to this commit, complete in about 9 - 9,3 seconds. With this
commit that hovers around 8,5 seconds.
2014-05-01 13:00:29 +02:00
Yorick Peterse 83f6d5437e Contextual pull parsing.
This adds the ability to more easily act upon specific node types and nestings
when using the pull parsing API.

A basic example of this API looks like the following (only including relevant
code):

    parser.parse do |node|
      parser.on(:element, %w{people person}) do
        people << {:name => nil, :age => nil}
      end

      parser.on(:text, %w{people person name}) do
        people.last[:name] = node.text
      end

      parser.on(:text, %w{people person age}) do
        people.last[:age] = node.text.to_i
      end
    end

This fixes #6.
2014-04-29 23:05:49 +02:00
Yorick Peterse 1a413998a3 Track the current node in the pull parser.
The current node is tracked in the instance method `node`.
2014-04-29 21:21:05 +02:00
Yorick Peterse 45b0cdf811 Track element name nesting in the pull parser.
Tracking the names of nested elements makes it a lot easier to do contextual
pull parsing. Without this it's impossible to know what context the parser is
in at a given moment.

For memory reasons the parser currently only tracks the element names. In the
future it might perhaps also track extra information to make parsing easier.
2014-04-28 23:40:36 +02:00
Yorick Peterse 030a0068bd Basic pull parsing setup.
This parser extends the regular DOM parser but instead delegates certain nodes
to a block instead of building a DOM tree.

The API is a bit raw in its current form but I'll extend it and make it a bit
more user friendly in the following commits. In particular I want to make it
easier to figure out if a certain node is nested inside another node.
2014-04-28 17:22:17 +02:00
Yorick Peterse fd5bbbc9a2 Move element recursion handling into a method.
This makes it easier to disable later on in the streaming parser.
2014-04-28 10:25:05 +02:00
Yorick Peterse 785ec26fe7 Create Element instances before recursing. 2014-04-28 10:21:34 +02:00
Yorick Peterse 9939cf49eb Move parser callback code into dedicated methods. 2014-04-28 10:18:55 +02:00
Yorick Peterse 5d05aed6ec Corrected docs for XML::Parser. 2014-04-26 12:57:35 +02:00
Yorick Peterse f53fe4ed7c Reset the lexer when resetting the parser.
Also removed the unused @lines instance variable.
2014-04-25 00:15:24 +02:00
Yorick Peterse 83ff0e6656 Various small parser cleanups. 2014-04-25 00:07:53 +02:00
Yorick Peterse ecf6851711 Revert "Move linking of child nodes to a dedicated mixin."
This doesn't actually make things any easier. It also introduces a weirdly
named mixin.

This reverts commit 0968465f0c.
2014-04-24 21:16:31 +02:00
Yorick Peterse 0968465f0c Move linking of child nodes to a dedicated mixin. 2014-04-24 09:43:50 +02:00
Yorick Peterse 08d412da7e First shot at removing the AST layer.
The AST layer is being removed because it doesn't really serve a useful
purpose. In particular when creating a streaming parser the AST nodes would
only introduce extra overhead.

As a result of this the parser will instead emit a DOM tree directly instead of
first emitting an AST.
2014-04-21 23:05:39 +02:00
Yorick Peterse 9ee9ec14cb Lexer: only pop elements when needed. 2014-04-19 01:10:32 +02:00
Yorick Peterse 54e6650338 Don't use define_method in the lexer.
Profiling showed that calls to methods defined using `define_method` are
really, really slow. Before this commit the lexer would process 3000-4000
lines per second. With this commit that has been increased to around 10 000
lines per second.

Thanks to @headius for mentioning the (potential) overhead of define_method.
2014-04-17 19:08:26 +02:00
Yorick Peterse d9fa4b7c45 Lex input as a sequence of bytes.
Instead of lexing the input as a raw String or as a set of codepoints it's
treated as a sequence of bytes. This removes the need of String#[] (replaced by
String#byteslice) which in turn reduces the amount of memory needed and speeds
up the lexing time.

Thanks to @headius and @apeiros for suggesting this and rubber ducking along!
2014-04-17 17:45:05 +02:00
Yorick Peterse 70516b7447 Yield tokens in the lexer and parser.
After some digging I found out that Racc has a method called `yyparse`. Using
this method (and a custom callback method) you can `yield` tokens as a form of
input. This makes it a lot easier to feed tokens as a stream from the lexer.

Sadly the current performance of the lexer is still total garbage. Most of the
memory usage also comes from using String#unpack, especially on large XML
inputs (e.g. 100 MB of XML). It looks like the resulting memory usage is about
10x the input size.

One option might be some kind of wrapper around String. This wrapper would have
a sliding window of, say, 1024 bytes. When you create it the first 1024 bytes
of the input would be unpacked. When seeking through the input this window
would move forward.

In theory this means that you'd only end up with having only 1024 Fixnum
instances around at any given time instead of "a very big number". I have to
test how efficient this is in practise.
2014-04-17 00:39:41 +02:00
Yorick Peterse 25edd2de00 Use a Set for storing void element names. 2014-04-10 12:28:47 +02:00
Yorick Peterse b96f7c4852 Lex attributes with namespaces.
These are lexed as just the name instead of two separate tokens.
2014-04-10 11:01:49 +02:00
Yorick Peterse c974b96b88 Truncate lines in parser errors.
The offending lines of code displayed in the error message are truncated to 80
characters. This should make reading the error messages less of a pain when
dealing with very long lines of HTML/XML.
2014-04-10 10:08:51 +02:00
Yorick Peterse 8237d5791d Stream tokens when lexing.
Instead of returning the tokens as a whole they are now streamed using
XML::Lexer#advance. This method returns the next token upon every call. It uses
a small buffer in case a particular block of text results in multiple tokens.
2014-04-09 22:08:13 +02:00
Yorick Peterse e9bb97d261 First steps towards making the lexer stream tokens 2014-04-09 19:32:06 +02:00
Yorick Peterse cb74c7edf9 Specs for XML parser errors. 2014-04-07 21:31:36 +02:00
Yorick Peterse 54ef125637 Basic docs for everything under Oga::XML. 2014-04-04 17:48:36 +02:00
Yorick Peterse 13a9228563 Properly indent doctype/XML decl inspect values. 2014-04-04 11:13:39 +02:00
Yorick Peterse 37a12722cb Rough setup for a custom #inspect format.
This format is a lot more readable than the default Ruby #inspect format
(mostly due to not including previous/next/parent nodes).
2014-04-04 00:41:29 +02:00
Yorick Peterse a2c525dd7c Insert newlines after XML dec/doctypes. 2014-04-03 23:04:21 +02:00
Yorick Peterse 230fafa2d3 Document should not inherit from Node.
A document is not an XML node on itself. If logic has to be shared between the
Document and the Node class I'll resort to using mixins for this.
2014-04-03 22:45:40 +02:00
Yorick Peterse c077988dd6 Tree building of doctypes. 2014-04-03 22:44:00 +02:00
Yorick Peterse 81b1155af3 Lex/parse doctype names separately. 2014-04-03 21:59:57 +02:00
Yorick Peterse 8185656c1e Fixed typ. 2014-04-03 21:41:31 +02:00
Yorick Peterse 30c01a5aee Tests for XML::TreeBuilder#handler_missing. 2014-04-03 09:43:30 +02:00
Yorick Peterse bdb76cefc5 Dedicated handling of XML declaration nodes. 2014-04-02 22:30:45 +02:00
Yorick Peterse d6c0a1f3f3 Lex/parser XML declaration attributes. 2014-04-02 22:01:17 +02:00
Yorick Peterse f99c13b516 Tests + docs for the TreeBuilder class. 2014-03-28 17:11:54 +01:00
Yorick Peterse 6d866523b8 Renamed XML::Builder to XML::TreeBuilder. 2014-03-28 16:37:37 +01:00
Yorick Peterse e141c084f9 Dedicated DOM builder class for CDATA tags. 2014-03-28 09:27:53 +01:00
Yorick Peterse 2b250bbf42 Rough DOM building setup. 2014-03-28 08:59:48 +01:00
Yorick Peterse 6ae52c1b12 Initial rough sketches for the DOM API. 2014-03-26 18:12:00 +01:00