For this I've enabled both the old expectation and stubbing/mocking syntax. The
old syntax is much more compact and to me reads nicer. For example, consider
the following:
lex('<foo></foo>').should == [...]
To me this reads much nicer than this:
expect(lex('<foo></foo>')).to eq([...])
Using IO/StringIO objects one can parse large XML files without first having to
read the entire file into memory. This can potentially save a lot of memory at
the cost of a slightly slower runtime.
For IO like instances the lexer will consume the input line by line. If a
String is given it's consumed as a whole instead. A small side effect of
reading the input line by line is that text such as "foo\nbar" will be lexed as
two tokens instead of one.
Fixes#19.
This moves the element related rules to the element_head machine (where they
belong). This in turn makes it possible to lex ">" as a text node, previously
this was impossible.
This adds the ability to more easily act upon specific node types and nestings
when using the pull parsing API.
A basic example of this API looks like the following (only including relevant
code):
parser.parse do |node|
parser.on(:element, %w{people person}) do
people << {:name => nil, :age => nil}
end
parser.on(:text, %w{people person name}) do
people.last[:name] = node.text
end
parser.on(:text, %w{people person age}) do
people.last[:age] = node.text.to_i
end
end
This fixes#6.
Tracking the names of nested elements makes it a lot easier to do contextual
pull parsing. Without this it's impossible to know what context the parser is
in at a given moment.
For memory reasons the parser currently only tracks the element names. In the
future it might perhaps also track extra information to make parsing easier.
This parser extends the regular DOM parser but instead delegates certain nodes
to a block instead of building a DOM tree.
The API is a bit raw in its current form but I'll extend it and make it a bit
more user friendly in the following commits. In particular I want to make it
easier to figure out if a certain node is nested inside another node.
The AST layer is being removed because it doesn't really serve a useful
purpose. In particular when creating a streaming parser the AST nodes would
only introduce extra overhead.
As a result of this the parser will instead emit a DOM tree directly instead of
first emitting an AST.
Instead of lexing the input as a raw String or as a set of codepoints it's
treated as a sequence of bytes. This removes the need of String#[] (replaced by
String#byteslice) which in turn reduces the amount of memory needed and speeds
up the lexing time.
Thanks to @headius and @apeiros for suggesting this and rubber ducking along!
Instead of returning the tokens as a whole they are now streamed using
XML::Lexer#advance. This method returns the next token upon every call. It uses
a small buffer in case a particular block of text results in multiple tokens.
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).