This method uses a loop to traverse upwards the DOM tree in order to find the
root document/element. While this might have an impact on performance I don't
expect Oga itself to call this method very often. The benefit is that Node
instances don't require users to manually pass the top level document as an
argument.
The combination of iterating over an array and removing values from it results
in not all elements being removed. For example:
numbers = [10, 20, 30]
numbers.each do |num|
numbers.delete(num)
end
numbers # => [20]
As a result of this the NodeSet#remove method uses two iterations:
1. One iteration to retrieve all NodeSet instances to remove nodes from.
2. One iteration to actually remove the nodes.
For the second iteration we iterate over the node sets and then over the nodes.
This ensures that we always remove all nodes instead of leaving some behind.
The documentation still leaves a lot to be desired and so does the API. There
also appears to be a problem where NodeSet#remove doesn't properly remove all
nodes from a set. Outside of that we're making slow progress towards a proper
DOM API.
The previous commit didn't fully change the operator precedence according to
the XPath 1.0 specification. Also thanks to @whitequark for clearing up a few
things about Racc's operator precedence system.
For this I've enabled both the old expectation and stubbing/mocking syntax. The
old syntax is much more compact and to me reads nicer. For example, consider
the following:
lex('<foo></foo>').should == [...]
To me this reads much nicer than this:
expect(lex('<foo></foo>')).to eq([...])
Using IO/StringIO objects one can parse large XML files without first having to
read the entire file into memory. This can potentially save a lot of memory at
the cost of a slightly slower runtime.
For IO like instances the lexer will consume the input line by line. If a
String is given it's consumed as a whole instead. A small side effect of
reading the input line by line is that text such as "foo\nbar" will be lexed as
two tokens instead of one.
Fixes#19.
This moves the element related rules to the element_head machine (where they
belong). This in turn makes it possible to lex ">" as a text node, previously
this was impossible.
This adds the ability to more easily act upon specific node types and nestings
when using the pull parsing API.
A basic example of this API looks like the following (only including relevant
code):
parser.parse do |node|
parser.on(:element, %w{people person}) do
people << {:name => nil, :age => nil}
end
parser.on(:text, %w{people person name}) do
people.last[:name] = node.text
end
parser.on(:text, %w{people person age}) do
people.last[:age] = node.text.to_i
end
end
This fixes#6.
Tracking the names of nested elements makes it a lot easier to do contextual
pull parsing. Without this it's impossible to know what context the parser is
in at a given moment.
For memory reasons the parser currently only tracks the element names. In the
future it might perhaps also track extra information to make parsing easier.
This parser extends the regular DOM parser but instead delegates certain nodes
to a block instead of building a DOM tree.
The API is a bit raw in its current form but I'll extend it and make it a bit
more user friendly in the following commits. In particular I want to make it
easier to figure out if a certain node is nested inside another node.
The AST layer is being removed because it doesn't really serve a useful
purpose. In particular when creating a streaming parser the AST nodes would
only introduce extra overhead.
As a result of this the parser will instead emit a DOM tree directly instead of
first emitting an AST.
Instead of lexing the input as a raw String or as a set of codepoints it's
treated as a sequence of bytes. This removes the need of String#[] (replaced by
String#byteslice) which in turn reduces the amount of memory needed and speeds
up the lexing time.
Thanks to @headius and @apeiros for suggesting this and rubber ducking along!
Instead of returning the tokens as a whole they are now streamed using
XML::Lexer#advance. This method returns the next token upon every call. It uses
a small buffer in case a particular block of text results in multiple tokens.
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).
This reverts commit 8898d08831.
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name
(T_ELEMENT_NAME). This makes it easier to include the namespace of an element
(T_ELEMENT_NS) in the output.
The current implementation is a bit messy. In particular the counting of column
numbers is not entirely the way it should be. There are also some problems with
nested tags/text that I still have to resolve.
This comes with various structural changes to the lexer as I'm slowly starting
to get the hang of Ragel. Ragel is a beast but damn it's an awesome piece of
software.
Note that the doctype public/system IDs are lexed as T_STRING. The parser will
figure out whether a ID is a public or system ID based on the order.
This fixes#1