When lexing multi-line strings everything used to work fine as long as the input
were to be read as a whole. However, when using an IO instance all hell would
break loose. Due to the lexer reading IO instances on a per line basis,
sometimes Ragel would end up setting "ts" to NULL. For example, the following
input would break the lexer:
<foo class="\nbar" />
Due to the input being read per line, the following data would be sent to the
lexer:
<foo class="\n
bar" />
This would result in different (or NULL) pointers being used for building a
string, in turn resulting in memory allocation errors.
To work around this the string lexing setup has been broken into separate
machines for single and double quoted strings. The tokens used have also been
changed so that instead of just "T_STRING" there are now the following tokens:
* T_STRING_SQUOTE
* T_STRING_DQUOTE
* T_STRING_BODY
A string can have multiple T_STRING_BODY tokens (= multi-line strings, only the
case for IO inputs). These strings are stitched back together by the parser.
This fixes#58.
Processing of this axis along with a predicate wouldn't quite work out. Even if
the predicate returned false the node would still be matched (which should not
be the case).
Previously input such as "x > y" would result in the following token sequences:
T_IDENT, T_CHILD, T_IDENT
This commit changes this to the following:
T_IDENT, T_SPACE, T_CHILD, T_IDENT
This allows the parser to use T_SPACE as a terminal token, this in turn prevents
around 16 shift/reduce conflicts from arising.
This does mean that input such as " > y" or " x > y" is now invalid. This
however can be solved by simply _not_ adding leading/trailing whitespace to CSS
queries.
The new setup will not involve a separate transformation stage, instead the CSS
parser will directly emit an XPath AST. This reduces the overhead needed for
parsing/evaluating CSS selectors while also simplifying the code. The downside
is that I basically have to re-write 80% of the parser.
This uses stricter (and more correct) rules in both the lexer and the parser.
The resulting AST has also received a small rework to make it more compact and
less confusing.
This includes support for the crazy 2n+1 syntax you can use with selectors such
as :nth-child().
CSS selectors: doing what XPath already does using an even crazier syntax,
because screw you.
When lexing XML entities such as & and < these sequences are now
converted into their "actual" forms. In turn, Oga::XML::Text#to_xml ensures they
are encoded when the method is called.
Performance wise this puts some strain on the lexer, for every T_TEXT/T_STRING
node now potentially has to have its content modified. In the benchmark
xml/lexer/string_average_bench.rb the average processing time is now about the
same as before the improvements made in
8db77c0a09. I was hoping that the lexer would
still be a bit faster, but alas this is not the case. Doing this in native code
would be a nightmare as C doesn't have a proper string replacement function. I'm
not old/sadistic enough to write on myself just yet.
This fixes#49
This was a gimmick in the first place. It doesn't work well with IO instances
(= requires re-reading of the input), the code was too complex and it wasn't
that useful in the end. Lets just get rid of it.
This fixes#53.
Instead of relying on String#count for counting newlines in text nodes, Oga now
does this in C/Java. String#count isn't exactly the fastest way of counting
characters. Performance was measured using
benchmark/xml/lexer/string_average_bench.rb. Before this patch the results were
as following:
MRI: 0.529s
Rbx: 4.965s
JRuby: 0.622s
After this patch:
MRI: 0.424s
Rbx: 1.942s
JRuby: 0.665s => numbers vary a bit, seem roughly the same as before
The commands used for benchmarking:
$ rake clean # to make sure that C exts aren't shared between MRI/Rbx
$ rake generate
$ rake fixtures
$ ruby benchmark/xml/lexer/string_average_bench.rb
The big difference for Rbx is probably due to the implementation of String#count
not being super fast. Some changes were made
(https://github.com/rubinius/rubinius/pull/3133) to the method, but this hasn't
been released yet.
JRuby seems to perform in a similar way, so either it was already optimizing
things for me or I suck at writing well performing Java code.
This fixes#51.
This ensures we only call String#downcase if we can't find an all lowercased
*and* all uppercased version of the element name. This in turn can save as many
object allocations as there are HTML opening tags.
This fixes#52.
Instead of using `namespace.name` lets just use `namespace_name`. This fixes the
problem of serializing attributes where the namespace prefix is "xmlns" as the
namespace for this isn't registered by default.
This fixes#47.
This API is a little bit dodgy (similar to Nokogiri's API) due to the use of
separate parser and handler classes. This is done to ensure that the return
values of callback methods (e.g. on_element) aren't used by Racc for building
AST trees. This also ensures that whatever variables are set by the handler
don't conflict with any variables of the parser.
This fixes#42.
While still a bit cryptic this is probably as best as we can get it. An example:
Oga.parse_xml("<namefoo:bar=\"10\"")
parser.rb:116:in `on_error': Unexpected string on line 1: (Racc::ParseError)
=> 1: <namefoo:bar="10"
This fixes#43.
When an XML element has no child nodes a self-closing tag is used. When parsing
documents/elements in HTML mode this is only done if the element is a so called
"void element" (e.g. <link> tags).
This fixes#46.
When the default namespace is registered (using xmlns="...") Oga now properly
sets the namespace of the container and all child elements.
This fixes#44.
This was originally reported by @jrochkind and partially patched by @billdueber.
My patches are built upon the latter, but without the need of using Array#map,
Array#join, etc. They also contain a few style changes.
This fixes#32 and #33.
The methods XML::Element#add_attribute and XML::Element#set can be used to more
easily add attributes to elements. The first method simply adds an Attribute
instance and links it to the element. This allows for fine grained control over
what data the attribute should contain. The second method ("set") simply sets an
attribute based on a name and value, optionally creating the attribute if it
doesn't already exist.
By using NodeSet#concat we can further reduce the amount of object allocations.
This in turn greatly reduces the time it takes to query large documents using
descendant-or-self.
Previously this wouldn't display anything due to the IO object being exhausted.
To fix this the input has to be wound back to the start, which means re-reading
it. Sadly I can't think of a way around this that doesn't require buffering
lines while parsing them (which massively increases memory usage).
This ensures the current context node is set correctly when using the "self"
axis inside a path that's inside a predicate, e.g.
foo/bar[baz/. = "something"]
Here the "self" axis should refer to foo/bar/baz, _not_ foo/bar.
The XPath number() function should also be capable of converting booleans to
numbers, something it previously was not able to do. In order to do this
reliably we can't rely on the string() function as this would make it impossible
to distinguish between literal string values and booleans. This is due to
true(), which returns a TrueClass, being converted to the string "true". This
string in turn can't be converted to a float.
When an attribute is prefixed with "xml" the default namespace should be used
automatically. This namespace is not registered on element level by default as
this namespace isn't registered manually, instead it's a "magic" namespace. This
also ensures we match the behaviour of libxml more closely, hopefully reducing
confusion.
When calling the string() XPath function floats with zero decimals (10.0, 5.0,
etc) should result in a string without any decimals. Ruby converts 10.0 to
"10.0" whereas XPath expects "10".
The particular case of string(10) having to return "10" instead of "10.0" will
be handled separately. Returning integers breaks behaviour/expectations
elsewhere.
This reverts commit 431a253000.
This allows filtering of nodes by indexes (e.g. using last() or a literal
number) and uses the correct index range (1 to N in XPath). The function
position() is not yet supported as this requires access to the current node,
which isn't passed down the call stack just yet.
Namespaces aren't scoped per document but instead per element, thus this method
doesn't make that much sense. This also fixes the remaining, failing XPath test.
The method XPath::Evaluator#node_matches? now has a special case to handle
"type-test" nodes. This in turn fixes a bunch of failing tests such as those for
the XPath query "parent::node()".
Unlike what I thought before syntax such as "node()" is not a function call.
Instead this is a special node test that tests the *types* of nodes, not their
names.
This separates namespace handling into namespace names and namespace objects.
The namespace objects are retrieved from the element an attribute belongs to.
Once retrieved the namespace is cached, due to the overhead of retrieving
namespaces in large documents.
The old code used for generating Object#inspect values has been ripped out (for
the most part). The result is a non indented but far more compact #inspect
output. The code for this is also easier and doesn't break the signature of
Object#inspect.
Oga won't be handling URIs any time soon. The rationale is that they server zero
purpose when it comes to just parsing XML. Another goal of Oga is to make it
easy to modify and reserialize documents back to XML. If namespaces would also
store the URIs this would make this process more difficult.
This also comes with some small cleanups regarding
XPath::Evaluator#node_matches?. This change removes the need to, every time,
also use can_match_node?() to prevent NoMethodError errors from popping up.
This still uses a stack but at least no longer relies on the call stack. I
decided not to go with the Morris in-order algorithm [1] as it modifies the tree
during a search. This would not work well if a document were to be accessed from
multiple threads at once (which should be possible for read-only operations).
I might change this method to actually perform a search (opposed to just
returning everything). This will require some closer inspection of the
available XPath axes to determine if this is needed.
Tests will also be added once I've taken care of the above.
[1]: http://en.wikipedia.org/wiki/Tree_traversal#Morris_in-order_traversal_using_threading
The previous commit was nonsense as I didn't understand XPath's "following" axis
properly. This commit introduces proper tests and a note for future me so that I
can implement it properly.
The evaluation of axes has been fixed by changing the initial context as well as
the behaviour of some of the handler methods.
The initial context has been changed so that it's simply a NodeSet of whatever
the root object is, either a Document or an Element instance. Previously this
would be set to the child nodes of a Document in case the root object was a
Document. This in turn would prevent "child" axes from operating correctly.
When parsing a bare node test such as "A" this is now parsed as following:
(axis "child" (test nil "A"))
Instead of this:
(test nil "A")
According to the XPath specification both are identical and this simplifies some
of the code in the XPath evaluator.
Upon further investigation this change turned out to be useless. Nokogiri/libxml
does not allow the use of long axes without tests, instead it ends up
lexing/parsing such a value as a simple node test.
This reverts commit f699b0d097.
An axes such as "." is the same as "self::node()". To simplify things on
parser/evaluator level we'll emit the corresponding tokens for a "node()"
function call for these axes.
Instead of using a raw Hash Oga now uses the XML::Attribute class for storing
information about element attributes.
Attributes are stored as an Array of XML::Attribute instances. This allows the
attributes to be more easily modified. If they were stored as a Hash you'd not
only have to update the attributes themselves but also the Hash that contains
them.
While using an Array has a slight runtime cost in most cases the amount of
attributes is small enough that this doesn't really pose a problem. If webscale
performance is desired at some point in the future Oga could most likely cache
the lookup of an attribute. This however is something for the future.
Instead of keeping track of an internal state in @stack and @context the various
processing methods now take the context as an extra argument and return the
nodes they produced. This makes it easier to recursively call certain methods, a
requirement for processing XPath axes (e.g. the "ancestor" axis).
The AST has been simplified by adjusting the way (path) nodes are nested.
Operators now also use `paths` instead of `expression` to allow for expressions
such as `/A or /B`. Sadly this introduces quite a bunch of conflicts in the
parser but we'll deal with those later if needed.
Come to think of it it might actually be easier to implement the evaluator as
an actual VM. That is, instead of directly running on the AST it runs on some
flavour of bytecode. Alternatively it runs directly on the AST but behaves more
like a (stack based) VM. This would most likely be easier than passing a cursor
to every node processing method.
This method can be used to retrieve the text of the given node only. In other
words, unlike Element#text it does not contain the text of any child nodes.
This method uses a loop to traverse upwards the DOM tree in order to find the
root document/element. While this might have an impact on performance I don't
expect Oga itself to call this method very often. The benefit is that Node
instances don't require users to manually pass the top level document as an
argument.
The combination of iterating over an array and removing values from it results
in not all elements being removed. For example:
numbers = [10, 20, 30]
numbers.each do |num|
numbers.delete(num)
end
numbers # => [20]
As a result of this the NodeSet#remove method uses two iterations:
1. One iteration to retrieve all NodeSet instances to remove nodes from.
2. One iteration to actually remove the nodes.
For the second iteration we iterate over the node sets and then over the nodes.
This ensures that we always remove all nodes instead of leaving some behind.
The documentation still leaves a lot to be desired and so does the API. There
also appears to be a problem where NodeSet#remove doesn't properly remove all
nodes from a set. Outside of that we're making slow progress towards a proper
DOM API.
Instead the various nodes can use NodeSet#index (aka Array#index) instead. This
has a slight performance overhead on very large (millions) of nodes but should
be fine in most other cases.
The previous commit didn't fully change the operator precedence according to
the XPath 1.0 specification. Also thanks to @whitequark for clearing up a few
things about Racc's operator precedence system.
Using IO/StringIO objects one can parse large XML files without first having to
read the entire file into memory. This can potentially save a lot of memory at
the cost of a slightly slower runtime.
For IO like instances the lexer will consume the input line by line. If a
String is given it's consumed as a whole instead. A small side effect of
reading the input line by line is that text such as "foo\nbar" will be lexed as
two tokens instead of one.
Fixes#19.
Instead of directly accessing the `data` instance variable the C/Java code now
uses the method `read_data`. This is part of one of the various steps required
to allow Oga to read data from IO like instances. It also means I can freely
change the name of the instance variable without also having to change the
C/Java code.
After discussing this with @headius I've decided to do this the manual way
anyway. Apparently the basic load service stuff is deprecated and not very
reliable.
While I've tried to keep Oga pure Ruby for as long as possible the performance
of Ragel's Ruby output was not worth the trouble. For example, lexing 10MB of
XML would take 5 to 6 seconds at least. Nokogiri on the other hand can parse
that same XML into a DOM document in about 300 miliseconds. Such a big
performance difference is not acceptable.
To work around this the XML/HTML lexer will be implemented in C for
MRI/Rubinius and Java for JRuby. For now there's only a C extension as I
haven't read up yet on the JRuby API. The end goal is to provide some sort of
Ragel "template" that can be used to generate the corresponding C/Java
extension code. This would remove the need of duplicating the grammar and
associated code.
The native extension setup is a hybrid between native and Ruby. The raw Ragel
stuff happens in C/Java while the actual logic of actions happens in Ruby. This
adds a small amount of overhead but makes it much easier to maintain the lexer.
Even with this extra overhead the performance is much better than pure Ruby.
The 10MB of XML mentioned above is lexed in about 600 miliseconds. In other
words, it's 10 times faster.
Instead of wrapping a predicate method around the ivar we'll just access it
directly. This reduces average lexing times in the big XML benchmark from 7,5
to ~7 seconds.
Instead of using instance variables for ts, te, etc we'll use local variables.
Grand wizard overloard @whitequark suggested that this would be quite a bit
faster, which turns out to be true. For example, the big XML lexer benchmark
would, prior to this commit, complete in about 9 - 9,3 seconds. With this
commit that hovers around 8,5 seconds.
This adds the ability to more easily act upon specific node types and nestings
when using the pull parsing API.
A basic example of this API looks like the following (only including relevant
code):
parser.parse do |node|
parser.on(:element, %w{people person}) do
people << {:name => nil, :age => nil}
end
parser.on(:text, %w{people person name}) do
people.last[:name] = node.text
end
parser.on(:text, %w{people person age}) do
people.last[:age] = node.text.to_i
end
end
This fixes#6.
Tracking the names of nested elements makes it a lot easier to do contextual
pull parsing. Without this it's impossible to know what context the parser is
in at a given moment.
For memory reasons the parser currently only tracks the element names. In the
future it might perhaps also track extra information to make parsing easier.
This parser extends the regular DOM parser but instead delegates certain nodes
to a block instead of building a DOM tree.
The API is a bit raw in its current form but I'll extend it and make it a bit
more user friendly in the following commits. In particular I want to make it
easier to figure out if a certain node is nested inside another node.
The AST layer is being removed because it doesn't really serve a useful
purpose. In particular when creating a streaming parser the AST nodes would
only introduce extra overhead.
As a result of this the parser will instead emit a DOM tree directly instead of
first emitting an AST.
Profiling showed that calls to methods defined using `define_method` are
really, really slow. Before this commit the lexer would process 3000-4000
lines per second. With this commit that has been increased to around 10 000
lines per second.
Thanks to @headius for mentioning the (potential) overhead of define_method.
Instead of lexing the input as a raw String or as a set of codepoints it's
treated as a sequence of bytes. This removes the need of String#[] (replaced by
String#byteslice) which in turn reduces the amount of memory needed and speeds
up the lexing time.
Thanks to @headius and @apeiros for suggesting this and rubber ducking along!
After some digging I found out that Racc has a method called `yyparse`. Using
this method (and a custom callback method) you can `yield` tokens as a form of
input. This makes it a lot easier to feed tokens as a stream from the lexer.
Sadly the current performance of the lexer is still total garbage. Most of the
memory usage also comes from using String#unpack, especially on large XML
inputs (e.g. 100 MB of XML). It looks like the resulting memory usage is about
10x the input size.
One option might be some kind of wrapper around String. This wrapper would have
a sliding window of, say, 1024 bytes. When you create it the first 1024 bytes
of the input would be unpacked. When seeking through the input this window
would move forward.
In theory this means that you'd only end up with having only 1024 Fixnum
instances around at any given time instead of "a very big number". I have to
test how efficient this is in practise.
The offending lines of code displayed in the error message are truncated to 80
characters. This should make reading the error messages less of a pain when
dealing with very long lines of HTML/XML.
Instead of returning the tokens as a whole they are now streamed using
XML::Lexer#advance. This method returns the next token upon every call. It uses
a small buffer in case a particular block of text results in multiple tokens.
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
The latter returns an Enumerable which on Ruby 1.9.3 doesn't have #length
available. Besides this it's better to just return an Array since we'll iterate
over every character anyway.
Grand wizard overlord @whitequark recommended this as it will bypass the need
for creating individual String instance for every character (at least not until
needed). This becomes noticable on large inputs (e.g. 100 MB of XML).
Previously these would result in the kernel OOM killing the process. Using
codepoints memory increase by a "mere" 1-1,5 GB.
This was a rather interesting turn of events. As it turned out the Ragel
generated lexer was extremely slow on large inputs. For example, lexing
benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark
benchmark/lexer/bench_html_time.rb:
Rehearsal --------------------------------------------------------
lex HTML 10.870000 0.000000 10.870000 ( 10.877920)
---------------------------------------------- total: 10.870000sec
user system total real
lex HTML 10.440000 0.010000 10.450000 ( 10.449500)
The corresponding benchmark-ips benchmark (bench_html.rb) presented the
following results:
Calculating -------------------------------------
lex HTML 1 i/100ms
-------------------------------------------------
lex HTML 0.1 (±0.0%) i/s - 1 in 10.472534s
10 seconds for around 165 KB of HTML was not acceptable. I spent a good time
profiling things, even submitting a patch to Ragel
(https://github.com/athurston/ragel/pull/1). At some point I decided to give a
pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to
write C reminded me why I didn't want to do it in C in the first place.
Around 2AM I gave up and went to brush my teeth and head to bed. Then, a
miracle happened. More precisely, I actually gave my brain some time to think
away from the computer. I said to myself:
What if I feed Ragel an Array of characters instead of an entire String?
That way I bypass String#[] being expensive without having to change all of
Ragel or use a different language.
The results of this change are rather interesting. With these changes the
benchmark bench_html_time.rb now gives back the following:
Rehearsal --------------------------------------------------------
lex HTML 0.550000 0.000000 0.550000 ( 0.550649)
----------------------------------------------- total: 0.550000sec
user system total real
lex HTML 0.520000 0.000000 0.520000 ( 0.520713)
The benchmark bench_html.rb in turn gives back this:
Calculating -------------------------------------
lex HTML 1 i/100ms
-------------------------------------------------
lex HTML 2.0 (±0.0%) i/s - 10 in 5.120905s
According to both benchmarks we now have a speedup of about 20 times without
having to make any further changes to Ragel or the lexer itself.
I love it when a plan comes together.
Instead of appending single characters to a String buffer the lexer now uses a
start and end position to figure out what the buffer is. This is a lot faster
than constantly appending to a String.
The error messages of the parser now contain surrounding lines of code instead
of only the offending line of code. This should make debugging a bit easier.
Line numbers are also shown for each line.
Although this AST is compacter it will result in conflicts between (text),
(attributes) and (attribute) nodes in regular XML documents. This is due to XML
allowing elements with these names (unlike in HTML).
This reverts commit 8898d08831.
The AST no longer uses the generic `element` type for element nodes but instead
changes the type based on the element type. That is, a <p> element now results
in an (p) node, <link> in (link), etc.
This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name
(T_ELEMENT_NAME). This makes it easier to include the namespace of an element
(T_ELEMENT_NS) in the output.
The current implementation is a bit messy. In particular the counting of column
numbers is not entirely the way it should be. There are also some problems with
nested tags/text that I still have to resolve.
This comes with various structural changes to the lexer as I'm slowly starting
to get the hang of Ragel. Ragel is a beast but damn it's an awesome piece of
software.
Note that the doctype public/system IDs are lexed as T_STRING. The parser will
figure out whether a ID is a public or system ID based on the order.
This fixes#1