This ensures that Oga can lex the following properly:
<input value="" />
Previously Ragel would stop upon finding the empty string. This was caused due
to the string rules being declared as following:
string_dquote = (dquote ^dquote+ dquote);
string_squote = (squote ^squote+ squote);
These rules only match strings _with_ content, not without. Since Ragel stops
consuming input the moment it finds unhandled data this resulted in incorrect
tokens being emitted.
Thanks to some heavy rubberducking with @whitequark the lexer is now a little
bit better at lexing T_TEXT nodes. For example, previously the following could
not be lexed properly:
"foo < bar"
There might still be some tweaking to do but we're getting there.
Using create_makefile('liboga/liboga') will compile liboga.so into
path-to-gem/lib/liboga/ and therefore require_relative in oga.rb will fail.
Therefore the right parameter for create_makefile is 'liboga' ->
path-to-gem/lib/liboga.so
The previous setup would consume too much. For example the following HTML:
<a><!--foo--><b><!--bar--></b></a>
would result in the following T_COMMENT token:
"foo--><b><!--bar"
The new setup requires the marking of a start position. I'm not a huge fan of
this but there doesn't appear to be a way around this.
Instead of using a raw Hash Oga now uses the XML::Attribute class for storing
information about element attributes.
Attributes are stored as an Array of XML::Attribute instances. This allows the
attributes to be more easily modified. If they were stored as a Hash you'd not
only have to update the attributes themselves but also the Hash that contains
them.
While using an Array has a slight runtime cost in most cases the amount of
attributes is small enough that this doesn't really pose a problem. If webscale
performance is desired at some point in the future Oga could most likely cache
the lookup of an attribute. This however is something for the future.
Using IO/StringIO objects one can parse large XML files without first having to
read the entire file into memory. This can potentially save a lot of memory at
the cost of a slightly slower runtime.
For IO like instances the lexer will consume the input line by line. If a
String is given it's consumed as a whole instead. A small side effect of
reading the input line by line is that text such as "foo\nbar" will be lexed as
two tokens instead of one.
Fixes#19.
Instead of directly accessing the `data` instance variable the C/Java code now
uses the method `read_data`. This is part of one of the various steps required
to allow Oga to read data from IO like instances. It also means I can freely
change the name of the instance variable without also having to change the
C/Java code.
This moves the element related rules to the element_head machine (where they
belong). This in turn makes it possible to lex ">" as a text node, previously
this was impossible.
After discussing this with @headius I've decided to do this the manual way
anyway. Apparently the basic load service stuff is deprecated and not very
reliable.