Similar to comments (ea8b4aa92f) and CDATA
tags (8acc7fc743) processing instructions
are now lexed in separate chunks _with_ proper support for streaming
input.
Related issue: #93
Instead of using a single token (T_CDATA) for a CDATA tag the lexer now
uses 3 tokens:
1. T_CDATA_START
2. T_CDATA_BODY
3. T_CDATA_END
The T_CDATA_BODY token can occur multiple times and is turned into a
single value in the XML parser. This is similar to the way strings are
lexed.
By changing the way CDATA tags are lexed Oga can now lex CDATA tags
containing newlines when using an IO as input. For example, this would
previously fail:
Oga.parse_xml(StringIO.new("<![CDATA[\nfoo]]>"))
Because IO input reads input per line the input for the lexer would be
as following:
"<![CDATA[\n"
"foo]]>"
Related issues: #93
This basically re-applies the technique used for HTML <script> tags. With this
extra addition I decided to rename/normalize a few things so it's easier to add
any extra tags in the future. One downside of this setup is that the following
will not be parsed by Oga:
<style>
</script>
</style>
The same applies to script tags containing a literal </style> tag. Since this
particular case is rather unlikely to occur I'm OK with not supporting it as it
_does_ simplify the lexer quite a bit.
Fixes#80
When lexing input in HTML mode the lexer has to treat _all_ content of a
<script> tag as plain text. This ensures that the lexer can process input such
as "x <y" and "// <foo>" correctly.
Fixes#70.
For JRuby this has little to no benefits as it uses strings for method names.
However, both MRI and Rubinius will perform a Symbol lookup whenever rb_intern()
is called. By doing this once for all callback names and caching the resulting
VALUE objects the lexer timings can be reduced by about 25%. In case of the
benchmark benchmark/xml/lexer/string_average_bench.rb this means it runs in
around 500ms instead of 700ms.
Instead of storing "act" and "cs" as an instance variable they (along with some
other variables) are now stored in a struct. This struct is attached to a lexer
instance using the (crappy) Data_Get_Struct/Data_Wrap_Struct API.
Haven't bumped into any problems just yet. However, in theory all sorts of evil
could happen here. Which is part of the problem of C: so much shit is undefined
behaviour that you can take a single step and fall in 15 holes at the same time.
In theory, because nobody bothered to actually specify it properly.
Instead of relying on String#count for counting newlines in text nodes, Oga now
does this in C/Java. String#count isn't exactly the fastest way of counting
characters. Performance was measured using
benchmark/xml/lexer/string_average_bench.rb. Before this patch the results were
as following:
MRI: 0.529s
Rbx: 4.965s
JRuby: 0.622s
After this patch:
MRI: 0.424s
Rbx: 1.942s
JRuby: 0.665s => numbers vary a bit, seem roughly the same as before
The commands used for benchmarking:
$ rake clean # to make sure that C exts aren't shared between MRI/Rbx
$ rake generate
$ rake fixtures
$ ruby benchmark/xml/lexer/string_average_bench.rb
The big difference for Rbx is probably due to the implementation of String#count
not being super fast. Some changes were made
(https://github.com/rubinius/rubinius/pull/3133) to the method, but this hasn't
been released yet.
JRuby seems to perform in a similar way, so either it was already optimizing
things for me or I suck at writing well performing Java code.
This fixes#51.
The previous setup would consume too much. For example the following HTML:
<a><!--foo--><b><!--bar--></b></a>
would result in the following T_COMMENT token:
"foo--><b><!--bar"
The new setup requires the marking of a start position. I'm not a huge fan of
this but there doesn't appear to be a way around this.
Using IO/StringIO objects one can parse large XML files without first having to
read the entire file into memory. This can potentially save a lot of memory at
the cost of a slightly slower runtime.
For IO like instances the lexer will consume the input line by line. If a
String is given it's consumed as a whole instead. A small side effect of
reading the input line by line is that text such as "foo\nbar" will be lexed as
two tokens instead of one.
Fixes#19.
Instead of directly accessing the `data` instance variable the C/Java code now
uses the method `read_data`. This is part of one of the various steps required
to allow Oga to read data from IO like instances. It also means I can freely
change the name of the instance variable without also having to change the
C/Java code.