Commit Graph

39 Commits

Author SHA1 Message Date
Yorick Peterse b2ea20ba61 Lex processing instructions in chunks
Similar to comments (ea8b4aa92f) and CDATA
tags (8acc7fc743) processing instructions
are now lexed in separate chunks _with_ proper support for streaming
input.

Related issue: #93
2015-04-15 00:11:57 +02:00
Yorick Peterse ea8b4aa92f Lex comments in chunks
Similar to this being added for CDATA tags in
8acc7fc743 comments are now also lexed in
chunks.

Related issue: #93
2015-04-14 23:11:22 +02:00
Yorick Peterse 8acc7fc743 Lex CDATA tags in chunks
Instead of using a single token (T_CDATA) for a CDATA tag the lexer now
uses 3 tokens:

1. T_CDATA_START
2. T_CDATA_BODY
3. T_CDATA_END

The T_CDATA_BODY token can occur multiple times and is turned into a
single value in the XML parser. This is similar to the way strings are
lexed.

By changing the way CDATA tags are lexed Oga can now lex CDATA tags
containing newlines when using an IO as input. For example, this would
previously fail:

    Oga.parse_xml(StringIO.new("<![CDATA[\nfoo]]>"))

Because IO input reads input per line the input for the lexer would be
as following:

    "<![CDATA[\n"
    "foo]]>"

Related issues: #93
2015-04-14 22:45:55 +02:00
Yorick Peterse 0800654c96 Support lexing or carriage returns
Fixes #89.
2015-04-03 00:46:37 +02:00
Yorick Peterse 3b2055a30b Refactored handling of literal HTML elements.
This ensures newlines can appear in <style> / <script> tags when using IOs as
input.
2015-03-04 11:44:31 +01:00
Yorick Peterse 78e40b55c0 Handle parsing of HTML <style> tags.
This basically re-applies the technique used for HTML <script> tags. With this
extra addition I decided to rename/normalize a few things so it's easier to add
any extra tags in the future. One downside of this setup is that the following
will not be parsed by Oga:

    <style>
        </script>
    </style>

The same applies to script tags containing a literal </style> tag. Since this
particular case is rather unlikely to occur I'm OK with not supporting it as it
_does_ simplify the lexer quite a bit.

Fixes #80
2015-03-03 16:28:05 +01:00
Yorick Peterse ba2177e2cf Lex contents of <script> tags as plain text.
When lexing input in HTML mode the lexer has to treat _all_ content of a
<script> tag as plain text. This ensures that the lexer can process input such
as "x <y" and "// <foo>" correctly.

Fixes #70.
2015-03-02 16:22:09 +01:00
Yorick Peterse 739f885078 Use ID instead of VALUE for C Symbols.
Thanks to @cremno for bringing this up.
2014-11-29 12:53:55 +01:00
Yorick Peterse 5e24a3d1e5 Short docs on lexer callback names. 2014-11-23 20:20:14 +01:00
Yorick Peterse 4fa88fcbde Cache rb_intern/symbol lookups in the lexer.
For JRuby this has little to no benefits as it uses strings for method names.
However, both MRI and Rubinius will perform a Symbol lookup whenever rb_intern()
is called. By doing this once for all callback names and caching the resulting
VALUE objects the lexer timings can be reduced by about 25%. In case of the
benchmark benchmark/xml/lexer/string_average_bench.rb this means it runs in
around 500ms instead of 700ms.
2014-11-22 01:53:37 +01:00
Yorick Peterse cbb2815146 Support for inline doctype rules plus newlines.
This adds support for lexing/parsing XML documents that use an IO as input _and_
contain doctype rules with newlines in them.

This fixes #63.
2014-11-18 20:02:55 +01:00
Yorick Peterse 24ae791f00 Better support for lexing multi-line strings.
When lexing multi-line strings everything used to work fine as long as the input
were to be read as a whole. However, when using an IO instance all hell would
break loose. Due to the lexer reading IO instances on a per line basis,
sometimes Ragel would end up setting "ts" to NULL. For example, the following
input would break the lexer:

    <foo class="\nbar" />

Due to the input being read per line, the following data would be sent to the
lexer:

    <foo class="\n
    bar" />

This would result in different (or NULL) pointers being used for building a
string, in turn resulting in memory allocation errors.

To work around this the string lexing setup has been broken into separate
machines for single and double quoted strings. The tokens used have also been
changed so that instead of just "T_STRING" there are now the following tokens:

* T_STRING_SQUOTE
* T_STRING_DQUOTE
* T_STRING_BODY

A string can have multiple T_STRING_BODY tokens (= multi-line strings, only the
case for IO inputs). These strings are stitched back together by the parser.

This fixes #58.
2014-10-26 11:39:56 +01:00
Yorick Peterse 8db77c0a09 Count newlines of text nodes in native code.
Instead of relying on String#count for counting newlines in text nodes, Oga now
does this in C/Java. String#count isn't exactly the fastest way of counting
characters. Performance was measured using
benchmark/xml/lexer/string_average_bench.rb. Before this patch the results were
as following:

    MRI:   0.529s
    Rbx:   4.965s
    JRuby: 0.622s

After this patch:

    MRI:   0.424s
    Rbx:   1.942s
    JRuby: 0.665s => numbers vary a bit, seem roughly the same as before

The commands used for benchmarking:

    $ rake clean # to make sure that C exts aren't shared between MRI/Rbx
    $ rake generate
    $ rake fixtures
    $ ruby benchmark/xml/lexer/string_average_bench.rb

The big difference for Rbx is probably due to the implementation of String#count
not being super fast. Some changes were made
(https://github.com/rubinius/rubinius/pull/3133) to the method, but this hasn't
been released yet.

JRuby seems to perform in a similar way, so either it was already optimizing
things for me or I suck at writing well performing Java code.

This fixes #51.
2014-09-25 22:49:11 +02:00
Yorick Peterse 00579eaa8a Changed text action from @{} to %{}.
This ensures the action is only run at the end, opposed to any non final state.
2014-09-23 22:58:20 +02:00
Yorick Peterse ad2e040f05 Handle lexing of input such as just "</".
Previously this would cause the lexer to go in an infinite loop in the "text"
state machine.

This fixes #37.
2014-09-15 17:20:06 +02:00
Yorick Peterse 9b8e9f49c6 Support for lexing empty attribute values.
This ensures that Oga can lex the following properly:

    <input value="" />

Previously Ragel would stop upon finding the empty string. This was caused due
to the string rules being declared as following:

    string_dquote = (dquote ^dquote+ dquote);
    string_squote = (squote ^squote+ squote);

These rules only match strings _with_ content, not without. Since Ragel stops
consuming input the moment it finds unhandled data this resulted in incorrect
tokens being emitted.
2014-09-03 23:10:50 +02:00
Yorick Peterse 49ddebf358 Tighten lexing of T_TEXT nodes.
Thanks to some heavy rubberducking with @whitequark the lexer is now a little
bit better at lexing T_TEXT nodes. For example, previously the following could
not be lexed properly:

    "foo < bar"

There might still be some tweaking to do but we're getting there.
2014-09-03 00:51:13 +02:00
Yorick Peterse 96b7296910 Ragel variable of element closing tags. 2014-09-02 22:50:21 +02:00
Yorick Peterse 56341b5585 Cleaned up lexing of comments/cdata.
Thanks to @whitequark for suggesting the use of the "--" operator.
2014-08-16 16:03:55 +02:00
Yorick Peterse 2c488f92be Cleaned up marking of comments/cdata tags. 2014-08-15 22:05:09 +02:00
Yorick Peterse 8f4eaf3823 Lexing of XML processing instructions. 2014-08-15 22:04:45 +02:00
Yorick Peterse 4e8cca258c Fixed lexing of XML CDATA tags. 2014-08-15 20:47:58 +02:00
Yorick Peterse 81edce2eb8 Fixed lexing of XML comments.
The previous setup would consume too much. For example the following HTML:

    <a><!--foo--><b><!--bar--></b></a>

would result in the following T_COMMENT token:

    "foo--><b><!--bar"

The new setup requires the marking of a start position. I'm not a huge fan of
this but there doesn't appear to be a way around this.
2014-08-15 20:42:32 +02:00
Yorick Peterse d5569ead0b Use XML::Attribute for element attributes.
Instead of using a raw Hash Oga now uses the XML::Attribute class for storing
information about element attributes.

Attributes are stored as an Array of XML::Attribute instances. This allows the
attributes to be more easily modified. If they were stored as a Hash you'd not
only have to update the attributes themselves but also the Hash that contains
them.

While using an Array has a slight runtime cost in most cases the amount of
attributes is small enough that this doesn't really pose a problem. If webscale
performance is desired at some point in the future Oga could most likely cache
the lookup of an attribute. This however is something for the future.
2014-07-20 07:29:37 +02:00
Yorick Peterse f660b11e47 Parsing of closing XML nodes with namespaces. 2014-07-09 19:54:45 +02:00
Yorick Peterse be3f8fb494 Removed the on_newline XML lexer callback. 2014-05-29 14:21:48 +02:00
Yorick Peterse 418b4ef498 Cleaned up documentation of the XML lexer. 2014-05-21 00:21:21 +02:00
Yorick Peterse 3a8582030d Removed remaining fhold call in the XML lexer.
There's no particular need any more for this fhold call so we're getting rid of
it.
2014-05-21 00:11:39 +02:00
Yorick Peterse 4542f06d0f Replaced fcall/fret with fnext in the XML lexer.
With the rules being cleaned up/moved around a bit we can drop the use of
fcall/fret. This saves the need of having to maintain a stack (position).
2014-05-21 00:08:48 +02:00
Yorick Peterse c56b0395e4 Moved various rules around for the XML lexer.
This moves the element related rules to the element_head machine (where they
belong). This in turn makes it possible to lex ">" as a text node, previously
this was impossible.
2014-05-21 00:04:53 +02:00
Yorick Peterse feaf28d423 Remove dedicated string machine in the XML lexer.
This removes the need for another fcall/fret combination.
2014-05-19 20:26:07 +02:00
Yorick Peterse 93b9718406 Cleaned up the XML lexer documentation. 2014-05-19 09:39:35 +02:00
Yorick Peterse cd0f3380c4 Merge multiple CDATA tokens into a single token.
The tokens T_CDATA_START, T_TEXT and T_CDATA_END have been merged together into
T_CDATA.
2014-05-19 09:36:19 +02:00
Yorick Peterse a4fb5c1299 Merge multiple comment tokens into a single one.
The tokens T_COMMENT_START, T_TEXT and T_COMMENT_END have been merged into a
single token: T_COMMENT. This simplifies both the lexer and the parser.
2014-05-19 09:30:30 +02:00
Yorick Peterse 44bf1dd1ca Split up handling of element names/namespaces.
This is now split up on Ragel level, simplifying the corresponding Ruby code.
2014-05-15 10:22:05 +02:00
Yorick Peterse 19f04f98f7 Support for lexing/parsing inline doctypes. 2014-05-10 00:28:11 +02:00
Yorick Peterse c472ceac6f Docs for the shared Ragel grammar. 2014-05-08 00:21:23 +02:00
Yorick Peterse e271298984 Use macros in the C lexer. 2014-05-07 00:57:25 +02:00
Yorick Peterse f25f8a3d15 Break up the Ragel C grammar.
The grammar is now broken up in to a base lexer and a C lexer. This allows the
same grammar to also be used in the Java code.
2014-05-07 00:50:34 +02:00