Commit Graph

61 Commits

Author SHA1 Message Date
Yorick Peterse b006289c5f Removed extra space in c/lexer.rl 2014-11-23 22:12:18 +01:00
Yorick Peterse 5e24a3d1e5 Short docs on lexer callback names. 2014-11-23 20:20:14 +01:00
Yorick Peterse 4fa88fcbde Cache rb_intern/symbol lookups in the lexer.
For JRuby this has little to no benefits as it uses strings for method names.
However, both MRI and Rubinius will perform a Symbol lookup whenever rb_intern()
is called. By doing this once for all callback names and caching the resulting
VALUE objects the lexer timings can be reduced by about 25%. In case of the
benchmark benchmark/xml/lexer/string_average_bench.rb this means it runs in
around 500ms instead of 700ms.
2014-11-22 01:53:37 +01:00
Yorick Peterse cbb2815146 Support for inline doctype rules plus newlines.
This adds support for lexing/parsing XML documents that use an IO as input _and_
contain doctype rules with newlines in them.

This fixes #63.
2014-11-18 20:02:55 +01:00
Yorick Peterse 24ae791f00 Better support for lexing multi-line strings.
When lexing multi-line strings everything used to work fine as long as the input
were to be read as a whole. However, when using an IO instance all hell would
break loose. Due to the lexer reading IO instances on a per line basis,
sometimes Ragel would end up setting "ts" to NULL. For example, the following
input would break the lexer:

    <foo class="\nbar" />

Due to the input being read per line, the following data would be sent to the
lexer:

    <foo class="\n
    bar" />

This would result in different (or NULL) pointers being used for building a
string, in turn resulting in memory allocation errors.

To work around this the string lexing setup has been broken into separate
machines for single and double quoted strings. The tokens used have also been
changed so that instead of just "T_STRING" there are now the following tokens:

* T_STRING_SQUOTE
* T_STRING_DQUOTE
* T_STRING_BODY

A string can have multiple T_STRING_BODY tokens (= multi-line strings, only the
case for IO inputs). These strings are stitched back together by the parser.

This fixes #58.
2014-10-26 11:39:56 +01:00
Yorick Peterse fca88a69d1 Track Ragel call stacks in the Java lexer.
This will be needed for the upcoming string lexing changes.
2014-10-26 11:39:19 +01:00
Yorick Peterse d951a8cc87 Track XML C lexer state in C only.
Instead of storing "act" and "cs" as an instance variable they (along with some
other variables) are now stored in a struct. This struct is attached to a lexer
instance using the (crappy) Data_Get_Struct/Data_Wrap_Struct API.
2014-10-26 11:38:06 +01:00
Yorick Peterse 1400a859ce Make sure C strings always end with a NULL.
Haven't bumped into any problems just yet. However, in theory all sorts of evil
could happen here. Which is part of the problem of C: so much shit is undefined
behaviour that you can take a single step and fall in 15 holes at the same time.
In theory, because nobody bothered to actually specify it properly.
2014-09-28 22:28:55 +02:00
Yorick Peterse 8db77c0a09 Count newlines of text nodes in native code.
Instead of relying on String#count for counting newlines in text nodes, Oga now
does this in C/Java. String#count isn't exactly the fastest way of counting
characters. Performance was measured using
benchmark/xml/lexer/string_average_bench.rb. Before this patch the results were
as following:

    MRI:   0.529s
    Rbx:   4.965s
    JRuby: 0.622s

After this patch:

    MRI:   0.424s
    Rbx:   1.942s
    JRuby: 0.665s => numbers vary a bit, seem roughly the same as before

The commands used for benchmarking:

    $ rake clean # to make sure that C exts aren't shared between MRI/Rbx
    $ rake generate
    $ rake fixtures
    $ ruby benchmark/xml/lexer/string_average_bench.rb

The big difference for Rbx is probably due to the implementation of String#count
not being super fast. Some changes were made
(https://github.com/rubinius/rubinius/pull/3133) to the method, but this hasn't
been released yet.

JRuby seems to perform in a similar way, so either it was already optimizing
things for me or I suck at writing well performing Java code.

This fixes #51.
2014-09-25 22:49:11 +02:00
Yorick Peterse 00579eaa8a Changed text action from @{} to %{}.
This ensures the action is only run at the end, opposed to any non final state.
2014-09-23 22:58:20 +02:00
Yorick Peterse ad2e040f05 Handle lexing of input such as just "</".
Previously this would cause the lexer to go in an infinite loop in the "text"
state machine.

This fixes #37.
2014-09-15 17:20:06 +02:00
Yorick Peterse 9b8e9f49c6 Support for lexing empty attribute values.
This ensures that Oga can lex the following properly:

    <input value="" />

Previously Ragel would stop upon finding the empty string. This was caused due
to the string rules being declared as following:

    string_dquote = (dquote ^dquote+ dquote);
    string_squote = (squote ^squote+ squote);

These rules only match strings _with_ content, not without. Since Ragel stops
consuming input the moment it finds unhandled data this resulted in incorrect
tokens being emitted.
2014-09-03 23:10:50 +02:00
Yorick Peterse 49ddebf358 Tighten lexing of T_TEXT nodes.
Thanks to some heavy rubberducking with @whitequark the lexer is now a little
bit better at lexing T_TEXT nodes. For example, previously the following could
not be lexed properly:

    "foo < bar"

There might still be some tweaking to do but we're getting there.
2014-09-03 00:51:13 +02:00
Yorick Peterse 96b7296910 Ragel variable of element closing tags. 2014-09-02 22:50:21 +02:00
Benjamin Klotz 0b096dfe25 Use proper create_makefile
Using create_makefile('liboga/liboga') will compile liboga.so into
path-to-gem/lib/liboga/ and therefore require_relative in oga.rb will fail.
Therefore the right parameter for create_makefile is 'liboga' ->
path-to-gem/lib/liboga.so
2014-09-02 20:27:20 +02:00
Yorick Peterse 56341b5585 Cleaned up lexing of comments/cdata.
Thanks to @whitequark for suggesting the use of the "--" operator.
2014-08-16 16:03:55 +02:00
Yorick Peterse 2c488f92be Cleaned up marking of comments/cdata tags. 2014-08-15 22:05:09 +02:00
Yorick Peterse 8f4eaf3823 Lexing of XML processing instructions. 2014-08-15 22:04:45 +02:00
Yorick Peterse 4e8cca258c Fixed lexing of XML CDATA tags. 2014-08-15 20:47:58 +02:00
Yorick Peterse 81edce2eb8 Fixed lexing of XML comments.
The previous setup would consume too much. For example the following HTML:

    <a><!--foo--><b><!--bar--></b></a>

would result in the following T_COMMENT token:

    "foo--><b><!--bar"

The new setup requires the marking of a start position. I'm not a huge fan of
this but there doesn't appear to be a way around this.
2014-08-15 20:42:32 +02:00
Yorick Peterse d5569ead0b Use XML::Attribute for element attributes.
Instead of using a raw Hash Oga now uses the XML::Attribute class for storing
information about element attributes.

Attributes are stored as an Array of XML::Attribute instances. This allows the
attributes to be more easily modified. If they were stored as a Hash you'd not
only have to update the attributes themselves but also the Hash that contains
them.

While using an Array has a slight runtime cost in most cases the amount of
attributes is small enough that this doesn't really pose a problem. If webscale
performance is desired at some point in the future Oga could most likely cache
the lookup of an attribute. This however is something for the future.
2014-07-20 07:29:37 +02:00
Yorick Peterse f660b11e47 Parsing of closing XML nodes with namespaces. 2014-07-09 19:54:45 +02:00
Yorick Peterse be3f8fb494 Removed the on_newline XML lexer callback. 2014-05-29 14:21:48 +02:00
Yorick Peterse 629dcd3fe6 Support for IO inputs in the lexer.
Using IO/StringIO objects one can parse large XML files without first having to
read the entire file into memory. This can potentially save a lot of memory at
the cost of a slightly slower runtime.

For IO like instances the lexer will consume the input line by line. If a
String is given it's consumed as a whole instead. A small side effect of
reading the input line by line is that text such as "foo\nbar" will be lexed as
two tokens instead of one.

Fixes #19.
2014-05-26 00:30:39 +02:00
Yorick Peterse 6b9d65923a Use a method for getting input in the XML lexer.
Instead of directly accessing the `data` instance variable the C/Java code now
uses the method `read_data`. This is part of one of the various steps required
to allow Oga to read data from IO like instances. It also means I can freely
change the name of the instance variable without also having to change the
C/Java code.
2014-05-21 00:27:23 +02:00
Yorick Peterse 418b4ef498 Cleaned up documentation of the XML lexer. 2014-05-21 00:21:21 +02:00
Yorick Peterse 3a8582030d Removed remaining fhold call in the XML lexer.
There's no particular need any more for this fhold call so we're getting rid of
it.
2014-05-21 00:11:39 +02:00
Yorick Peterse 4542f06d0f Replaced fcall/fret with fnext in the XML lexer.
With the rules being cleaned up/moved around a bit we can drop the use of
fcall/fret. This saves the need of having to maintain a stack (position).
2014-05-21 00:08:48 +02:00
Yorick Peterse c56b0395e4 Moved various rules around for the XML lexer.
This moves the element related rules to the element_head machine (where they
belong). This in turn makes it possible to lex ">" as a text node, previously
this was impossible.
2014-05-21 00:04:53 +02:00
Yorick Peterse feaf28d423 Remove dedicated string machine in the XML lexer.
This removes the need for another fcall/fret combination.
2014-05-19 20:26:07 +02:00
Yorick Peterse 93b9718406 Cleaned up the XML lexer documentation. 2014-05-19 09:39:35 +02:00
Yorick Peterse cd0f3380c4 Merge multiple CDATA tokens into a single token.
The tokens T_CDATA_START, T_TEXT and T_CDATA_END have been merged together into
T_CDATA.
2014-05-19 09:36:19 +02:00
Yorick Peterse a4fb5c1299 Merge multiple comment tokens into a single one.
The tokens T_COMMENT_START, T_TEXT and T_COMMENT_END have been merged into a
single token: T_COMMENT. This simplifies both the lexer and the parser.
2014-05-19 09:30:30 +02:00
Yorick Peterse 31ec76c90a Fixed guard in the lexer header. 2014-05-18 16:51:17 +02:00
Yorick Peterse ad67cd708f Only include debug info when DEBUG is set. 2014-05-15 20:43:48 +02:00
Yorick Peterse 44bf1dd1ca Split up handling of element names/namespaces.
This is now split up on Ragel level, simplifying the corresponding Ruby code.
2014-05-15 10:22:05 +02:00
Yorick Peterse 1b58723e7d Removed stdioh. #include.
This header is also not needed.
2014-05-11 21:06:55 +02:00
Yorick Peterse e2b9fc75ca Removed #include for malloc.h
Apparently some OS' move this to malloc/malloc.h. Since it's not needed lets
just get rid of it.
2014-05-11 21:06:02 +02:00
Yorick Peterse 19f04f98f7 Support for lexing/parsing inline doctypes. 2014-05-10 00:28:11 +02:00
Yorick Peterse c472ceac6f Docs for the shared Ragel grammar. 2014-05-08 00:21:23 +02:00
Yorick Peterse fe74d60138 Manually bootstrap JRuby after all.
After discussing this with @headius I've decided to do this the manual way
anyway. Apparently the basic load service stuff is deprecated and not very
reliable.
2014-05-07 22:32:34 +02:00
Yorick Peterse ee78b2c382 Don't redefine namespaces in C.
The Oga::XML namespace should be set up by Ruby, not by C.
2014-05-07 10:52:06 +02:00
Yorick Peterse bbdc7966db Documentation for the JRuby extension. 2014-05-07 10:24:24 +02:00
Yorick Peterse 3afef5f7cc Lexer support for JRuby.
JRuby now passes all tests. Benchmark wise it completes the big XML benchmark
in about 500-600 milliseconds.
2014-05-07 09:40:22 +02:00
Yorick Peterse b9a4038e42 Callback boilerplate for the Java lexer. 2014-05-07 01:01:24 +02:00
Yorick Peterse e271298984 Use macros in the C lexer. 2014-05-07 00:57:25 +02:00
Yorick Peterse f25f8a3d15 Break up the Ragel C grammar.
The grammar is now broken up in to a base lexer and a C lexer. This allows the
same grammar to also be used in the Java code.
2014-05-07 00:50:34 +02:00
Yorick Peterse 9abc5c1c92 Separated the Java and C ext codebases. 2014-05-07 00:29:10 +02:00
Yorick Peterse b8efed5177 Renamed on_start_doctype to on_doctype_start. 2014-05-06 23:18:44 +02:00
Yorick Peterse f39fe5d857 JRuby lexer boilerplate with actual input.
This doesn't actually lex anything just yet but at least the input from Ruby is
in place.
2014-05-06 22:43:55 +02:00