Yorick Peterse
feaf28d423
Remove dedicated string machine in the XML lexer.
...
This removes the need for another fcall/fret combination.
2014-05-19 20:26:07 +02:00
Yorick Peterse
93b9718406
Cleaned up the XML lexer documentation.
2014-05-19 09:39:35 +02:00
Yorick Peterse
cd0f3380c4
Merge multiple CDATA tokens into a single token.
...
The tokens T_CDATA_START, T_TEXT and T_CDATA_END have been merged together into
T_CDATA.
2014-05-19 09:36:19 +02:00
Yorick Peterse
a4fb5c1299
Merge multiple comment tokens into a single one.
...
The tokens T_COMMENT_START, T_TEXT and T_COMMENT_END have been merged into a
single token: T_COMMENT. This simplifies both the lexer and the parser.
2014-05-19 09:30:30 +02:00
Yorick Peterse
31ec76c90a
Fixed guard in the lexer header.
2014-05-18 16:51:17 +02:00
Yorick Peterse
ad67cd708f
Only include debug info when DEBUG is set.
2014-05-15 20:43:48 +02:00
Yorick Peterse
44bf1dd1ca
Split up handling of element names/namespaces.
...
This is now split up on Ragel level, simplifying the corresponding Ruby code.
2014-05-15 10:22:05 +02:00
Yorick Peterse
1b58723e7d
Removed stdioh. #include.
...
This header is also not needed.
2014-05-11 21:06:55 +02:00
Yorick Peterse
e2b9fc75ca
Removed #include for malloc.h
...
Apparently some OS' move this to malloc/malloc.h. Since it's not needed lets
just get rid of it.
2014-05-11 21:06:02 +02:00
Yorick Peterse
19f04f98f7
Support for lexing/parsing inline doctypes.
2014-05-10 00:28:11 +02:00
Yorick Peterse
c472ceac6f
Docs for the shared Ragel grammar.
2014-05-08 00:21:23 +02:00
Yorick Peterse
fe74d60138
Manually bootstrap JRuby after all.
...
After discussing this with @headius I've decided to do this the manual way
anyway. Apparently the basic load service stuff is deprecated and not very
reliable.
2014-05-07 22:32:34 +02:00
Yorick Peterse
ee78b2c382
Don't redefine namespaces in C.
...
The Oga::XML namespace should be set up by Ruby, not by C.
2014-05-07 10:52:06 +02:00
Yorick Peterse
bbdc7966db
Documentation for the JRuby extension.
2014-05-07 10:24:24 +02:00
Yorick Peterse
3afef5f7cc
Lexer support for JRuby.
...
JRuby now passes all tests. Benchmark wise it completes the big XML benchmark
in about 500-600 milliseconds.
2014-05-07 09:40:22 +02:00
Yorick Peterse
b9a4038e42
Callback boilerplate for the Java lexer.
2014-05-07 01:01:24 +02:00
Yorick Peterse
e271298984
Use macros in the C lexer.
2014-05-07 00:57:25 +02:00
Yorick Peterse
f25f8a3d15
Break up the Ragel C grammar.
...
The grammar is now broken up in to a base lexer and a C lexer. This allows the
same grammar to also be used in the Java code.
2014-05-07 00:50:34 +02:00
Yorick Peterse
9abc5c1c92
Separated the Java and C ext codebases.
2014-05-07 00:29:10 +02:00
Yorick Peterse
b8efed5177
Renamed on_start_doctype to on_doctype_start.
2014-05-06 23:18:44 +02:00
Yorick Peterse
f39fe5d857
JRuby lexer boilerplate with actual input.
...
This doesn't actually lex anything just yet but at least the input from Ruby is
in place.
2014-05-06 22:43:55 +02:00
Yorick Peterse
fea5ec7946
Removed the package line in LibogaService.java
2014-05-06 20:52:42 +02:00
Yorick Peterse
2053018d07
Slap JRuby so that it can load the .jar file.
2014-05-06 20:45:26 +02:00
Yorick Peterse
6e685378e0
Setup Ragel for JRuby and load things the hard way
2014-05-06 19:06:04 +02:00
Yorick Peterse
64c9e18651
Setup for Java and Ragel.
2014-05-06 10:24:07 +02:00
Yorick Peterse
eeeeb0efad
Don't track the generated Java lexer.
2014-05-06 10:11:19 +02:00
Yorick Peterse
d2742cfdde
Use 4 spaces for C/Java code.
2014-05-06 09:41:36 +02:00
Yorick Peterse
c30d3a7627
Half-assed JRuby boilerplate.
...
Blowing my brains out over getting this fat pig to do what I want but we're
getting there.
2014-05-06 00:23:07 +02:00
Yorick Peterse
2b3a6be24d
Use liboga as a prefix in the C code.
...
Namespaces? What are those?
2014-05-05 21:19:50 +02:00
Yorick Peterse
57fd4dff64
Docs for the C lexer.
2014-05-05 09:40:08 +02:00
Yorick Peterse
335f3cc6d6
Use rb_enc_str_new instead of rb_enc_str_new_cstr.
...
The latter in combination with strndup() would leak large amounts of memory.
2014-05-05 00:34:19 +02:00
Yorick Peterse
2689d3f65a
Initial setup using a C extension.
...
While I've tried to keep Oga pure Ruby for as long as possible the performance
of Ragel's Ruby output was not worth the trouble. For example, lexing 10MB of
XML would take 5 to 6 seconds at least. Nokogiri on the other hand can parse
that same XML into a DOM document in about 300 miliseconds. Such a big
performance difference is not acceptable.
To work around this the XML/HTML lexer will be implemented in C for
MRI/Rubinius and Java for JRuby. For now there's only a C extension as I
haven't read up yet on the JRuby API. The end goal is to provide some sort of
Ragel "template" that can be used to generate the corresponding C/Java
extension code. This would remove the need of duplicating the grammar and
associated code.
The native extension setup is a hybrid between native and Ruby. The raw Ragel
stuff happens in C/Java while the actual logic of actions happens in Ruby. This
adds a small amount of overhead but makes it much easier to maintain the lexer.
Even with this extra overhead the performance is much better than pure Ruby.
The 10MB of XML mentioned above is lexed in about 600 miliseconds. In other
words, it's 10 times faster.
2014-05-05 00:31:28 +02:00