core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	8f4eaf3823	Lexing of XML processing instructions.	2014-08-15 22:04:45 +02:00
Yorick Peterse	4e8cca258c	Fixed lexing of XML CDATA tags.	2014-08-15 20:47:58 +02:00
Yorick Peterse	81edce2eb8	Fixed lexing of XML comments. The previous setup would consume too much. For example the following HTML: <a><!--foo--><b><!--bar--></b></a> would result in the following T_COMMENT token: "foo--><b><!--bar" The new setup requires the marking of a start position. I'm not a huge fan of this but there doesn't appear to be a way around this.	2014-08-15 20:42:32 +02:00
Yorick Peterse	d5569ead0b	Use XML::Attribute for element attributes. Instead of using a raw Hash Oga now uses the XML::Attribute class for storing information about element attributes. Attributes are stored as an Array of XML::Attribute instances. This allows the attributes to be more easily modified. If they were stored as a Hash you'd not only have to update the attributes themselves but also the Hash that contains them. While using an Array has a slight runtime cost in most cases the amount of attributes is small enough that this doesn't really pose a problem. If webscale performance is desired at some point in the future Oga could most likely cache the lookup of an attribute. This however is something for the future.	2014-07-20 07:29:37 +02:00
Yorick Peterse	f660b11e47	Parsing of closing XML nodes with namespaces.	2014-07-09 19:54:45 +02:00
Yorick Peterse	be3f8fb494	Removed the on_newline XML lexer callback.	2014-05-29 14:21:48 +02:00
Yorick Peterse	629dcd3fe6	Support for IO inputs in the lexer. Using IO/StringIO objects one can parse large XML files without first having to read the entire file into memory. This can potentially save a lot of memory at the cost of a slightly slower runtime. For IO like instances the lexer will consume the input line by line. If a String is given it's consumed as a whole instead. A small side effect of reading the input line by line is that text such as "foo\nbar" will be lexed as two tokens instead of one. Fixes #19.	2014-05-26 00:30:39 +02:00
Yorick Peterse	6b9d65923a	Use a method for getting input in the XML lexer. Instead of directly accessing the `data` instance variable the C/Java code now uses the method `read_data`. This is part of one of the various steps required to allow Oga to read data from IO like instances. It also means I can freely change the name of the instance variable without also having to change the C/Java code.	2014-05-21 00:27:23 +02:00
Yorick Peterse	418b4ef498	Cleaned up documentation of the XML lexer.	2014-05-21 00:21:21 +02:00
Yorick Peterse	3a8582030d	Removed remaining fhold call in the XML lexer. There's no particular need any more for this fhold call so we're getting rid of it.	2014-05-21 00:11:39 +02:00
Yorick Peterse	4542f06d0f	Replaced fcall/fret with fnext in the XML lexer. With the rules being cleaned up/moved around a bit we can drop the use of fcall/fret. This saves the need of having to maintain a stack (position).	2014-05-21 00:08:48 +02:00
Yorick Peterse	c56b0395e4	Moved various rules around for the XML lexer. This moves the element related rules to the element_head machine (where they belong). This in turn makes it possible to lex ">" as a text node, previously this was impossible.	2014-05-21 00:04:53 +02:00
Yorick Peterse	feaf28d423	Remove dedicated string machine in the XML lexer. This removes the need for another fcall/fret combination.	2014-05-19 20:26:07 +02:00
Yorick Peterse	93b9718406	Cleaned up the XML lexer documentation.	2014-05-19 09:39:35 +02:00
Yorick Peterse	cd0f3380c4	Merge multiple CDATA tokens into a single token. The tokens T_CDATA_START, T_TEXT and T_CDATA_END have been merged together into T_CDATA.	2014-05-19 09:36:19 +02:00
Yorick Peterse	a4fb5c1299	Merge multiple comment tokens into a single one. The tokens T_COMMENT_START, T_TEXT and T_COMMENT_END have been merged into a single token: T_COMMENT. This simplifies both the lexer and the parser.	2014-05-19 09:30:30 +02:00
Yorick Peterse	31ec76c90a	Fixed guard in the lexer header.	2014-05-18 16:51:17 +02:00
Yorick Peterse	ad67cd708f	Only include debug info when DEBUG is set.	2014-05-15 20:43:48 +02:00
Yorick Peterse	44bf1dd1ca	Split up handling of element names/namespaces. This is now split up on Ragel level, simplifying the corresponding Ruby code.	2014-05-15 10:22:05 +02:00
Yorick Peterse	1b58723e7d	Removed stdioh. #include. This header is also not needed.	2014-05-11 21:06:55 +02:00
Yorick Peterse	e2b9fc75ca	Removed #include for malloc.h Apparently some OS' move this to malloc/malloc.h. Since it's not needed lets just get rid of it.	2014-05-11 21:06:02 +02:00
Yorick Peterse	19f04f98f7	Support for lexing/parsing inline doctypes.	2014-05-10 00:28:11 +02:00
Yorick Peterse	c472ceac6f	Docs for the shared Ragel grammar.	2014-05-08 00:21:23 +02:00
Yorick Peterse	fe74d60138	Manually bootstrap JRuby after all. After discussing this with @headius I've decided to do this the manual way anyway. Apparently the basic load service stuff is deprecated and not very reliable.	2014-05-07 22:32:34 +02:00
Yorick Peterse	ee78b2c382	Don't redefine namespaces in C. The Oga::XML namespace should be set up by Ruby, not by C.	2014-05-07 10:52:06 +02:00
Yorick Peterse	bbdc7966db	Documentation for the JRuby extension.	2014-05-07 10:24:24 +02:00
Yorick Peterse	3afef5f7cc	Lexer support for JRuby. JRuby now passes all tests. Benchmark wise it completes the big XML benchmark in about 500-600 milliseconds.	2014-05-07 09:40:22 +02:00
Yorick Peterse	b9a4038e42	Callback boilerplate for the Java lexer.	2014-05-07 01:01:24 +02:00
Yorick Peterse	e271298984	Use macros in the C lexer.	2014-05-07 00:57:25 +02:00
Yorick Peterse	f25f8a3d15	Break up the Ragel C grammar. The grammar is now broken up in to a base lexer and a C lexer. This allows the same grammar to also be used in the Java code.	2014-05-07 00:50:34 +02:00
Yorick Peterse	9abc5c1c92	Separated the Java and C ext codebases.	2014-05-07 00:29:10 +02:00
Yorick Peterse	b8efed5177	Renamed on_start_doctype to on_doctype_start.	2014-05-06 23:18:44 +02:00
Yorick Peterse	f39fe5d857	JRuby lexer boilerplate with actual input. This doesn't actually lex anything just yet but at least the input from Ruby is in place.	2014-05-06 22:43:55 +02:00
Yorick Peterse	fea5ec7946	Removed the package line in LibogaService.java	2014-05-06 20:52:42 +02:00
Yorick Peterse	2053018d07	Slap JRuby so that it can load the .jar file.	2014-05-06 20:45:26 +02:00
Yorick Peterse	6e685378e0	Setup Ragel for JRuby and load things the hard way	2014-05-06 19:06:04 +02:00
Yorick Peterse	64c9e18651	Setup for Java and Ragel.	2014-05-06 10:24:07 +02:00
Yorick Peterse	eeeeb0efad	Don't track the generated Java lexer.	2014-05-06 10:11:19 +02:00
Yorick Peterse	d2742cfdde	Use 4 spaces for C/Java code.	2014-05-06 09:41:36 +02:00
Yorick Peterse	c30d3a7627	Half-assed JRuby boilerplate. Blowing my brains out over getting this fat pig to do what I want but we're getting there.	2014-05-06 00:23:07 +02:00
Yorick Peterse	2b3a6be24d	Use liboga as a prefix in the C code. Namespaces? What are those?	2014-05-05 21:19:50 +02:00
Yorick Peterse	57fd4dff64	Docs for the C lexer.	2014-05-05 09:40:08 +02:00
Yorick Peterse	335f3cc6d6	Use rb_enc_str_new instead of rb_enc_str_new_cstr. The latter in combination with strndup() would leak large amounts of memory.	2014-05-05 00:34:19 +02:00
Yorick Peterse	2689d3f65a	Initial setup using a C extension. While I've tried to keep Oga pure Ruby for as long as possible the performance of Ragel's Ruby output was not worth the trouble. For example, lexing 10MB of XML would take 5 to 6 seconds at least. Nokogiri on the other hand can parse that same XML into a DOM document in about 300 miliseconds. Such a big performance difference is not acceptable. To work around this the XML/HTML lexer will be implemented in C for MRI/Rubinius and Java for JRuby. For now there's only a C extension as I haven't read up yet on the JRuby API. The end goal is to provide some sort of Ragel "template" that can be used to generate the corresponding C/Java extension code. This would remove the need of duplicating the grammar and associated code. The native extension setup is a hybrid between native and Ruby. The raw Ragel stuff happens in C/Java while the actual logic of actions happens in Ruby. This adds a small amount of overhead but makes it much easier to maintain the lexer. Even with this extra overhead the performance is much better than pure Ruby. The 10MB of XML mentioned above is lexed in about 600 miliseconds. In other words, it's 10 times faster.	2014-05-05 00:31:28 +02:00

1 2

94 Commits