core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	b2ea20ba61	Lex processing instructions in chunks Similar to comments (`ea8b4aa92f`) and CDATA tags (`8acc7fc743`) processing instructions are now lexed in separate chunks _with_ proper support for streaming input. Related issue: #93	2015-04-15 00:11:57 +02:00
Yorick Peterse	ea8b4aa92f	Lex comments in chunks Similar to this being added for CDATA tags in `8acc7fc743` comments are now also lexed in chunks. Related issue: #93	2015-04-14 23:11:22 +02:00
Yorick Peterse	8acc7fc743	Lex CDATA tags in chunks Instead of using a single token (T_CDATA) for a CDATA tag the lexer now uses 3 tokens: 1. T_CDATA_START 2. T_CDATA_BODY 3. T_CDATA_END The T_CDATA_BODY token can occur multiple times and is turned into a single value in the XML parser. This is similar to the way strings are lexed. By changing the way CDATA tags are lexed Oga can now lex CDATA tags containing newlines when using an IO as input. For example, this would previously fail: Oga.parse_xml(StringIO.new("<![CDATA[\nfoo]]>")) Because IO input reads input per line the input for the lexer would be as following: "<![CDATA[\n" "foo]]>" Related issues: #93	2015-04-14 22:45:55 +02:00
Yorick Peterse	0800654c96	Support lexing or carriage returns Fixes #89.	2015-04-03 00:46:37 +02:00
Yorick Peterse	5802d9d62c	Use RbConfig::CONFIG['CC'] vs 'cc'	2015-03-23 19:46:44 +01:00
Yorick Peterse	2bf5fe3061	Updated extconf.rb for Windows support	2015-03-22 14:03:37 +01:00
Yorick Peterse	7e847a0ae9	Make C90 happy.	2015-03-05 22:57:51 +01:00
Yorick Peterse	33c46a1841	Use ID instead of VALUE for callback names in C.	2015-03-05 22:57:51 +01:00
Yorick Peterse	3b2055a30b	Refactored handling of literal HTML elements. This ensures newlines can appear in <style> / <script> tags when using IOs as input.	2015-03-04 11:44:31 +01:00
Yorick Peterse	78e40b55c0	Handle parsing of HTML <style> tags. This basically re-applies the technique used for HTML <script> tags. With this extra addition I decided to rename/normalize a few things so it's easier to add any extra tags in the future. One downside of this setup is that the following will not be parsed by Oga: <style> </script> </style> The same applies to script tags containing a literal </style> tag. Since this particular case is rather unlikely to occur I'm OK with not supporting it as it _does_ simplify the lexer quite a bit. Fixes #80	2015-03-03 16:28:05 +01:00
Yorick Peterse	ba2177e2cf	Lex contents of <script> tags as plain text. When lexing input in HTML mode the lexer has to treat _all_ content of a <script> tag as plain text. This ensures that the lexer can process input such as "x <y" and "// <foo>" correctly. Fixes #70.	2015-03-02 16:22:09 +01:00
Yorick Peterse	8fdf27dcef	Removed unused C lexer macros.	2015-03-02 15:43:47 +01:00
Yorick Peterse	739f885078	Use ID instead of VALUE for C Symbols. Thanks to @cremno for bringing this up.	2014-11-29 12:53:55 +01:00
Yorick Peterse	b006289c5f	Removed extra space in c/lexer.rl	2014-11-23 22:12:18 +01:00
Yorick Peterse	5e24a3d1e5	Short docs on lexer callback names.	2014-11-23 20:20:14 +01:00
Yorick Peterse	4fa88fcbde	Cache rb_intern/symbol lookups in the lexer. For JRuby this has little to no benefits as it uses strings for method names. However, both MRI and Rubinius will perform a Symbol lookup whenever rb_intern() is called. By doing this once for all callback names and caching the resulting VALUE objects the lexer timings can be reduced by about 25%. In case of the benchmark benchmark/xml/lexer/string_average_bench.rb this means it runs in around 500ms instead of 700ms.	2014-11-22 01:53:37 +01:00
Yorick Peterse	cbb2815146	Support for inline doctype rules plus newlines. This adds support for lexing/parsing XML documents that use an IO as input _and_ contain doctype rules with newlines in them. This fixes #63.	2014-11-18 20:02:55 +01:00
Yorick Peterse	24ae791f00	Better support for lexing multi-line strings. When lexing multi-line strings everything used to work fine as long as the input were to be read as a whole. However, when using an IO instance all hell would break loose. Due to the lexer reading IO instances on a per line basis, sometimes Ragel would end up setting "ts" to NULL. For example, the following input would break the lexer: <foo class="\nbar" /> Due to the input being read per line, the following data would be sent to the lexer: <foo class="\n bar" /> This would result in different (or NULL) pointers being used for building a string, in turn resulting in memory allocation errors. To work around this the string lexing setup has been broken into separate machines for single and double quoted strings. The tokens used have also been changed so that instead of just "T_STRING" there are now the following tokens: * T_STRING_SQUOTE * T_STRING_DQUOTE * T_STRING_BODY A string can have multiple T_STRING_BODY tokens (= multi-line strings, only the case for IO inputs). These strings are stitched back together by the parser. This fixes #58.	2014-10-26 11:39:56 +01:00
Yorick Peterse	fca88a69d1	Track Ragel call stacks in the Java lexer. This will be needed for the upcoming string lexing changes.	2014-10-26 11:39:19 +01:00
Yorick Peterse	d951a8cc87	Track XML C lexer state in C only. Instead of storing "act" and "cs" as an instance variable they (along with some other variables) are now stored in a struct. This struct is attached to a lexer instance using the (crappy) Data_Get_Struct/Data_Wrap_Struct API.	2014-10-26 11:38:06 +01:00
Yorick Peterse	1400a859ce	Make sure C strings always end with a NULL. Haven't bumped into any problems just yet. However, in theory all sorts of evil could happen here. Which is part of the problem of C: so much shit is undefined behaviour that you can take a single step and fall in 15 holes at the same time. In theory, because nobody bothered to actually specify it properly.	2014-09-28 22:28:55 +02:00
Yorick Peterse	8db77c0a09	Count newlines of text nodes in native code. Instead of relying on String#count for counting newlines in text nodes, Oga now does this in C/Java. String#count isn't exactly the fastest way of counting characters. Performance was measured using benchmark/xml/lexer/string_average_bench.rb. Before this patch the results were as following: MRI: 0.529s Rbx: 4.965s JRuby: 0.622s After this patch: MRI: 0.424s Rbx: 1.942s JRuby: 0.665s => numbers vary a bit, seem roughly the same as before The commands used for benchmarking: $ rake clean # to make sure that C exts aren't shared between MRI/Rbx $ rake generate $ rake fixtures $ ruby benchmark/xml/lexer/string_average_bench.rb The big difference for Rbx is probably due to the implementation of String#count not being super fast. Some changes were made (https://github.com/rubinius/rubinius/pull/3133) to the method, but this hasn't been released yet. JRuby seems to perform in a similar way, so either it was already optimizing things for me or I suck at writing well performing Java code. This fixes #51.	2014-09-25 22:49:11 +02:00
Yorick Peterse	00579eaa8a	Changed text action from @{} to %{}. This ensures the action is only run at the end, opposed to any non final state.	2014-09-23 22:58:20 +02:00
Yorick Peterse	ad2e040f05	Handle lexing of input such as just "</". Previously this would cause the lexer to go in an infinite loop in the "text" state machine. This fixes #37.	2014-09-15 17:20:06 +02:00
Yorick Peterse	9b8e9f49c6	Support for lexing empty attribute values. This ensures that Oga can lex the following properly: <input value="" /> Previously Ragel would stop upon finding the empty string. This was caused due to the string rules being declared as following: string_dquote = (dquote ^dquote+ dquote); string_squote = (squote ^squote+ squote); These rules only match strings _with_ content, not without. Since Ragel stops consuming input the moment it finds unhandled data this resulted in incorrect tokens being emitted.	2014-09-03 23:10:50 +02:00
Yorick Peterse	49ddebf358	Tighten lexing of T_TEXT nodes. Thanks to some heavy rubberducking with @whitequark the lexer is now a little bit better at lexing T_TEXT nodes. For example, previously the following could not be lexed properly: "foo < bar" There might still be some tweaking to do but we're getting there.	2014-09-03 00:51:13 +02:00
Yorick Peterse	96b7296910	Ragel variable of element closing tags.	2014-09-02 22:50:21 +02:00
Benjamin Klotz	0b096dfe25	Use proper create_makefile Using create_makefile('liboga/liboga') will compile liboga.so into path-to-gem/lib/liboga/ and therefore require_relative in oga.rb will fail. Therefore the right parameter for create_makefile is 'liboga' -> path-to-gem/lib/liboga.so	2014-09-02 20:27:20 +02:00
Yorick Peterse	56341b5585	Cleaned up lexing of comments/cdata. Thanks to @whitequark for suggesting the use of the "--" operator.	2014-08-16 16:03:55 +02:00
Yorick Peterse	2c488f92be	Cleaned up marking of comments/cdata tags.	2014-08-15 22:05:09 +02:00
Yorick Peterse	8f4eaf3823	Lexing of XML processing instructions.	2014-08-15 22:04:45 +02:00
Yorick Peterse	4e8cca258c	Fixed lexing of XML CDATA tags.	2014-08-15 20:47:58 +02:00
Yorick Peterse	81edce2eb8	Fixed lexing of XML comments. The previous setup would consume too much. For example the following HTML: <a><!--foo--><b><!--bar--></b></a> would result in the following T_COMMENT token: "foo--><b><!--bar" The new setup requires the marking of a start position. I'm not a huge fan of this but there doesn't appear to be a way around this.	2014-08-15 20:42:32 +02:00
Yorick Peterse	d5569ead0b	Use XML::Attribute for element attributes. Instead of using a raw Hash Oga now uses the XML::Attribute class for storing information about element attributes. Attributes are stored as an Array of XML::Attribute instances. This allows the attributes to be more easily modified. If they were stored as a Hash you'd not only have to update the attributes themselves but also the Hash that contains them. While using an Array has a slight runtime cost in most cases the amount of attributes is small enough that this doesn't really pose a problem. If webscale performance is desired at some point in the future Oga could most likely cache the lookup of an attribute. This however is something for the future.	2014-07-20 07:29:37 +02:00
Yorick Peterse	f660b11e47	Parsing of closing XML nodes with namespaces.	2014-07-09 19:54:45 +02:00
Yorick Peterse	be3f8fb494	Removed the on_newline XML lexer callback.	2014-05-29 14:21:48 +02:00
Yorick Peterse	629dcd3fe6	Support for IO inputs in the lexer. Using IO/StringIO objects one can parse large XML files without first having to read the entire file into memory. This can potentially save a lot of memory at the cost of a slightly slower runtime. For IO like instances the lexer will consume the input line by line. If a String is given it's consumed as a whole instead. A small side effect of reading the input line by line is that text such as "foo\nbar" will be lexed as two tokens instead of one. Fixes #19.	2014-05-26 00:30:39 +02:00
Yorick Peterse	6b9d65923a	Use a method for getting input in the XML lexer. Instead of directly accessing the `data` instance variable the C/Java code now uses the method `read_data`. This is part of one of the various steps required to allow Oga to read data from IO like instances. It also means I can freely change the name of the instance variable without also having to change the C/Java code.	2014-05-21 00:27:23 +02:00
Yorick Peterse	418b4ef498	Cleaned up documentation of the XML lexer.	2014-05-21 00:21:21 +02:00
Yorick Peterse	3a8582030d	Removed remaining fhold call in the XML lexer. There's no particular need any more for this fhold call so we're getting rid of it.	2014-05-21 00:11:39 +02:00
Yorick Peterse	4542f06d0f	Replaced fcall/fret with fnext in the XML lexer. With the rules being cleaned up/moved around a bit we can drop the use of fcall/fret. This saves the need of having to maintain a stack (position).	2014-05-21 00:08:48 +02:00
Yorick Peterse	c56b0395e4	Moved various rules around for the XML lexer. This moves the element related rules to the element_head machine (where they belong). This in turn makes it possible to lex ">" as a text node, previously this was impossible.	2014-05-21 00:04:53 +02:00
Yorick Peterse	feaf28d423	Remove dedicated string machine in the XML lexer. This removes the need for another fcall/fret combination.	2014-05-19 20:26:07 +02:00
Yorick Peterse	93b9718406	Cleaned up the XML lexer documentation.	2014-05-19 09:39:35 +02:00
Yorick Peterse	cd0f3380c4	Merge multiple CDATA tokens into a single token. The tokens T_CDATA_START, T_TEXT and T_CDATA_END have been merged together into T_CDATA.	2014-05-19 09:36:19 +02:00
Yorick Peterse	a4fb5c1299	Merge multiple comment tokens into a single one. The tokens T_COMMENT_START, T_TEXT and T_COMMENT_END have been merged into a single token: T_COMMENT. This simplifies both the lexer and the parser.	2014-05-19 09:30:30 +02:00
Yorick Peterse	31ec76c90a	Fixed guard in the lexer header.	2014-05-18 16:51:17 +02:00
Yorick Peterse	ad67cd708f	Only include debug info when DEBUG is set.	2014-05-15 20:43:48 +02:00
Yorick Peterse	44bf1dd1ca	Split up handling of element names/namespaces. This is now split up on Ragel level, simplifying the corresponding Ruby code.	2014-05-15 10:22:05 +02:00
Yorick Peterse	1b58723e7d	Removed stdioh. #include. This header is also not needed.	2014-05-11 21:06:55 +02:00

1 2

74 Commits