core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	7e847a0ae9	Make C90 happy.	2015-03-05 22:57:51 +01:00
Yorick Peterse	33c46a1841	Use ID instead of VALUE for callback names in C.	2015-03-05 22:57:51 +01:00
Yorick Peterse	78e40b55c0	Handle parsing of HTML <style> tags. This basically re-applies the technique used for HTML <script> tags. With this extra addition I decided to rename/normalize a few things so it's easier to add any extra tags in the future. One downside of this setup is that the following will not be parsed by Oga: <style> </script> </style> The same applies to script tags containing a literal </style> tag. Since this particular case is rather unlikely to occur I'm OK with not supporting it as it _does_ simplify the lexer quite a bit. Fixes #80	2015-03-03 16:28:05 +01:00
Yorick Peterse	ba2177e2cf	Lex contents of <script> tags as plain text. When lexing input in HTML mode the lexer has to treat _all_ content of a <script> tag as plain text. This ensures that the lexer can process input such as "x <y" and "// <foo>" correctly. Fixes #70.	2015-03-02 16:22:09 +01:00
Yorick Peterse	8fdf27dcef	Removed unused C lexer macros.	2015-03-02 15:43:47 +01:00
Yorick Peterse	739f885078	Use ID instead of VALUE for C Symbols. Thanks to @cremno for bringing this up.	2014-11-29 12:53:55 +01:00
Yorick Peterse	b006289c5f	Removed extra space in c/lexer.rl	2014-11-23 22:12:18 +01:00
Yorick Peterse	4fa88fcbde	Cache rb_intern/symbol lookups in the lexer. For JRuby this has little to no benefits as it uses strings for method names. However, both MRI and Rubinius will perform a Symbol lookup whenever rb_intern() is called. By doing this once for all callback names and caching the resulting VALUE objects the lexer timings can be reduced by about 25%. In case of the benchmark benchmark/xml/lexer/string_average_bench.rb this means it runs in around 500ms instead of 700ms.	2014-11-22 01:53:37 +01:00
Yorick Peterse	d951a8cc87	Track XML C lexer state in C only. Instead of storing "act" and "cs" as an instance variable they (along with some other variables) are now stored in a struct. This struct is attached to a lexer instance using the (crappy) Data_Get_Struct/Data_Wrap_Struct API.	2014-10-26 11:38:06 +01:00
Yorick Peterse	1400a859ce	Make sure C strings always end with a NULL. Haven't bumped into any problems just yet. However, in theory all sorts of evil could happen here. Which is part of the problem of C: so much shit is undefined behaviour that you can take a single step and fall in 15 holes at the same time. In theory, because nobody bothered to actually specify it properly.	2014-09-28 22:28:55 +02:00
Yorick Peterse	8db77c0a09	Count newlines of text nodes in native code. Instead of relying on String#count for counting newlines in text nodes, Oga now does this in C/Java. String#count isn't exactly the fastest way of counting characters. Performance was measured using benchmark/xml/lexer/string_average_bench.rb. Before this patch the results were as following: MRI: 0.529s Rbx: 4.965s JRuby: 0.622s After this patch: MRI: 0.424s Rbx: 1.942s JRuby: 0.665s => numbers vary a bit, seem roughly the same as before The commands used for benchmarking: $ rake clean # to make sure that C exts aren't shared between MRI/Rbx $ rake generate $ rake fixtures $ ruby benchmark/xml/lexer/string_average_bench.rb The big difference for Rbx is probably due to the implementation of String#count not being super fast. Some changes were made (https://github.com/rubinius/rubinius/pull/3133) to the method, but this hasn't been released yet. JRuby seems to perform in a similar way, so either it was already optimizing things for me or I suck at writing well performing Java code. This fixes #51.	2014-09-25 22:49:11 +02:00
Yorick Peterse	81edce2eb8	Fixed lexing of XML comments. The previous setup would consume too much. For example the following HTML: <a><!--foo--><b><!--bar--></b></a> would result in the following T_COMMENT token: "foo--><b><!--bar" The new setup requires the marking of a start position. I'm not a huge fan of this but there doesn't appear to be a way around this.	2014-08-15 20:42:32 +02:00
Yorick Peterse	629dcd3fe6	Support for IO inputs in the lexer. Using IO/StringIO objects one can parse large XML files without first having to read the entire file into memory. This can potentially save a lot of memory at the cost of a slightly slower runtime. For IO like instances the lexer will consume the input line by line. If a String is given it's consumed as a whole instead. A small side effect of reading the input line by line is that text such as "foo\nbar" will be lexed as two tokens instead of one. Fixes #19.	2014-05-26 00:30:39 +02:00
Yorick Peterse	6b9d65923a	Use a method for getting input in the XML lexer. Instead of directly accessing the `data` instance variable the C/Java code now uses the method `read_data`. This is part of one of the various steps required to allow Oga to read data from IO like instances. It also means I can freely change the name of the instance variable without also having to change the C/Java code.	2014-05-21 00:27:23 +02:00
Yorick Peterse	4542f06d0f	Replaced fcall/fret with fnext in the XML lexer. With the rules being cleaned up/moved around a bit we can drop the use of fcall/fret. This saves the need of having to maintain a stack (position).	2014-05-21 00:08:48 +02:00
Yorick Peterse	ee78b2c382	Don't redefine namespaces in C. The Oga::XML namespace should be set up by Ruby, not by C.	2014-05-07 10:52:06 +02:00
Yorick Peterse	e271298984	Use macros in the C lexer.	2014-05-07 00:57:25 +02:00
Yorick Peterse	f25f8a3d15	Break up the Ragel C grammar. The grammar is now broken up in to a base lexer and a C lexer. This allows the same grammar to also be used in the Java code.	2014-05-07 00:50:34 +02:00
Yorick Peterse	9abc5c1c92	Separated the Java and C ext codebases.	2014-05-07 00:29:10 +02:00

19 Commits