core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	f00fa40e3a	Make PUBLIC/SYSTEM matching case-insensitive Some websites may use "public" or "system" in doctypes, or completely messed up casing such as PuBlIc (unlikely, but possible). This ensures we don't care about the exact casing used. This fixes https://gitlab.com/yorickpeterse/oga/issues/199	2020-01-08 03:23:46 +01:00
Yorick Peterse	f574197ea6	Ignore nested element start tags This ensures that Oga is able to tokenize input such as the following: <script<script>foo</script> Oga will now treat this as: <script>foo</script> This is based on libxml behaviour, which seems to differ a bit from Chromium which treats the node as a text node. This however would require complex look-ahead logic (as far as I can tell) that I really don't want to implement in Oga. Fixes #186	2017-12-28 16:12:20 +01:00
Yorick Peterse	8282325569	Don't warn for implicit fallthroughs This is the result of Ragel output which we can't control.	2017-06-17 13:46:59 +02:00
Yorick Peterse	01fa1513f4	Lexing of processing instructions with namespaces This adds lexing/parsing support for processing instructions that contain namespace prefixes such as `<?foo:bar ?>`.	2016-09-17 14:51:48 +02:00
Yorick Peterse	66fc4b1dfc	Fixed parsing HTML identifiers containing colons HTML identifiers containing colons should be treated in two ways: * For element names the prefix (= the namespace prefix in case of XML) should be ignored as HTML doesn't support/use namespaces. * For attribute names a colon is a valid character, thus "foo:bar:baz" should be treated as a single attribute name. This fixes #142.	2015-12-26 20:28:35 +01:00
Yorick Peterse	dde644cd79	Support for Unicode XML/HTML identifiers Technically HTML only allows for ASCII names but restricting that actually requires more work than just allowing it.	2015-06-29 21:08:01 +02:00
Laurence Lee	b7771ed5fe	Patch to Oga Upstream to fix "Oga Parser doesn't handle dots in XML Tags". Fixes rubyjedi/soap4r#7	2015-06-29 20:55:48 +02:00
Yorick Peterse	3b633ff41c	Relax support for HTML unquoted attribute values This allows for parsing of HTML such as: <a href=lol("javascript")></a> Here the "href" attribute would have its value set to: lol("javascript") Fixes #119	2015-06-29 16:35:48 +02:00
Yorick Peterse	4031c4f843	Nuked Oga::XML::Lexer#html This method was rather pointless since there's already a "html?" method.	2015-06-15 23:45:20 +02:00
Yorick Peterse	fd307a0fcc	Support HTML attributes without starting quotes This allows the lexer to process input such as: <a href=foo"></a> For XML input the lexer still expects properly opened/closed attribute values. Fixes #109	2015-06-08 06:46:08 +02:00
Yorick Peterse	a76286b973	Support for spaces around attribute equal signs This also takes care of making sure line numbers are incremented properly. Fixes #112	2015-06-08 06:34:49 +02:00
Yorick Peterse	d2523a1082	Support whitespace in element closing tags Fixes #108	2015-05-25 13:41:17 +02:00
Yorick Peterse	5182d0c488	Correct closing of unclosed, nested HTML elements Previous HTML such as this would be lexed incorrectly: <div> <ul> <li>foo </ul> inside div </div> outside div The lexer would see this as the following instead: <div> <ul> <li>foo</li> inside div </ul> outside div </div> This commit exposes the name of the closing tag to XML::Lexer#on_element_end (omitted for self closing tags). This can be used to automatically close nested tags that were left open, ensuring the above HTML is lexer correctly. The new setup ignores namespace prefixes as these are not used in HTML, XML in turn won't even run the code to begin with since it doesn't allow one to leave out closing tags.	2015-05-23 09:59:50 +02:00
Yorick Peterse	8135074a62	Merged on_element_start with on_element_name This makes it easier to automatically insert preceding tokens when starting a new element as we now have access to the name. Previously on_element_start would be invoked first which doesn't receive an argument.	2015-04-21 23:38:06 +02:00
Yorick Peterse	73fbbfbdbd	Use separate Ragel machines for script/style tags Previously a single Ragel machine was used for processing HTML script and style tags. This had the unfortunate side-effect that the following was not parsed correctly (while being valid HTML): <script> var foo = "</style>"; </script> The same applied to style tags: <style> /* </script> */ </style> By using separate machines we can work around the above issue. The downside is that this can produce multiple T_TEXT nodes, which have to be stitched back together in the parser.	2015-04-16 01:45:39 +02:00
Andrei Botalov	2d43e459a1	Update links to discontinued W3C document with a spec http://www.w3.org/TR/html-markup is marked as discontinued	2015-04-15 23:56:25 +03:00
Yorick Peterse	6b779d7883	Handle lexing of stray quotes in element heads This adds lexing support for HTML/XML such as: <foo bar="""></foo> While technically invalid, some websites (e.g. yahoo.com) contain HTML just like this. The lexer handles this as following: 1. When we're in the "element_head" machine, do business as usual until we bump into a "=". 2. Call (using Ragel's "fcall") the machine to use for processing the attribute value (if any). 3. In this machine quoted strings are processed. The moment a string has been processed the lexer jumps right back in to the "element_head" machine. This ensures that any stray quotes are ignored instead of being processed as extra attribute values (eventually leading to parsing errors due to unbalanced quotes).	2015-04-15 22:33:53 +02:00
Yorick Peterse	9a0e31d0ae	Fix for lexing newlines in doctypes This also ensures that newlines are advanced properly. Fixes #95	2015-04-15 20:22:14 +02:00
Yorick Peterse	bc9b9bc953	Remove unneeded Ragel machine for HTML attrs We can just use the same machine here as used for XML attribute values.	2015-04-15 01:50:42 +02:00
Yorick Peterse	d892ce9787	Fix for lexing HTML quoted attrs followed by "/>" This ensures that when using input such as <a href="foo"/> the "/" is not part of the attribute value.	2015-04-15 01:47:08 +02:00
Yorick Peterse	afbb585812	Lexing support for unquoted HTML attribute values This adds support for HTML such as: <a href=foo>HTML is a child of Satan itself</a> Fixes #94	2015-04-15 01:23:46 +02:00
Yorick Peterse	e942086f2d	Fixed counting of newlines in XML declarations	2015-04-15 00:22:58 +02:00
Yorick Peterse	b2ea20ba61	Lex processing instructions in chunks Similar to comments (`ea8b4aa92f`) and CDATA tags (`8acc7fc743`) processing instructions are now lexed in separate chunks _with_ proper support for streaming input. Related issue: #93	2015-04-15 00:11:57 +02:00
Yorick Peterse	ea8b4aa92f	Lex comments in chunks Similar to this being added for CDATA tags in `8acc7fc743` comments are now also lexed in chunks. Related issue: #93	2015-04-14 23:11:22 +02:00
Yorick Peterse	8acc7fc743	Lex CDATA tags in chunks Instead of using a single token (T_CDATA) for a CDATA tag the lexer now uses 3 tokens: 1. T_CDATA_START 2. T_CDATA_BODY 3. T_CDATA_END The T_CDATA_BODY token can occur multiple times and is turned into a single value in the XML parser. This is similar to the way strings are lexed. By changing the way CDATA tags are lexed Oga can now lex CDATA tags containing newlines when using an IO as input. For example, this would previously fail: Oga.parse_xml(StringIO.new("<![CDATA[\nfoo]]>")) Because IO input reads input per line the input for the lexer would be as following: "<![CDATA[\n" "foo]]>" Related issues: #93	2015-04-14 22:45:55 +02:00
Yorick Peterse	0800654c96	Support lexing or carriage returns Fixes #89.	2015-04-03 00:46:37 +02:00
Yorick Peterse	5802d9d62c	Use RbConfig::CONFIG['CC'] vs 'cc'	2015-03-23 19:46:44 +01:00
Yorick Peterse	2bf5fe3061	Updated extconf.rb for Windows support	2015-03-22 14:03:37 +01:00
Yorick Peterse	7e847a0ae9	Make C90 happy.	2015-03-05 22:57:51 +01:00
Yorick Peterse	33c46a1841	Use ID instead of VALUE for callback names in C.	2015-03-05 22:57:51 +01:00
Yorick Peterse	3b2055a30b	Refactored handling of literal HTML elements. This ensures newlines can appear in <style> / <script> tags when using IOs as input.	2015-03-04 11:44:31 +01:00
Yorick Peterse	78e40b55c0	Handle parsing of HTML <style> tags. This basically re-applies the technique used for HTML <script> tags. With this extra addition I decided to rename/normalize a few things so it's easier to add any extra tags in the future. One downside of this setup is that the following will not be parsed by Oga: <style> </script> </style> The same applies to script tags containing a literal </style> tag. Since this particular case is rather unlikely to occur I'm OK with not supporting it as it _does_ simplify the lexer quite a bit. Fixes #80	2015-03-03 16:28:05 +01:00
Yorick Peterse	ba2177e2cf	Lex contents of <script> tags as plain text. When lexing input in HTML mode the lexer has to treat _all_ content of a <script> tag as plain text. This ensures that the lexer can process input such as "x <y" and "// <foo>" correctly. Fixes #70.	2015-03-02 16:22:09 +01:00
Yorick Peterse	8fdf27dcef	Removed unused C lexer macros.	2015-03-02 15:43:47 +01:00
Yorick Peterse	739f885078	Use ID instead of VALUE for C Symbols. Thanks to @cremno for bringing this up.	2014-11-29 12:53:55 +01:00
Yorick Peterse	b006289c5f	Removed extra space in c/lexer.rl	2014-11-23 22:12:18 +01:00
Yorick Peterse	5e24a3d1e5	Short docs on lexer callback names.	2014-11-23 20:20:14 +01:00
Yorick Peterse	4fa88fcbde	Cache rb_intern/symbol lookups in the lexer. For JRuby this has little to no benefits as it uses strings for method names. However, both MRI and Rubinius will perform a Symbol lookup whenever rb_intern() is called. By doing this once for all callback names and caching the resulting VALUE objects the lexer timings can be reduced by about 25%. In case of the benchmark benchmark/xml/lexer/string_average_bench.rb this means it runs in around 500ms instead of 700ms.	2014-11-22 01:53:37 +01:00
Yorick Peterse	cbb2815146	Support for inline doctype rules plus newlines. This adds support for lexing/parsing XML documents that use an IO as input _and_ contain doctype rules with newlines in them. This fixes #63.	2014-11-18 20:02:55 +01:00
Yorick Peterse	24ae791f00	Better support for lexing multi-line strings. When lexing multi-line strings everything used to work fine as long as the input were to be read as a whole. However, when using an IO instance all hell would break loose. Due to the lexer reading IO instances on a per line basis, sometimes Ragel would end up setting "ts" to NULL. For example, the following input would break the lexer: <foo class="\nbar" /> Due to the input being read per line, the following data would be sent to the lexer: <foo class="\n bar" /> This would result in different (or NULL) pointers being used for building a string, in turn resulting in memory allocation errors. To work around this the string lexing setup has been broken into separate machines for single and double quoted strings. The tokens used have also been changed so that instead of just "T_STRING" there are now the following tokens: * T_STRING_SQUOTE * T_STRING_DQUOTE * T_STRING_BODY A string can have multiple T_STRING_BODY tokens (= multi-line strings, only the case for IO inputs). These strings are stitched back together by the parser. This fixes #58.	2014-10-26 11:39:56 +01:00
Yorick Peterse	fca88a69d1	Track Ragel call stacks in the Java lexer. This will be needed for the upcoming string lexing changes.	2014-10-26 11:39:19 +01:00
Yorick Peterse	d951a8cc87	Track XML C lexer state in C only. Instead of storing "act" and "cs" as an instance variable they (along with some other variables) are now stored in a struct. This struct is attached to a lexer instance using the (crappy) Data_Get_Struct/Data_Wrap_Struct API.	2014-10-26 11:38:06 +01:00
Yorick Peterse	1400a859ce	Make sure C strings always end with a NULL. Haven't bumped into any problems just yet. However, in theory all sorts of evil could happen here. Which is part of the problem of C: so much shit is undefined behaviour that you can take a single step and fall in 15 holes at the same time. In theory, because nobody bothered to actually specify it properly.	2014-09-28 22:28:55 +02:00
Yorick Peterse	8db77c0a09	Count newlines of text nodes in native code. Instead of relying on String#count for counting newlines in text nodes, Oga now does this in C/Java. String#count isn't exactly the fastest way of counting characters. Performance was measured using benchmark/xml/lexer/string_average_bench.rb. Before this patch the results were as following: MRI: 0.529s Rbx: 4.965s JRuby: 0.622s After this patch: MRI: 0.424s Rbx: 1.942s JRuby: 0.665s => numbers vary a bit, seem roughly the same as before The commands used for benchmarking: $ rake clean # to make sure that C exts aren't shared between MRI/Rbx $ rake generate $ rake fixtures $ ruby benchmark/xml/lexer/string_average_bench.rb The big difference for Rbx is probably due to the implementation of String#count not being super fast. Some changes were made (https://github.com/rubinius/rubinius/pull/3133) to the method, but this hasn't been released yet. JRuby seems to perform in a similar way, so either it was already optimizing things for me or I suck at writing well performing Java code. This fixes #51.	2014-09-25 22:49:11 +02:00
Yorick Peterse	00579eaa8a	Changed text action from @{} to %{}. This ensures the action is only run at the end, opposed to any non final state.	2014-09-23 22:58:20 +02:00
Yorick Peterse	ad2e040f05	Handle lexing of input such as just "</". Previously this would cause the lexer to go in an infinite loop in the "text" state machine. This fixes #37.	2014-09-15 17:20:06 +02:00
Yorick Peterse	9b8e9f49c6	Support for lexing empty attribute values. This ensures that Oga can lex the following properly: <input value="" /> Previously Ragel would stop upon finding the empty string. This was caused due to the string rules being declared as following: string_dquote = (dquote ^dquote+ dquote); string_squote = (squote ^squote+ squote); These rules only match strings _with_ content, not without. Since Ragel stops consuming input the moment it finds unhandled data this resulted in incorrect tokens being emitted.	2014-09-03 23:10:50 +02:00
Yorick Peterse	49ddebf358	Tighten lexing of T_TEXT nodes. Thanks to some heavy rubberducking with @whitequark the lexer is now a little bit better at lexing T_TEXT nodes. For example, previously the following could not be lexed properly: "foo < bar" There might still be some tweaking to do but we're getting there.	2014-09-03 00:51:13 +02:00
Yorick Peterse	96b7296910	Ragel variable of element closing tags.	2014-09-02 22:50:21 +02:00
Benjamin Klotz	0b096dfe25	Use proper create_makefile Using create_makefile('liboga/liboga') will compile liboga.so into path-to-gem/lib/liboga/ and therefore require_relative in oga.rb will fail. Therefore the right parameter for create_makefile is 'liboga' -> path-to-gem/lib/liboga.so	2014-09-02 20:27:20 +02:00

1 2

96 Commits