core/oga - oga

Commit Graph

Author	SHA1	Message	Date
KitaitiMakoto	977bd594c8	Add support for XPath namespace aliases This fixes https://gitlab.com/yorickpeterse/oga/issues/176	2019-11-29 14:21:45 +00:00
David Cornu	bc87711f9c	Return an Enumerator from each* methods when no block is given	2018-01-29 13:12:42 -05:00
Yorick Peterse	f574197ea6	Ignore nested element start tags This ensures that Oga is able to tokenize input such as the following: <script<script>foo</script> Oga will now treat this as: <script>foo</script> This is based on libxml behaviour, which seems to differ a bit from Chromium which treats the node as a text node. This however would require complex look-ahead logic (as far as I can tell) that I really don't want to implement in Oga. Fixes #186	2017-12-28 16:12:20 +01:00
Yorick Peterse	6f747656b6	Use RSpec 3 expect syntax for tests This should make it a little bit easier for others to contribute.	2017-06-17 13:52:43 +02:00
Yorick Peterse	84c4db3e9f	Clean up changes from PR #174	2017-04-18 12:55:11 +02:00
PikachuEXE	21b5eeec4b	Fix using symbol on Element#attribute alwas getting nil	2017-04-18 12:51:34 +02:00
Yorick Peterse	673f4a29db	Use HTML5 style closing tags for void elements This ensures that element tags such as <img> tags don't use a closing /> when documents are parsed as HTML documents. Fixes #170	2017-02-10 15:24:41 +01:00
Yorick Peterse	131fba7aed	Doctype inherits from Node This makes it possible to parse documents where a doctype resides in a node, instead of being located at the root. Fixes #169	2017-02-10 15:10:30 +01:00
Yorick Peterse	e0e0687dc2	Generating closing element & Doctype XML This commit fixes two problems: 1. Doctypes introducing too many newlines 2. Elements with siblings and a common parent not being closed properly == Doctypes When generating the XML for a doctype the XML::Generator class would append a trailing newline. This however meant that if the next text node was also a newline you'd now have two newlines. In previous versions of Oga this worked because the old XML generation code would call String#strip on the XML to add after the doctype. To support this in the new version we perform a lookahead in XML::Generator#on_doctype to remove any trailing newlines added by this method in case the first child node is a newline text node. == Closing Elements When an element has a sibling following it _and_ does not have any child nodes it would not be closed properly when generating XML. This is due to the "until next_node = ..." expression evaluating to true, thus never executing its body. There's probably some way to work around this by using the "loop" method, but considering it's 02:09 I think the current approach is good enough. Future me will probably hate me for it.	2016-09-27 02:10:16 +02:00
Yorick Peterse	01fa1513f4	Lexing of processing instructions with namespaces This adds lexing/parsing support for processing instructions that contain namespace prefixes such as `<?foo:bar ?>`.	2016-09-17 14:51:48 +02:00
Yorick Peterse	116b9b0ceb	Make XmlDeclaration a ProcessingInstruction This allows Oga to parse documents that contain an XML declaration at a place other than at the document root. Oga still only assigns the XML declaration to the document whenever it is at the top-level. This matches libxml/XML specification behaviour as far as I can tell.	2016-09-17 14:39:07 +02:00
Scott Wheeler	d40baf0c72	Add aliases for accessing attributes via [] and []= This also fixes accessing attributes via symbol name and tests to ensure that such does not break in the future.	2016-09-14 15:21:46 +03:00
Yorick Peterse	38284278d5	Don't process siblings when reaching a root node When generating XML we should not process the siblings of a root node. Doing so results in invalid XML being returned (due to siblings not being children of the root node). Not processing the siblings in this case also prevents the siblings loop from getting stuck. To explain what's happening, let's assume we're using the following document tree: Document \|_ Text \|_ Element Now let's say we take the Text node and call "to_xml" on it. When we start the loop we'll run into the following code: if child_node = children && current.children[0] current = child_node else Here the if statement will evaluate to false because a Text node doesn't have any child nodes, as such we enter the else branch. We now reach the following code: until next_node = current.is_a?(Node) && current.next A Text node is a descendant of Node and it happens to have another node (the Element node) as the next sibling. As a result we enter the `until` loop's body. We now run into this code: if current.is_a?(Node) && current != @start current = current.parent end Here `current` is still our Text node and it is the @start node. As a result the `current` re-assignment won't be evaluated. Next we run into the following: after_element(current, output) if current.is_a?(Element) break if current == @start The first line will not evaluate because `current` is still the `Text` node. The `break` will evaluate because `current` is the same as @start. This will then lead to the following code being executed: current = next_node Here `next_node` is the next sibling of the Text node, which in the above example is the Element node. Because all of the above runs in a `while` loop we'll at some point end up again at the start of the `until` loop. At this point the `current` variable contains an `Element`. Because this node does not have a node following it we'll once again enter the `until` loop's body. This loop will now get stuck because `current` is a Node, it's not the same as @start, thus `current` is set to its parent (the Document), which also isn't the same as @start. On the next iteration this loop will break because `current` is no longer a node. However, because a Document _does_ have child nodes the whole process of traversing children/siblings will keep repeating itself forever. To work around this we now use the following statement: if child_node = children && current.children[0] ... elsif current == @start after_element(current, output) if current.is_a?(Element) break else until next_node = current.is_a?(Node) && current.next ... end This prevents processing of any siblings once we have reached the root node, in turn preventing the loop getting stuck forever. I'm willing to bet there are probably a few more edge cases, but I can't think of any others at the moment. Fixes #161	2016-09-10 02:49:05 +02:00
Yorick Peterse	68f1f9f660	Relax parsing of XML doctypes This allows the parser to parse doctypes that contain a mixture of names, public IDs, inline rules, etc. Fixes #159	2016-09-06 22:25:22 +02:00
Yorick Peterse	5a58b14137	Use static variables for Node#previous/#next Instead of calculating the previous/next node on the fly this data is now set automatically whenever a node is stored in a NodeSet with an owner. While this introduces some overhead and complexity when adding or removing nodes from a NodeSet, it greatly reduces the runtime overhead of calling Node#previous or Node#next.	2016-09-04 21:07:35 +02:00
Yorick Peterse	dd138981f6	Generate XML without relying on recursion While using recursion is an easy way of generating XML it can lead to the call stack overflowing when serialising documents with lots of nested nodes. Generally there are two ways of working around this: 1. Use an explicit stack (e.g. an array or a queue of sorts) instead of relying on the call stack. 2. Use an algorithm that doesn't use a stack at all (e.g. Morris traversal). This commit introduces the XML::Generator class which can serialize documents back to XML without using a stack at all. This class takes advantage of XML nodes having access to not only their child nodes, but also their siblings and their parents. All XML serialisation logic now resides in the XML::Generator class. In turn the various "to_xml" methods just use this class and serialize everything starting at "self".	2016-09-04 19:19:00 +02:00
Yorick Peterse	9ac16e2e4f	Fixed index check in Node#next An index can/should never be equal the length of a NodeSet, thus we should use "<" here instead of "<=".	2016-09-03 23:56:55 +02:00
Erik Michaels-Ober	3a89dcffab	Remove Parser#reset and PullParser#reset	2016-07-13 17:19:42 +02:00
Erik Michaels-Ober	dc30b8b6c1	Remove Lexer#reset method Resolves https://github.com/YorickPeterse/oga/issues/153.	2016-07-13 17:19:42 +02:00
Yorick Peterse	5bfc2d50f2	Preserve entities that can't be decoded Certain entities when decoded will produce a String with an invalid encoding. This commit ensures that instead of raising an EncodingError further down the line (e.g. when calling "inspect" on a document) the entities are preserved as-is. Fixes #143	2016-02-09 19:51:53 +01:00
Yorick Peterse	66fc4b1dfc	Fixed parsing HTML identifiers containing colons HTML identifiers containing colons should be treated in two ways: * For element names the prefix (= the namespace prefix in case of XML) should be ignored as HTML doesn't support/use namespaces. * For attribute names a colon is a valid character, thus "foo:bar:baz" should be treated as a single attribute name. This fixes #142.	2015-12-26 20:28:35 +01:00
Yorick Peterse	07658dadb1	Added Attribute#parent	2015-08-28 16:22:42 +02:00
Yorick Peterse	9899a419b7	Added Attribute#each_ancestor	2015-08-26 22:26:46 +02:00
Yorick Peterse	d408989499	Added expanded_name for Element and Attribute	2015-08-19 20:14:23 +02:00
Yorick Peterse	52741a3b78	Added XML::Node#each_ancestor This method can be used to walk through the ancestor tree of a Node.	2015-08-19 20:14:20 +02:00
Yorick Peterse	94f7f85dc3	Added XML::Document#root_node	2015-08-19 20:14:20 +02:00
Jakub Pawlowicz	6fc3ef425b	Fixes #118 - decoding invalid entities. Previous regular expression was too greedy in terms of matching letters from outside of A-F hex scope, and matching letters when not in hex mode.	2015-06-30 17:56:26 +02:00
Yorick Peterse	565e3da176	Added encoding comment in elements_spec.rb This ensures that older Ruby versions don't poop their pants when running these specs.	2015-06-29 21:09:33 +02:00
Yorick Peterse	dde644cd79	Support for Unicode XML/HTML identifiers Technically HTML only allows for ASCII names but restricting that actually requires more work than just allowing it.	2015-06-29 21:08:01 +02:00
Laurence Lee	139985612b	Lexer test for elements with inline dots.	2015-06-29 20:55:48 +02:00
Tero Tasanen	0b4791b277	Ability to replace a node with another node or string ``` element = Oga::XML::Element.new(:name => 'div') some_node.replace(element) ``` You can also pass a `String` to `replace` and it will be replaced with a `Oga::XML::Text` node ``` some_node.replace('this will replace the current node with a text node') ``` closes #115	2015-06-17 21:27:50 +03:00
Yorick Peterse	074b53c18c	Fix entity encoding of attribute values This ensures that single and double quotes are also encoded, previously they would be left as is. Fixes #113	2015-06-16 22:47:10 +02:00
Yorick Peterse	2c18a51ba9	Support for strict parsing of XML documents Currently this only disabled the automatic insertion of closing tags, in the future this may also disable other features if deemed worth the effort. Fixes #107	2015-06-15 23:53:11 +02:00
Yorick Peterse	a76286b973	Support for spaces around attribute equal signs This also takes care of making sure line numbers are incremented properly. Fixes #112	2015-06-08 06:34:49 +02:00
Yorick Peterse	d2523a1082	Support whitespace in element closing tags Fixes #108	2015-05-25 13:41:17 +02:00
Yorick Peterse	f587b49406	Move HTML lexer specs into spec/oga/html/lexer	2015-05-23 09:47:49 +02:00
Yorick Peterse	c97c1b6899	Do not encode single/double quotes as entities By encoding single/double quotes we can potentially break input, so lets stop doing this. This now ensures that this: <foo>a"b</foo> Is actually serialized back into the exact same instead of being serialized into: <foo>a"b</foo>	2015-05-21 11:23:44 +02:00
Yorick Peterse	dc2e31e35b	Added remaining HTML closing specs	2015-05-19 23:41:06 +02:00
Yorick Peterse	2f182a65fe	HTML closing specs for <dd>/<dd> elements	2015-05-19 00:22:04 +02:00
Yorick Peterse	1ba801370f	HTML closing specs for the <li> element	2015-05-18 21:49:36 +02:00
Yorick Peterse	efeb38699a	HTML closing specs for the "body" element	2015-05-18 21:44:00 +02:00
Yorick Peterse	5a74571536	Added HTML head closing specs	2015-05-18 00:32:19 +02:00
Yorick Peterse	2a1c5646f3	Reworked HTML colgroup closing specs	2015-05-18 00:32:09 +02:00
Yorick Peterse	81cf7ba9b6	Reworked HTML caption closing specs	2015-05-18 00:32:01 +02:00
Yorick Peterse	541fb2d5c3	Removed generated HTML closing specs	2015-05-18 00:31:48 +02:00
Yorick Peterse	1c095ddaff	Added more HTML closing rules for colgroup/caption	2015-05-12 23:14:48 +02:00
Yorick Peterse	1e0b7feb02	Recursively closing of parent HTML elements When closing certain HTML elements the lexer should also close whatever parent elements remain. For example, consider the following HTML: <table> <thead> <tr> <th>Foo <th>Bar <tbody> ... </tbody> </table> Here the "<tbody>" element shouldn't only close the "<th>Bar" element but also the parent "<tr>" and "<thead>" elements. This ensures we'd end up with the following HTML: <table> <thead> <tr> <th>Foo</th> <th>Bar</th> </tr> </thead> <tbody> ... </tbody> </table> Instead of garbage along the lines of this: <table> <thead> <tr> <th>Foo</th> <th>Bar</th> <tbody> ... </tbody> </table></tr></thead> Fixes #99 (hopefully for good this time)	2015-05-12 00:35:00 +02:00
Yorick Peterse	4b1c296936	Automatically closing of certain HTML tags This ensures that HTML such as this: <li>foo <li>bar is parsed as this: <li>foo</li> <li>bar</li> and not as this: <li> foo <li>bar</li> </li> Fixes #97	2015-04-27 18:43:26 +02:00
Yorick Peterse	8135074a62	Merged on_element_start with on_element_name This makes it easier to automatically insert preceding tokens when starting a new element as we now have access to the name. Previously on_element_start would be invoked first which doesn't receive an argument.	2015-04-21 23:38:06 +02:00
Yorick Peterse	853d804f34	Decoding of zero padded XML entities This would previously fail due to the lack of an explicit base to use for Integer().	2015-04-20 00:13:15 +02:00

1 2 3 4 5

205 Commits