Commit Graph

205 Commits

Author SHA1 Message Date
KitaitiMakoto 977bd594c8 Add support for XPath namespace aliases
This fixes https://gitlab.com/yorickpeterse/oga/issues/176
2019-11-29 14:21:45 +00:00
David Cornu bc87711f9c Return an Enumerator from each* methods when no block is given 2018-01-29 13:12:42 -05:00
Yorick Peterse f574197ea6
Ignore nested element start tags
This ensures that Oga is able to tokenize input such as the following:

    <script<script>foo</script>

Oga will now treat this as:

    <script>foo</script>

This is based on libxml behaviour, which seems to differ a bit from
Chromium which treats the node as a text node. This however would
require complex look-ahead logic (as far as I can tell) that I really
don't want to implement in Oga.

Fixes #186
2017-12-28 16:12:20 +01:00
Yorick Peterse 6f747656b6
Use RSpec 3 expect syntax for tests
This should make it a little bit easier for others to contribute.
2017-06-17 13:52:43 +02:00
Yorick Peterse 84c4db3e9f
Clean up changes from PR #174 2017-04-18 12:55:11 +02:00
PikachuEXE 21b5eeec4b Fix using symbol on Element#attribute alwas getting nil 2017-04-18 12:51:34 +02:00
Yorick Peterse 673f4a29db
Use HTML5 style closing tags for void elements
This ensures that element tags such as <img> tags don't use a closing />
when documents are parsed as HTML documents.

Fixes #170
2017-02-10 15:24:41 +01:00
Yorick Peterse 131fba7aed
Doctype inherits from Node
This makes it possible to parse documents where a doctype resides in a
node, instead of being located at the root.

Fixes #169
2017-02-10 15:10:30 +01:00
Yorick Peterse e0e0687dc2
Generating closing element & Doctype XML
This commit fixes two problems:

1. Doctypes introducing too many newlines
2. Elements with siblings and a common parent not being closed properly

== Doctypes

When generating the XML for a doctype the XML::Generator class would
append a trailing newline. This however meant that if the next text node
was also a newline you'd now have two newlines. In previous versions of
Oga this worked because the old XML generation code would call
String#strip on the XML to add after the doctype.

To support this in the new version we perform a lookahead in
XML::Generator#on_doctype to remove any trailing newlines added by this
method in case the first child node is a newline text node.

== Closing Elements

When an element has a sibling following it _and_ does not have any child
nodes it would not be closed properly when generating XML. This is due
to the "until next_node = ..." expression evaluating to true, thus never
executing its body.

There's probably some way to work around this by using the "loop"
method, but considering it's 02:09 I think the current approach is good
enough. Future me will probably hate me for it.
2016-09-27 02:10:16 +02:00
Yorick Peterse 01fa1513f4
Lexing of processing instructions with namespaces
This adds lexing/parsing support for processing instructions that
contain namespace prefixes such as `<?foo:bar ?>`.
2016-09-17 14:51:48 +02:00
Yorick Peterse 116b9b0ceb
Make XmlDeclaration a ProcessingInstruction
This allows Oga to parse documents that contain an XML declaration at a
place other than at the document root. Oga still only assigns the XML
declaration to the document whenever it is at the top-level. This
matches libxml/XML specification behaviour as far as I can tell.
2016-09-17 14:39:07 +02:00
Scott Wheeler d40baf0c72 Add aliases for accessing attributes via [] and []=
This also fixes accessing attributes via symbol name and tests to ensure
that such does not break in the future.
2016-09-14 15:21:46 +03:00
Yorick Peterse 38284278d5
Don't process siblings when reaching a root node
When generating XML we should not process the siblings of a root node.
Doing so results in invalid XML being returned (due to siblings not
being children of the root node).

Not processing the siblings in this case also prevents the siblings loop
from getting stuck. To explain what's happening, let's assume we're
using the following document tree:

    Document
      |_ Text
      |_ Element

Now let's say we take the Text node and call "to_xml" on it. When we
start the loop we'll run into the following code:

    if child_node = children && current.children[0]
      current = child_node
    else

Here the if statement will evaluate to false because a Text node doesn't
have any child nodes, as such we enter the else branch. We now reach the
following code:

    until next_node = current.is_a?(Node) && current.next

A Text node is a descendant of Node and it happens to have another node
(the Element node) as the next sibling. As a result we enter the `until`
loop's body. We now run into this code:

    if current.is_a?(Node) && current != @start
      current = current.parent
    end

Here `current` is still our Text node and it is the @start node. As a
result the `current` re-assignment won't be evaluated.

Next we run into the following:

    after_element(current, output) if current.is_a?(Element)

    break if current == @start

The first line will not evaluate because `current` is still the `Text`
node.  The `break` *will* evaluate because `current` is the same as
@start.

This will then lead to the following code being executed:

    current = next_node

Here `next_node` is the next sibling of the Text node, which in the
above example is the Element node.

Because all of the above runs in a `while` loop we'll at some point end
up again at the start of the `until` loop. At this point the `current`
variable contains an `Element`. Because this node does *not* have a node
following it we'll once again enter the `until` loop's body.

This loop will now get stuck because `current` is a Node, it's not the
same as @start, thus `current` is set to its parent (the Document),
which also isn't the same as @start.

On the next iteration this loop will break because `current` is no
longer a node. However, because a Document _does_ have child nodes the
whole process of traversing children/siblings will keep repeating itself
forever.

To work around this we now use the following statement:

    if child_node = children && current.children[0]
      ...
    elsif current == @start
      after_element(current, output) if current.is_a?(Element)

      break
    else
      until next_node = current.is_a?(Node) && current.next
      ...
    end

This prevents processing of any siblings once we have reached the root
node, in turn preventing the loop getting stuck forever.

I'm willing to bet there are probably a few more edge cases, but I can't
think of any others at the moment.

Fixes #161
2016-09-10 02:49:05 +02:00
Yorick Peterse 68f1f9f660
Relax parsing of XML doctypes
This allows the parser to parse doctypes that contain a mixture of
names, public IDs, inline rules, etc.

Fixes #159
2016-09-06 22:25:22 +02:00
Yorick Peterse 5a58b14137
Use static variables for Node#previous/#next
Instead of calculating the previous/next node on the fly this data is
now set automatically whenever a node is stored in a NodeSet with an
owner. While this introduces some overhead and complexity when adding or
removing nodes from a NodeSet, it greatly reduces the runtime overhead
of calling Node#previous or Node#next.
2016-09-04 21:07:35 +02:00
Yorick Peterse dd138981f6
Generate XML without relying on recursion
While using recursion is an easy way of generating XML it can lead to
the call stack overflowing when serialising documents with lots of
nested nodes.

Generally there are two ways of working around this:

1. Use an explicit stack (e.g. an array or a queue of sorts) instead of
   relying on the call stack.
2. Use an algorithm that doesn't use a stack at all (e.g. Morris
   traversal).

This commit introduces the XML::Generator class which can serialize
documents back to XML without using a stack at all. This class takes
advantage of XML nodes having access to not only their child nodes, but
also their siblings and their parents.

All XML serialisation logic now resides in the XML::Generator class. In
turn the various "to_xml" methods just use this class and serialize
everything starting at "self".
2016-09-04 19:19:00 +02:00
Yorick Peterse 9ac16e2e4f
Fixed index check in Node#next
An index can/should never be equal the length of a NodeSet, thus we
should use "<" here instead of "<=".
2016-09-03 23:56:55 +02:00
Erik Michaels-Ober 3a89dcffab
Remove Parser#reset and PullParser#reset 2016-07-13 17:19:42 +02:00
Erik Michaels-Ober dc30b8b6c1
Remove Lexer#reset method
Resolves https://github.com/YorickPeterse/oga/issues/153.
2016-07-13 17:19:42 +02:00
Yorick Peterse 5bfc2d50f2 Preserve entities that can't be decoded
Certain entities when decoded will produce a String with an invalid
encoding. This commit ensures that instead of raising an EncodingError
further down the line (e.g. when calling "inspect" on a document) the
entities are preserved as-is.

Fixes #143
2016-02-09 19:51:53 +01:00
Yorick Peterse 66fc4b1dfc Fixed parsing HTML identifiers containing colons
HTML identifiers containing colons should be treated in two ways:

* For element names the prefix (= the namespace prefix in case of XML)
  should be ignored as HTML doesn't support/use namespaces.
* For attribute names a colon is a valid character, thus "foo:bar:baz"
  should be treated as a single attribute name.

This fixes #142.
2015-12-26 20:28:35 +01:00
Yorick Peterse 07658dadb1 Added Attribute#parent 2015-08-28 16:22:42 +02:00
Yorick Peterse 9899a419b7 Added Attribute#each_ancestor 2015-08-26 22:26:46 +02:00
Yorick Peterse d408989499 Added expanded_name for Element and Attribute 2015-08-19 20:14:23 +02:00
Yorick Peterse 52741a3b78 Added XML::Node#each_ancestor
This method can be used to walk through the ancestor tree of a Node.
2015-08-19 20:14:20 +02:00
Yorick Peterse 94f7f85dc3 Added XML::Document#root_node 2015-08-19 20:14:20 +02:00
Jakub Pawlowicz 6fc3ef425b Fixes #118 - decoding invalid entities.
Previous regular expression was too greedy in terms of matching
letters from outside of A-F hex scope, and matching letters when
not in hex mode.
2015-06-30 17:56:26 +02:00
Yorick Peterse 565e3da176 Added encoding comment in elements_spec.rb
This ensures that older Ruby versions don't poop their pants when
running these specs.
2015-06-29 21:09:33 +02:00
Yorick Peterse dde644cd79 Support for Unicode XML/HTML identifiers
Technically HTML only allows for ASCII names but restricting that
actually requires more work than just allowing it.
2015-06-29 21:08:01 +02:00
Laurence Lee 139985612b Lexer test for elements with inline dots. 2015-06-29 20:55:48 +02:00
Tero Tasanen 0b4791b277 Ability to replace a node with another node or string
```
element = Oga::XML::Element.new(:name => 'div')
some_node.replace(element)
```

You can also pass a `String` to  `replace` and it will be replaced with
a `Oga::XML::Text` node

```
some_node.replace('this will replace the current node with a text node')
```

closes #115
2015-06-17 21:27:50 +03:00
Yorick Peterse 074b53c18c Fix entity encoding of attribute values
This ensures that single and double quotes are also encoded, previously
they would be left as is.

Fixes #113
2015-06-16 22:47:10 +02:00
Yorick Peterse 2c18a51ba9 Support for strict parsing of XML documents
Currently this only disabled the automatic insertion of closing tags, in
the future this may also disable other features if deemed worth the
effort.

Fixes #107
2015-06-15 23:53:11 +02:00
Yorick Peterse a76286b973 Support for spaces around attribute equal signs
This also takes care of making sure line numbers are incremented
properly.

Fixes #112
2015-06-08 06:34:49 +02:00
Yorick Peterse d2523a1082 Support whitespace in element closing tags
Fixes #108
2015-05-25 13:41:17 +02:00
Yorick Peterse f587b49406 Move HTML lexer specs into spec/oga/html/lexer 2015-05-23 09:47:49 +02:00
Yorick Peterse c97c1b6899 Do not encode single/double quotes as entities
By encoding single/double quotes we can potentially break input, so lets
stop doing this. This now ensures that this:

    <foo>a"b</foo>

Is actually serialized back into the exact same instead of being
serialized into:

    <foo>a&quot;b</foo>
2015-05-21 11:23:44 +02:00
Yorick Peterse dc2e31e35b Added remaining HTML closing specs 2015-05-19 23:41:06 +02:00
Yorick Peterse 2f182a65fe HTML closing specs for <dd>/<dd> elements 2015-05-19 00:22:04 +02:00
Yorick Peterse 1ba801370f HTML closing specs for the <li> element 2015-05-18 21:49:36 +02:00
Yorick Peterse efeb38699a HTML closing specs for the "body" element 2015-05-18 21:44:00 +02:00
Yorick Peterse 5a74571536 Added HTML head closing specs 2015-05-18 00:32:19 +02:00
Yorick Peterse 2a1c5646f3 Reworked HTML colgroup closing specs 2015-05-18 00:32:09 +02:00
Yorick Peterse 81cf7ba9b6 Reworked HTML caption closing specs 2015-05-18 00:32:01 +02:00
Yorick Peterse 541fb2d5c3 Removed generated HTML closing specs 2015-05-18 00:31:48 +02:00
Yorick Peterse 1c095ddaff Added more HTML closing rules for colgroup/caption 2015-05-12 23:14:48 +02:00
Yorick Peterse 1e0b7feb02 Recursively closing of parent HTML elements
When closing certain HTML elements the lexer should also close whatever
parent elements remain. For example, consider the following HTML:

    <table>
        <thead>
            <tr>
                <th>Foo
                <th>Bar
        <tbody>
            ...
        </tbody>
    </table>

Here the "<tbody>" element shouldn't only close the "<th>Bar" element
but also the parent "<tr>" and "<thead>" elements. This ensures we'd end
up with the following HTML:

    <table>
        <thead>
            <tr>
                <th>Foo</th>
                <th>Bar</th>
            </tr>
        </thead>
        <tbody>
            ...
        </tbody>
    </table>

Instead of garbage along the lines of this:

    <table>
        <thead>
            <tr>
                <th>Foo</th>
                <th>Bar</th>
        <tbody>
            ...
        </tbody>
    </table></tr></thead>

Fixes #99 (hopefully for good this time)
2015-05-12 00:35:00 +02:00
Yorick Peterse 4b1c296936 Automatically closing of certain HTML tags
This ensures that HTML such as this:

    <li>foo
    <li>bar

is parsed as this:

    <li>foo</li>
    <li>bar</li>

and not as this:

    <li>
        foo
        <li>bar</li>
    </li>

Fixes #97
2015-04-27 18:43:26 +02:00
Yorick Peterse 8135074a62 Merged on_element_start with on_element_name
This makes it easier to automatically insert preceding tokens when
starting a new element as we now have access to the name. Previously
on_element_start would be invoked first which doesn't receive an
argument.
2015-04-21 23:38:06 +02:00
Yorick Peterse 853d804f34 Decoding of zero padded XML entities
This would previously fail due to the lack of an explicit base to use
for Integer().
2015-04-20 00:13:15 +02:00