Commit Graph

1192 Commits

Author SHA1 Message Date
Yorick Peterse 01fa1513f4
Lexing of processing instructions with namespaces
This adds lexing/parsing support for processing instructions that
contain namespace prefixes such as `<?foo:bar ?>`.
2016-09-17 14:51:48 +02:00
Yorick Peterse 116b9b0ceb
Make XmlDeclaration a ProcessingInstruction
This allows Oga to parse documents that contain an XML declaration at a
place other than at the document root. Oga still only assigns the XML
declaration to the document whenever it is at the top-level. This
matches libxml/XML specification behaviour as far as I can tell.
2016-09-17 14:39:07 +02:00
Scott Wheeler d40baf0c72 Add aliases for accessing attributes via [] and []=
This also fixes accessing attributes via symbol name and tests to ensure
that such does not break in the future.
2016-09-14 15:21:46 +03:00
Yorick Peterse b8fd8670df
Release 2.6 2016-09-10 02:50:02 +02:00
Yorick Peterse 38284278d5
Don't process siblings when reaching a root node
When generating XML we should not process the siblings of a root node.
Doing so results in invalid XML being returned (due to siblings not
being children of the root node).

Not processing the siblings in this case also prevents the siblings loop
from getting stuck. To explain what's happening, let's assume we're
using the following document tree:

    Document
      |_ Text
      |_ Element

Now let's say we take the Text node and call "to_xml" on it. When we
start the loop we'll run into the following code:

    if child_node = children && current.children[0]
      current = child_node
    else

Here the if statement will evaluate to false because a Text node doesn't
have any child nodes, as such we enter the else branch. We now reach the
following code:

    until next_node = current.is_a?(Node) && current.next

A Text node is a descendant of Node and it happens to have another node
(the Element node) as the next sibling. As a result we enter the `until`
loop's body. We now run into this code:

    if current.is_a?(Node) && current != @start
      current = current.parent
    end

Here `current` is still our Text node and it is the @start node. As a
result the `current` re-assignment won't be evaluated.

Next we run into the following:

    after_element(current, output) if current.is_a?(Element)

    break if current == @start

The first line will not evaluate because `current` is still the `Text`
node.  The `break` *will* evaluate because `current` is the same as
@start.

This will then lead to the following code being executed:

    current = next_node

Here `next_node` is the next sibling of the Text node, which in the
above example is the Element node.

Because all of the above runs in a `while` loop we'll at some point end
up again at the start of the `until` loop. At this point the `current`
variable contains an `Element`. Because this node does *not* have a node
following it we'll once again enter the `until` loop's body.

This loop will now get stuck because `current` is a Node, it's not the
same as @start, thus `current` is set to its parent (the Document),
which also isn't the same as @start.

On the next iteration this loop will break because `current` is no
longer a node. However, because a Document _does_ have child nodes the
whole process of traversing children/siblings will keep repeating itself
forever.

To work around this we now use the following statement:

    if child_node = children && current.children[0]
      ...
    elsif current == @start
      after_element(current, output) if current.is_a?(Element)

      break
    else
      until next_node = current.is_a?(Node) && current.next
      ...
    end

This prevents processing of any siblings once we have reached the root
node, in turn preventing the loop getting stuck forever.

I'm willing to bet there are probably a few more edge cases, but I can't
think of any others at the moment.

Fixes #161
2016-09-10 02:49:05 +02:00
Yorick Peterse 7aa34fd192
Removed max width from the YARD output
YARD apparently switched layout to something that doesn't like this, so
let's just get rid of it.
2016-09-06 22:42:57 +02:00
Yorick Peterse 7a8220ae78
Remove unnecessary use of Object#send 2016-09-06 22:37:30 +02:00
Yorick Peterse a6cd19933d
Fixed some YARD markup in XML::Generator 2016-09-06 22:32:10 +02:00
Yorick Peterse dd554f31e7
Release 2.5 2016-09-06 22:30:50 +02:00
Yorick Peterse 68f1f9f660
Relax parsing of XML doctypes
This allows the parser to parse doctypes that contain a mixture of
names, public IDs, inline rules, etc.

Fixes #159
2016-09-06 22:25:22 +02:00
Yorick Peterse e58a41f711
Release 2.4 2016-09-04 21:16:11 +02:00
Yorick Peterse 5a58b14137
Use static variables for Node#previous/#next
Instead of calculating the previous/next node on the fly this data is
now set automatically whenever a node is stored in a NodeSet with an
owner. While this introduces some overhead and complexity when adding or
removing nodes from a NodeSet, it greatly reduces the runtime overhead
of calling Node#previous or Node#next.
2016-09-04 21:07:35 +02:00
Yorick Peterse dd138981f6
Generate XML without relying on recursion
While using recursion is an easy way of generating XML it can lead to
the call stack overflowing when serialising documents with lots of
nested nodes.

Generally there are two ways of working around this:

1. Use an explicit stack (e.g. an array or a queue of sorts) instead of
   relying on the call stack.
2. Use an algorithm that doesn't use a stack at all (e.g. Morris
   traversal).

This commit introduces the XML::Generator class which can serialize
documents back to XML without using a stack at all. This class takes
advantage of XML nodes having access to not only their child nodes, but
also their siblings and their parents.

All XML serialisation logic now resides in the XML::Generator class. In
turn the various "to_xml" methods just use this class and serialize
everything starting at "self".
2016-09-04 19:19:00 +02:00
Yorick Peterse 9ac16e2e4f
Fixed index check in Node#next
An index can/should never be equal the length of a NodeSet, thus we
should use "<" here instead of "<=".
2016-09-03 23:56:55 +02:00
Yorick Peterse de85784097
Release 2.3 2016-07-13 22:44:42 +02:00
Erik Michaels-Ober dca9efb3b1
Change build order to optimize build speed 2016-07-13 22:34:31 +02:00
Erik Michaels-Ober e9073b88c5 Test against Ruby 2.3 2016-07-13 09:12:59 -07:00
Erik Michaels-Ober 3a89dcffab
Remove Parser#reset and PullParser#reset 2016-07-13 17:19:42 +02:00
Erik Michaels-Ober c431c2b004
Lock json dependency to ~> 1.8 on Windows Ruby 1.9 2016-07-13 17:19:42 +02:00
Erik Michaels-Ober 59d2b8c2bc
Remove call to reset_native in Lexer#lex 2016-07-13 17:19:42 +02:00
Erik Michaels-Ober dc30b8b6c1
Remove Lexer#reset method
Resolves https://github.com/YorickPeterse/oga/issues/153.
2016-07-13 17:19:42 +02:00
Erik Michaels-Ober 9a47c751e4
Lock json dependency to ~> 1.8 on Ruby 1.9 2016-07-13 17:19:42 +02:00
Erik Michaels-Ober cf3055123f Ignore nondeterministic test failure of Rubinius on macOS 2016-07-12 11:53:46 -07:00
Yorick Peterse 00ab8bbe73
Clarify README performance feature a bit 2016-04-21 15:37:15 +02:00
Yorick Peterse dead5b4f51 Release 2.2 2016-02-23 22:36:15 +01:00
Yorick Peterse 6d3c5c2ce9 XPath support for nested pipe operators
Basically this will process the left-hand side first, assign the result
to a variable and then append this set with the nodes from the
right-hand side.

Fixes #149
2016-02-23 22:24:07 +01:00
Andrew Murray 40501f9522 Fixed typo 2016-02-12 23:11:20 +11:00
Yorick Peterse ea47f99ce4 Added Windows support as a feature in the README 2016-02-10 19:07:08 +01:00
Yorick Peterse 83d0759998 Release 2.1 2016-02-09 20:17:54 +01:00
Yorick Peterse 5bfc2d50f2 Preserve entities that can't be decoded
Certain entities when decoded will produce a String with an invalid
encoding. This commit ensures that instead of raising an EncodingError
further down the line (e.g. when calling "inspect" on a document) the
entities are preserved as-is.

Fixes #143
2016-02-09 19:51:53 +01:00
Yorick Peterse 76b183e7ab Simplify Oga's versioning policy 2016-01-22 02:14:51 +01:00
Yorick Peterse a04797b946 Added code of conduct
This is largely based on the Contributor Covenant but with the list of
unacceptable behaviours updated according to the Rubinius CoC (as I feel
the latter is more explicit/accurate/better).
2016-01-22 01:51:44 +01:00
Yorick Peterse ee906c9af4 Use "rbx" on Travis instead of "rbx-2" 2016-01-08 11:32:55 +01:00
Yorick Peterse fd1570870e Release 2.0.0 2015-12-26 20:46:24 +01:00
Yorick Peterse 66fc4b1dfc Fixed parsing HTML identifiers containing colons
HTML identifiers containing colons should be treated in two ways:

* For element names the prefix (= the namespace prefix in case of XML)
  should be ignored as HTML doesn't support/use namespaces.
* For attribute names a colon is a valid character, thus "foo:bar:baz"
  should be treated as a single attribute name.

This fixes #142.
2015-12-26 20:28:35 +01:00
Yorick Peterse a938f23a0e Added at_css to Nokogiri migration guide 2015-11-17 16:35:50 +01:00
Yorick Peterse 082af145e3 Updated Nokogiri migration guide for CSS support 2015-11-17 16:32:46 +01:00
Yorick Peterse 9bb908f8b1 Use #== in Conversion.boolean?
On JRuby 9.0.1.0 this is a bit faster than using "is_a?":

    require 'benchmark/ips'

    input = false

    Benchmark.ips do |bench|
      bench.report 'is_a?' do
        input.is_a?(TrueClass) || input.is_a?(FalseClass)
      end

      bench.report '==' do
        input == true || input == false
      end

      bench.compare!
    end

This outputs:

    Calculating -------------------------------------
                   is_a?    86.129k i/100ms
                      ==   112.837k i/100ms
    -------------------------------------------------
                   is_a?      7.375M (±15.3%) i/s -     35.227M
                      ==     10.428M (±12.0%) i/s -     50.889M

    Comparison:
                      ==: 10427617.5 i/s
                   is_a?:  7374666.2 i/s - 1.41x slower

On both MRI 2.2 and Rubinius 2.5.8 there's little to no difference
between these two methods.
2015-09-23 16:35:09 +02:00
Yorick Peterse d815437217 Exclude 2.2/jruby on OS X 2015-09-17 13:38:48 +02:00
Yorick Peterse cd2195ef1d Use before_install checks to install Ragel on OS X 2015-09-17 13:29:04 +02:00
Yorick Peterse 205feaf704 Lets try to install Ragel on OS X 2015-09-17 13:25:36 +02:00
Yorick Peterse e37bbcbce6 Also run Travis on OS X 2015-09-17 12:21:23 +02:00
Yorick Peterse 5b2dfdbb09 Fixed Markdown headings in the changelog 2015-09-17 01:14:00 +02:00
Yorick Peterse 0fd6fd8645 Release 1.3.1 2015-09-07 14:11:00 +02:00
Yorick Peterse bd48dc15cc Evaluate compiled blocks in an isolated Binding
Re-using the Binding of the XPath::Compiler#compile method would lead to
race conditions, and possibly a memory leak due to the Binding sticking
around for compiled Proc's lifetime.

By using a dedicated class (and its corresponding Binding) we can work
around this. Access to this class is not synchronized as compiled Procs
don't mutate their enclosing environment.

The race condition can be demonstrated using code such as the
following:

    xml = <<-EOF
    <people>
      <person>
        <name>Alice</name>
      </person>

      <person>
        <name>Bob</name>
      </person>

      <person>
        <name>Eve</name>
      </person>
    </people>
    EOF

    4.times.map do
      Thread.new do
        10_000.times do
          document = Oga.parse_xml(xml)

          document.at_xpath('people/person/name').text
        end
      end
    end.each(&:join)

Running this code would result in NoMethodErrors due to "at_xpath"
returning a NilClass opposed to an Oga::XML::Element.
2015-09-07 14:02:31 +02:00
Yorick Peterse b07c75e964 Moved comparing_gems_bench to xpath/compiler
This is a compiler benchmark, not a parser benchmark.
2015-09-06 22:14:17 +02:00
Yorick Peterse ac5cb3d24f Tweaked thread safety notice in the README
Querying the same document concurrently _could_ lead to problems, so
lets just recommend users to not even try this.
2015-09-06 19:30:40 +02:00
Yorick Peterse 4c79468091 Release 1.3.0 2015-09-06 19:20:45 +02:00
Yorick Peterse 791085302e Prepare changelog for 1.3.0 2015-09-04 16:46:33 +02:00
Yorick Peterse f753f08f18 Revamp CSS parser for better axis support
This makes it possible to parse expressions such as "foo>bar", "> .bar",
"> foo.bar", and similar expressions.

This fixes #126 and fixes #131.
2015-09-04 16:06:20 +02:00