Commit Graph

567 Commits

Author SHA1 Message Date
Yorick Peterse ea8b4aa92f Lex comments in chunks
Similar to this being added for CDATA tags in
8acc7fc743 comments are now also lexed in
chunks.

Related issue: #93
2015-04-14 23:11:22 +02:00
Yorick Peterse 8acc7fc743 Lex CDATA tags in chunks
Instead of using a single token (T_CDATA) for a CDATA tag the lexer now
uses 3 tokens:

1. T_CDATA_START
2. T_CDATA_BODY
3. T_CDATA_END

The T_CDATA_BODY token can occur multiple times and is turned into a
single value in the XML parser. This is similar to the way strings are
lexed.

By changing the way CDATA tags are lexed Oga can now lex CDATA tags
containing newlines when using an IO as input. For example, this would
previously fail:

    Oga.parse_xml(StringIO.new("<![CDATA[\nfoo]]>"))

Because IO input reads input per line the input for the lexer would be
as following:

    "<![CDATA[\n"
    "foo]]>"

Related issues: #93
2015-04-14 22:45:55 +02:00
Yorick Peterse 739e3b474c Optimize Traversal#each_node allocations
This removes the need for allocating an Array for every set of child
nodes.
2015-04-12 22:45:42 +02:00
Yorick Peterse b42f9aaf32 Cache output of Element#available_namespaces
This cache is flushed whenever Element#register_namespace is called.
When this cache is flushed it's also recursively flushed for all child
elements. This makes calls to Element#register_namespace a bit more
expensive but in turn calls to Element#available_namespaces will be a
lot faster.
2015-04-12 20:22:33 +02:00
Yorick Peterse fa838154fc Flush Element#namespace cache
When setting a new namespace name using Element#namespace_name= the
cache used by Element#namespace is flushed properly.
2015-04-11 19:20:50 +02:00
Yorick Peterse b0359b37e5 Cache Node#html? and Node#root_node
The results of these methods is now cached until a Node is moved into
another NodeSet. This reduces the time spent in the
xpath/evaluator/big_xml_average_bench.rb benchmark from roughly 10
seconds to roughly 5 seconds per iteration.
2015-04-11 19:12:26 +02:00
Yorick Peterse 421e6e910b Release 0.3.1 2015-04-08 14:58:53 +02:00
Yorick Peterse 4bdc8a3fdc Don't convert entities in script/style elements
In HTML the text of a script/style tag should be left untouched, no
entities must be converted. Doing so would break Javascript such as the
following:

    foo&&bar;

Such code is often the result of minifiers doing their dirty business.
2015-04-08 14:32:09 +02:00
Yorick Peterse 6a1010c287 Fixed decoding entities in attribute values
This was broken by introducing the process of lazy decoding of XML/HTML
entities. The new setup works similar to how XML::Text#text decodes any
entities that may be present.

Fixes #91
2015-04-07 21:18:22 +02:00
Yorick Peterse ef7f50137a Added Oga::EntityDecoder
This module removes some of the code duplication needed to determine
what entity decoder to use.
2015-04-07 21:18:15 +02:00
Yorick Peterse 3f6aa04e91 Release 0.3.0 2015-04-03 21:11:15 +02:00
Yorick Peterse 3176459307 Ignore declared namespaces in HTML documents
The HTML spec states that any declared namespaces, including the default
namespace are to be ignored.

This fixes #85
2015-03-26 22:38:39 +01:00
Yorick Peterse 5adeae18d0 XPath queries match nodes in the default namespace
When querying an XML document that explicitly defines the default XML
namespace the XPath evaluator now correctly matches all nodes within
that namespace if no namespace prefix is given in the query. Previously
this would always return an empty set.
2015-03-26 01:13:55 +01:00
Yorick Peterse f175414917 Added XML::Element#default_namespace? 2015-03-26 01:10:20 +01:00
Yorick Peterse b6fcd326ef Added XML::Node#html? and XML::Node#xml?
The former has been moved over from XML::Text, the latter just inverts
html?.
2015-03-26 01:02:32 +01:00
Yorick Peterse 4ad502958d Added XML::Attribute#==
Overwriting this method makes it easier to check if a given namespace
equals the default XML (and soon HTML) namespace.
2015-03-26 00:53:16 +01:00
Yorick Peterse f2d69af33b Distinguish default attribute/element namespaces
The previous commit messed this up because I wasn't fully awake.
2015-03-26 00:43:50 +01:00
Yorick Peterse 68ada997a8 Moved default namespace into Oga::XML
The default namespace is now located at Oga::XML::DEFAULT_NAMESPACE
instead of Oga::XML::Attribute::DEFAULT_NAMESPACE.
2015-03-26 00:35:28 +01:00
Yorick Peterse 3cdcdf6daa Corrected YARD formatting 2015-03-23 00:31:56 +01:00
Yorick Peterse 66fa9f62ef Added LRU#maximum=/maximum
This allows one to change the maximum amount of keys stored in the
XPath/CSS caches, for example:

    Oga::XPath::Parser::CACHE.maximum = 2056
2015-03-23 00:26:48 +01:00
Yorick Peterse 12aa21fb50 Use parse_with_cache when querying xpath/css 2015-03-23 00:23:46 +01:00
Yorick Peterse 2c4e490614 Added CSS/XPath Parser.parse_with_cache
This method parses and caches ASTs using Oga::LRU. Currently the default
of 1024 keys is used.

See #71 for more information.
2015-03-23 00:22:59 +01:00
Yorick Peterse 67d7d9af88 Added thread-safe LRU class
This class will be used for storing parser XPath/CSS ASTs.

See #71 for more information.
2015-03-23 00:21:52 +01:00
Yorick Peterse 31e93e54f9 Removed Mutex usage from XML::Text
Instead of trying to make this class thread-safe I'm going with the
option of simply declaring it unsafe to mutate instances of XML::Text
while reading it in parallel. This removes the need for Mutex
allocations and keeps the code simple.

Fixes #82
2015-03-21 01:27:00 +01:00
Yorick Peterse c647f064b5 Remove remaining Racc parsing bits 2015-03-21 01:23:00 +01:00
Yorick Peterse ed14981044 Ported the CSS parser to ruby-ll 2015-03-21 01:23:00 +01:00
Yorick Peterse 2714dbe419 Use the ? operator in the XPath parser 2015-03-21 01:23:00 +01:00
Yorick Peterse 3b74a55d73 Use the ? operator in the XML parser 2015-03-21 01:23:00 +01:00
Yorick Peterse 2bbb7d2b10 Use new operators in the XML parser
This allows the removal of quite a bit of recursion based code.
2015-03-21 01:23:00 +01:00
Yorick Peterse 02da47c1f0 Replaced some XPath parser recursion with * 2015-03-21 01:23:00 +01:00
Yorick Peterse 3b06780802 Removed Racc based XPath parser 2015-03-21 01:23:00 +01:00
Yorick Peterse 588c225c53 Proper XPath operator parsing precedence 2015-03-21 01:23:00 +01:00
Yorick Peterse 0fa9d4df88 Ported remaining XPath parsing bits to ruby-ll.
Currently all operators are left-associative with no particular precedence. This
causes a few specs to fail for now. Outside of that the new parser should be
able to parse the same input as the Racc based parser.
2015-03-21 01:22:59 +01:00
Yorick Peterse 4ebfc849a4 Start porting the XPath parser to ruby-ll.
There are still a few bits left to do such as supporting parenthesis and
assigning the correct precedence to the others.
2015-03-21 01:22:59 +01:00
Yorick Peterse cbdaeb21f4 Unwrap a few lines in the XML parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse cfc6749556 Use splat instead of Array#unshift for attributes. 2015-03-21 01:22:59 +01:00
Yorick Peterse d210c9fb57 Compacted a few XML parser rules. 2015-03-21 01:22:59 +01:00
Yorick Peterse a5cd75cb7e Removed useless string allocs from the XML parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse fdcd712ffe Don't use Array#uniq in NodeSet#initialize.
Removing this makes the process of parsing larger XML documents a bit faster.
The downside is that NodeSet#initialize will no longer filter out duplicate
nodes, though this is not something Oga itself relies upon.

Methods such as NodeSet#push still do ignore elements already present.
2015-03-21 01:22:59 +01:00
Yorick Peterse c36b35ac0f Skip ownership iteration when there's no owner.
There's no point in iterating over all the nodes and assigning ownership if
there's no owner to begin with.
2015-03-21 01:22:59 +01:00
Yorick Peterse 006ef4d51a Port over most of the old XML error handling.
Some messages are a bit different due to ruby-ll's error handling, other than
that it's largely the same stuff as before.
2015-03-21 01:22:59 +01:00
Yorick Peterse 1a326fc516 Remove Racc based XML parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse d8b9725b82 Fixed SAX parsing of XML attributes.
This was utterly broken, mainly due to me overlooking it. There are now 2 new
callbacks to handle this properly:

* on_attribute: to handle a single attribute/value pair
* on_attributes: to handle a collection of attributes (as returned by
  on_attribute)

By default on_attribut returns a Hash, on_attributes in turn merges all
attribute hashes into a single one. This ensures that on_element _actually_
receives the attributes as a Hash, instead of an Array with random
nil/XML::Attribute values.
2015-03-21 01:22:59 +01:00
Yorick Peterse dd626c10d3 Use Array#unshift in the LL XML grammar.
Using Array#+ for large sets (e.g. in the benchmarks) is _really_ slow.
Interesting enough Array#unshift uses as much memory as the Racc parser and is
about as fast, even though it has to move memory around.
2015-03-21 01:22:59 +01:00
Yorick Peterse f94407ee9d Parser callback for XML attributes. 2015-03-21 01:22:59 +01:00
Yorick Peterse a023b35e78 Fixed the pull parser for the XML LL parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse 5eed0d31d6 Ported over most of the XML parser to ruby-ll.
This is still missing the error handling previously present.
2015-03-21 01:22:59 +01:00
Yorick Peterse 15a3ab9ba5 ruby-ll: full support for parsing doctypes. 2015-03-21 01:22:59 +01:00
Yorick Peterse 71aefb53cc Started porting the XML parser to ruby-ll
This is far from done.
2015-03-21 01:22:59 +01:00
Yorick Peterse 2ec91f130f Lazy decoding of XML/HTML entities.
Instead of decoding entities in the lexer we'll do this whenever XML::Text#text
is called. This removes the overhead from the parsing phase and ensures the
process is only triggered when actually needed. Note that calling #to_xml and/or
the #inspect methods on a Text (or parent) instance will also trigger the entity
conversion process.

The new entity decoding API supports both regular entities (e.g. &amp;) as well
as codepoint based entities (both regular and hexadecimal codepoints).

To allow safe read-only access to Text instances from multiple threads a mutex
is used. This mutex ensures that only 1 thread can trigger the conversion
process.

Fixes #68
2015-03-05 23:00:43 +01:00
Yorick Peterse 3e05593536 Release 0.2.3 2015-03-04 11:56:23 +01:00
Yorick Peterse 78e40b55c0 Handle parsing of HTML <style> tags.
This basically re-applies the technique used for HTML <script> tags. With this
extra addition I decided to rename/normalize a few things so it's easier to add
any extra tags in the future. One downside of this setup is that the following
will not be parsed by Oga:

    <style>
        </script>
    </style>

The same applies to script tags containing a literal </style> tag. Since this
particular case is rather unlikely to occur I'm OK with not supporting it as it
_does_ simplify the lexer quite a bit.

Fixes #80
2015-03-03 16:28:05 +01:00
Yorick Peterse 73534375d5 Release 0.2.2 2015-03-03 13:36:32 +01:00
Yorick Peterse 142b467277 Set parent of nodes set using Element#inner_text=
This ensures that any text nodes created using Element#inner_text= have their
parent node set correctly.
2015-03-03 13:13:05 +01:00
Yorick Peterse 503efc32cd Release 0.2.1 2015-03-02 22:12:49 +01:00
Yorick Peterse 874d7124af Don't convert <script> text to XML entities.
Fixes #79.
2015-03-02 17:32:19 +01:00
Yorick Peterse 9a586363e9 Added XML::Document#html? 2015-03-02 16:39:40 +01:00
Yorick Peterse ba2177e2cf Lex contents of <script> tags as plain text.
When lexing input in HTML mode the lexer has to treat _all_ content of a
<script> tag as plain text. This ensures that the lexer can process input such
as "x <y" and "// <foo>" correctly.

Fixes #70.
2015-03-02 16:22:09 +01:00
Yorick Peterse e138aa15ac Removed stray comment in the XPath parser. 2014-12-28 23:55:33 +01:00
Yorick Peterse 746c8052dd Remove all nodes when calling Element#inner_text=
This fixes #64.
2014-12-14 23:32:43 +01:00
Dmitry Krasnoukhov 26baf89440 Add missing entities to the decode/encode lists 2014-11-21 01:53:11 +02:00
Yorick Peterse cbb2815146 Support for inline doctype rules plus newlines.
This adds support for lexing/parsing XML documents that use an IO as input _and_
contain doctype rules with newlines in them.

This fixes #63.
2014-11-18 20:02:55 +01:00
Yorick Peterse 922cee913d Release 0.2.0 2014-11-17 23:26:19 +01:00
Yorick Peterse ad4f650c5d Fixed XML entity encoding/decoding ordering.
Thanks to @krasnoukhov for providing the initial patch, which this commit is
largely based on.

This fixes #49.
2014-11-17 22:39:43 +01:00
Yorick Peterse cd86d5d294 Allow removal of element attributes. 2014-11-17 09:00:40 +01:00
Yorick Peterse 804646cc5e Don't modify raw namespaces.
When calling Element#available_namespaces the list of namespaces returned by
Element#namespaces must not be modified.
2014-11-17 00:01:16 +01:00
Yorick Peterse 6753d6a26d Slightly better docs for the XPath/CSS parsers. 2014-11-16 23:40:19 +01:00
Yorick Peterse 57adabc068 Ensure SAX after_element receives meaningful args
This changes the behaviour of after_element when parsing documents using the SAX
parsing API. Previously it would always receive a nil argument, which is kinda
pointless. This commit changes that by making sure it receives a namespace name
(if any) and the element name.

This fixes #54.
2014-11-16 23:32:32 +01:00
Yorick Peterse 23b408fe4f Cleaned up CSS parser code for counting siblings. 2014-11-15 18:31:08 +01:00
Yorick Peterse b464815577 Fixed AST generation for nth-(first|last)-of-type. 2014-11-15 18:27:15 +01:00
Yorick Peterse 9eead81a7c Fixed AST for :only-of-type 2014-11-15 18:08:26 +01:00
Yorick Peterse 1c301d40e2 Properly fixed AST for first-of-type/last-of-type
This requires keeping track of the current element being processed. This in turn
allows the usage of count() + preceding-sibling/following-sibling.
2014-11-15 17:58:56 +01:00
Yorick Peterse f1d574f342 Evaluate XPath predicates for every context node.
Instead of evaluating a predicate once for all context nodes, they should
instead be evaluated separately per context node.
2014-11-15 00:31:44 +01:00
Yorick Peterse 6daa3e7a00 Reverted AST changes for first-of-type
Functions can't be used in combination with axes, so I'll just need to fix the
position() function to work properly.
2014-11-14 23:51:46 +01:00
Yorick Peterse 2d6a2be2e8 Revert "Fixed XPath AST for :last-of-type"
Axes can't be used in combination with functions.

This reverts commit b0b572a584.
2014-11-14 23:49:49 +01:00
Yorick Peterse b0b572a584 Fixed XPath AST for :last-of-type
This should count following nodes, not merely the position.
2014-11-14 23:27:15 +01:00
Yorick Peterse 0128dc50ae Fixed CSS evaluation of :first-of-type
The old XPath "position() = 1" would work in Nokogiri due to the way they
retrieve descendants. In Oga however this would simply always return the first
node.

To fix this Oga now counts the amount of preceding siblings that match the same
full name.
2014-11-14 01:25:03 +01:00
Yorick Peterse e3a26c5d15 Allow querying of nodes using CSS. 2014-11-14 01:05:29 +01:00
Yorick Peterse 01b88d8c68 Use correct root for preceding/following(-sibling)
This ensures these axes work correctly when scoped to a node instead of a
document.
2014-11-13 01:11:29 +01:00
Yorick Peterse b0f6409d1e Wrap predicate nodes around others in CSS ASTs. 2014-11-13 00:47:26 +01:00
Yorick Peterse 817a5e075b Wrap predicate AST nodes _around_ other nodes.
This means that "foo[1]" uses this AST:

    (predicate (test nil "foo") (int 1))

Instead of this AST:

    (test nil "foo" (int 1))

This makes it easier for the XPath evaluator to process predicates correctly.
2014-11-12 22:59:38 +01:00
Yorick Peterse 857ac517d5 Added XML::NodeSet#==
This method can be used to compare two NodeSet instances. By using
XML::NodeSet#equal_nodes?() the need for exposing the "nodes" instance variable
is also removed.
2014-11-09 18:33:16 +01:00
Yorick Peterse a586a512b8 Removed support for CSS such as "|X"
This expression could be used to get all elements that _don't_ have any
namespace. The problem is that this can't be expressed as just a node test,
instead the resulting XPath would have to look something like the following:

    X[local-name() = name()]

However, since the XPath predicates are already created for pseudo classes and
such, also injecting the above into it would be a real big pain. As such I've
decided not to support it.
2014-11-05 00:45:40 +01:00
Yorick Peterse 1c20ef52ae XPath idents can't start with a star.
That is, names such as "*foo" are not valid. This was most likely a typo in the
first place.
2014-11-05 00:38:17 +01:00
Yorick Peterse 61801fe562 Allow CSS identifiers to start with an underscore. 2014-11-05 00:37:40 +01:00
Yorick Peterse f1316c50fb Allow XPath idents to start with an underscore.
This is valid in XML so XPath should support it as well.
2014-11-05 00:35:47 +01:00
Yorick Peterse eb3a3e7630 Use the descendant axis for CSS selectors.
Instead of using "descendant-or-self" Oga will use "descendant". This ensures
that expressions such as "foo *" don't return a set also including the "foo"
element.

Nokogiri solves this problem in a somewhat different way by using //foo//* for
the CSS expression "foo *". While this works in Nokogiri the expression
"descendant-or-self::node()" is slow as a snail in Oga (due to its nature of
retrieving _all_ nodes first). By using "descendant" we can work around this
problem.
2014-11-05 00:22:09 +01:00
Yorick Peterse 602f2fe8bb Apply node() type tests to the document too.
When running XPath queries such as "self::node()" the result should be a set
containing the document itself. This in turn fixes expressions such as
descendant-or-self::node()/a.
2014-11-04 23:41:45 +01:00
Yorick Peterse de04c61df9 Parsing support for the :empty pseudo. 2014-11-04 00:00:23 +01:00
Yorick Peterse 1b870406de Parsing support for the :only-of-type pseudo. 2014-11-03 23:47:25 +01:00
Yorick Peterse 2cf058457d Parsing support for the :only-child pseudo. 2014-11-03 23:27:56 +01:00
Yorick Peterse 759bb71d17 Parsing support for the :last-of-type pseudo. 2014-11-03 23:01:44 +01:00
Yorick Peterse 10632eafd4 Parsing support for the :first-of-type pseudo. 2014-11-03 23:00:15 +01:00
Yorick Peterse 7a606a11d3 Parsing support for the :last-child pseudo class. 2014-11-03 22:58:08 +01:00
Yorick Peterse b4fec3cc8c Parsing support for the :first-child pseudo class. 2014-11-03 22:45:00 +01:00
Yorick Peterse 020d979fba Parsing support for :nth-last-of-type(). 2014-11-03 22:15:21 +01:00
Yorick Peterse bc8be9f725 Fixed various incorrect YARD tags. 2014-11-02 21:23:29 +01:00
Yorick Peterse 2e1320c2dc Explicit return in step_modulo_value. 2014-11-02 19:32:02 +01:00
Yorick Peterse ab8b451dc3 Parsing support for :nth-of-type() 2014-11-02 19:29:39 +01:00
Yorick Peterse 0faceffacb Parsing support for nth-child(n+X) 2014-11-02 19:29:09 +01:00