Commit Graph

498 Commits

Author SHA1 Message Date
Yorick Peterse 66fa9f62ef Added LRU#maximum=/maximum
This allows one to change the maximum amount of keys stored in the
XPath/CSS caches, for example:

    Oga::XPath::Parser::CACHE.maximum = 2056
2015-03-23 00:26:48 +01:00
Yorick Peterse 12aa21fb50 Use parse_with_cache when querying xpath/css 2015-03-23 00:23:46 +01:00
Yorick Peterse 2c4e490614 Added CSS/XPath Parser.parse_with_cache
This method parses and caches ASTs using Oga::LRU. Currently the default
of 1024 keys is used.

See #71 for more information.
2015-03-23 00:22:59 +01:00
Yorick Peterse 67d7d9af88 Added thread-safe LRU class
This class will be used for storing parser XPath/CSS ASTs.

See #71 for more information.
2015-03-23 00:21:52 +01:00
Yorick Peterse 31e93e54f9 Removed Mutex usage from XML::Text
Instead of trying to make this class thread-safe I'm going with the
option of simply declaring it unsafe to mutate instances of XML::Text
while reading it in parallel. This removes the need for Mutex
allocations and keeps the code simple.

Fixes #82
2015-03-21 01:27:00 +01:00
Yorick Peterse c647f064b5 Remove remaining Racc parsing bits 2015-03-21 01:23:00 +01:00
Yorick Peterse ed14981044 Ported the CSS parser to ruby-ll 2015-03-21 01:23:00 +01:00
Yorick Peterse 2714dbe419 Use the ? operator in the XPath parser 2015-03-21 01:23:00 +01:00
Yorick Peterse 3b74a55d73 Use the ? operator in the XML parser 2015-03-21 01:23:00 +01:00
Yorick Peterse 2bbb7d2b10 Use new operators in the XML parser
This allows the removal of quite a bit of recursion based code.
2015-03-21 01:23:00 +01:00
Yorick Peterse 02da47c1f0 Replaced some XPath parser recursion with * 2015-03-21 01:23:00 +01:00
Yorick Peterse 3b06780802 Removed Racc based XPath parser 2015-03-21 01:23:00 +01:00
Yorick Peterse 588c225c53 Proper XPath operator parsing precedence 2015-03-21 01:23:00 +01:00
Yorick Peterse 0fa9d4df88 Ported remaining XPath parsing bits to ruby-ll.
Currently all operators are left-associative with no particular precedence. This
causes a few specs to fail for now. Outside of that the new parser should be
able to parse the same input as the Racc based parser.
2015-03-21 01:22:59 +01:00
Yorick Peterse 4ebfc849a4 Start porting the XPath parser to ruby-ll.
There are still a few bits left to do such as supporting parenthesis and
assigning the correct precedence to the others.
2015-03-21 01:22:59 +01:00
Yorick Peterse cbdaeb21f4 Unwrap a few lines in the XML parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse cfc6749556 Use splat instead of Array#unshift for attributes. 2015-03-21 01:22:59 +01:00
Yorick Peterse d210c9fb57 Compacted a few XML parser rules. 2015-03-21 01:22:59 +01:00
Yorick Peterse a5cd75cb7e Removed useless string allocs from the XML parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse fdcd712ffe Don't use Array#uniq in NodeSet#initialize.
Removing this makes the process of parsing larger XML documents a bit faster.
The downside is that NodeSet#initialize will no longer filter out duplicate
nodes, though this is not something Oga itself relies upon.

Methods such as NodeSet#push still do ignore elements already present.
2015-03-21 01:22:59 +01:00
Yorick Peterse c36b35ac0f Skip ownership iteration when there's no owner.
There's no point in iterating over all the nodes and assigning ownership if
there's no owner to begin with.
2015-03-21 01:22:59 +01:00
Yorick Peterse 006ef4d51a Port over most of the old XML error handling.
Some messages are a bit different due to ruby-ll's error handling, other than
that it's largely the same stuff as before.
2015-03-21 01:22:59 +01:00
Yorick Peterse 1a326fc516 Remove Racc based XML parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse d8b9725b82 Fixed SAX parsing of XML attributes.
This was utterly broken, mainly due to me overlooking it. There are now 2 new
callbacks to handle this properly:

* on_attribute: to handle a single attribute/value pair
* on_attributes: to handle a collection of attributes (as returned by
  on_attribute)

By default on_attribut returns a Hash, on_attributes in turn merges all
attribute hashes into a single one. This ensures that on_element _actually_
receives the attributes as a Hash, instead of an Array with random
nil/XML::Attribute values.
2015-03-21 01:22:59 +01:00
Yorick Peterse dd626c10d3 Use Array#unshift in the LL XML grammar.
Using Array#+ for large sets (e.g. in the benchmarks) is _really_ slow.
Interesting enough Array#unshift uses as much memory as the Racc parser and is
about as fast, even though it has to move memory around.
2015-03-21 01:22:59 +01:00
Yorick Peterse f94407ee9d Parser callback for XML attributes. 2015-03-21 01:22:59 +01:00
Yorick Peterse a023b35e78 Fixed the pull parser for the XML LL parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse 5eed0d31d6 Ported over most of the XML parser to ruby-ll.
This is still missing the error handling previously present.
2015-03-21 01:22:59 +01:00
Yorick Peterse 15a3ab9ba5 ruby-ll: full support for parsing doctypes. 2015-03-21 01:22:59 +01:00
Yorick Peterse 71aefb53cc Started porting the XML parser to ruby-ll
This is far from done.
2015-03-21 01:22:59 +01:00
Yorick Peterse 2ec91f130f Lazy decoding of XML/HTML entities.
Instead of decoding entities in the lexer we'll do this whenever XML::Text#text
is called. This removes the overhead from the parsing phase and ensures the
process is only triggered when actually needed. Note that calling #to_xml and/or
the #inspect methods on a Text (or parent) instance will also trigger the entity
conversion process.

The new entity decoding API supports both regular entities (e.g. &) as well
as codepoint based entities (both regular and hexadecimal codepoints).

To allow safe read-only access to Text instances from multiple threads a mutex
is used. This mutex ensures that only 1 thread can trigger the conversion
process.

Fixes #68
2015-03-05 23:00:43 +01:00
Yorick Peterse 3e05593536 Release 0.2.3 2015-03-04 11:56:23 +01:00
Yorick Peterse 78e40b55c0 Handle parsing of HTML <style> tags.
This basically re-applies the technique used for HTML <script> tags. With this
extra addition I decided to rename/normalize a few things so it's easier to add
any extra tags in the future. One downside of this setup is that the following
will not be parsed by Oga:

    <style>
        </script>
    </style>

The same applies to script tags containing a literal </style> tag. Since this
particular case is rather unlikely to occur I'm OK with not supporting it as it
_does_ simplify the lexer quite a bit.

Fixes #80
2015-03-03 16:28:05 +01:00
Yorick Peterse 73534375d5 Release 0.2.2 2015-03-03 13:36:32 +01:00
Yorick Peterse 142b467277 Set parent of nodes set using Element#inner_text=
This ensures that any text nodes created using Element#inner_text= have their
parent node set correctly.
2015-03-03 13:13:05 +01:00
Yorick Peterse 503efc32cd Release 0.2.1 2015-03-02 22:12:49 +01:00
Yorick Peterse 874d7124af Don't convert <script> text to XML entities.
Fixes #79.
2015-03-02 17:32:19 +01:00
Yorick Peterse 9a586363e9 Added XML::Document#html? 2015-03-02 16:39:40 +01:00
Yorick Peterse ba2177e2cf Lex contents of <script> tags as plain text.
When lexing input in HTML mode the lexer has to treat _all_ content of a
<script> tag as plain text. This ensures that the lexer can process input such
as "x <y" and "// <foo>" correctly.

Fixes #70.
2015-03-02 16:22:09 +01:00
Yorick Peterse e138aa15ac Removed stray comment in the XPath parser. 2014-12-28 23:55:33 +01:00
Yorick Peterse 746c8052dd Remove all nodes when calling Element#inner_text=
This fixes #64.
2014-12-14 23:32:43 +01:00
Dmitry Krasnoukhov 26baf89440 Add missing entities to the decode/encode lists 2014-11-21 01:53:11 +02:00
Yorick Peterse cbb2815146 Support for inline doctype rules plus newlines.
This adds support for lexing/parsing XML documents that use an IO as input _and_
contain doctype rules with newlines in them.

This fixes #63.
2014-11-18 20:02:55 +01:00
Yorick Peterse 922cee913d Release 0.2.0 2014-11-17 23:26:19 +01:00
Yorick Peterse ad4f650c5d Fixed XML entity encoding/decoding ordering.
Thanks to @krasnoukhov for providing the initial patch, which this commit is
largely based on.

This fixes #49.
2014-11-17 22:39:43 +01:00
Yorick Peterse cd86d5d294 Allow removal of element attributes. 2014-11-17 09:00:40 +01:00
Yorick Peterse 804646cc5e Don't modify raw namespaces.
When calling Element#available_namespaces the list of namespaces returned by
Element#namespaces must not be modified.
2014-11-17 00:01:16 +01:00
Yorick Peterse 6753d6a26d Slightly better docs for the XPath/CSS parsers. 2014-11-16 23:40:19 +01:00
Yorick Peterse 57adabc068 Ensure SAX after_element receives meaningful args
This changes the behaviour of after_element when parsing documents using the SAX
parsing API. Previously it would always receive a nil argument, which is kinda
pointless. This commit changes that by making sure it receives a namespace name
(if any) and the element name.

This fixes #54.
2014-11-16 23:32:32 +01:00
Yorick Peterse 23b408fe4f Cleaned up CSS parser code for counting siblings. 2014-11-15 18:31:08 +01:00