Commit Graph

506 Commits

Author SHA1 Message Date
Yorick Peterse 3176459307 Ignore declared namespaces in HTML documents
The HTML spec states that any declared namespaces, including the default
namespace are to be ignored.

This fixes #85
2015-03-26 22:38:39 +01:00
Yorick Peterse 5adeae18d0 XPath queries match nodes in the default namespace
When querying an XML document that explicitly defines the default XML
namespace the XPath evaluator now correctly matches all nodes within
that namespace if no namespace prefix is given in the query. Previously
this would always return an empty set.
2015-03-26 01:13:55 +01:00
Yorick Peterse f175414917 Added XML::Element#default_namespace? 2015-03-26 01:10:20 +01:00
Yorick Peterse b6fcd326ef Added XML::Node#html? and XML::Node#xml?
The former has been moved over from XML::Text, the latter just inverts
html?.
2015-03-26 01:02:32 +01:00
Yorick Peterse 4ad502958d Added XML::Attribute#==
Overwriting this method makes it easier to check if a given namespace
equals the default XML (and soon HTML) namespace.
2015-03-26 00:53:16 +01:00
Yorick Peterse f2d69af33b Distinguish default attribute/element namespaces
The previous commit messed this up because I wasn't fully awake.
2015-03-26 00:43:50 +01:00
Yorick Peterse 68ada997a8 Moved default namespace into Oga::XML
The default namespace is now located at Oga::XML::DEFAULT_NAMESPACE
instead of Oga::XML::Attribute::DEFAULT_NAMESPACE.
2015-03-26 00:35:28 +01:00
Yorick Peterse 3cdcdf6daa Corrected YARD formatting 2015-03-23 00:31:56 +01:00
Yorick Peterse 66fa9f62ef Added LRU#maximum=/maximum
This allows one to change the maximum amount of keys stored in the
XPath/CSS caches, for example:

    Oga::XPath::Parser::CACHE.maximum = 2056
2015-03-23 00:26:48 +01:00
Yorick Peterse 12aa21fb50 Use parse_with_cache when querying xpath/css 2015-03-23 00:23:46 +01:00
Yorick Peterse 2c4e490614 Added CSS/XPath Parser.parse_with_cache
This method parses and caches ASTs using Oga::LRU. Currently the default
of 1024 keys is used.

See #71 for more information.
2015-03-23 00:22:59 +01:00
Yorick Peterse 67d7d9af88 Added thread-safe LRU class
This class will be used for storing parser XPath/CSS ASTs.

See #71 for more information.
2015-03-23 00:21:52 +01:00
Yorick Peterse 31e93e54f9 Removed Mutex usage from XML::Text
Instead of trying to make this class thread-safe I'm going with the
option of simply declaring it unsafe to mutate instances of XML::Text
while reading it in parallel. This removes the need for Mutex
allocations and keeps the code simple.

Fixes #82
2015-03-21 01:27:00 +01:00
Yorick Peterse c647f064b5 Remove remaining Racc parsing bits 2015-03-21 01:23:00 +01:00
Yorick Peterse ed14981044 Ported the CSS parser to ruby-ll 2015-03-21 01:23:00 +01:00
Yorick Peterse 2714dbe419 Use the ? operator in the XPath parser 2015-03-21 01:23:00 +01:00
Yorick Peterse 3b74a55d73 Use the ? operator in the XML parser 2015-03-21 01:23:00 +01:00
Yorick Peterse 2bbb7d2b10 Use new operators in the XML parser
This allows the removal of quite a bit of recursion based code.
2015-03-21 01:23:00 +01:00
Yorick Peterse 02da47c1f0 Replaced some XPath parser recursion with * 2015-03-21 01:23:00 +01:00
Yorick Peterse 3b06780802 Removed Racc based XPath parser 2015-03-21 01:23:00 +01:00
Yorick Peterse 588c225c53 Proper XPath operator parsing precedence 2015-03-21 01:23:00 +01:00
Yorick Peterse 0fa9d4df88 Ported remaining XPath parsing bits to ruby-ll.
Currently all operators are left-associative with no particular precedence. This
causes a few specs to fail for now. Outside of that the new parser should be
able to parse the same input as the Racc based parser.
2015-03-21 01:22:59 +01:00
Yorick Peterse 4ebfc849a4 Start porting the XPath parser to ruby-ll.
There are still a few bits left to do such as supporting parenthesis and
assigning the correct precedence to the others.
2015-03-21 01:22:59 +01:00
Yorick Peterse cbdaeb21f4 Unwrap a few lines in the XML parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse cfc6749556 Use splat instead of Array#unshift for attributes. 2015-03-21 01:22:59 +01:00
Yorick Peterse d210c9fb57 Compacted a few XML parser rules. 2015-03-21 01:22:59 +01:00
Yorick Peterse a5cd75cb7e Removed useless string allocs from the XML parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse fdcd712ffe Don't use Array#uniq in NodeSet#initialize.
Removing this makes the process of parsing larger XML documents a bit faster.
The downside is that NodeSet#initialize will no longer filter out duplicate
nodes, though this is not something Oga itself relies upon.

Methods such as NodeSet#push still do ignore elements already present.
2015-03-21 01:22:59 +01:00
Yorick Peterse c36b35ac0f Skip ownership iteration when there's no owner.
There's no point in iterating over all the nodes and assigning ownership if
there's no owner to begin with.
2015-03-21 01:22:59 +01:00
Yorick Peterse 006ef4d51a Port over most of the old XML error handling.
Some messages are a bit different due to ruby-ll's error handling, other than
that it's largely the same stuff as before.
2015-03-21 01:22:59 +01:00
Yorick Peterse 1a326fc516 Remove Racc based XML parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse d8b9725b82 Fixed SAX parsing of XML attributes.
This was utterly broken, mainly due to me overlooking it. There are now 2 new
callbacks to handle this properly:

* on_attribute: to handle a single attribute/value pair
* on_attributes: to handle a collection of attributes (as returned by
  on_attribute)

By default on_attribut returns a Hash, on_attributes in turn merges all
attribute hashes into a single one. This ensures that on_element _actually_
receives the attributes as a Hash, instead of an Array with random
nil/XML::Attribute values.
2015-03-21 01:22:59 +01:00
Yorick Peterse dd626c10d3 Use Array#unshift in the LL XML grammar.
Using Array#+ for large sets (e.g. in the benchmarks) is _really_ slow.
Interesting enough Array#unshift uses as much memory as the Racc parser and is
about as fast, even though it has to move memory around.
2015-03-21 01:22:59 +01:00
Yorick Peterse f94407ee9d Parser callback for XML attributes. 2015-03-21 01:22:59 +01:00
Yorick Peterse a023b35e78 Fixed the pull parser for the XML LL parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse 5eed0d31d6 Ported over most of the XML parser to ruby-ll.
This is still missing the error handling previously present.
2015-03-21 01:22:59 +01:00
Yorick Peterse 15a3ab9ba5 ruby-ll: full support for parsing doctypes. 2015-03-21 01:22:59 +01:00
Yorick Peterse 71aefb53cc Started porting the XML parser to ruby-ll
This is far from done.
2015-03-21 01:22:59 +01:00
Yorick Peterse 2ec91f130f Lazy decoding of XML/HTML entities.
Instead of decoding entities in the lexer we'll do this whenever XML::Text#text
is called. This removes the overhead from the parsing phase and ensures the
process is only triggered when actually needed. Note that calling #to_xml and/or
the #inspect methods on a Text (or parent) instance will also trigger the entity
conversion process.

The new entity decoding API supports both regular entities (e.g. &) as well
as codepoint based entities (both regular and hexadecimal codepoints).

To allow safe read-only access to Text instances from multiple threads a mutex
is used. This mutex ensures that only 1 thread can trigger the conversion
process.

Fixes #68
2015-03-05 23:00:43 +01:00
Yorick Peterse 3e05593536 Release 0.2.3 2015-03-04 11:56:23 +01:00
Yorick Peterse 78e40b55c0 Handle parsing of HTML <style> tags.
This basically re-applies the technique used for HTML <script> tags. With this
extra addition I decided to rename/normalize a few things so it's easier to add
any extra tags in the future. One downside of this setup is that the following
will not be parsed by Oga:

    <style>
        </script>
    </style>

The same applies to script tags containing a literal </style> tag. Since this
particular case is rather unlikely to occur I'm OK with not supporting it as it
_does_ simplify the lexer quite a bit.

Fixes #80
2015-03-03 16:28:05 +01:00
Yorick Peterse 73534375d5 Release 0.2.2 2015-03-03 13:36:32 +01:00
Yorick Peterse 142b467277 Set parent of nodes set using Element#inner_text=
This ensures that any text nodes created using Element#inner_text= have their
parent node set correctly.
2015-03-03 13:13:05 +01:00
Yorick Peterse 503efc32cd Release 0.2.1 2015-03-02 22:12:49 +01:00
Yorick Peterse 874d7124af Don't convert <script> text to XML entities.
Fixes #79.
2015-03-02 17:32:19 +01:00
Yorick Peterse 9a586363e9 Added XML::Document#html? 2015-03-02 16:39:40 +01:00
Yorick Peterse ba2177e2cf Lex contents of <script> tags as plain text.
When lexing input in HTML mode the lexer has to treat _all_ content of a
<script> tag as plain text. This ensures that the lexer can process input such
as "x <y" and "// <foo>" correctly.

Fixes #70.
2015-03-02 16:22:09 +01:00
Yorick Peterse e138aa15ac Removed stray comment in the XPath parser. 2014-12-28 23:55:33 +01:00
Yorick Peterse 746c8052dd Remove all nodes when calling Element#inner_text=
This fixes #64.
2014-12-14 23:32:43 +01:00
Dmitry Krasnoukhov 26baf89440 Add missing entities to the decode/encode lists 2014-11-21 01:53:11 +02:00