Commit Graph

568 Commits

Author SHA1 Message Date
Yorick Peterse e3b45fddfc to_float support for non String values 2015-08-19 20:14:22 +02:00
Yorick Peterse 07b52fb48a Added Ruby::Node#not
This is a shortcut for "!foo". Using this method one doesn't have to
worry about how the "!" operator binds. For example, this:

    !foo.or(bar)

would be parsed/evaluated as this:

    !(foo.or(bar))

when instead we want it to be this:

    (!foo).or(bar)

Using explicit parenthesis leads to ugly code, so now we can do this
instead:

    foo.not.or(bar)
2015-08-19 20:14:22 +02:00
Yorick Peterse 4da1c637bc Cleaned up descendant-or-self compiler specs 2015-08-19 20:14:22 +02:00
Yorick Peterse 2eb12eced6 XPath compiler support for all operators
Some specs still fail due to true()/false() not being implemented but
the operators themselves should work just fine.
2015-08-19 20:14:21 +02:00
Yorick Peterse 3a18d23792 to_boolean support for truthy Ruby values 2015-08-19 20:14:21 +02:00
Yorick Peterse 06ae1503d4 nodes/attributes support in to_compatible_types
This extends XPath::Conversion.to_compatible_types so that it can also
take XML::Node and XML::Attribute objects as input.
2015-08-19 20:14:21 +02:00
Yorick Peterse 376d016acd Expanded supported input for Conversion.to_float
This extends XPath::Conversion.to_float so it can also convert NodeSet
and Node instances.
2015-08-19 20:14:21 +02:00
Yorick Peterse 8a82cc3593 XPath compiler support for the "=" operator 2015-08-19 20:14:21 +02:00
Yorick Peterse 04aa8f6546 Ruby generator support for "begin" blocks 2015-08-19 20:14:21 +02:00
Yorick Peterse 92b43a7500 Renamed on_begin to on_followed_by 2015-08-19 20:14:21 +02:00
Yorick Peterse c98ba21a87 Ruby generator support for mass assignments 2015-08-19 20:14:21 +02:00
Yorick Peterse 4f03bf19c1 XPath compiler support for "ancestor" 2015-08-19 20:14:21 +02:00
Yorick Peterse 52741a3b78 Added XML::Node#each_ancestor
This method can be used to walk through the ancestor tree of a Node.
2015-08-19 20:14:20 +02:00
Yorick Peterse db39b25546 XPath compiler support for ancestor-or-self
This also comes with some changes to the specs as the old behaviour of
the Evaluator was incorrect. The Evaluator would bail after matching a
single node but instead it's meant to continue until it runs out of
parent nodes.
2015-08-19 20:14:20 +02:00
Yorick Peterse d8fbaf75d8 Ruby generator support for while loops 2015-08-19 20:14:20 +02:00
Yorick Peterse 7fdf8d7460 Rewrote XPath compiler predicate specs 2015-08-19 20:14:20 +02:00
Yorick Peterse 6f6151fd52 Added Ruby generator support for Symbols 2015-08-19 20:14:20 +02:00
Yorick Peterse cf2405998b Ruby::Generator support for #[] methods 2015-08-19 20:14:20 +02:00
Yorick Peterse 2c1b4e7cbc Support for generating "else" statements 2015-08-19 20:14:20 +02:00
Yorick Peterse ac6c0d806e Updated XPath variables spec to use the compiler 2015-08-19 20:14:20 +02:00
Yorick Peterse 94f7f85dc3 Added XML::Document#root_node 2015-08-19 20:14:20 +02:00
Yorick Peterse a7744b7a5c Use the XPath compiler for XPath/CSS specs 2015-08-19 20:14:20 +02:00
Yorick Peterse 3300a6df49 Added XPath::Compiler.compile_with_cache 2015-08-19 20:14:20 +02:00
Yorick Peterse 6d01adafc7 XPath compiler now actually returns a Proc 2015-08-19 20:14:20 +02:00
Yorick Peterse 6daff674d9 Use "parse" instead of "parse_xml" 2015-08-19 20:14:20 +02:00
Yorick Peterse 337d126264 Added Ruby::Generator class
This class will be used to serialize a Ruby AST back to valid Ruby
source code (as a String).
2015-08-19 20:14:20 +02:00
Yorick Peterse 6673f176d8 Added Oga::Ruby::Node
This class will be used for building Ruby ASTs that will be generated
based on XPath expressions.
2015-08-19 20:14:20 +02:00
Yorick Peterse 08c965bfbc Basic specs for the XPath compiler 2015-08-19 20:14:19 +02:00
Daniel Fockler 496811a23f Fixes #127 2015-08-14 16:15:49 -07:00
Jakub Pawlowicz ed3cbe7975 Fixes #129 - lexing superfluous end tags.
Prevents a superfluous end tag of a self-closing HTML tag from
closing its parent element prematurely, for example:

```html
<object><param></param><param></param></object>
```

(note <param> is self closing) being turned into:

```html
<object><param/></object><param/>
```
2015-07-23 13:16:11 +01:00
Jakub Pawlowicz 6fc3ef425b Fixes #118 - decoding invalid entities.
Previous regular expression was too greedy in terms of matching
letters from outside of A-F hex scope, and matching letters when
not in hex mode.
2015-06-30 17:56:26 +02:00
Yorick Peterse 565e3da176 Added encoding comment in elements_spec.rb
This ensures that older Ruby versions don't poop their pants when
running these specs.
2015-06-29 21:09:33 +02:00
Yorick Peterse dde644cd79 Support for Unicode XML/HTML identifiers
Technically HTML only allows for ASCII names but restricting that
actually requires more work than just allowing it.
2015-06-29 21:08:01 +02:00
Laurence Lee 139985612b Lexer test for elements with inline dots. 2015-06-29 20:55:48 +02:00
Yorick Peterse 71960fff87 Added CSS :nth() pseudo class
This is a Nokogiri extension (as far as I'm aware) but it's useful
enough to also include in Oga. Selectors such as "foo:nth(2)" are simply
compiled to XPath "descendant::foo[position() = 2]".

Fixes #123
2015-06-29 20:51:38 +02:00
Yorick Peterse d26b48feb4 CSS parsing support for commas
The lexer already had the basic plumbing in place but apparently I
completely forgot to also implement the required bits in the parser.

Fixes #121
2015-06-29 18:59:01 +02:00
Yorick Peterse 3b633ff41c Relax support for HTML unquoted attribute values
This allows for parsing of HTML such as:

    <a href=lol("javascript")></a>

Here the "href" attribute would have its value set to:

    lol("javascript")

Fixes #119
2015-06-29 16:35:48 +02:00
Tero Tasanen 0b4791b277 Ability to replace a node with another node or string
```
element = Oga::XML::Element.new(:name => 'div')
some_node.replace(element)
```

You can also pass a `String` to  `replace` and it will be replaced with
a `Oga::XML::Text` node

```
some_node.replace('this will replace the current node with a text node')
```

closes #115
2015-06-17 21:27:50 +03:00
Yorick Peterse 074b53c18c Fix entity encoding of attribute values
This ensures that single and double quotes are also encoded, previously
they would be left as is.

Fixes #113
2015-06-16 22:47:10 +02:00
Yorick Peterse 2c18a51ba9 Support for strict parsing of XML documents
Currently this only disabled the automatic insertion of closing tags, in
the future this may also disable other features if deemed worth the
effort.

Fixes #107
2015-06-15 23:53:11 +02:00
Yorick Peterse fd307a0fcc Support HTML attributes without starting quotes
This allows the lexer to process input such as:

    <a href=foo"></a>

For XML input the lexer still expects properly opened/closed attribute
values.

Fixes #109
2015-06-08 06:46:08 +02:00
Yorick Peterse a76286b973 Support for spaces around attribute equal signs
This also takes care of making sure line numbers are incremented
properly.

Fixes #112
2015-06-08 06:34:49 +02:00
Yorick Peterse af7f2674af Decoding of entities with numbers
This ensures that entities such as "&frac12;" are decoded properly.
Previously this would be ignored as the regular expression used for this
only matched [a-zA-Z].

This was adapted from PR #111.
2015-06-07 17:42:24 +02:00
Yorick Peterse d2523a1082 Support whitespace in element closing tags
Fixes #108
2015-05-25 13:41:17 +02:00
Yorick Peterse d0d597e2d9 Allow script/template in various table elements
Fixes #105
2015-05-23 10:46:49 +02:00
Yorick Peterse 5182d0c488 Correct closing of unclosed, nested HTML elements
Previous HTML such as this would be lexed incorrectly:

    <div>
        <ul>
            <li>foo
        </ul>
        inside div
    </div>
    outside div

The lexer would see this as the following instead:

    <div>
        <ul>
            <li>foo</li>
            inside div
        </ul>
    outside div
    </div>

This commit exposes the name of the closing tag to
XML::Lexer#on_element_end (omitted for self closing tags). This can be
used to automatically close nested tags that were left open, ensuring
the above HTML is lexer correctly.

The new setup ignores namespace prefixes as these are not used in HTML,
XML in turn won't even run the code to begin with since it doesn't allow
one to leave out closing tags.
2015-05-23 09:59:50 +02:00
Yorick Peterse 8172de192c Dropped html_ prefix from HTML lexer specs 2015-05-23 09:48:45 +02:00
Yorick Peterse f587b49406 Move HTML lexer specs into spec/oga/html/lexer 2015-05-23 09:47:49 +02:00
Yorick Peterse c97c1b6899 Do not encode single/double quotes as entities
By encoding single/double quotes we can potentially break input, so lets
stop doing this. This now ensures that this:

    <foo>a"b</foo>

Is actually serialized back into the exact same instead of being
serialized into:

    <foo>a&quot;b</foo>
2015-05-21 11:23:44 +02:00
Yorick Peterse dc2e31e35b Added remaining HTML closing specs 2015-05-19 23:41:06 +02:00
Yorick Peterse 2f182a65fe HTML closing specs for <dd>/<dd> elements 2015-05-19 00:22:04 +02:00
Yorick Peterse 1ba801370f HTML closing specs for the <li> element 2015-05-18 21:49:36 +02:00
Yorick Peterse efeb38699a HTML closing specs for the "body" element 2015-05-18 21:44:00 +02:00
Yorick Peterse 5a74571536 Added HTML head closing specs 2015-05-18 00:32:19 +02:00
Yorick Peterse 2a1c5646f3 Reworked HTML colgroup closing specs 2015-05-18 00:32:09 +02:00
Yorick Peterse 81cf7ba9b6 Reworked HTML caption closing specs 2015-05-18 00:32:01 +02:00
Yorick Peterse 541fb2d5c3 Removed generated HTML closing specs 2015-05-18 00:31:48 +02:00
Yorick Peterse 132d112f5f Removed NodeNameSet class 2015-05-17 21:59:43 +02:00
Yorick Peterse ca16a2976e Added Blacklist/Whitelist classes
These will be used in favour of the NodeNameSet class.
2015-05-17 21:55:06 +02:00
Yorick Peterse 1c095ddaff Added more HTML closing rules for colgroup/caption 2015-05-12 23:14:48 +02:00
Yorick Peterse 1e0b7feb02 Recursively closing of parent HTML elements
When closing certain HTML elements the lexer should also close whatever
parent elements remain. For example, consider the following HTML:

    <table>
        <thead>
            <tr>
                <th>Foo
                <th>Bar
        <tbody>
            ...
        </tbody>
    </table>

Here the "<tbody>" element shouldn't only close the "<th>Bar" element
but also the parent "<tr>" and "<thead>" elements. This ensures we'd end
up with the following HTML:

    <table>
        <thead>
            <tr>
                <th>Foo</th>
                <th>Bar</th>
            </tr>
        </thead>
        <tbody>
            ...
        </tbody>
    </table>

Instead of garbage along the lines of this:

    <table>
        <thead>
            <tr>
                <th>Foo</th>
                <th>Bar</th>
        <tbody>
            ...
        </tbody>
    </table></tr></thead>

Fixes #99 (hopefully for good this time)
2015-05-12 00:35:00 +02:00
Yorick Peterse 4b1c296936 Automatically closing of certain HTML tags
This ensures that HTML such as this:

    <li>foo
    <li>bar

is parsed as this:

    <li>foo</li>
    <li>bar</li>

and not as this:

    <li>
        foo
        <li>bar</li>
    </li>

Fixes #97
2015-04-27 18:43:26 +02:00
Yorick Peterse 4b21a2fadc Added NodeNameSet class
This class can be used to more easily create a Set containing both
lowercase and uppercase element names.
2015-04-22 00:54:29 +02:00
Yorick Peterse 8135074a62 Merged on_element_start with on_element_name
This makes it easier to automatically insert preceding tokens when
starting a new element as we now have access to the name. Previously
on_element_start would be invoked first which doesn't receive an
argument.
2015-04-21 23:38:06 +02:00
Yorick Peterse 853d804f34 Decoding of zero padded XML entities
This would previously fail due to the lack of an explicit base to use
for Integer().
2015-04-20 00:13:15 +02:00
Yorick Peterse 13e2c3d82f Better handling of incorrect XML/HTML tags
The XML/HTML lexer is now capable of processing most invalid XML/HTML
(that I can think of at least). This is achieved by inserting missing
closing tags (where needed) and/or ignoring excessive closing tags. For
example, HTML such as this:

    <a></a></p>

Results in the following tokens:

    [:T_ELEM_START, nil, 1]
    [:T_ELEM_NAME, 'a', 1]
    [:T_ELEM_CLOSE, nil, 1]

In turn this HTML:

    <a>

Results in these tokens:

    [:T_ELEM_START, nil, 1]
    [:T_ELEM_NAME, 'a', 1]
    [:T_ELEM_CLOSE, nil, 1]

Fixes #84
2015-04-19 23:19:02 +02:00
Yorick Peterse da62fcd75d Decode XML/HTML entities in the SAX parser
This was broken when decoding was moved out of the Lexer class into
XML::Text and XML::Attribute.

Fixes #92
2015-04-18 22:03:44 +02:00
Yorick Peterse 73fbbfbdbd Use separate Ragel machines for script/style tags
Previously a single Ragel machine was used for processing HTML
script and style tags. This had the unfortunate side-effect that the
following was not parsed correctly (while being valid HTML):

    <script>
    var foo = "</style>";
    </script>

The same applied to style tags:

    <style>
    /* </script> */
    </style>

By using separate machines we can work around the above issue. The
downside is that this can produce multiple T_TEXT nodes, which have to
be stitched back together in the parser.
2015-04-16 01:45:39 +02:00
Yorick Peterse 6b779d7883 Handle lexing of stray quotes in element heads
This adds lexing support for HTML/XML such as:

    <foo bar="""></foo>

While technically invalid, some websites (e.g. yahoo.com) contain HTML
just like this.

The lexer handles this as following:

1. When we're in the "element_head" machine, do business as usual until
   we bump into a "=".

2. Call (using Ragel's "fcall") the machine to use for processing the
   attribute value (if any).

3. In this machine quoted strings are processed. The moment a string has
   been processed the lexer jumps right back in to the "element_head"
   machine. This ensures that any stray quotes are ignored instead of
   being processed as extra attribute values (eventually leading to
   parsing errors due to unbalanced quotes).
2015-04-15 22:33:53 +02:00
Yorick Peterse 9a0e31d0ae Fix for lexing newlines in doctypes
This also ensures that newlines are advanced properly.

Fixes #95
2015-04-15 20:22:14 +02:00
Yorick Peterse d892ce9787 Fix for lexing HTML quoted attrs followed by "/>"
This ensures that when using input such as <a href="foo"/> the "/" is
not part of the attribute value.
2015-04-15 01:47:08 +02:00
Yorick Peterse afbb585812 Lexing support for unquoted HTML attribute values
This adds support for HTML such as:

    <a href=foo>HTML is a child of Satan itself</a>

Fixes #94
2015-04-15 01:23:46 +02:00
Yorick Peterse e942086f2d Fixed counting of newlines in XML declarations 2015-04-15 00:22:58 +02:00
Yorick Peterse b2ea20ba61 Lex processing instructions in chunks
Similar to comments (ea8b4aa92f) and CDATA
tags (8acc7fc743) processing instructions
are now lexed in separate chunks _with_ proper support for streaming
input.

Related issue: #93
2015-04-15 00:11:57 +02:00
Yorick Peterse ea8b4aa92f Lex comments in chunks
Similar to this being added for CDATA tags in
8acc7fc743 comments are now also lexed in
chunks.

Related issue: #93
2015-04-14 23:11:22 +02:00
Yorick Peterse 8acc7fc743 Lex CDATA tags in chunks
Instead of using a single token (T_CDATA) for a CDATA tag the lexer now
uses 3 tokens:

1. T_CDATA_START
2. T_CDATA_BODY
3. T_CDATA_END

The T_CDATA_BODY token can occur multiple times and is turned into a
single value in the XML parser. This is similar to the way strings are
lexed.

By changing the way CDATA tags are lexed Oga can now lex CDATA tags
containing newlines when using an IO as input. For example, this would
previously fail:

    Oga.parse_xml(StringIO.new("<![CDATA[\nfoo]]>"))

Because IO input reads input per line the input for the lexer would be
as following:

    "<![CDATA[\n"
    "foo]]>"

Related issues: #93
2015-04-14 22:45:55 +02:00
Yorick Peterse b42f9aaf32 Cache output of Element#available_namespaces
This cache is flushed whenever Element#register_namespace is called.
When this cache is flushed it's also recursively flushed for all child
elements. This makes calls to Element#register_namespace a bit more
expensive but in turn calls to Element#available_namespaces will be a
lot faster.
2015-04-12 20:22:33 +02:00
Yorick Peterse fa838154fc Flush Element#namespace cache
When setting a new namespace name using Element#namespace_name= the
cache used by Element#namespace is flushed properly.
2015-04-11 19:20:50 +02:00
Yorick Peterse b0359b37e5 Cache Node#html? and Node#root_node
The results of these methods is now cached until a Node is moved into
another NodeSet. This reduces the time spent in the
xpath/evaluator/big_xml_average_bench.rb benchmark from roughly 10
seconds to roughly 5 seconds per iteration.
2015-04-11 19:12:26 +02:00
Yorick Peterse 4bdc8a3fdc Don't convert entities in script/style elements
In HTML the text of a script/style tag should be left untouched, no
entities must be converted. Doing so would break Javascript such as the
following:

    foo&&bar;

Such code is often the result of minifiers doing their dirty business.
2015-04-08 14:32:09 +02:00
Yorick Peterse 6a1010c287 Fixed decoding entities in attribute values
This was broken by introducing the process of lazy decoding of XML/HTML
entities. The new setup works similar to how XML::Text#text decodes any
entities that may be present.

Fixes #91
2015-04-07 21:18:22 +02:00
Yorick Peterse ef7f50137a Added Oga::EntityDecoder
This module removes some of the code duplication needed to determine
what entity decoder to use.
2015-04-07 21:18:15 +02:00
Yorick Peterse 0800654c96 Support lexing or carriage returns
Fixes #89.
2015-04-03 00:46:37 +02:00
Yorick Peterse 3176459307 Ignore declared namespaces in HTML documents
The HTML spec states that any declared namespaces, including the default
namespace are to be ignored.

This fixes #85
2015-03-26 22:38:39 +01:00
Yorick Peterse 5adeae18d0 XPath queries match nodes in the default namespace
When querying an XML document that explicitly defines the default XML
namespace the XPath evaluator now correctly matches all nodes within
that namespace if no namespace prefix is given in the query. Previously
this would always return an empty set.
2015-03-26 01:13:55 +01:00
Yorick Peterse f175414917 Added XML::Element#default_namespace? 2015-03-26 01:10:20 +01:00
Yorick Peterse b6fcd326ef Added XML::Node#html? and XML::Node#xml?
The former has been moved over from XML::Text, the latter just inverts
html?.
2015-03-26 01:02:32 +01:00
Yorick Peterse 4ad502958d Added XML::Attribute#==
Overwriting this method makes it easier to check if a given namespace
equals the default XML (and soon HTML) namespace.
2015-03-26 00:53:16 +01:00
Yorick Peterse f2d69af33b Distinguish default attribute/element namespaces
The previous commit messed this up because I wasn't fully awake.
2015-03-26 00:43:50 +01:00
Yorick Peterse 68ada997a8 Moved default namespace into Oga::XML
The default namespace is now located at Oga::XML::DEFAULT_NAMESPACE
instead of Oga::XML::Attribute::DEFAULT_NAMESPACE.
2015-03-26 00:35:28 +01:00
Yorick Peterse 66fa9f62ef Added LRU#maximum=/maximum
This allows one to change the maximum amount of keys stored in the
XPath/CSS caches, for example:

    Oga::XPath::Parser::CACHE.maximum = 2056
2015-03-23 00:26:48 +01:00
Yorick Peterse 2c4e490614 Added CSS/XPath Parser.parse_with_cache
This method parses and caches ASTs using Oga::LRU. Currently the default
of 1024 keys is used.

See #71 for more information.
2015-03-23 00:22:59 +01:00
Yorick Peterse 67d7d9af88 Added thread-safe LRU class
This class will be used for storing parser XPath/CSS ASTs.

See #71 for more information.
2015-03-23 00:21:52 +01:00
Yorick Peterse 45d84d31da Renamed rspec helper files 2015-03-22 22:50:03 +01:00
Yorick Peterse 70e4942d3e CSS parser spec for "+ b" 2015-03-21 01:23:00 +01:00
Yorick Peterse 6039e1dbeb XPath parsing spec for axes with predicates 2015-03-21 01:23:00 +01:00
Yorick Peterse 62fa2a9cc5 Spec for XPath functions inside predicates. 2015-03-21 01:23:00 +01:00
Yorick Peterse 194d981996 XPath specs for paths with multiple members. 2015-03-21 01:22:59 +01:00
Yorick Peterse fdcd712ffe Don't use Array#uniq in NodeSet#initialize.
Removing this makes the process of parsing larger XML documents a bit faster.
The downside is that NodeSet#initialize will no longer filter out duplicate
nodes, though this is not something Oga itself relies upon.

Methods such as NodeSet#push still do ignore elements already present.
2015-03-21 01:22:59 +01:00
Yorick Peterse f83c03aaec Fixed typo in NodeSet spec. 2015-03-21 01:22:59 +01:00