Commit Graph

1181 Commits

Author SHA1 Message Date
Yorick Peterse c97c1b6899 Do not encode single/double quotes as entities
By encoding single/double quotes we can potentially break input, so lets
stop doing this. This now ensures that this:

    <foo>a"b</foo>

Is actually serialized back into the exact same instead of being
serialized into:

    <foo>a&quot;b</foo>
2015-05-21 11:23:44 +02:00
Yorick Peterse 098cd51ab7 Notes on how to run benchmarks 2015-05-20 23:32:52 +02:00
Yorick Peterse fc431ad5db Added dev dependencies to the contributing guide 2015-05-20 23:29:01 +02:00
Yorick Peterse 4cb6d7cdb6 Updated changelog date for 1.0 2015-05-20 22:46:05 +02:00
Yorick Peterse cab47f155c Release 1.0.0 2015-05-20 22:44:59 +02:00
Yorick Peterse c4f3d9e0fa Updated changelog regarding HTML5 support 2015-05-19 23:45:28 +02:00
Yorick Peterse dc2e31e35b Added remaining HTML closing specs 2015-05-19 23:41:06 +02:00
Yorick Peterse 2f182a65fe HTML closing specs for <dd>/<dd> elements 2015-05-19 00:22:04 +02:00
Yorick Peterse 1ba801370f HTML closing specs for the <li> element 2015-05-18 21:49:36 +02:00
Yorick Peterse efeb38699a HTML closing specs for the "body" element 2015-05-18 21:44:00 +02:00
Yorick Peterse 688a1fff0e Use blacklists/whitelists for HTML closing rules
This allows for more fine grained control over when to close certain
elements. For example, an unclosed <tr> element should be closed first
when bumping into any element other than <td> or <th>. Using the old
NodeNameSet this would mean having to list every possible HTML element
out there. Using this new setup one can just create a whitelist of the
<td> and <th> elements.
2015-05-18 00:32:29 +02:00
Yorick Peterse 5a74571536 Added HTML head closing specs 2015-05-18 00:32:19 +02:00
Yorick Peterse 2a1c5646f3 Reworked HTML colgroup closing specs 2015-05-18 00:32:09 +02:00
Yorick Peterse 81cf7ba9b6 Reworked HTML caption closing specs 2015-05-18 00:32:01 +02:00
Yorick Peterse 541fb2d5c3 Removed generated HTML closing specs 2015-05-18 00:31:48 +02:00
Yorick Peterse 132d112f5f Removed NodeNameSet class 2015-05-17 21:59:43 +02:00
Yorick Peterse cec8798694 Mark HTML_VOID_ELEMENTS as private 2015-05-17 21:59:14 +02:00
Yorick Peterse bcc101b819 Use Whitelist for HTML_VOID_ELEMENTS 2015-05-17 21:59:00 +02:00
Yorick Peterse 596a9b18d6 Use Whitelist for LITERAL_HTML_ELEMENTS 2015-05-17 21:56:38 +02:00
Yorick Peterse ca16a2976e Added Blacklist/Whitelist classes
These will be used in favour of the NodeNameSet class.
2015-05-17 21:55:06 +02:00
Yorick Peterse 3e337faa1d Added license change to the 1.0 changelog 2015-05-16 00:13:13 +02:00
Yorick Peterse 928c8c0232 Updated Gemspec license 2015-05-15 23:57:50 +02:00
Yorick Peterse 0a7242aed4 Change license from MIT to MPL 2.0
While the MIT license is a fantastic license for those too lazy (or
unable) to understand more complex licenses it's too lax when it comes
to protecting authors (= me). For example, there are no clauses regarding
patents or ownership of source code. This means that patent trolls
could, in theory, drag me to court.

Of course one can still do that when using the MPL, but at least it has
an explicit clause regarding patents. The MPL also provides a nice
balance between the MIT license and the Apache license. I don't like the
Apache license as it requires listing any significant changes in every
changed file.

In short, I don't really care what people do with my software (they
could sell it for millions for all I care), as long as they don't drag
me to court or otherwise hold me accountable for something.
2015-05-15 23:48:18 +02:00
Yorick Peterse 1c095ddaff Added more HTML closing rules for colgroup/caption 2015-05-12 23:14:48 +02:00
Yorick Peterse 7d9604fd93 Added 1e0b7f to the 1.0 changelog 2015-05-12 00:44:51 +02:00
Yorick Peterse 1e0b7feb02 Recursively closing of parent HTML elements
When closing certain HTML elements the lexer should also close whatever
parent elements remain. For example, consider the following HTML:

    <table>
        <thead>
            <tr>
                <th>Foo
                <th>Bar
        <tbody>
            ...
        </tbody>
    </table>

Here the "<tbody>" element shouldn't only close the "<th>Bar" element
but also the parent "<tr>" and "<thead>" elements. This ensures we'd end
up with the following HTML:

    <table>
        <thead>
            <tr>
                <th>Foo</th>
                <th>Bar</th>
            </tr>
        </thead>
        <tbody>
            ...
        </tbody>
    </table>

Instead of garbage along the lines of this:

    <table>
        <thead>
            <tr>
                <th>Foo</th>
                <th>Bar</th>
        <tbody>
            ...
        </tbody>
    </table></tr></thead>

Fixes #99 (hopefully for good this time)
2015-05-12 00:35:00 +02:00
Yorick Peterse 11c9b69847 Clarify the name a bit more in the README 2015-05-11 21:36:21 +02:00
Yorick Peterse df96b3d3bb Define the public API using YARD/semver 2015-05-11 21:34:34 +02:00
Yorick Peterse 039edee9ec Removed native extensions section from the README
This is already covered in CONTRIBUTING.md.
2015-05-11 21:19:53 +02:00
Yorick Peterse 99608ec159 Prepared changelog for 1.0 2015-05-11 11:28:05 +02:00
Yorick Peterse ecdeeacd76 Use Symbol#equal? instead of Symbol#==
At least on JRuby and Rubinius this can be quite a bit faster. On MRI
the difference is not really significant.
2015-05-07 23:03:03 +02:00
Yorick Peterse 5c7c4a6110 Don't use a splat with AST::Node#to_a
By using AST::Node#children directly with a splat we save ourselves an
extra method call. This in turn speeds up both the
xpath/evaluator/big_xml_average_bench.rb and
xpath/evaluator/node_matches_bench.rb benchmarks a little bit.
2015-05-07 01:23:27 +02:00
Yorick Peterse 0298e7068c Don't use Namespace#to_s when matching namespaces
This is a waste of time as it allocates a new String on every call.
2015-05-07 01:04:03 +02:00
Yorick Peterse b9145d83f8 Less html? calls in Element#available_namespaces
Previously it would always call the "html?" method, even if the
available namespaces were already set.
2015-05-07 01:02:59 +02:00
Yorick Peterse b5e63dc50e Improved perf of XPath::Evaluator#node_matches?
Using the benchmark xpath/evaluator/node_matches_bench.rb the results
prior to this commit were as following for 3 cases:

    name only:          737633 i/s
    namespace wildcard: 612196 i/s
    name wildcard:      516030 i/s

With this commit said numbers have changed to the following:

    name only:          746086  i/s
    namespace wildcard: 1097168 i/s
    name wildcard:      1151255 i/s

This results in the following increase of performance for each case:

    name only:          1,011x (insignificant)
    namespace wildcard: 1,79x
    name wildcard:      2,23x

In the benchmark xpath/evaluator/big_xml_average_bench.rb the difference
isn't really noticable as said benchmark only queries elements by names,
of which the performance hasn't really improved.
2015-05-07 00:05:25 +02:00
Yorick Peterse 361374c813 IPS benchmark for XPath::Evaluator#node_matches? 2015-05-07 00:04:40 +02:00
Yorick Peterse 69180ff686 Extra closing rules for caption/colgroup/head/body
Fixes #99
2015-05-03 01:09:07 +02:00
Yorick Peterse dc82953f1a Use "tags when left out" in the HTML5 section 2015-04-28 00:08:51 +02:00
Yorick Peterse b858ff75df Clarify lack of inserting html/head/body HTML tags 2015-04-28 00:08:12 +02:00
Yorick Peterse 4b1c296936 Automatically closing of certain HTML tags
This ensures that HTML such as this:

    <li>foo
    <li>bar

is parsed as this:

    <li>foo</li>
    <li>bar</li>

and not as this:

    <li>
        foo
        <li>bar</li>
    </li>

Fixes #97
2015-04-27 18:43:26 +02:00
Yorick Peterse 4b21a2fadc Added NodeNameSet class
This class can be used to more easily create a Set containing both
lowercase and uppercase element names.
2015-04-22 00:54:29 +02:00
Yorick Peterse 8135074a62 Merged on_element_start with on_element_name
This makes it easier to automatically insert preceding tokens when
starting a new element as we now have access to the name. Previously
on_element_start would be invoked first which doesn't receive an
argument.
2015-04-21 23:38:06 +02:00
Yorick Peterse 853d804f34 Decoding of zero padded XML entities
This would previously fail due to the lack of an explicit base to use
for Integer().
2015-04-20 00:13:15 +02:00
Yorick Peterse 13e2c3d82f Better handling of incorrect XML/HTML tags
The XML/HTML lexer is now capable of processing most invalid XML/HTML
(that I can think of at least). This is achieved by inserting missing
closing tags (where needed) and/or ignoring excessive closing tags. For
example, HTML such as this:

    <a></a></p>

Results in the following tokens:

    [:T_ELEM_START, nil, 1]
    [:T_ELEM_NAME, 'a', 1]
    [:T_ELEM_CLOSE, nil, 1]

In turn this HTML:

    <a>

Results in these tokens:

    [:T_ELEM_START, nil, 1]
    [:T_ELEM_NAME, 'a', 1]
    [:T_ELEM_CLOSE, nil, 1]

Fixes #84
2015-04-19 23:19:02 +02:00
Yorick Peterse 84e1bfc955 Release 0.3.4 2015-04-19 22:19:02 +02:00
Yorick Peterse da62fcd75d Decode XML/HTML entities in the SAX parser
This was broken when decoding was moved out of the Lexer class into
XML::Text and XML::Attribute.

Fixes #92
2015-04-18 22:03:44 +02:00
Yorick Peterse 611beb78c7 Release 0.3.3 2015-04-18 20:49:40 +02:00
Yorick Peterse 73fbbfbdbd Use separate Ragel machines for script/style tags
Previously a single Ragel machine was used for processing HTML
script and style tags. This had the unfortunate side-effect that the
following was not parsed correctly (while being valid HTML):

    <script>
    var foo = "</style>";
    </script>

The same applied to style tags:

    <style>
    /* </script> */
    </style>

By using separate machines we can work around the above issue. The
downside is that this can produce multiple T_TEXT nodes, which have to
be stitched back together in the parser.
2015-04-16 01:45:39 +02:00
Andrei Botalov 2d43e459a1 Update links to discontinued W3C document with a spec
http://www.w3.org/TR/html-markup is marked as discontinued
2015-04-15 23:56:25 +03:00
Yorick Peterse 6b779d7883 Handle lexing of stray quotes in element heads
This adds lexing support for HTML/XML such as:

    <foo bar="""></foo>

While technically invalid, some websites (e.g. yahoo.com) contain HTML
just like this.

The lexer handles this as following:

1. When we're in the "element_head" machine, do business as usual until
   we bump into a "=".

2. Call (using Ragel's "fcall") the machine to use for processing the
   attribute value (if any).

3. In this machine quoted strings are processed. The moment a string has
   been processed the lexer jumps right back in to the "element_head"
   machine. This ensures that any stray quotes are ignored instead of
   being processed as extra attribute values (eventually leading to
   parsing errors due to unbalanced quotes).
2015-04-15 22:33:53 +02:00