Commit Graph

1096 Commits

Author SHA1 Message Date
Yorick Peterse f814ec7170 Expanded CONTRIBUTING guide
Now includes more notes about specs, line wrapping, writing Git commit
messages and more.
2015-06-07 17:22:44 +02:00
Yorick Peterse 5951a6f187 Release 1.0.2 2015-06-03 06:53:31 +02:00
Yorick Peterse 4bfeea2590 Use require vs require_relative
See ruby-ll commit b27fe7cc109a39184ac984405a1e452868f3fac9 for a more
in-depth explanation of this.
2015-06-03 06:42:30 +02:00
Yorick Peterse d2523a1082 Support whitespace in element closing tags
Fixes #108
2015-05-25 13:41:17 +02:00
Yorick Peterse d0d597e2d9 Allow script/template in various table elements
Fixes #105
2015-05-23 10:46:49 +02:00
Yorick Peterse 5182d0c488 Correct closing of unclosed, nested HTML elements
Previous HTML such as this would be lexed incorrectly:

    <div>
        <ul>
            <li>foo
        </ul>
        inside div
    </div>
    outside div

The lexer would see this as the following instead:

    <div>
        <ul>
            <li>foo</li>
            inside div
        </ul>
    outside div
    </div>

This commit exposes the name of the closing tag to
XML::Lexer#on_element_end (omitted for self closing tags). This can be
used to automatically close nested tags that were left open, ensuring
the above HTML is lexer correctly.

The new setup ignores namespace prefixes as these are not used in HTML,
XML in turn won't even run the code to begin with since it doesn't allow
one to leave out closing tags.
2015-05-23 09:59:50 +02:00
Yorick Peterse 8172de192c Dropped html_ prefix from HTML lexer specs 2015-05-23 09:48:45 +02:00
Yorick Peterse f587b49406 Move HTML lexer specs into spec/oga/html/lexer 2015-05-23 09:47:49 +02:00
Yorick Peterse 04f431eee7 Add HTML lexer large input timing benchmark 2015-05-23 06:54:15 +02:00
Yorick Peterse ce10ac779d Move HTML lexer benchmarks to a separate directory 2015-05-23 06:53:11 +02:00
Yorick Peterse 3c6263d8de Updated list of elements that close <p> tags 2015-05-21 21:11:41 +02:00
Yorick Peterse 73855e6428 Compare with Nokogiri in the HTML parser bench 2015-05-21 20:59:48 +02:00
Yorick Peterse 1d88b063ac Set the ACL when uploading documentation
Apparently when you upload files using a different account (which can
access an S3 bucket based on the bucket policies) it uploads files as
private.
2015-05-21 15:47:29 +02:00
Yorick Peterse 04948eb211 Release 1.0.1 2015-05-21 11:42:06 +02:00
Yorick Peterse 2766d5f27f Pack HTML entities using "U*"
See https://github.com/YorickPeterse/oga/issues/90#issuecomment-89859273
for more details, apparently I didn't fix this before.
2015-05-21 11:26:01 +02:00
Yorick Peterse c97c1b6899 Do not encode single/double quotes as entities
By encoding single/double quotes we can potentially break input, so lets
stop doing this. This now ensures that this:

    <foo>a"b</foo>

Is actually serialized back into the exact same instead of being
serialized into:

    <foo>a&quot;b</foo>
2015-05-21 11:23:44 +02:00
Yorick Peterse 098cd51ab7 Notes on how to run benchmarks 2015-05-20 23:32:52 +02:00
Yorick Peterse fc431ad5db Added dev dependencies to the contributing guide 2015-05-20 23:29:01 +02:00
Yorick Peterse 4cb6d7cdb6 Updated changelog date for 1.0 2015-05-20 22:46:05 +02:00
Yorick Peterse cab47f155c Release 1.0.0 2015-05-20 22:44:59 +02:00
Yorick Peterse c4f3d9e0fa Updated changelog regarding HTML5 support 2015-05-19 23:45:28 +02:00
Yorick Peterse dc2e31e35b Added remaining HTML closing specs 2015-05-19 23:41:06 +02:00
Yorick Peterse 2f182a65fe HTML closing specs for <dd>/<dd> elements 2015-05-19 00:22:04 +02:00
Yorick Peterse 1ba801370f HTML closing specs for the <li> element 2015-05-18 21:49:36 +02:00
Yorick Peterse efeb38699a HTML closing specs for the "body" element 2015-05-18 21:44:00 +02:00
Yorick Peterse 688a1fff0e Use blacklists/whitelists for HTML closing rules
This allows for more fine grained control over when to close certain
elements. For example, an unclosed <tr> element should be closed first
when bumping into any element other than <td> or <th>. Using the old
NodeNameSet this would mean having to list every possible HTML element
out there. Using this new setup one can just create a whitelist of the
<td> and <th> elements.
2015-05-18 00:32:29 +02:00
Yorick Peterse 5a74571536 Added HTML head closing specs 2015-05-18 00:32:19 +02:00
Yorick Peterse 2a1c5646f3 Reworked HTML colgroup closing specs 2015-05-18 00:32:09 +02:00
Yorick Peterse 81cf7ba9b6 Reworked HTML caption closing specs 2015-05-18 00:32:01 +02:00
Yorick Peterse 541fb2d5c3 Removed generated HTML closing specs 2015-05-18 00:31:48 +02:00
Yorick Peterse 132d112f5f Removed NodeNameSet class 2015-05-17 21:59:43 +02:00
Yorick Peterse cec8798694 Mark HTML_VOID_ELEMENTS as private 2015-05-17 21:59:14 +02:00
Yorick Peterse bcc101b819 Use Whitelist for HTML_VOID_ELEMENTS 2015-05-17 21:59:00 +02:00
Yorick Peterse 596a9b18d6 Use Whitelist for LITERAL_HTML_ELEMENTS 2015-05-17 21:56:38 +02:00
Yorick Peterse ca16a2976e Added Blacklist/Whitelist classes
These will be used in favour of the NodeNameSet class.
2015-05-17 21:55:06 +02:00
Yorick Peterse 3e337faa1d Added license change to the 1.0 changelog 2015-05-16 00:13:13 +02:00
Yorick Peterse 928c8c0232 Updated Gemspec license 2015-05-15 23:57:50 +02:00
Yorick Peterse 0a7242aed4 Change license from MIT to MPL 2.0
While the MIT license is a fantastic license for those too lazy (or
unable) to understand more complex licenses it's too lax when it comes
to protecting authors (= me). For example, there are no clauses regarding
patents or ownership of source code. This means that patent trolls
could, in theory, drag me to court.

Of course one can still do that when using the MPL, but at least it has
an explicit clause regarding patents. The MPL also provides a nice
balance between the MIT license and the Apache license. I don't like the
Apache license as it requires listing any significant changes in every
changed file.

In short, I don't really care what people do with my software (they
could sell it for millions for all I care), as long as they don't drag
me to court or otherwise hold me accountable for something.
2015-05-15 23:48:18 +02:00
Yorick Peterse 1c095ddaff Added more HTML closing rules for colgroup/caption 2015-05-12 23:14:48 +02:00
Yorick Peterse 7d9604fd93 Added 1e0b7f to the 1.0 changelog 2015-05-12 00:44:51 +02:00
Yorick Peterse 1e0b7feb02 Recursively closing of parent HTML elements
When closing certain HTML elements the lexer should also close whatever
parent elements remain. For example, consider the following HTML:

    <table>
        <thead>
            <tr>
                <th>Foo
                <th>Bar
        <tbody>
            ...
        </tbody>
    </table>

Here the "<tbody>" element shouldn't only close the "<th>Bar" element
but also the parent "<tr>" and "<thead>" elements. This ensures we'd end
up with the following HTML:

    <table>
        <thead>
            <tr>
                <th>Foo</th>
                <th>Bar</th>
            </tr>
        </thead>
        <tbody>
            ...
        </tbody>
    </table>

Instead of garbage along the lines of this:

    <table>
        <thead>
            <tr>
                <th>Foo</th>
                <th>Bar</th>
        <tbody>
            ...
        </tbody>
    </table></tr></thead>

Fixes #99 (hopefully for good this time)
2015-05-12 00:35:00 +02:00
Yorick Peterse 11c9b69847 Clarify the name a bit more in the README 2015-05-11 21:36:21 +02:00
Yorick Peterse df96b3d3bb Define the public API using YARD/semver 2015-05-11 21:34:34 +02:00
Yorick Peterse 039edee9ec Removed native extensions section from the README
This is already covered in CONTRIBUTING.md.
2015-05-11 21:19:53 +02:00
Yorick Peterse 99608ec159 Prepared changelog for 1.0 2015-05-11 11:28:05 +02:00
Yorick Peterse ecdeeacd76 Use Symbol#equal? instead of Symbol#==
At least on JRuby and Rubinius this can be quite a bit faster. On MRI
the difference is not really significant.
2015-05-07 23:03:03 +02:00
Yorick Peterse 5c7c4a6110 Don't use a splat with AST::Node#to_a
By using AST::Node#children directly with a splat we save ourselves an
extra method call. This in turn speeds up both the
xpath/evaluator/big_xml_average_bench.rb and
xpath/evaluator/node_matches_bench.rb benchmarks a little bit.
2015-05-07 01:23:27 +02:00
Yorick Peterse 0298e7068c Don't use Namespace#to_s when matching namespaces
This is a waste of time as it allocates a new String on every call.
2015-05-07 01:04:03 +02:00
Yorick Peterse b9145d83f8 Less html? calls in Element#available_namespaces
Previously it would always call the "html?" method, even if the
available namespaces were already set.
2015-05-07 01:02:59 +02:00
Yorick Peterse b5e63dc50e Improved perf of XPath::Evaluator#node_matches?
Using the benchmark xpath/evaluator/node_matches_bench.rb the results
prior to this commit were as following for 3 cases:

    name only:          737633 i/s
    namespace wildcard: 612196 i/s
    name wildcard:      516030 i/s

With this commit said numbers have changed to the following:

    name only:          746086  i/s
    namespace wildcard: 1097168 i/s
    name wildcard:      1151255 i/s

This results in the following increase of performance for each case:

    name only:          1,011x (insignificant)
    namespace wildcard: 1,79x
    name wildcard:      2,23x

In the benchmark xpath/evaluator/big_xml_average_bench.rb the difference
isn't really noticable as said benchmark only queries elements by names,
of which the performance hasn't really improved.
2015-05-07 00:05:25 +02:00