Commit Graph

961 Commits

Author SHA1 Message Date
Yorick Peterse c36b35ac0f Skip ownership iteration when there's no owner.
There's no point in iterating over all the nodes and assigning ownership if
there's no owner to begin with.
2015-03-21 01:22:59 +01:00
Yorick Peterse f83c03aaec Fixed typo in NodeSet spec. 2015-03-21 01:22:59 +01:00
Yorick Peterse 9621fe1fc8 Moved changelog to the root directory. 2015-03-21 01:22:59 +01:00
Yorick Peterse 006ef4d51a Port over most of the old XML error handling.
Some messages are a bit different due to ruby-ll's error handling, other than
that it's largely the same stuff as before.
2015-03-21 01:22:59 +01:00
Yorick Peterse 1a326fc516 Remove Racc based XML parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse 1b9a4db268 Depend on ruby-ll 1.1 or newer. 2015-03-21 01:22:59 +01:00
Yorick Peterse d8b9725b82 Fixed SAX parsing of XML attributes.
This was utterly broken, mainly due to me overlooking it. There are now 2 new
callbacks to handle this properly:

* on_attribute: to handle a single attribute/value pair
* on_attributes: to handle a collection of attributes (as returned by
  on_attribute)

By default on_attribut returns a Hash, on_attributes in turn merges all
attribute hashes into a single one. This ensures that on_element _actually_
receives the attributes as a Hash, instead of an Array with random
nil/XML::Attribute values.
2015-03-21 01:22:59 +01:00
Yorick Peterse 605d565104 Use sax_parse_html for HTML documents.
I suspect the only reason this test ever passed due to Racc's error handling.
Either way this was using the wrong method.
2015-03-21 01:22:59 +01:00
Yorick Peterse dd626c10d3 Use Array#unshift in the LL XML grammar.
Using Array#+ for large sets (e.g. in the benchmarks) is _really_ slow.
Interesting enough Array#unshift uses as much memory as the Racc parser and is
about as fast, even though it has to move memory around.
2015-03-21 01:22:59 +01:00
Yorick Peterse f94407ee9d Parser callback for XML attributes. 2015-03-21 01:22:59 +01:00
Yorick Peterse a023b35e78 Fixed the pull parser for the XML LL parser. 2015-03-21 01:22:59 +01:00
Yorick Peterse 5eed0d31d6 Ported over most of the XML parser to ruby-ll.
This is still missing the error handling previously present.
2015-03-21 01:22:59 +01:00
Yorick Peterse 15a3ab9ba5 ruby-ll: full support for parsing doctypes. 2015-03-21 01:22:59 +01:00
Yorick Peterse 71aefb53cc Started porting the XML parser to ruby-ll
This is far from done.
2015-03-21 01:22:59 +01:00
Yorick Peterse 2f67399784 Use 72 characters for Git instead of 80.
This follows the Linux/universal Git guidelines more closely.
2015-03-16 14:58:20 +01:00
Yorick Peterse 2ec91f130f Lazy decoding of XML/HTML entities.
Instead of decoding entities in the lexer we'll do this whenever XML::Text#text
is called. This removes the overhead from the parsing phase and ensures the
process is only triggered when actually needed. Note that calling #to_xml and/or
the #inspect methods on a Text (or parent) instance will also trigger the entity
conversion process.

The new entity decoding API supports both regular entities (e.g. &) as well
as codepoint based entities (both regular and hexadecimal codepoints).

To allow safe read-only access to Text instances from multiple threads a mutex
is used. This mutex ensures that only 1 thread can trigger the conversion
process.

Fixes #68
2015-03-05 23:00:43 +01:00
Yorick Peterse 7409257702 Replaced HTML benchmark fixtures.
The new fixture is the HTML of a person article which contains a few HTML
entities.
2015-03-05 22:58:22 +01:00
Yorick Peterse 7e847a0ae9 Make C90 happy. 2015-03-05 22:57:51 +01:00
Yorick Peterse 33c46a1841 Use ID instead of VALUE for callback names in C. 2015-03-05 22:57:51 +01:00
Yorick Peterse 3e05593536 Release 0.2.3 2015-03-04 11:56:23 +01:00
Yorick Peterse aa42cc9ce7 Updated changelog for 0.2.3 2015-03-04 11:49:24 +01:00
Yorick Peterse 3b2055a30b Refactored handling of literal HTML elements.
This ensures newlines can appear in <style> / <script> tags when using IOs as
input.
2015-03-04 11:44:31 +01:00
Yorick Peterse 78e40b55c0 Handle parsing of HTML <style> tags.
This basically re-applies the technique used for HTML <script> tags. With this
extra addition I decided to rename/normalize a few things so it's easier to add
any extra tags in the future. One downside of this setup is that the following
will not be parsed by Oga:

    <style>
        </script>
    </style>

The same applies to script tags containing a literal </style> tag. Since this
particular case is rather unlikely to occur I'm OK with not supporting it as it
_does_ simplify the lexer quite a bit.

Fixes #80
2015-03-03 16:28:05 +01:00
Yorick Peterse 73534375d5 Release 0.2.2 2015-03-03 13:36:32 +01:00
Yorick Peterse 142b467277 Set parent of nodes set using Element#inner_text=
This ensures that any text nodes created using Element#inner_text= have their
parent node set correctly.
2015-03-03 13:13:05 +01:00
Yorick Peterse 503efc32cd Release 0.2.1 2015-03-02 22:12:49 +01:00
Yorick Peterse bc74d31bb5 Updated changelog for 0.2.1. 2015-03-02 17:44:08 +01:00
Yorick Peterse 874d7124af Don't convert <script> text to XML entities.
Fixes #79.
2015-03-02 17:32:19 +01:00
Yorick Peterse 9a586363e9 Added XML::Document#html? 2015-03-02 16:39:40 +01:00
Yorick Peterse ba2177e2cf Lex contents of <script> tags as plain text.
When lexing input in HTML mode the lexer has to treat _all_ content of a
<script> tag as plain text. This ensures that the lexer can process input such
as "x <y" and "// <foo>" correctly.

Fixes #70.
2015-03-02 16:22:09 +01:00
Yorick Peterse 351b5ac004 Added spec for lexing inline HTML script tags.
Related issue: #70
2015-03-02 16:20:06 +01:00
Yorick Peterse 8fdf27dcef Removed unused C lexer macros. 2015-03-02 15:43:47 +01:00
Yorick Peterse 8b910c700d Updated EditorConfig file for ruby-ll files. 2015-02-13 09:38:29 +01:00
Yorick Peterse c68b038e53 Added benchmark for the CSS parser. 2015-02-13 09:36:24 +01:00
Yorick Peterse f94461a9ca Upload docs to S3. 2015-01-17 18:00:05 +01:00
Yorick Peterse 2d03ce8e51 Run tests on MRI 2.2. 2015-01-09 21:37:09 +01:00
Yorick Peterse 47a3c5e7f8 Use describe/it instead of context/example.
This keeps things consistent with the general testing guidelines in the Ruby
community. This in turn should hopefully make my life easier as I don't have to
tell people to use this rather odd stlye I was using before.
2015-01-08 23:01:53 +01:00
Yorick Peterse e138aa15ac Removed stray comment in the XPath parser. 2014-12-28 23:55:33 +01:00
Yorick Peterse 746c8052dd Remove all nodes when calling Element#inner_text=
This fixes #64.
2014-12-14 23:32:43 +01:00
Yorick Peterse 739f885078 Use ID instead of VALUE for C Symbols.
Thanks to @cremno for bringing this up.
2014-11-29 12:53:55 +01:00
Yorick Peterse b006289c5f Removed extra space in c/lexer.rl 2014-11-23 22:12:18 +01:00
Yorick Peterse 5e24a3d1e5 Short docs on lexer callback names. 2014-11-23 20:20:14 +01:00
Yorick Peterse 4fa88fcbde Cache rb_intern/symbol lookups in the lexer.
For JRuby this has little to no benefits as it uses strings for method names.
However, both MRI and Rubinius will perform a Symbol lookup whenever rb_intern()
is called. By doing this once for all callback names and caching the resulting
VALUE objects the lexer timings can be reduced by about 25%. In case of the
benchmark benchmark/xml/lexer/string_average_bench.rb this means it runs in
around 500ms instead of 700ms.
2014-11-22 01:53:37 +01:00
Yorick Peterse a10fe855d7 Merge pull request #67 from krasnoukhov/xml-entities
Add missing entities to the decode/encode lists
2014-11-21 01:12:24 +01:00
Dmitry Krasnoukhov 26baf89440 Add missing entities to the decode/encode lists 2014-11-21 01:53:11 +02:00
Yorick Peterse 81c49b5101 Contributing notes on thread-safety/require usage. 2014-11-20 20:09:41 +01:00
Yorick Peterse cbb2815146 Support for inline doctype rules plus newlines.
This adds support for lexing/parsing XML documents that use an IO as input _and_
contain doctype rules with newlines in them.

This fixes #63.
2014-11-18 20:02:55 +01:00
Yorick Peterse f88df486ba README example on using Enumerator for input. 2014-11-17 23:59:30 +01:00
Yorick Peterse b8f9d04b17 Added checksums for v0.2.0 2014-11-17 23:31:40 +01:00
Yorick Peterse ae17e7f137 Clean before building any Gem. 2014-11-17 23:28:57 +01:00