# Changelog This document contains details of the various releases and their release dates. Dates are in the format `yyyy-mm-dd`. ## 2.0.0 - 2015-12-26 ### Fixed parsing HTML identifiers HTML identifiers are now parsed correctly. This means that for the element `` the element name is now "bar" _without_ a namespace prefix ever being set. For the element `` the attribute name is now "bar:baz" instead of just "baz". This particular change may break existing applications, hence the version bump to 2.0. See commit 66fc4b1dfcc4b651302c7582f62287d5750dcbfe for more information. ### Slightly improved performance of checking XPath booleans Performance of checking if certain XPath values are booleans has been improved somewhat. See commit 9bb908f8b1f6c72582ae6070d30f8bd8316ec5ad for more information. ## 1.3.1 - 2015-09-07 ### Race condition in the XPath compiler This release fixes a race condition in the XPath compiler. The `XPath::Compiler#compile` method would compile Procs using its own Binding, this in turn would lead to race conditions when using the compiled Procs concurrently. See commit bd48dc15cc26f4eb556068afaafd2ab18271d8d3 for more information. ## 1.3.0 - 2015-09-06 ### XPath query evaluation rewritten The system used for evaluating XPath and CSS queries has been rewritten from the ground up, resulting in much better performance. Prior to 1.3.0 Oga would evaluate queries by iterating over the Abstract Syntax Tree (AST) produced by the XPath/CSS parser. This setup could lead to _lots_ of object allocations and method calls, even for small queries. Starting with 1.3.0 Oga instead generates Ruby code based on XPath expressions. The generated code relies on nesting of conditionals (instead of method calls) and allocates far fewer objects (partially as a result of this). The generated code is cached based on the input expression, removing the need for recompiling the same expression over and over. The result of all this greatly improved querying performance. As an example, lets look at the benchmark `benchmark/xpath/compiler/big_xml_average_bench.rb`. When using Oga 1.2.3 the output is as following: Iteration: 1: 3.292 Iteration: 2: 2.71 Iteration: 3: 2.747 Iteration: 4: 2.752 Iteration: 5: 2.776 Iteration: 6: 2.735 Iteration: 7: 2.761 Iteration: 8: 2.741 Iteration: 9: 2.791 Iteration: 10: 2.787 Iterations: 10 Average: 2.809 sec Using Oga 1.3.0 we instead get the following output: Iteration: 1: 0.639 Iteration: 2: 0.422 Iteration: 3: 0.428 Iteration: 4: 0.47 Iteration: 5: 0.443 Iteration: 6: 0.445 Iteration: 7: 0.51 Iteration: 8: 0.485 Iteration: 9: 0.506 Iteration: 10: 0.547 Iterations: 10 Average: 0.489 sec Here Oga 1.3.0 is about 5.7 times faster compared to version 1.2.3. In the coming days I'll work on writing a blog post that explains more about the new compiler setup, how it works, how it performs, etc. In the mean time, see the following issues/pull requests for more information: * * ### Escaping of characters in CSS expressions CSS expressions now allow querying of nodes having dots in the element name or namespace. This can be done by escaping the dot using a backslash. For example: Oga.parse_xml('').css('foo\.bar') # => NodeSet(Element(name: "foo.bar")) See issue for more information. ### Support for the CSS :not() pseudo class CSS expressions can now use the `:not()` pseudo class. See issue for more information. ### Improved parsing of CSS expressions CSS expressions such as `foo>bar` and `foo > .bar` are now supported, previously these would result in parser errors. See the following issues for more information: * * ### Unicode support for CSS/XPath CSS and XPath expressions can now contain Unicode characters, previously only ASCII characters were allowed for identifiers (node tests, attribute names, etc). See issue for more information. ## 1.2.3 - 2015-08-19 ### NodeSet performance improvements Performance of the NodeSet class has been improved, especially when used in concurrent environments. See commit 4f94d03a85f6acabe5cc57ba8c778928e42186be for more information. ### Comparing names in the XPath evaluator Performance of comparing names of nodes in the XPath evaluator has been improved thanks to Daniel Fockler. See the following commits for more information: * be7bc8f4234832ea16385ed92ba252850d7a890e * fc38b39aa36f49fc38afd44c7e23ac3bfc6159e7 * 496811a23fa7c3f0498ec5721575b1c8406a5351 ## 1.2.2 - 2015-08-14 ### Race condition in the LRU class A race condition in the LRU class has been resolved. This race condition would result in errors such as "ConcurrencyError: Detected invalid array contents due to unsynchronized modifications with concurrent users" on JRuby or "ArgumentError: negative array size" on Rubinius. See commit 32b75bf62c0c1770b68e7e1a9918718943d1c04c for more information. ### Lexing of void elements with explicit closing tags Void elements followed by an explicit closing tag (e.g. ``) are now lexed properly. Thanks to Jakub Pawlowicz for fixing this. See commit ed3cbe7975eeb9d142c4f649334038b6389abc0e for more information. ## 1.2.1 - 2015-07-01 ### Better support for decoding unrecognized XML/HTML entities Jakub Pawlowicz improved the process of decoding XML/HTML entities so that it handles unrecognized entities better. Previously Oga would raise an error when trying to decode entities such as `&#TAB;` instead of just leaving them as-is. See issue and pull request for more information. ## 1.2.0 - 2015-06-30 ### Support for Unicode in XML/HTML identifiers XML/HTML element and attribute names can now contain Unicode characters. While the HTML specification states only ASCII may be used Oga still supports Unicode identifiers for HTML. See commit dde644cd7991f5d24e662e0fc4094bd644274046 for more information. ### Support for dots in XML/HTML identifiers XML/HTML element and attribute names can now contain dots such as `baz`. Thanks to Laurence Lee for adding this. See commit b7771ed5fe4b82ad72d255444f87f5e51638af7d for more information. ### Support for the CSS :nth() pseudo class Oga now supports the `:nth()` CSS pseudo class. This pseudo class can be used to select the Nth element in a set regardless of any preceding/following siblings. See commit 71960fff87da633dcab863002a461fbf7d4c5738 for more information. ### Support for commas in CSS expressions CSS expressions such as `foo, bar` and `#hello, #world` are now supported. See commit d26b48feb454de0d5b2a3a9bcffaa7c5d9d604b5 for more information. ## 1.1.0 - 2015-06-29 ### Better support for unquoted HTML attribute values Oga can now parse HTML such as `` and basically any other kind of value as long as it does not contain a `>` or whitespace. See commit 3b633ff41c48c44893e42d3ba29ef7a5e3d70617 for more information. ### Support for replacing of DOM nodes The newly added method `Oga::XML::Node#replace` can be used to replace an existing node with another node or with a String (which will result in it being replaced with a Text node). For example: p = Oga::XML::Element.new(:name => 'p') div = Oga::XML::Element.new(:name => 'div', :children => [p]) puts div.to_xml # => "

" p.replace('Hello world!') puts div.to_xml # => "
Hello world!
" Thanks to Tero Tasanen for adding this. See commit 0b4791b277abf492ae0feb1c467dfc03aef4f2ec and for more information. ### Encoding quotes in attribute values When serializing elements back to XML Oga now properly encodes single/double quotes in attribute values. See commit 074b53c18c85eaeba09557f6b0c5a6792f522c3e for more information. ## 1.0.3 - 2015-06-16 ### Strict XML parsing support Oga can now parse XML documents in "strict mode". This mode currently only disables the automatic insertion of missing closing tags. This feature can be used as following: document = Oga.parse_xml('bar', :strict => true) This works for all 3 (DOM, SAX and pull) parsing APIs. See commit 2c18a51ba905d46469170af7f071b068abe965eb for more information. ### Support for HTML attribute values without starting quotes Oga can now parse HTML such as ``. This will be parsed as if the input were ``. See commit fd307a0fcc3616ded272432ba27f972a9113953a for more information. ### Support for spaces around attribute equal signs Previously XML/HTML such as `` would not be parsed correctly as Oga didn't support spaces around the `=` sign. Commit a76286b973ed6d6241a0280eb3d1d117428e9964 added support for input like this. ### Decoding entities with numbers Oga can now decode entities such as `½`. Due to an incorrect regular expression these entities would not be decoded. See commit af7f2674af65a2dd50f6f8a138ddd9429e8533d1 for more information. ## 1.0.2 - 2015-06-03 ### Fix for requiring extensions on certain platforms The loading of files has been changed to use `require` so that native extensions are loaded properly even when a platform decides not to store in in the lib directory. See commit 4bfeea2590682ce7bf721c1305cb7c7a5707faac for more information. ### Better closing of HTML tags Closing of HTML tags has been improved so Oga can parse HTML such as this:
  • foo
inside div
outside div See the following commits for more information: * d0d597e2d93035c35b6b653d181f550d9dd522fd * 5182d0c488759efb96d85a399de29550faea3efe * 3c6263d8de30b91aac7c3b16b65f00407b88fc13 ### Whitespace support in closing tags Oga can now lex HTML/XML such as the following:

hello

See commit d2523a1082b5ab601724e02fa4c613a9d9d9e3c6 for more information. ## 1.0.1 - 2015-05-21 ### Encoding quotes in XML Oga no longer encodes single/double quotes as XML entities when serializing a document back to XML. This ensures that input such as `a"b` doesn't get turned into `a"b`. ### HTML Entity Encoding HTML entities are now generated using `pack('U*')` instead of `pack('U')` ensuring the correct characters/codepoints are produced. ## 1.0.0 - 2015-05-20 This marks the first stable release (API wise) for Oga. It's been quite the ride since the very first commit from February 26, 2014. In the releases following 1.0 I plan to focus mainly on performance as both XMl/HTML parsing and XPath evaluation performance is not quite as fast as I'd like it to be. ### License Change Up until 1.0.0 Oga was licensed under the MIT license. Since this license does fairly little to protect authors (especially regarding patents) I've decided to change the license to the Mozilla Public License 2.0. More information on this can be found in commit 0a7242aed44fcd7ef18327cc5b10263fd9807a35. ### XPath Performance Improvements With 1.0 the evaluator received further performance improvements that should be especially noticable when querying large XML/HTML documents. Improving XPath performance is an ongoing task so expect similar improvements in upcoming releases. See the following commits for more information: * ecdeeacd767ec974e7cf2306f30d62bf7c3120b8 * 5c7c4a6110d9fc7142bccc367f8b77b98532eac4 * 0298e7068c79a46aef6dc8256ccc25348d2bdf1d * b9145d83f8ac97d813cabc3837488a0d732893fd * b5e63dc50eb8423a1839fbfb815521e8f3a1e378 ### Full HTML5 Support With 1.0 Oga finally supports parsing of HTML5 according to the official specification. This means that Oga is now capable of parsing HTML such as the following:

Hello, this is a list:

  • First item
  • Second item
This would be parsed as if the HTML were as following instead:

Hello, this is a list:

  • First item
  • Second item
See the following commits for more information: * 688a1fff0efb9e2405e0aab5b3a7164e78ec287e * 1c095ddaffd7e33cc89449d77a1cc25a781f8a92 * 1e0b7feb026d95f2b04706391a868d64b7e5de6e * 69180ff686553958eeedecf1d89a9e6a56d7571e * 4b1c296936c02854247fbc0814a005f05b7eec0e * 4b21a2fadc8684446663d92c7b73be46595323c1 * 8135074a62c38b039fbee2d916a196e1e43039f3 The following issues are also worth checking out: * https://github.com/YorickPeterse/oga/issues/101 * https://github.com/YorickPeterse/oga/issues/99 ### Handling of invalid XML/HTML Oga can now handle most forms of invalid XML/HTML by automatically inserting missing closing tags and ignoring stray opening tags where possible. This allows Oga to parse XML such as the following: Alice See commit 13e2c3d82ffb9f32b863cb47f6808cf061e07095 for more information. ### Decoding zero padded XML/HTML entities Oga can now decode zero padded XML/HTML entities such as `&`. See commit 853d804f3468c9f54c222568a7faedf736f8dc1a for more information. ## 0.3.4 - 2015-04-19 XML and HTML entities are decoded in the SAX parser before data is passed to a custom handler class. See commit da62fcd75d0889e4539e7390777a906a914a78c0 for more information. ## 0.3.3 - 2015-04-18 ### Improved lexer support for script/style tags Commit 73fbbfbdbdecafcf5f873b8a27e81c19a2e2ed0c improved support for lexing HTML script and style tags, ensuring that HTML such as the following is processed correctly: ### Lexing of extra quotes The XML lexer can now handle stray quotes that reside in the open tag of an element, for example: While technically invalid HTML certain websites such as contain HTML like this. See commit 6b779d788384b89ba30ef60c17a156216ba5b333 for more information. ### Lexing of doctypes containing newlines The XML lexer is now capable of lexing doctypes that contain newlines such as: See commit 9a0e31d0ae9fc8bbf9fdacb13100a7327d09157a for more information. ## 0.3.2 - 2015-04-15 ### Support for unquoted HTML attribute values Oga can now lex/parse HTML attribute values that don't use quotes. For example, the following is valid HTML: Foo And so is this: Foo/bar See Github issue and the following commits for more information: * bc9b9bc9537d9dc614b47323e0a6727a4ec2dd04 * d892ce97874ed0f1382df993c40a452530025f02 * afbb5858122d5aece252b957b3988787ed76168f * 23a441933ac659933646418ed62ba188bb20ff65 ### Counting newlines in XML declarations The XML lexer has been adjusted so that it counts newlines when processing XML declarations. While these newlines are not exposed to the resulting `Oga::XML::*` instances they are used when reporting errors. Previously the lexer wouldn't count newlines in XML declarations, leading to error messages referring to incorrect line numbers. This was fixed in commit e942086f2df0204fc7756c3df260297f5cadc7c2. ### Better lexer support for CDATA, comments and processing instructions The XML lexer has been tweaked so it can handle multi-line CDATA tags, comments and processing instructions, both when using a String and IO (or similar) as input. See Github issue and the following commits for more information: * b2ea20ba615953254554565e0c8b11587ac4f59c * ea8b4aa92fe746a9da19e94c3edf68b41495d992 * 8acc7fc743c9492eed2d9c885c22c1b5bec06d0f ### Performance Improvements To improve performance of the XPath evaluator (as well as generic code using Oga) the following methods now cache their return values: * `Oga::XML::Element#available_namespaces` * `Oga::XML::Element#namespace` * `Oga::XML::Node#html?` These cache of these methods is flushed automatically when needed. For example, registering a new namespace will flush the cache for `Element#available_namespaces` and `Element#namespace`. The performance of `Oga::XML::Traversal#each_node` has also been optimized, cutting down the amount of object allocations significantly. Combined these improvements should make XPath evaluation roughly 4 times faster. See the following commits for more information: * 739e3b474cb562f774a0e80f5f33b3b18ec7d8c5 * b42f9aaf322c6bb67a3ddfd2b350d72a45c1fd8f * fa838154fc19c938355e1d96c5e2dd4d8c299ba3 * b0359b37e536aef172b95b54dea91198b9512e15 ## 0.3.1 - 2015-04-08 Oga no longer decodes any HTML entities that appear inside the body of a ` Would effectively be turned into: See commit 4bdc8a3fdcc3111c1e2f7de983faaaf5bb6fffb1 for more information. ## 0.3.0 - 2015-04-03 ### Lexing of carriage returns Oga can now lex and parse XML documents using carriage returns for newlines. This was added in commit 0800654c962c20fb139a389245359bca9952dcd1. ### Improved handling of HTML namespaces Oga now ignores any declared namespaces when parsing HTML documents as HTML5 does not allow one to register custom namespaces. See commit 31764593070b29fcd16040a6a0bd553e464324cd for more information. ### Improved handling of explicitly declared default XML namespaces In the past explicitly defining the default XML namespace in a document would lead to Oga's XPath evaluator not being able to match any nodes. This has been fixed in commit 5adeae18d0e53fda3bcfb883b414dee8e3a9d87d. ### Caching of XPath/CSS expressions The CSS and XPath parsers now cache the ASTs of an expression used when querying a document using CSS or XPath. This can give a pretty noticable speed improvement, especially when running the same expression in a loop (or just many different times). Parsed expressions are stored in an LRU to prevent memory from growing forever. Currently the capacity is set to 1024 values but this can be changed as following: Oga::XPath::Parser::CACHE.maximum = 2048 Oga::CSS::Parser::CACHE.maximum = 2048 The LRU synchronizes method calls to allow safe usage from multiple threads. See the following commits for more info: * 66fa9f62ef1f5e2e447cdc724b42f2e1d58b0753 * 12aa21fb502a044d660cc53557d0a1208eb8e61d * 2c4e490614528dc873f8275fe10c34ae489cfee5 * 67d7d9af88787a8a810273e3451b194a6284b1ef ### Windows support While Oga for the most part already supported Windows a few changes for the extension compilation process were required to allow users to install Oga on Windows. Tests are run on AppVeyor (a continuous integration service for Windows platforms). Oga requires devkit () to be installed on non Cygwin/MinGW environments. Cygwin/MinGW environments probably already work, although I do not run any tests on these environments. ### SAX parsing of XML attributes Parsing of XML attributes using the SAX API was overhauled quite a bit. As these changes are not backwards compatible it's likely that existing SAX parsers will break. See commit d8b9725b82f93d92b10170612446fbbef6190fda for more information. ### Parser callbacks for XML attributes The XML parser has an extra callback method called `on_attribute` which is used to create a new attribute. This callback can be used in custom SAX parsers just like the other callbacks. ### Parser rewritten using ruby-ll The XML, CSS and XPath parsers have been re-written using ruby-ll (). While Racc served its purpose (until now) it has three main problems: 1. Performance is not as good as it should be. 2. The codebase is dated and generally hard to deal with, as such it's quite difficult to optimize in reasonable time. 3. LALR parser errors can be incredibly painful to debug. For this reason I wrote ruby-ll and replaced Oga's Racc based parsers with ruby-ll parsers. These parsers are LL(1) parsers which makes them a lot easier to debug. Performance is currently a tiny bit faster than the old Racc parsers, but this will be improved in the coming releases of both Oga and ruby-ll. See pull request for more information. ### Lazy decoding of XML/HTML entities In the past XML/HTML entities were decoded in the lexer, adding overhead even when not needed. This has been changed so that the decoding of entities only occurs when calling `XML::Text#text`. With this particular change also comes support for HTML entities and codepoint based XML/HTML entities. See commit 2ec91f130fcdfee918578d045b07367aec434260 for more information. ## 0.2.3 - 2015-03-04 This release adds support for lexing HTML `