Instead of using a single token (T_CDATA) for a CDATA tag the lexer now
uses 3 tokens:
1. T_CDATA_START
2. T_CDATA_BODY
3. T_CDATA_END
The T_CDATA_BODY token can occur multiple times and is turned into a
single value in the XML parser. This is similar to the way strings are
lexed.
By changing the way CDATA tags are lexed Oga can now lex CDATA tags
containing newlines when using an IO as input. For example, this would
previously fail:
Oga.parse_xml(StringIO.new("<![CDATA[\nfoo]]>"))
Because IO input reads input per line the input for the lexer would be
as following:
"<![CDATA[\n"
"foo]]>"
Related issues: #93
This is the first spec of many that will be re-written. Eventually this will
remove the need of the shared examples as well as removing lots of code
duplication and odd context blocks.
The new setup will not involve a separate transformation stage, instead the CSS
parser will directly emit an XPath AST. This reduces the overhead needed for
parsing/evaluating CSS selectors while also simplifying the code. The downside
is that I basically have to re-write 80% of the parser.
Instead of returning the tokens as a whole they are now streamed using
XML::Lexer#advance. This method returns the next token upon every call. It uses
a small buffer in case a particular block of text results in multiple tokens.