Commit Graph

87 Commits

Author SHA1 Message Date
Yorick Peterse 3ff9057858 Renamed the XPath axes_spec file. 2014-06-13 09:32:46 +02:00
Yorick Peterse f557aaef76 Lexer specs for the various XPath axes. 2014-06-13 00:16:40 +02:00
Yorick Peterse eba2d9954d Support for parsing basic XPath expressions. 2014-06-12 00:20:46 +02:00
Yorick Peterse 70f3b7fa92 Lex XPath operators using individual tokens.
Instead of lexing every operator as T_OP they now use individual tokens such as
T_EQ and T_LT.
2014-06-09 23:35:54 +02:00
Yorick Peterse 114bc0d6e8 Upgrade to RSpec 3.0.
For this I've enabled both the old expectation and stubbing/mocking syntax. The
old syntax is much more compact and to me reads nicer. For example, consider
the following:

    lex('<foo></foo>').should == [...]

To me this reads much nicer than this:

    expect(lex('<foo></foo>')).to eq([...])
2014-06-02 12:09:38 +02:00
Yorick Peterse e11b9ed32c Basic XPath parser setup. 2014-06-01 23:02:28 +02:00
Yorick Peterse 54de2df0c7 Support for lexing XPath wildcard expressions.
To support this we need to require whitespace around the "*" operator. This is
not ideal but it will do for now.
2014-06-01 23:01:24 +02:00
Yorick Peterse b7855c124f XPath lexer test for predicates using axes. 2014-06-01 19:49:37 +02:00
Yorick Peterse 88b06e247d Added a few more XPath lexer tests. 2014-06-01 19:43:36 +02:00
Yorick Peterse 8dd8d7a519 Basic working XPath lexer.
This doesn't lex everything of the XPath specification just yet and needs more
tests.
2014-06-01 19:24:35 +02:00
Yorick Peterse 28edc7726f Rewind IO input upon resetting the lexer. 2014-05-26 00:33:20 +02:00
Yorick Peterse 629dcd3fe6 Support for IO inputs in the lexer.
Using IO/StringIO objects one can parse large XML files without first having to
read the entire file into memory. This can potentially save a lot of memory at
the cost of a slightly slower runtime.

For IO like instances the lexer will consume the input line by line. If a
String is given it's consumed as a whole instead. A small side effect of
reading the input line by line is that text such as "foo\nbar" will be lexed as
two tokens instead of one.

Fixes #19.
2014-05-26 00:30:39 +02:00
Yorick Peterse c56b0395e4 Moved various rules around for the XML lexer.
This moves the element related rules to the element_head machine (where they
belong). This in turn makes it possible to lex ">" as a text node, previously
this was impossible.
2014-05-21 00:04:53 +02:00
Yorick Peterse cd0f3380c4 Merge multiple CDATA tokens into a single token.
The tokens T_CDATA_START, T_TEXT and T_CDATA_END have been merged together into
T_CDATA.
2014-05-19 09:36:19 +02:00
Yorick Peterse a4fb5c1299 Merge multiple comment tokens into a single one.
The tokens T_COMMENT_START, T_TEXT and T_COMMENT_END have been merged into a
single token: T_COMMENT. This simplifies both the lexer and the parser.
2014-05-19 09:30:30 +02:00
Yorick Peterse 723a273e4f Enforce symbols for element attributes.
This comes with a little bit of memory overhead but this should be minor in
most cases.
2014-05-15 01:04:26 +02:00
Yorick Peterse 19f04f98f7 Support for lexing/parsing inline doctypes. 2014-05-10 00:28:11 +02:00
Yorick Peterse 83f6d5437e Contextual pull parsing.
This adds the ability to more easily act upon specific node types and nestings
when using the pull parsing API.

A basic example of this API looks like the following (only including relevant
code):

    parser.parse do |node|
      parser.on(:element, %w{people person}) do
        people << {:name => nil, :age => nil}
      end

      parser.on(:text, %w{people person name}) do
        people.last[:name] = node.text
      end

      parser.on(:text, %w{people person age}) do
        people.last[:age] = node.text.to_i
      end
    end

This fixes #6.
2014-04-29 23:05:49 +02:00
Yorick Peterse 1a413998a3 Track the current node in the pull parser.
The current node is tracked in the instance method `node`.
2014-04-29 21:21:05 +02:00
Yorick Peterse 45b0cdf811 Track element name nesting in the pull parser.
Tracking the names of nested elements makes it a lot easier to do contextual
pull parsing. Without this it's impossible to know what context the parser is
in at a given moment.

For memory reasons the parser currently only tracks the element names. In the
future it might perhaps also track extra information to make parsing easier.
2014-04-28 23:40:36 +02:00
Yorick Peterse 030a0068bd Basic pull parsing setup.
This parser extends the regular DOM parser but instead delegates certain nodes
to a block instead of building a DOM tree.

The API is a bit raw in its current form but I'll extend it and make it a bit
more user friendly in the following commits. In particular I want to make it
easier to figure out if a certain node is nested inside another node.
2014-04-28 17:22:17 +02:00
Yorick Peterse 08d412da7e First shot at removing the AST layer.
The AST layer is being removed because it doesn't really serve a useful
purpose. In particular when creating a streaming parser the AST nodes would
only introduce extra overhead.

As a result of this the parser will instead emit a DOM tree directly instead of
first emitting an AST.
2014-04-21 23:05:39 +02:00
Yorick Peterse d9fa4b7c45 Lex input as a sequence of bytes.
Instead of lexing the input as a raw String or as a set of codepoints it's
treated as a sequence of bytes. This removes the need of String#[] (replaced by
String#byteslice) which in turn reduces the amount of memory needed and speeds
up the lexing time.

Thanks to @headius and @apeiros for suggesting this and rubber ducking along!
2014-04-17 17:45:05 +02:00
Yorick Peterse b96f7c4852 Lex attributes with namespaces.
These are lexed as just the name instead of two separate tokens.
2014-04-10 11:01:49 +02:00
Yorick Peterse 8237d5791d Stream tokens when lexing.
Instead of returning the tokens as a whole they are now streamed using
XML::Lexer#advance. This method returns the next token upon every call. It uses
a small buffer in case a particular block of text results in multiple tokens.
2014-04-09 22:08:13 +02:00
Yorick Peterse 10d0ec1573 Specs for parsing various empty nodes. 2014-04-07 21:33:23 +02:00
Yorick Peterse cb74c7edf9 Specs for XML parser errors. 2014-04-07 21:31:36 +02:00
Yorick Peterse 915d3ee505 Expanded tests for XML::Document#inspect. 2014-04-07 20:11:12 +02:00
Yorick Peterse e9412c9c4e Tests for various inspect methods. 2014-04-07 09:58:31 +02:00
Yorick Peterse a2c525dd7c Insert newlines after XML dec/doctypes. 2014-04-03 23:04:21 +02:00
Yorick Peterse c077988dd6 Tree building of doctypes. 2014-04-03 22:44:00 +02:00
Yorick Peterse 81b1155af3 Lex/parse doctype names separately. 2014-04-03 21:59:57 +02:00
Yorick Peterse 6cf906e500 Lexer tests for single quoted attributes. 2014-04-03 18:50:07 +02:00
Yorick Peterse 30c01a5aee Tests for XML::TreeBuilder#handler_missing. 2014-04-03 09:43:30 +02:00
Yorick Peterse 0f129ceac9 Tests for XML::TreeBuilder#on_comment. 2014-04-03 09:38:18 +02:00
Yorick Peterse bdb76cefc5 Dedicated handling of XML declaration nodes. 2014-04-02 22:30:45 +02:00
Yorick Peterse d6c0a1f3f3 Lex/parser XML declaration attributes. 2014-04-02 22:01:17 +02:00
Yorick Peterse fa2e71c790 Tests for TreeBuilder#on_document. 2014-03-28 18:52:08 +01:00
Yorick Peterse f99c13b516 Tests + docs for the TreeBuilder class. 2014-03-28 17:11:54 +01:00
Yorick Peterse 331726b2ca Tests for the various XML node types. 2014-03-28 16:34:30 +01:00
Yorick Peterse 79818eb349 Added a convenience class for parsing HTML.
This removes the need for users having to set the `:html` option themselves.
2014-03-25 09:40:24 +01:00
Yorick Peterse 58009614f6 Moved XML specs into spec/oga/xml. 2014-03-25 09:36:39 +01:00
Yorick Peterse eae13d21ed Namespaced the lexer/parser under Oga::XML.
With the upcoming XPath and CSS selector lexers/parsers it will be confusing to
keep these in the root namespace.
2014-03-25 09:34:38 +01:00
Yorick Peterse 641c54261e Simplified lexer output for comments. 2014-03-24 21:34:30 +01:00
Yorick Peterse eaf1669b07 Simplified lexer output for CDATA tags. 2014-03-24 21:33:05 +01:00
Yorick Peterse 470be5a839 Simplified the lexer output for doctypes. 2014-03-24 21:32:16 +01:00
Yorick Peterse ac775918ee Lexing/parsing of XML declaration tags.
This closes #12.
2014-03-24 21:30:19 +01:00
Yorick Peterse b695ecf0df Renamed element lexer tags.
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
2014-03-24 20:32:43 +01:00
Yorick Peterse 91fb7523fd Lex open tags with newlines in them. 2014-03-20 23:39:29 +01:00
Yorick Peterse 74bc11a239 Rip out column counting.
This makes both the lexer and parser quite a bit easier to use. Counting column
numbers isn't also really needed when parsing XML/HTML.
2014-03-20 19:44:28 +01:00