Commit Graph

1075 Commits

Author SHA1 Message Date
Yorick Peterse b96f7c4852 Lex attributes with namespaces.
These are lexed as just the name instead of two separate tokens.
2014-04-10 11:01:49 +02:00
Yorick Peterse c974b96b88 Truncate lines in parser errors.
The offending lines of code displayed in the error message are truncated to 80
characters. This should make reading the error messages less of a pain when
dealing with very long lines of HTML/XML.
2014-04-10 10:08:51 +02:00
Yorick Peterse 292a98d7f6 Basic benchmarks for the Parser class. 2014-04-10 10:05:04 +02:00
Yorick Peterse 8ca7781842 Updated the lexer benchmarks.
These had to be updated for the API changes of Oga::XML::Lexer.
2014-04-10 10:01:11 +02:00
Yorick Peterse 8237d5791d Stream tokens when lexing.
Instead of returning the tokens as a whole they are now streamed using
XML::Lexer#advance. This method returns the next token upon every call. It uses
a small buffer in case a particular block of text results in multiple tokens.
2014-04-09 22:08:13 +02:00
Yorick Peterse e9bb97d261 First steps towards making the lexer stream tokens 2014-04-09 19:32:06 +02:00
Yorick Peterse 10d0ec1573 Specs for parsing various empty nodes. 2014-04-07 21:33:23 +02:00
Yorick Peterse cb74c7edf9 Specs for XML parser errors. 2014-04-07 21:31:36 +02:00
Yorick Peterse 915d3ee505 Expanded tests for XML::Document#inspect. 2014-04-07 20:11:12 +02:00
Yorick Peterse e9412c9c4e Tests for various inspect methods. 2014-04-07 09:58:31 +02:00
Yorick Peterse 54ef125637 Basic docs for everything under Oga::XML. 2014-04-04 17:48:36 +02:00
Yorick Peterse 13a9228563 Properly indent doctype/XML decl inspect values. 2014-04-04 11:13:39 +02:00
Yorick Peterse 37a12722cb Rough setup for a custom #inspect format.
This format is a lot more readable than the default Ruby #inspect format
(mostly due to not including previous/next/parent nodes).
2014-04-04 00:41:29 +02:00
Yorick Peterse a2c525dd7c Insert newlines after XML dec/doctypes. 2014-04-03 23:04:21 +02:00
Yorick Peterse 230fafa2d3 Document should not inherit from Node.
A document is not an XML node on itself. If logic has to be shared between the
Document and the Node class I'll resort to using mixins for this.
2014-04-03 22:45:40 +02:00
Yorick Peterse c077988dd6 Tree building of doctypes. 2014-04-03 22:44:00 +02:00
Yorick Peterse 81b1155af3 Lex/parse doctype names separately. 2014-04-03 21:59:57 +02:00
Yorick Peterse 8185656c1e Fixed typ. 2014-04-03 21:41:31 +02:00
Yorick Peterse 6cf906e500 Lexer tests for single quoted attributes. 2014-04-03 18:50:07 +02:00
Yorick Peterse 30c01a5aee Tests for XML::TreeBuilder#handler_missing. 2014-04-03 09:43:30 +02:00
Yorick Peterse 0f129ceac9 Tests for XML::TreeBuilder#on_comment. 2014-04-03 09:38:18 +02:00
Yorick Peterse bdb76cefc5 Dedicated handling of XML declaration nodes. 2014-04-02 22:30:45 +02:00
Yorick Peterse d6c0a1f3f3 Lex/parser XML declaration attributes. 2014-04-02 22:01:17 +02:00
Yorick Peterse fa2e71c790 Tests for TreeBuilder#on_document. 2014-03-28 18:52:08 +01:00
Yorick Peterse f99c13b516 Tests + docs for the TreeBuilder class. 2014-03-28 17:11:54 +01:00
Yorick Peterse 6d866523b8 Renamed XML::Builder to XML::TreeBuilder. 2014-03-28 16:37:37 +01:00
Yorick Peterse 331726b2ca Tests for the various XML node types. 2014-03-28 16:34:30 +01:00
Yorick Peterse c366a96ce8 Rake task for generating code coverage. 2014-03-28 16:33:47 +01:00
Yorick Peterse e141c084f9 Dedicated DOM builder class for CDATA tags. 2014-03-28 09:27:53 +01:00
Yorick Peterse 2b250bbf42 Rough DOM building setup. 2014-03-28 08:59:48 +01:00
Yorick Peterse 6ae52c1b12 Initial rough sketches for the DOM API. 2014-03-26 18:12:00 +01:00
Yorick Peterse 6c661f3ee9 Removed the donations section.
I gave this some thought and I've removed it for two reasons:

1. My Dogecoin Wallet takes *forever* to sync with the network (13 weeks
   behind) so I uninstalled it. I can't be bothered waiting forever for a
   gimmick.

2. I don't like asking for donations/money. I'd much rather have people send me
   an Email thanking me for my work than for them to donate money. The latter
   means much more to me.
2014-03-25 23:55:10 +01:00
Yorick Peterse 4a48647d1e Removed generated lexer/parser.
I am a dumbass.
2014-03-25 21:47:40 +01:00
Yorick Peterse fb626278a8 Re-wrapped comments in the XML lexer. 2014-03-25 10:12:39 +01:00
Yorick Peterse 8ebd72158c Renamed XML::Lexer#t to #emit(). 2014-03-25 09:42:52 +01:00
Yorick Peterse 79818eb349 Added a convenience class for parsing HTML.
This removes the need for users having to set the `:html` option themselves.
2014-03-25 09:40:24 +01:00
Yorick Peterse 58009614f6 Moved XML specs into spec/oga/xml. 2014-03-25 09:36:39 +01:00
Yorick Peterse 7c03de0e2f Renamed HTML_PARSER to PARSER_OUTPUT.
This keeps it consistent with the lexer.
2014-03-25 09:35:48 +01:00
Yorick Peterse eae13d21ed Namespaced the lexer/parser under Oga::XML.
With the upcoming XPath and CSS selector lexers/parsers it will be confusing to
keep these in the root namespace.
2014-03-25 09:34:38 +01:00
Yorick Peterse 2259061c89 Don't require the 2nd Lexer#add_token argument. 2014-03-24 21:35:47 +01:00
Yorick Peterse 641c54261e Simplified lexer output for comments. 2014-03-24 21:34:30 +01:00
Yorick Peterse eaf1669b07 Simplified lexer output for CDATA tags. 2014-03-24 21:33:05 +01:00
Yorick Peterse 470be5a839 Simplified the lexer output for doctypes. 2014-03-24 21:32:16 +01:00
Yorick Peterse ac775918ee Lexing/parsing of XML declaration tags.
This closes #12.
2014-03-24 21:30:19 +01:00
Yorick Peterse b695ecf0df Renamed element lexer tags.
T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to
T_ELEM_END. This keeps the token names consistent with the other ones (e.g.
T_COMMENT_START).
2014-03-24 20:32:43 +01:00
Yorick Peterse 0b6ba6e6b5 Fixed typ. 2014-03-24 20:20:19 +01:00
Yorick Peterse ca66339a08 README entry on donations. 2014-03-24 20:13:16 +01:00
Yorick Peterse 52abc9d29e Basic documentation for Oga::Parser. 2014-03-23 21:29:57 +01:00
Yorick Peterse 19c1d66287 Use String#unpack instead of String#codepoints.
The latter returns an Enumerable which on Ruby 1.9.3 doesn't have #length
available. Besides this it's better to just return an Array since we'll iterate
over every character anyway.
2014-03-23 21:21:27 +01:00
Yorick Peterse a2452b6371 Use codepoints instead of chars in the lexer.
Grand wizard overlord @whitequark recommended this as it will bypass the need
for creating individual String instance for every character (at least not until
needed). This becomes noticable on large inputs (e.g. 100 MB of XML).
Previously these would result in the kernel OOM killing the process. Using
codepoints memory increase by a "mere" 1-1,5 GB.
2014-03-23 20:20:07 +01:00