core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	97d8450cba	Removed the `regenerate` task.	2014-04-19 00:59:09 +02:00
Yorick Peterse	6f1ce17b31	Benchmark for lexer lines/second. This benchmark uses a fixture file that is automatically downloaded.	2014-04-17 20:06:24 +02:00
Yorick Peterse	54e6650338	Don't use define_method in the lexer. Profiling showed that calls to methods defined using `define_method` are really, really slow. Before this commit the lexer would process 3000-4000 lines per second. With this commit that has been increased to around 10 000 lines per second. Thanks to @headius for mentioning the (potential) overhead of define_method.	2014-04-17 19:08:26 +02:00
Yorick Peterse	d9fa4b7c45	Lex input as a sequence of bytes. Instead of lexing the input as a raw String or as a set of codepoints it's treated as a sequence of bytes. This removes the need of String#[] (replaced by String#byteslice) which in turn reduces the amount of memory needed and speeds up the lexing time. Thanks to @headius and @apeiros for suggesting this and rubber ducking along!	2014-04-17 17:45:05 +02:00
Yorick Peterse	70516b7447	Yield tokens in the lexer and parser. After some digging I found out that Racc has a method called `yyparse`. Using this method (and a custom callback method) you can `yield` tokens as a form of input. This makes it a lot easier to feed tokens as a stream from the lexer. Sadly the current performance of the lexer is still total garbage. Most of the memory usage also comes from using String#unpack, especially on large XML inputs (e.g. 100 MB of XML). It looks like the resulting memory usage is about 10x the input size. One option might be some kind of wrapper around String. This wrapper would have a sliding window of, say, 1024 bytes. When you create it the first 1024 bytes of the input would be unpacked. When seeking through the input this window would move forward. In theory this means that you'd only end up with having only 1024 Fixnum instances around at any given time instead of "a very big number". I have to test how efficient this is in practise.	2014-04-17 00:39:41 +02:00
Yorick Peterse	144c95cbb4	Replaced the HRS fixture with one from Gist. The HRS output is invalid, which Oga can not handle at this time.	2014-04-10 21:31:01 +02:00
Yorick Peterse	25edd2de00	Use a Set for storing void element names.	2014-04-10 12:28:47 +02:00
Yorick Peterse	b96f7c4852	Lex attributes with namespaces. These are lexed as just the name instead of two separate tokens.	2014-04-10 11:01:49 +02:00
Yorick Peterse	c974b96b88	Truncate lines in parser errors. The offending lines of code displayed in the error message are truncated to 80 characters. This should make reading the error messages less of a pain when dealing with very long lines of HTML/XML.	2014-04-10 10:08:51 +02:00
Yorick Peterse	292a98d7f6	Basic benchmarks for the Parser class.	2014-04-10 10:05:04 +02:00
Yorick Peterse	8ca7781842	Updated the lexer benchmarks. These had to be updated for the API changes of Oga::XML::Lexer.	2014-04-10 10:01:11 +02:00
Yorick Peterse	8237d5791d	Stream tokens when lexing. Instead of returning the tokens as a whole they are now streamed using XML::Lexer#advance. This method returns the next token upon every call. It uses a small buffer in case a particular block of text results in multiple tokens.	2014-04-09 22:08:13 +02:00
Yorick Peterse	e9bb97d261	First steps towards making the lexer stream tokens	2014-04-09 19:32:06 +02:00
Yorick Peterse	10d0ec1573	Specs for parsing various empty nodes.	2014-04-07 21:33:23 +02:00
Yorick Peterse	cb74c7edf9	Specs for XML parser errors.	2014-04-07 21:31:36 +02:00
Yorick Peterse	915d3ee505	Expanded tests for XML::Document#inspect.	2014-04-07 20:11:12 +02:00
Yorick Peterse	e9412c9c4e	Tests for various inspect methods.	2014-04-07 09:58:31 +02:00
Yorick Peterse	54ef125637	Basic docs for everything under Oga::XML.	2014-04-04 17:48:36 +02:00
Yorick Peterse	13a9228563	Properly indent doctype/XML decl inspect values.	2014-04-04 11:13:39 +02:00
Yorick Peterse	37a12722cb	Rough setup for a custom #inspect format. This format is a lot more readable than the default Ruby #inspect format (mostly due to not including previous/next/parent nodes).	2014-04-04 00:41:29 +02:00
Yorick Peterse	a2c525dd7c	Insert newlines after XML dec/doctypes.	2014-04-03 23:04:21 +02:00
Yorick Peterse	230fafa2d3	Document should not inherit from Node. A document is not an XML node on itself. If logic has to be shared between the Document and the Node class I'll resort to using mixins for this.	2014-04-03 22:45:40 +02:00
Yorick Peterse	c077988dd6	Tree building of doctypes.	2014-04-03 22:44:00 +02:00
Yorick Peterse	81b1155af3	Lex/parse doctype names separately.	2014-04-03 21:59:57 +02:00
Yorick Peterse	8185656c1e	Fixed typ.	2014-04-03 21:41:31 +02:00
Yorick Peterse	6cf906e500	Lexer tests for single quoted attributes.	2014-04-03 18:50:07 +02:00
Yorick Peterse	30c01a5aee	Tests for XML::TreeBuilder#handler_missing.	2014-04-03 09:43:30 +02:00
Yorick Peterse	0f129ceac9	Tests for XML::TreeBuilder#on_comment.	2014-04-03 09:38:18 +02:00
Yorick Peterse	bdb76cefc5	Dedicated handling of XML declaration nodes.	2014-04-02 22:30:45 +02:00
Yorick Peterse	d6c0a1f3f3	Lex/parser XML declaration attributes.	2014-04-02 22:01:17 +02:00
Yorick Peterse	fa2e71c790	Tests for TreeBuilder#on_document.	2014-03-28 18:52:08 +01:00
Yorick Peterse	f99c13b516	Tests + docs for the TreeBuilder class.	2014-03-28 17:11:54 +01:00
Yorick Peterse	6d866523b8	Renamed XML::Builder to XML::TreeBuilder.	2014-03-28 16:37:37 +01:00
Yorick Peterse	331726b2ca	Tests for the various XML node types.	2014-03-28 16:34:30 +01:00
Yorick Peterse	c366a96ce8	Rake task for generating code coverage.	2014-03-28 16:33:47 +01:00
Yorick Peterse	e141c084f9	Dedicated DOM builder class for CDATA tags.	2014-03-28 09:27:53 +01:00
Yorick Peterse	2b250bbf42	Rough DOM building setup.	2014-03-28 08:59:48 +01:00
Yorick Peterse	6ae52c1b12	Initial rough sketches for the DOM API.	2014-03-26 18:12:00 +01:00
Yorick Peterse	6c661f3ee9	Removed the donations section. I gave this some thought and I've removed it for two reasons: 1. My Dogecoin Wallet takes forever to sync with the network (13 weeks behind) so I uninstalled it. I can't be bothered waiting forever for a gimmick. 2. I don't like asking for donations/money. I'd much rather have people send me an Email thanking me for my work than for them to donate money. The latter means much more to me.	2014-03-25 23:55:10 +01:00
Yorick Peterse	4a48647d1e	Removed generated lexer/parser. I am a dumbass.	2014-03-25 21:47:40 +01:00
Yorick Peterse	fb626278a8	Re-wrapped comments in the XML lexer.	2014-03-25 10:12:39 +01:00
Yorick Peterse	8ebd72158c	Renamed XML::Lexer#t to #emit().	2014-03-25 09:42:52 +01:00
Yorick Peterse	79818eb349	Added a convenience class for parsing HTML. This removes the need for users having to set the `:html` option themselves.	2014-03-25 09:40:24 +01:00
Yorick Peterse	58009614f6	Moved XML specs into spec/oga/xml.	2014-03-25 09:36:39 +01:00
Yorick Peterse	7c03de0e2f	Renamed HTML_PARSER to PARSER_OUTPUT. This keeps it consistent with the lexer.	2014-03-25 09:35:48 +01:00
Yorick Peterse	eae13d21ed	Namespaced the lexer/parser under Oga::XML. With the upcoming XPath and CSS selector lexers/parsers it will be confusing to keep these in the root namespace.	2014-03-25 09:34:38 +01:00
Yorick Peterse	2259061c89	Don't require the 2nd Lexer#add_token argument.	2014-03-24 21:35:47 +01:00
Yorick Peterse	641c54261e	Simplified lexer output for comments.	2014-03-24 21:34:30 +01:00
Yorick Peterse	eaf1669b07	Simplified lexer output for CDATA tags.	2014-03-24 21:33:05 +01:00
Yorick Peterse	470be5a839	Simplified the lexer output for doctypes.	2014-03-24 21:32:16 +01:00

1 2 3 4 5

232 Commits All Branches Search

232 Commits

All Branches