core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	d9fa4b7c45	Lex input as a sequence of bytes. Instead of lexing the input as a raw String or as a set of codepoints it's treated as a sequence of bytes. This removes the need of String#[] (replaced by String#byteslice) which in turn reduces the amount of memory needed and speeds up the lexing time. Thanks to @headius and @apeiros for suggesting this and rubber ducking along!	2014-04-17 17:45:05 +02:00
Yorick Peterse	70516b7447	Yield tokens in the lexer and parser. After some digging I found out that Racc has a method called `yyparse`. Using this method (and a custom callback method) you can `yield` tokens as a form of input. This makes it a lot easier to feed tokens as a stream from the lexer. Sadly the current performance of the lexer is still total garbage. Most of the memory usage also comes from using String#unpack, especially on large XML inputs (e.g. 100 MB of XML). It looks like the resulting memory usage is about 10x the input size. One option might be some kind of wrapper around String. This wrapper would have a sliding window of, say, 1024 bytes. When you create it the first 1024 bytes of the input would be unpacked. When seeking through the input this window would move forward. In theory this means that you'd only end up with having only 1024 Fixnum instances around at any given time instead of "a very big number". I have to test how efficient this is in practise.	2014-04-17 00:39:41 +02:00
Yorick Peterse	25edd2de00	Use a Set for storing void element names.	2014-04-10 12:28:47 +02:00
Yorick Peterse	b96f7c4852	Lex attributes with namespaces. These are lexed as just the name instead of two separate tokens.	2014-04-10 11:01:49 +02:00
Yorick Peterse	c974b96b88	Truncate lines in parser errors. The offending lines of code displayed in the error message are truncated to 80 characters. This should make reading the error messages less of a pain when dealing with very long lines of HTML/XML.	2014-04-10 10:08:51 +02:00
Yorick Peterse	8237d5791d	Stream tokens when lexing. Instead of returning the tokens as a whole they are now streamed using XML::Lexer#advance. This method returns the next token upon every call. It uses a small buffer in case a particular block of text results in multiple tokens.	2014-04-09 22:08:13 +02:00
Yorick Peterse	e9bb97d261	First steps towards making the lexer stream tokens	2014-04-09 19:32:06 +02:00
Yorick Peterse	cb74c7edf9	Specs for XML parser errors.	2014-04-07 21:31:36 +02:00
Yorick Peterse	54ef125637	Basic docs for everything under Oga::XML.	2014-04-04 17:48:36 +02:00
Yorick Peterse	13a9228563	Properly indent doctype/XML decl inspect values.	2014-04-04 11:13:39 +02:00
Yorick Peterse	37a12722cb	Rough setup for a custom #inspect format. This format is a lot more readable than the default Ruby #inspect format (mostly due to not including previous/next/parent nodes).	2014-04-04 00:41:29 +02:00
Yorick Peterse	a2c525dd7c	Insert newlines after XML dec/doctypes.	2014-04-03 23:04:21 +02:00
Yorick Peterse	230fafa2d3	Document should not inherit from Node. A document is not an XML node on itself. If logic has to be shared between the Document and the Node class I'll resort to using mixins for this.	2014-04-03 22:45:40 +02:00
Yorick Peterse	c077988dd6	Tree building of doctypes.	2014-04-03 22:44:00 +02:00
Yorick Peterse	81b1155af3	Lex/parse doctype names separately.	2014-04-03 21:59:57 +02:00
Yorick Peterse	8185656c1e	Fixed typ.	2014-04-03 21:41:31 +02:00
Yorick Peterse	30c01a5aee	Tests for XML::TreeBuilder#handler_missing.	2014-04-03 09:43:30 +02:00
Yorick Peterse	bdb76cefc5	Dedicated handling of XML declaration nodes.	2014-04-02 22:30:45 +02:00
Yorick Peterse	d6c0a1f3f3	Lex/parser XML declaration attributes.	2014-04-02 22:01:17 +02:00
Yorick Peterse	f99c13b516	Tests + docs for the TreeBuilder class.	2014-03-28 17:11:54 +01:00
Yorick Peterse	6d866523b8	Renamed XML::Builder to XML::TreeBuilder.	2014-03-28 16:37:37 +01:00
Yorick Peterse	e141c084f9	Dedicated DOM builder class for CDATA tags.	2014-03-28 09:27:53 +01:00
Yorick Peterse	2b250bbf42	Rough DOM building setup.	2014-03-28 08:59:48 +01:00
Yorick Peterse	6ae52c1b12	Initial rough sketches for the DOM API.	2014-03-26 18:12:00 +01:00
Yorick Peterse	4a48647d1e	Removed generated lexer/parser. I am a dumbass.	2014-03-25 21:47:40 +01:00
Yorick Peterse	fb626278a8	Re-wrapped comments in the XML lexer.	2014-03-25 10:12:39 +01:00
Yorick Peterse	8ebd72158c	Renamed XML::Lexer#t to #emit().	2014-03-25 09:42:52 +01:00
Yorick Peterse	79818eb349	Added a convenience class for parsing HTML. This removes the need for users having to set the `:html` option themselves.	2014-03-25 09:40:24 +01:00
Yorick Peterse	eae13d21ed	Namespaced the lexer/parser under Oga::XML. With the upcoming XPath and CSS selector lexers/parsers it will be confusing to keep these in the root namespace.	2014-03-25 09:34:38 +01:00
Yorick Peterse	2259061c89	Don't require the 2nd Lexer#add_token argument.	2014-03-24 21:35:47 +01:00
Yorick Peterse	641c54261e	Simplified lexer output for comments.	2014-03-24 21:34:30 +01:00
Yorick Peterse	eaf1669b07	Simplified lexer output for CDATA tags.	2014-03-24 21:33:05 +01:00
Yorick Peterse	470be5a839	Simplified the lexer output for doctypes.	2014-03-24 21:32:16 +01:00
Yorick Peterse	ac775918ee	Lexing/parsing of XML declaration tags. This closes #12.	2014-03-24 21:30:19 +01:00
Yorick Peterse	b695ecf0df	Renamed element lexer tags. T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to T_ELEM_END. This keeps the token names consistent with the other ones (e.g. T_COMMENT_START).	2014-03-24 20:32:43 +01:00
Yorick Peterse	52abc9d29e	Basic documentation for Oga::Parser.	2014-03-23 21:29:57 +01:00
Yorick Peterse	19c1d66287	Use String#unpack instead of String#codepoints. The latter returns an Enumerable which on Ruby 1.9.3 doesn't have #length available. Besides this it's better to just return an Array since we'll iterate over every character anyway.	2014-03-23 21:21:27 +01:00
Yorick Peterse	a2452b6371	Use codepoints instead of chars in the lexer. Grand wizard overlord @whitequark recommended this as it will bypass the need for creating individual String instance for every character (at least not until needed). This becomes noticable on large inputs (e.g. 100 MB of XML). Previously these would result in the kernel OOM killing the process. Using codepoints memory increase by a "mere" 1-1,5 GB.	2014-03-23 20:20:07 +01:00
Yorick Peterse	cdf5f1d541	Improve lexer performance by 20x or so. This was a rather interesting turn of events. As it turned out the Ragel generated lexer was extremely slow on large inputs. For example, lexing benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark benchmark/lexer/bench_html_time.rb: Rehearsal -------------------------------------------------------- lex HTML 10.870000 0.000000 10.870000 ( 10.877920) ---------------------------------------------- total: 10.870000sec user system total real lex HTML 10.440000 0.010000 10.450000 ( 10.449500) The corresponding benchmark-ips benchmark (bench_html.rb) presented the following results: Calculating ------------------------------------- lex HTML 1 i/100ms ------------------------------------------------- lex HTML 0.1 (±0.0%) i/s - 1 in 10.472534s 10 seconds for around 165 KB of HTML was not acceptable. I spent a good time profiling things, even submitting a patch to Ragel (https://github.com/athurston/ragel/pull/1). At some point I decided to give a pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to write C reminded me why I didn't want to do it in C in the first place. Around 2AM I gave up and went to brush my teeth and head to bed. Then, a miracle happened. More precisely, I actually gave my brain some time to think away from the computer. I said to myself: What if I feed Ragel an Array of characters instead of an entire String? That way I bypass String#[] being expensive without having to change all of Ragel or use a different language. The results of this change are rather interesting. With these changes the benchmark bench_html_time.rb now gives back the following: Rehearsal -------------------------------------------------------- lex HTML 0.550000 0.000000 0.550000 ( 0.550649) ----------------------------------------------- total: 0.550000sec user system total real lex HTML 0.520000 0.000000 0.520000 ( 0.520713) The benchmark bench_html.rb in turn gives back this: Calculating ------------------------------------- lex HTML 1 i/100ms ------------------------------------------------- lex HTML 2.0 (±0.0%) i/s - 10 in 5.120905s According to both benchmarks we now have a speedup of about 20 times without having to make any further changes to Ragel or the lexer itself. I love it when a plan comes together.	2014-03-23 12:46:22 +01:00
Yorick Peterse	0e9d9b844c	Removed duplicate start_element rule.	2014-03-21 18:54:47 +01:00
Yorick Peterse	56ed9e949c	Use index based buffering for strings. This uses the same system as for T_TEXT nodes.	2014-03-21 17:45:40 +01:00
Yorick Peterse	9fa694ad4f	Use index based buffers for text nodes. Instead of appending single characters to a String buffer the lexer now uses a start and end position to figure out what the buffer is. This is a lot faster than constantly appending to a String.	2014-03-21 17:32:07 +01:00
Yorick Peterse	55f116124c	Fix for showing lines in parser errors.	2014-03-21 00:16:20 +01:00
Yorick Peterse	7749f4abce	Corrected a comment in the parser.	2014-03-21 00:10:20 +01:00
Yorick Peterse	a20ec0000a	Show up to 5 surrounding lines in parser errors.	2014-03-20 23:40:25 +01:00
Yorick Peterse	91fb7523fd	Lex open tags with newlines in them.	2014-03-20 23:39:29 +01:00
Yorick Peterse	ba17996bfc	Fancier error messages for the parser. The error messages of the parser now contain surrounding lines of code instead of only the offending line of code. This should make debugging a bit easier. Line numbers are also shown for each line.	2014-03-20 23:30:24 +01:00
Yorick Peterse	74bc11a239	Rip out column counting. This makes both the lexer and parser quite a bit easier to use. Counting column numbers isn't also really needed when parsing XML/HTML.	2014-03-20 19:44:28 +01:00
Yorick Peterse	70a39042e7	Removed useless rules from the parser.	2014-03-20 18:58:32 +01:00
Yorick Peterse	03774f2788	Documented the lexer.	2014-03-19 22:05:57 +01:00

1 2

93 Commits