core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	79818eb349	Added a convenience class for parsing HTML. This removes the need for users having to set the `:html` option themselves.	2014-03-25 09:40:24 +01:00
Yorick Peterse	58009614f6	Moved XML specs into spec/oga/xml.	2014-03-25 09:36:39 +01:00
Yorick Peterse	7c03de0e2f	Renamed HTML_PARSER to PARSER_OUTPUT. This keeps it consistent with the lexer.	2014-03-25 09:35:48 +01:00
Yorick Peterse	eae13d21ed	Namespaced the lexer/parser under Oga::XML. With the upcoming XPath and CSS selector lexers/parsers it will be confusing to keep these in the root namespace.	2014-03-25 09:34:38 +01:00
Yorick Peterse	2259061c89	Don't require the 2nd Lexer#add_token argument.	2014-03-24 21:35:47 +01:00
Yorick Peterse	641c54261e	Simplified lexer output for comments.	2014-03-24 21:34:30 +01:00
Yorick Peterse	eaf1669b07	Simplified lexer output for CDATA tags.	2014-03-24 21:33:05 +01:00
Yorick Peterse	470be5a839	Simplified the lexer output for doctypes.	2014-03-24 21:32:16 +01:00
Yorick Peterse	ac775918ee	Lexing/parsing of XML declaration tags. This closes #12.	2014-03-24 21:30:19 +01:00
Yorick Peterse	b695ecf0df	Renamed element lexer tags. T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to T_ELEM_END. This keeps the token names consistent with the other ones (e.g. T_COMMENT_START).	2014-03-24 20:32:43 +01:00
Yorick Peterse	0b6ba6e6b5	Fixed typ.	2014-03-24 20:20:19 +01:00
Yorick Peterse	ca66339a08	README entry on donations.	2014-03-24 20:13:16 +01:00
Yorick Peterse	52abc9d29e	Basic documentation for Oga::Parser.	2014-03-23 21:29:57 +01:00
Yorick Peterse	19c1d66287	Use String#unpack instead of String#codepoints. The latter returns an Enumerable which on Ruby 1.9.3 doesn't have #length available. Besides this it's better to just return an Array since we'll iterate over every character anyway.	2014-03-23 21:21:27 +01:00
Yorick Peterse	a2452b6371	Use codepoints instead of chars in the lexer. Grand wizard overlord @whitequark recommended this as it will bypass the need for creating individual String instance for every character (at least not until needed). This becomes noticable on large inputs (e.g. 100 MB of XML). Previously these would result in the kernel OOM killing the process. Using codepoints memory increase by a "mere" 1-1,5 GB.	2014-03-23 20:20:07 +01:00
Yorick Peterse	cdf5f1d541	Improve lexer performance by 20x or so. This was a rather interesting turn of events. As it turned out the Ragel generated lexer was extremely slow on large inputs. For example, lexing benchmark/fixtures/hrs.html took around 10 seconds according to the benchmark benchmark/lexer/bench_html_time.rb: Rehearsal -------------------------------------------------------- lex HTML 10.870000 0.000000 10.870000 ( 10.877920) ---------------------------------------------- total: 10.870000sec user system total real lex HTML 10.440000 0.010000 10.450000 ( 10.449500) The corresponding benchmark-ips benchmark (bench_html.rb) presented the following results: Calculating ------------------------------------- lex HTML 1 i/100ms ------------------------------------------------- lex HTML 0.1 (±0.0%) i/s - 1 in 10.472534s 10 seconds for around 165 KB of HTML was not acceptable. I spent a good time profiling things, even submitting a patch to Ragel (https://github.com/athurston/ragel/pull/1). At some point I decided to give a pure C lexer + FFI bindings a try (so it would also work on JRuby). Trying to write C reminded me why I didn't want to do it in C in the first place. Around 2AM I gave up and went to brush my teeth and head to bed. Then, a miracle happened. More precisely, I actually gave my brain some time to think away from the computer. I said to myself: What if I feed Ragel an Array of characters instead of an entire String? That way I bypass String#[] being expensive without having to change all of Ragel or use a different language. The results of this change are rather interesting. With these changes the benchmark bench_html_time.rb now gives back the following: Rehearsal -------------------------------------------------------- lex HTML 0.550000 0.000000 0.550000 ( 0.550649) ----------------------------------------------- total: 0.550000sec user system total real lex HTML 0.520000 0.000000 0.520000 ( 0.520713) The benchmark bench_html.rb in turn gives back this: Calculating ------------------------------------- lex HTML 1 i/100ms ------------------------------------------------- lex HTML 2.0 (±0.0%) i/s - 10 in 5.120905s According to both benchmarks we now have a speedup of about 20 times without having to make any further changes to Ragel or the lexer itself. I love it when a plan comes together.	2014-03-23 12:46:22 +01:00
Yorick Peterse	4b914b3d6f	Added extra benchmarks for lexing large inputs.	2014-03-23 12:46:04 +01:00
Yorick Peterse	0e9d9b844c	Removed duplicate start_element rule.	2014-03-21 18:54:47 +01:00
Yorick Peterse	56ed9e949c	Use index based buffering for strings. This uses the same system as for T_TEXT nodes.	2014-03-21 17:45:40 +01:00
Yorick Peterse	d7a40ec470	Simple benchmark for lexing elements.	2014-03-21 17:45:23 +01:00
Yorick Peterse	9fa694ad4f	Use index based buffers for text nodes. Instead of appending single characters to a String buffer the lexer now uses a start and end position to figure out what the buffer is. This is a lot faster than constantly appending to a String.	2014-03-21 17:32:07 +01:00
Yorick Peterse	2852afce9b	Benchmark for measuring CDATA lexing.	2014-03-21 16:59:44 +01:00
Yorick Peterse	55f116124c	Fix for showing lines in parser errors.	2014-03-21 00:16:20 +01:00
Yorick Peterse	7749f4abce	Corrected a comment in the parser.	2014-03-21 00:10:20 +01:00
Yorick Peterse	a20ec0000a	Show up to 5 surrounding lines in parser errors.	2014-03-20 23:40:25 +01:00
Yorick Peterse	91fb7523fd	Lex open tags with newlines in them.	2014-03-20 23:39:29 +01:00
Yorick Peterse	ba17996bfc	Fancier error messages for the parser. The error messages of the parser now contain surrounding lines of code instead of only the offending line of code. This should make debugging a bit easier. Line numbers are also shown for each line.	2014-03-20 23:30:24 +01:00
Yorick Peterse	74bc11a239	Rip out column counting. This makes both the lexer and parser quite a bit easier to use. Counting column numbers isn't also really needed when parsing XML/HTML.	2014-03-20 19:44:28 +01:00
Yorick Peterse	70a39042e7	Removed useless rules from the parser.	2014-03-20 18:58:32 +01:00
Yorick Peterse	03774f2788	Documented the lexer.	2014-03-19 22:05:57 +01:00
Yorick Peterse	192ba9bb54	Expanded the lexer comment tests.	2014-03-19 21:44:57 +01:00
Yorick Peterse	f1fcdfbacb	Cleaned up the Ragel bits of the lexer. This removes some of the complexity that existed before (e.g. too many state machines) and fixes a bunch of problems with nested data.	2014-03-19 21:44:10 +01:00
Yorick Peterse	7271e74396	Revert "Compacter parser AST." Although this AST is compacter it will result in conflicts between (text), (attributes) and (attribute) nodes in regular XML documents. This is due to XML allowing elements with these names (unlike in HTML). This reverts commit `8898d08831`.	2014-03-18 18:55:16 +01:00
Yorick Peterse	9687dd379f	Added a .ruby-version file.	2014-03-18 18:08:25 +01:00
Yorick Peterse	56f22c311e	Allow JRuby to fail for now.	2014-03-18 00:13:33 +01:00
Yorick Peterse	422832fd68	Lowered the required Ragel version to 6.7.	2014-03-18 00:12:21 +01:00
Yorick Peterse	091e32c17a	Install Ragel on Travis CI.	2014-03-18 00:09:16 +01:00
Yorick Peterse	8d4d3999b5	Configuration file for Travis CI.	2014-03-17 21:52:24 +01:00
Yorick Peterse	9975c9c430	Removed the emit_text_buffer Ragel action.	2014-03-17 21:49:49 +01:00
Yorick Peterse	274ab359ba	Don't use separate tokens/nodes for newlines. Newlines are now lexed together with regular text. The line numbers are advanced based on the amount of "\n" sequences in a text buffer.	2014-03-17 21:26:21 +01:00
Yorick Peterse	8898d08831	Compacter parser AST. The AST no longer uses the generic `element` type for element nodes but instead changes the type based on the element type. That is, a <p> element now results in an (p) node, <link> in (link), etc.	2014-03-17 21:03:54 +01:00
Yorick Peterse	8d3f3f15d7	Renamed parse_html() to parse().	2014-03-16 23:46:20 +01:00
Yorick Peterse	cb75edc30d	Basic support for lexing/parsing HTML5. This will need a bunch of extra tests before I'll consider closing #7.	2014-03-16 23:42:24 +01:00
Yorick Peterse	ce8bbdb64a	Parsing support for multiple nested nodes.	2014-03-15 20:19:54 +01:00
Yorick Peterse	05ee3c13c9	Parsing support for nested element/text nodes.	2014-03-14 00:44:11 +01:00
Yorick Peterse	6b2f682c5c	Tests for lexing a basic HTML document. This also comes with some changes to the lexer so that it advances column/line numbers correctly.	2014-03-13 23:55:18 +01:00
Yorick Peterse	edf2e4112b	Added a test for parsing bare text tokens.	2014-03-13 00:42:58 +01:00
Yorick Peterse	34f8779c94	Lexing of bare regular text. This is currently a bit of a hack but at least we're slowly getting there.	2014-03-13 00:42:12 +01:00
Yorick Peterse	2fbca93ae8	Supported for parsing nested elements.	2014-03-12 23:13:28 +01:00
Yorick Peterse	8cfa81aed9	Basic support for parsing elements. This includes support for elements with namespaces and attributes. Nested elements are not yet supported.	2014-03-12 23:02:54 +01:00

... 20 21 22 23 24

1190 Commits All Branches Search

1190 Commits

All Branches