core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	08d412da7e	First shot at removing the AST layer. The AST layer is being removed because it doesn't really serve a useful purpose. In particular when creating a streaming parser the AST nodes would only introduce extra overhead. As a result of this the parser will instead emit a DOM tree directly instead of first emitting an AST.	2014-04-21 23:05:39 +02:00
Yorick Peterse	d9fa4b7c45	Lex input as a sequence of bytes. Instead of lexing the input as a raw String or as a set of codepoints it's treated as a sequence of bytes. This removes the need of String#[] (replaced by String#byteslice) which in turn reduces the amount of memory needed and speeds up the lexing time. Thanks to @headius and @apeiros for suggesting this and rubber ducking along!	2014-04-17 17:45:05 +02:00
Yorick Peterse	b96f7c4852	Lex attributes with namespaces. These are lexed as just the name instead of two separate tokens.	2014-04-10 11:01:49 +02:00
Yorick Peterse	8237d5791d	Stream tokens when lexing. Instead of returning the tokens as a whole they are now streamed using XML::Lexer#advance. This method returns the next token upon every call. It uses a small buffer in case a particular block of text results in multiple tokens.	2014-04-09 22:08:13 +02:00
Yorick Peterse	10d0ec1573	Specs for parsing various empty nodes.	2014-04-07 21:33:23 +02:00
Yorick Peterse	cb74c7edf9	Specs for XML parser errors.	2014-04-07 21:31:36 +02:00
Yorick Peterse	915d3ee505	Expanded tests for XML::Document#inspect.	2014-04-07 20:11:12 +02:00
Yorick Peterse	e9412c9c4e	Tests for various inspect methods.	2014-04-07 09:58:31 +02:00
Yorick Peterse	a2c525dd7c	Insert newlines after XML dec/doctypes.	2014-04-03 23:04:21 +02:00
Yorick Peterse	c077988dd6	Tree building of doctypes.	2014-04-03 22:44:00 +02:00
Yorick Peterse	81b1155af3	Lex/parse doctype names separately.	2014-04-03 21:59:57 +02:00
Yorick Peterse	6cf906e500	Lexer tests for single quoted attributes.	2014-04-03 18:50:07 +02:00
Yorick Peterse	30c01a5aee	Tests for XML::TreeBuilder#handler_missing.	2014-04-03 09:43:30 +02:00
Yorick Peterse	0f129ceac9	Tests for XML::TreeBuilder#on_comment.	2014-04-03 09:38:18 +02:00
Yorick Peterse	bdb76cefc5	Dedicated handling of XML declaration nodes.	2014-04-02 22:30:45 +02:00
Yorick Peterse	d6c0a1f3f3	Lex/parser XML declaration attributes.	2014-04-02 22:01:17 +02:00
Yorick Peterse	fa2e71c790	Tests for TreeBuilder#on_document.	2014-03-28 18:52:08 +01:00
Yorick Peterse	f99c13b516	Tests + docs for the TreeBuilder class.	2014-03-28 17:11:54 +01:00
Yorick Peterse	331726b2ca	Tests for the various XML node types.	2014-03-28 16:34:30 +01:00
Yorick Peterse	79818eb349	Added a convenience class for parsing HTML. This removes the need for users having to set the `:html` option themselves.	2014-03-25 09:40:24 +01:00
Yorick Peterse	58009614f6	Moved XML specs into spec/oga/xml.	2014-03-25 09:36:39 +01:00
Yorick Peterse	eae13d21ed	Namespaced the lexer/parser under Oga::XML. With the upcoming XPath and CSS selector lexers/parsers it will be confusing to keep these in the root namespace.	2014-03-25 09:34:38 +01:00
Yorick Peterse	641c54261e	Simplified lexer output for comments.	2014-03-24 21:34:30 +01:00
Yorick Peterse	eaf1669b07	Simplified lexer output for CDATA tags.	2014-03-24 21:33:05 +01:00
Yorick Peterse	470be5a839	Simplified the lexer output for doctypes.	2014-03-24 21:32:16 +01:00
Yorick Peterse	ac775918ee	Lexing/parsing of XML declaration tags. This closes #12.	2014-03-24 21:30:19 +01:00
Yorick Peterse	b695ecf0df	Renamed element lexer tags. T_ELEM_OPEN has been renamed to T_ELEM_START, T_ELEM_CLOSE has been renamed to T_ELEM_END. This keeps the token names consistent with the other ones (e.g. T_COMMENT_START).	2014-03-24 20:32:43 +01:00
Yorick Peterse	91fb7523fd	Lex open tags with newlines in them.	2014-03-20 23:39:29 +01:00
Yorick Peterse	74bc11a239	Rip out column counting. This makes both the lexer and parser quite a bit easier to use. Counting column numbers isn't also really needed when parsing XML/HTML.	2014-03-20 19:44:28 +01:00
Yorick Peterse	192ba9bb54	Expanded the lexer comment tests.	2014-03-19 21:44:57 +01:00
Yorick Peterse	7271e74396	Revert "Compacter parser AST." Although this AST is compacter it will result in conflicts between (text), (attributes) and (attribute) nodes in regular XML documents. This is due to XML allowing elements with these names (unlike in HTML). This reverts commit `8898d08831`.	2014-03-18 18:55:16 +01:00
Yorick Peterse	274ab359ba	Don't use separate tokens/nodes for newlines. Newlines are now lexed together with regular text. The line numbers are advanced based on the amount of "\n" sequences in a text buffer.	2014-03-17 21:26:21 +01:00
Yorick Peterse	8898d08831	Compacter parser AST. The AST no longer uses the generic `element` type for element nodes but instead changes the type based on the element type. That is, a <p> element now results in an (p) node, <link> in (link), etc.	2014-03-17 21:03:54 +01:00
Yorick Peterse	8d3f3f15d7	Renamed parse_html() to parse().	2014-03-16 23:46:20 +01:00
Yorick Peterse	cb75edc30d	Basic support for lexing/parsing HTML5. This will need a bunch of extra tests before I'll consider closing #7.	2014-03-16 23:42:24 +01:00
Yorick Peterse	ce8bbdb64a	Parsing support for multiple nested nodes.	2014-03-15 20:19:54 +01:00
Yorick Peterse	05ee3c13c9	Parsing support for nested element/text nodes.	2014-03-14 00:44:11 +01:00
Yorick Peterse	6b2f682c5c	Tests for lexing a basic HTML document. This also comes with some changes to the lexer so that it advances column/line numbers correctly.	2014-03-13 23:55:18 +01:00
Yorick Peterse	edf2e4112b	Added a test for parsing bare text tokens.	2014-03-13 00:42:58 +01:00
Yorick Peterse	34f8779c94	Lexing of bare regular text. This is currently a bit of a hack but at least we're slowly getting there.	2014-03-13 00:42:12 +01:00
Yorick Peterse	2fbca93ae8	Supported for parsing nested elements.	2014-03-12 23:13:28 +01:00
Yorick Peterse	8cfa81aed9	Basic support for parsing elements. This includes support for elements with namespaces and attributes. Nested elements are not yet supported.	2014-03-12 23:02:54 +01:00
Yorick Peterse	98b3443e7f	Lexing of element attributes without values.	2014-03-12 22:41:17 +01:00
Yorick Peterse	ed9d8c05a2	Added support for parsing comments.	2014-03-12 22:20:12 +01:00
Yorick Peterse	0a396043f8	Support for parsing CDATA tags.	2014-03-11 22:22:02 +01:00
Yorick Peterse	c9592856f0	Updated parsing of doctypes. The resulting nodes now separate the type, public and system IDs in to separate string values.	2014-03-11 22:08:21 +01:00
Yorick Peterse	4a41894e2c	Updated the doctype parser specs.	2014-03-11 22:02:26 +01:00
Yorick Peterse	8ce76be050	Moved the parser class to Oga::Parser. Oga will use the same parser for XML and HTML so it doesn't make sense to separate the two into different namespaces (at least for now).	2014-03-11 22:01:50 +01:00
Yorick Peterse	eacd9b88cf	Reworked token generation for elements. This emits separate tokens for the start tag (T_ELEMENT_OPEN) and name (T_ELEMENT_NAME). This makes it easier to include the namespace of an element (T_ELEMENT_NS) in the output.	2014-03-10 23:50:39 +01:00
Yorick Peterse	cd53d5e426	Fixed advancing column numbers. In a bunch of cases the column number would not be increased correctly.	2014-03-07 23:54:56 +01:00

1 2

66 Commits