core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	a1e9e74b9c	Updated a benchmark description.	2014-04-29 14:24:33 +02:00
Yorick Peterse	a42240bc2e	Profiling setup for the pull parser.	2014-04-29 13:50:50 +02:00
Yorick Peterse	d5e59c38ac	Profiling setup for the DOM parser.	2014-04-29 13:47:55 +02:00
Yorick Peterse	2c4890533c	Set initial size for the lexer graph.	2014-04-29 13:47:36 +02:00
Yorick Peterse	a111e673cb	Changed the big XML file size to 10 MB. This makes various calculations a bit easier opposed to when the file is 11MB in size.	2014-04-29 13:42:02 +02:00
Yorick Peterse	53c45c621b	Basic memory profiling setup. This makes it a bit easier to profile memory usage of certain components and plot them using Gnuplot. In the past I would write one-off scripts for this and throw them away, only to figure out I needed them again later on. Profiling samples are written to profile/samples and can be plotted using corresponding Gnuplot scripts found in profile/plot. The latter requires Gnuplot to be installed.	2014-04-29 13:38:56 +02:00
Yorick Peterse	70fcc8534c	Benchmark for parsing big XML documents.	2014-04-29 13:05:45 +02:00
Yorick Peterse	45b0cdf811	Track element name nesting in the pull parser. Tracking the names of nested elements makes it a lot easier to do contextual pull parsing. Without this it's impossible to know what context the parser is in at a given moment. For memory reasons the parser currently only tracks the element names. In the future it might perhaps also track extra information to make parsing easier.	2014-04-28 23:40:36 +02:00
Yorick Peterse	030a0068bd	Basic pull parsing setup. This parser extends the regular DOM parser but instead delegates certain nodes to a block instead of building a DOM tree. The API is a bit raw in its current form but I'll extend it and make it a bit more user friendly in the following commits. In particular I want to make it easier to figure out if a certain node is nested inside another node.	2014-04-28 17:22:17 +02:00
Yorick Peterse	fd5bbbc9a2	Move element recursion handling into a method. This makes it easier to disable later on in the streaming parser.	2014-04-28 10:25:05 +02:00
Yorick Peterse	785ec26fe7	Create Element instances before recursing.	2014-04-28 10:21:34 +02:00
Yorick Peterse	9939cf49eb	Move parser callback code into dedicated methods.	2014-04-28 10:18:55 +02:00
Yorick Peterse	5d05aed6ec	Corrected docs for XML::Parser.	2014-04-26 12:57:35 +02:00
Yorick Peterse	f53fe4ed7c	Reset the lexer when resetting the parser. Also removed the unused @lines instance variable.	2014-04-25 00:15:24 +02:00
Yorick Peterse	83ff0e6656	Various small parser cleanups.	2014-04-25 00:07:53 +02:00
Yorick Peterse	ecf6851711	Revert "Move linking of child nodes to a dedicated mixin." This doesn't actually make things any easier. It also introduces a weirdly named mixin. This reverts commit `0968465f0c`.	2014-04-24 21:16:31 +02:00
Yorick Peterse	0968465f0c	Move linking of child nodes to a dedicated mixin.	2014-04-24 09:43:50 +02:00
Yorick Peterse	08d412da7e	First shot at removing the AST layer. The AST layer is being removed because it doesn't really serve a useful purpose. In particular when creating a streaming parser the AST nodes would only introduce extra overhead. As a result of this the parser will instead emit a DOM tree directly instead of first emitting an AST.	2014-04-21 23:05:39 +02:00
Yorick Peterse	9ee9ec14cb	Lexer: only pop elements when needed.	2014-04-19 01:10:32 +02:00
Yorick Peterse	c8c9da2922	Track the XML fixture in Git. To make running benchmarks easier we'll track the XML file in Git in its compressed form. I also decreased the size of the XML file from ~50 MB to ~10MB.	2014-04-19 01:03:14 +02:00
Yorick Peterse	97d8450cba	Removed the `regenerate` task.	2014-04-19 00:59:09 +02:00
Yorick Peterse	6f1ce17b31	Benchmark for lexer lines/second. This benchmark uses a fixture file that is automatically downloaded.	2014-04-17 20:06:24 +02:00
Yorick Peterse	54e6650338	Don't use define_method in the lexer. Profiling showed that calls to methods defined using `define_method` are really, really slow. Before this commit the lexer would process 3000-4000 lines per second. With this commit that has been increased to around 10 000 lines per second. Thanks to @headius for mentioning the (potential) overhead of define_method.	2014-04-17 19:08:26 +02:00
Yorick Peterse	d9fa4b7c45	Lex input as a sequence of bytes. Instead of lexing the input as a raw String or as a set of codepoints it's treated as a sequence of bytes. This removes the need of String#[] (replaced by String#byteslice) which in turn reduces the amount of memory needed and speeds up the lexing time. Thanks to @headius and @apeiros for suggesting this and rubber ducking along!	2014-04-17 17:45:05 +02:00
Yorick Peterse	70516b7447	Yield tokens in the lexer and parser. After some digging I found out that Racc has a method called `yyparse`. Using this method (and a custom callback method) you can `yield` tokens as a form of input. This makes it a lot easier to feed tokens as a stream from the lexer. Sadly the current performance of the lexer is still total garbage. Most of the memory usage also comes from using String#unpack, especially on large XML inputs (e.g. 100 MB of XML). It looks like the resulting memory usage is about 10x the input size. One option might be some kind of wrapper around String. This wrapper would have a sliding window of, say, 1024 bytes. When you create it the first 1024 bytes of the input would be unpacked. When seeking through the input this window would move forward. In theory this means that you'd only end up with having only 1024 Fixnum instances around at any given time instead of "a very big number". I have to test how efficient this is in practise.	2014-04-17 00:39:41 +02:00
Yorick Peterse	144c95cbb4	Replaced the HRS fixture with one from Gist. The HRS output is invalid, which Oga can not handle at this time.	2014-04-10 21:31:01 +02:00
Yorick Peterse	25edd2de00	Use a Set for storing void element names.	2014-04-10 12:28:47 +02:00
Yorick Peterse	b96f7c4852	Lex attributes with namespaces. These are lexed as just the name instead of two separate tokens.	2014-04-10 11:01:49 +02:00
Yorick Peterse	c974b96b88	Truncate lines in parser errors. The offending lines of code displayed in the error message are truncated to 80 characters. This should make reading the error messages less of a pain when dealing with very long lines of HTML/XML.	2014-04-10 10:08:51 +02:00
Yorick Peterse	292a98d7f6	Basic benchmarks for the Parser class.	2014-04-10 10:05:04 +02:00
Yorick Peterse	8ca7781842	Updated the lexer benchmarks. These had to be updated for the API changes of Oga::XML::Lexer.	2014-04-10 10:01:11 +02:00
Yorick Peterse	8237d5791d	Stream tokens when lexing. Instead of returning the tokens as a whole they are now streamed using XML::Lexer#advance. This method returns the next token upon every call. It uses a small buffer in case a particular block of text results in multiple tokens.	2014-04-09 22:08:13 +02:00
Yorick Peterse	e9bb97d261	First steps towards making the lexer stream tokens	2014-04-09 19:32:06 +02:00
Yorick Peterse	10d0ec1573	Specs for parsing various empty nodes.	2014-04-07 21:33:23 +02:00
Yorick Peterse	cb74c7edf9	Specs for XML parser errors.	2014-04-07 21:31:36 +02:00
Yorick Peterse	915d3ee505	Expanded tests for XML::Document#inspect.	2014-04-07 20:11:12 +02:00
Yorick Peterse	e9412c9c4e	Tests for various inspect methods.	2014-04-07 09:58:31 +02:00
Yorick Peterse	54ef125637	Basic docs for everything under Oga::XML.	2014-04-04 17:48:36 +02:00
Yorick Peterse	13a9228563	Properly indent doctype/XML decl inspect values.	2014-04-04 11:13:39 +02:00
Yorick Peterse	37a12722cb	Rough setup for a custom #inspect format. This format is a lot more readable than the default Ruby #inspect format (mostly due to not including previous/next/parent nodes).	2014-04-04 00:41:29 +02:00
Yorick Peterse	a2c525dd7c	Insert newlines after XML dec/doctypes.	2014-04-03 23:04:21 +02:00
Yorick Peterse	230fafa2d3	Document should not inherit from Node. A document is not an XML node on itself. If logic has to be shared between the Document and the Node class I'll resort to using mixins for this.	2014-04-03 22:45:40 +02:00
Yorick Peterse	c077988dd6	Tree building of doctypes.	2014-04-03 22:44:00 +02:00
Yorick Peterse	81b1155af3	Lex/parse doctype names separately.	2014-04-03 21:59:57 +02:00
Yorick Peterse	8185656c1e	Fixed typ.	2014-04-03 21:41:31 +02:00
Yorick Peterse	6cf906e500	Lexer tests for single quoted attributes.	2014-04-03 18:50:07 +02:00
Yorick Peterse	30c01a5aee	Tests for XML::TreeBuilder#handler_missing.	2014-04-03 09:43:30 +02:00
Yorick Peterse	0f129ceac9	Tests for XML::TreeBuilder#on_comment.	2014-04-03 09:38:18 +02:00
Yorick Peterse	bdb76cefc5	Dedicated handling of XML declaration nodes.	2014-04-02 22:30:45 +02:00
Yorick Peterse	d6c0a1f3f3	Lex/parser XML declaration attributes.	2014-04-02 22:01:17 +02:00

... 18 19 20 21 22 ...

1102 Commits All Branches Search

1102 Commits

All Branches