core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	203aea6b1a	Cleaned up benchmarking code.	2014-05-01 13:08:44 +02:00
Yorick Peterse	ebf9099f0e	Dropped the benchmark_ prefixes. These files reside in a benchmark/ directory. Gee, I wonder what they do.	2014-05-01 13:03:21 +02:00
Yorick Peterse	20f2f256f6	Benchmark for measuring average lexing times.	2014-05-01 13:01:52 +02:00
Yorick Peterse	f607cf50dc	Use local variables for Ragel. Instead of using instance variables for ts, te, etc we'll use local variables. Grand wizard overloard @whitequark suggested that this would be quite a bit faster, which turns out to be true. For example, the big XML lexer benchmark would, prior to this commit, complete in about 9 - 9,3 seconds. With this commit that hovers around 8,5 seconds.	2014-05-01 13:00:29 +02:00
Yorick Peterse	e26d5a8664	Removed unused variable in a lexer benchmark.	2014-05-01 12:25:49 +02:00
Yorick Peterse	2f36692abe	Fixed the big XML lexer benchmark.	2014-04-30 09:28:28 +02:00
Yorick Peterse	83f6d5437e	Contextual pull parsing. This adds the ability to more easily act upon specific node types and nestings when using the pull parsing API. A basic example of this API looks like the following (only including relevant code): parser.parse do \|node\| parser.on(:element, %w{people person}) do people << {:name => nil, :age => nil} end parser.on(:text, %w{people person name}) do people.last[:name] = node.text end parser.on(:text, %w{people person age}) do people.last[:age] = node.text.to_i end end This fixes #6.	2014-04-29 23:05:49 +02:00
Yorick Peterse	1a413998a3	Track the current node in the pull parser. The current node is tracked in the instance method `node`.	2014-04-29 21:21:05 +02:00
Yorick Peterse	d0b3653785	Updated the manifest, again.	2014-04-29 20:42:17 +02:00
Yorick Peterse	5339664f33	Include .yardopts in the Gem.	2014-04-29 20:42:09 +02:00
Yorick Peterse	8522a82cf9	Updated the manifest.	2014-04-29 20:41:11 +02:00
Yorick Peterse	503b254216	Generate files before generting the manifest.	2014-04-29 20:41:02 +02:00
Yorick Peterse	586c8f1d46	Generated an initial manifest.	2014-04-29 20:40:34 +02:00
Yorick Peterse	59dae873e4	Don't rely on Git for generating the MANIFEST. When using Git the resulting Gem will contain far too many useless files. For example, the profile/ and spec/ directories are not needed when building Gems.	2014-04-29 20:39:20 +02:00
Yorick Peterse	579c0499ed	Benchmark for the pull parser.	2014-04-29 14:48:43 +02:00
Yorick Peterse	5ed09236f9	Big XML benchmark for the lexer.	2014-04-29 14:48:36 +02:00
Yorick Peterse	a1e9e74b9c	Updated a benchmark description.	2014-04-29 14:24:33 +02:00
Yorick Peterse	a42240bc2e	Profiling setup for the pull parser.	2014-04-29 13:50:50 +02:00
Yorick Peterse	d5e59c38ac	Profiling setup for the DOM parser.	2014-04-29 13:47:55 +02:00
Yorick Peterse	2c4890533c	Set initial size for the lexer graph.	2014-04-29 13:47:36 +02:00
Yorick Peterse	a111e673cb	Changed the big XML file size to 10 MB. This makes various calculations a bit easier opposed to when the file is 11MB in size.	2014-04-29 13:42:02 +02:00
Yorick Peterse	53c45c621b	Basic memory profiling setup. This makes it a bit easier to profile memory usage of certain components and plot them using Gnuplot. In the past I would write one-off scripts for this and throw them away, only to figure out I needed them again later on. Profiling samples are written to profile/samples and can be plotted using corresponding Gnuplot scripts found in profile/plot. The latter requires Gnuplot to be installed.	2014-04-29 13:38:56 +02:00
Yorick Peterse	70fcc8534c	Benchmark for parsing big XML documents.	2014-04-29 13:05:45 +02:00
Yorick Peterse	45b0cdf811	Track element name nesting in the pull parser. Tracking the names of nested elements makes it a lot easier to do contextual pull parsing. Without this it's impossible to know what context the parser is in at a given moment. For memory reasons the parser currently only tracks the element names. In the future it might perhaps also track extra information to make parsing easier.	2014-04-28 23:40:36 +02:00
Yorick Peterse	030a0068bd	Basic pull parsing setup. This parser extends the regular DOM parser but instead delegates certain nodes to a block instead of building a DOM tree. The API is a bit raw in its current form but I'll extend it and make it a bit more user friendly in the following commits. In particular I want to make it easier to figure out if a certain node is nested inside another node.	2014-04-28 17:22:17 +02:00
Yorick Peterse	fd5bbbc9a2	Move element recursion handling into a method. This makes it easier to disable later on in the streaming parser.	2014-04-28 10:25:05 +02:00
Yorick Peterse	785ec26fe7	Create Element instances before recursing.	2014-04-28 10:21:34 +02:00
Yorick Peterse	9939cf49eb	Move parser callback code into dedicated methods.	2014-04-28 10:18:55 +02:00
Yorick Peterse	5d05aed6ec	Corrected docs for XML::Parser.	2014-04-26 12:57:35 +02:00
Yorick Peterse	f53fe4ed7c	Reset the lexer when resetting the parser. Also removed the unused @lines instance variable.	2014-04-25 00:15:24 +02:00
Yorick Peterse	83ff0e6656	Various small parser cleanups.	2014-04-25 00:07:53 +02:00
Yorick Peterse	ecf6851711	Revert "Move linking of child nodes to a dedicated mixin." This doesn't actually make things any easier. It also introduces a weirdly named mixin. This reverts commit `0968465f0c`.	2014-04-24 21:16:31 +02:00
Yorick Peterse	0968465f0c	Move linking of child nodes to a dedicated mixin.	2014-04-24 09:43:50 +02:00
Yorick Peterse	08d412da7e	First shot at removing the AST layer. The AST layer is being removed because it doesn't really serve a useful purpose. In particular when creating a streaming parser the AST nodes would only introduce extra overhead. As a result of this the parser will instead emit a DOM tree directly instead of first emitting an AST.	2014-04-21 23:05:39 +02:00
Yorick Peterse	9ee9ec14cb	Lexer: only pop elements when needed.	2014-04-19 01:10:32 +02:00
Yorick Peterse	c8c9da2922	Track the XML fixture in Git. To make running benchmarks easier we'll track the XML file in Git in its compressed form. I also decreased the size of the XML file from ~50 MB to ~10MB.	2014-04-19 01:03:14 +02:00
Yorick Peterse	97d8450cba	Removed the `regenerate` task.	2014-04-19 00:59:09 +02:00
Yorick Peterse	6f1ce17b31	Benchmark for lexer lines/second. This benchmark uses a fixture file that is automatically downloaded.	2014-04-17 20:06:24 +02:00
Yorick Peterse	54e6650338	Don't use define_method in the lexer. Profiling showed that calls to methods defined using `define_method` are really, really slow. Before this commit the lexer would process 3000-4000 lines per second. With this commit that has been increased to around 10 000 lines per second. Thanks to @headius for mentioning the (potential) overhead of define_method.	2014-04-17 19:08:26 +02:00
Yorick Peterse	d9fa4b7c45	Lex input as a sequence of bytes. Instead of lexing the input as a raw String or as a set of codepoints it's treated as a sequence of bytes. This removes the need of String#[] (replaced by String#byteslice) which in turn reduces the amount of memory needed and speeds up the lexing time. Thanks to @headius and @apeiros for suggesting this and rubber ducking along!	2014-04-17 17:45:05 +02:00
Yorick Peterse	70516b7447	Yield tokens in the lexer and parser. After some digging I found out that Racc has a method called `yyparse`. Using this method (and a custom callback method) you can `yield` tokens as a form of input. This makes it a lot easier to feed tokens as a stream from the lexer. Sadly the current performance of the lexer is still total garbage. Most of the memory usage also comes from using String#unpack, especially on large XML inputs (e.g. 100 MB of XML). It looks like the resulting memory usage is about 10x the input size. One option might be some kind of wrapper around String. This wrapper would have a sliding window of, say, 1024 bytes. When you create it the first 1024 bytes of the input would be unpacked. When seeking through the input this window would move forward. In theory this means that you'd only end up with having only 1024 Fixnum instances around at any given time instead of "a very big number". I have to test how efficient this is in practise.	2014-04-17 00:39:41 +02:00
Yorick Peterse	144c95cbb4	Replaced the HRS fixture with one from Gist. The HRS output is invalid, which Oga can not handle at this time.	2014-04-10 21:31:01 +02:00
Yorick Peterse	25edd2de00	Use a Set for storing void element names.	2014-04-10 12:28:47 +02:00
Yorick Peterse	b96f7c4852	Lex attributes with namespaces. These are lexed as just the name instead of two separate tokens.	2014-04-10 11:01:49 +02:00
Yorick Peterse	c974b96b88	Truncate lines in parser errors. The offending lines of code displayed in the error message are truncated to 80 characters. This should make reading the error messages less of a pain when dealing with very long lines of HTML/XML.	2014-04-10 10:08:51 +02:00
Yorick Peterse	292a98d7f6	Basic benchmarks for the Parser class.	2014-04-10 10:05:04 +02:00
Yorick Peterse	8ca7781842	Updated the lexer benchmarks. These had to be updated for the API changes of Oga::XML::Lexer.	2014-04-10 10:01:11 +02:00
Yorick Peterse	8237d5791d	Stream tokens when lexing. Instead of returning the tokens as a whole they are now streamed using XML::Lexer#advance. This method returns the next token upon every call. It uses a small buffer in case a particular block of text results in multiple tokens.	2014-04-09 22:08:13 +02:00
Yorick Peterse	e9bb97d261	First steps towards making the lexer stream tokens	2014-04-09 19:32:06 +02:00
Yorick Peterse	10d0ec1573	Specs for parsing various empty nodes.	2014-04-07 21:33:23 +02:00

... 6 7 8 9 10 ...

518 Commits All Branches Search

518 Commits

All Branches