core/oga - oga

Commit Graph

Author	SHA1	Message	Date
Yorick Peterse	2689d3f65a	Initial setup using a C extension. While I've tried to keep Oga pure Ruby for as long as possible the performance of Ragel's Ruby output was not worth the trouble. For example, lexing 10MB of XML would take 5 to 6 seconds at least. Nokogiri on the other hand can parse that same XML into a DOM document in about 300 miliseconds. Such a big performance difference is not acceptable. To work around this the XML/HTML lexer will be implemented in C for MRI/Rubinius and Java for JRuby. For now there's only a C extension as I haven't read up yet on the JRuby API. The end goal is to provide some sort of Ragel "template" that can be used to generate the corresponding C/Java extension code. This would remove the need of duplicating the grammar and associated code. The native extension setup is a hybrid between native and Ruby. The raw Ragel stuff happens in C/Java while the actual logic of actions happens in Ruby. This adds a small amount of overhead but makes it much easier to maintain the lexer. Even with this extra overhead the performance is much better than pure Ruby. The 10MB of XML mentioned above is lexed in about 600 miliseconds. In other words, it's 10 times faster.	2014-05-05 00:31:28 +02:00
Yorick Peterse	baaa24a760	Indentation fix in the lexer.	2014-05-04 18:06:43 +02:00
Yorick Peterse	f18e8893de	Removed the buffering crap from the lexer.	2014-05-04 17:39:08 +02:00
Yorick Peterse	57255012b7	Patch the Ragel lexer after generating it. This further increases throughput of the lexer. On MRI this seems to save around one second or so. It now sits at ~6,8 seconds in the big XML benchmark. On JRuby, combined with some JIT options and invoke dynamic enabled, this can reduce the average lexing time to around 3,5 seconds. Rubinius, also with a few aggressive JIT options, seems to stick around 9 seocnds.	2014-05-02 00:40:10 +02:00
Yorick Peterse	9dfdefee47	Removed XML::Lexer#buffering? Instead of wrapping a predicate method around the ivar we'll just access it directly. This reduces average lexing times in the big XML benchmark from 7,5 to ~7 seconds.	2014-05-01 22:59:56 +02:00
Yorick Peterse	b854f737cd	Run memory profiling for 60 seconds.	2014-05-01 21:47:51 +02:00
Yorick Peterse	676a5333c0	Use a default gnuplot script.	2014-05-01 21:27:08 +02:00
Yorick Peterse	3344f373bd	Plot time offsets on X axes when profiling.	2014-05-01 21:26:05 +02:00
Yorick Peterse	f4a71d7f63	Use wx as a gnuplot terminal. This allows users to zoom in and such, which doesn't work on the qt terminal for some reason.	2014-05-01 21:01:25 +02:00
Yorick Peterse	e33bb6f901	Remove sample files when running rake clean.	2014-05-01 20:57:17 +02:00
Yorick Peterse	1c35317165	Revamped the profiling setup. This removes the need for dozens of standalone gnuplot scripts, adds extra profiling data and makes the actual profiling easier.	2014-05-01 20:54:25 +02:00
Yorick Peterse	e54d77fc2f	Cleaned up the average timing benchmark.	2014-05-01 13:43:33 +02:00
Yorick Peterse	203aea6b1a	Cleaned up benchmarking code.	2014-05-01 13:08:44 +02:00
Yorick Peterse	ebf9099f0e	Dropped the benchmark_ prefixes. These files reside in a benchmark/ directory. Gee, I wonder what they do.	2014-05-01 13:03:21 +02:00
Yorick Peterse	20f2f256f6	Benchmark for measuring average lexing times.	2014-05-01 13:01:52 +02:00
Yorick Peterse	f607cf50dc	Use local variables for Ragel. Instead of using instance variables for ts, te, etc we'll use local variables. Grand wizard overloard @whitequark suggested that this would be quite a bit faster, which turns out to be true. For example, the big XML lexer benchmark would, prior to this commit, complete in about 9 - 9,3 seconds. With this commit that hovers around 8,5 seconds.	2014-05-01 13:00:29 +02:00
Yorick Peterse	e26d5a8664	Removed unused variable in a lexer benchmark.	2014-05-01 12:25:49 +02:00
Yorick Peterse	2f36692abe	Fixed the big XML lexer benchmark.	2014-04-30 09:28:28 +02:00
Yorick Peterse	83f6d5437e	Contextual pull parsing. This adds the ability to more easily act upon specific node types and nestings when using the pull parsing API. A basic example of this API looks like the following (only including relevant code): parser.parse do \|node\| parser.on(:element, %w{people person}) do people << {:name => nil, :age => nil} end parser.on(:text, %w{people person name}) do people.last[:name] = node.text end parser.on(:text, %w{people person age}) do people.last[:age] = node.text.to_i end end This fixes #6.	2014-04-29 23:05:49 +02:00
Yorick Peterse	1a413998a3	Track the current node in the pull parser. The current node is tracked in the instance method `node`.	2014-04-29 21:21:05 +02:00
Yorick Peterse	d0b3653785	Updated the manifest, again.	2014-04-29 20:42:17 +02:00
Yorick Peterse	5339664f33	Include .yardopts in the Gem.	2014-04-29 20:42:09 +02:00
Yorick Peterse	8522a82cf9	Updated the manifest.	2014-04-29 20:41:11 +02:00
Yorick Peterse	503b254216	Generate files before generting the manifest.	2014-04-29 20:41:02 +02:00
Yorick Peterse	586c8f1d46	Generated an initial manifest.	2014-04-29 20:40:34 +02:00
Yorick Peterse	59dae873e4	Don't rely on Git for generating the MANIFEST. When using Git the resulting Gem will contain far too many useless files. For example, the profile/ and spec/ directories are not needed when building Gems.	2014-04-29 20:39:20 +02:00
Yorick Peterse	579c0499ed	Benchmark for the pull parser.	2014-04-29 14:48:43 +02:00
Yorick Peterse	5ed09236f9	Big XML benchmark for the lexer.	2014-04-29 14:48:36 +02:00
Yorick Peterse	a1e9e74b9c	Updated a benchmark description.	2014-04-29 14:24:33 +02:00
Yorick Peterse	a42240bc2e	Profiling setup for the pull parser.	2014-04-29 13:50:50 +02:00
Yorick Peterse	d5e59c38ac	Profiling setup for the DOM parser.	2014-04-29 13:47:55 +02:00
Yorick Peterse	2c4890533c	Set initial size for the lexer graph.	2014-04-29 13:47:36 +02:00
Yorick Peterse	a111e673cb	Changed the big XML file size to 10 MB. This makes various calculations a bit easier opposed to when the file is 11MB in size.	2014-04-29 13:42:02 +02:00
Yorick Peterse	53c45c621b	Basic memory profiling setup. This makes it a bit easier to profile memory usage of certain components and plot them using Gnuplot. In the past I would write one-off scripts for this and throw them away, only to figure out I needed them again later on. Profiling samples are written to profile/samples and can be plotted using corresponding Gnuplot scripts found in profile/plot. The latter requires Gnuplot to be installed.	2014-04-29 13:38:56 +02:00
Yorick Peterse	70fcc8534c	Benchmark for parsing big XML documents.	2014-04-29 13:05:45 +02:00
Yorick Peterse	45b0cdf811	Track element name nesting in the pull parser. Tracking the names of nested elements makes it a lot easier to do contextual pull parsing. Without this it's impossible to know what context the parser is in at a given moment. For memory reasons the parser currently only tracks the element names. In the future it might perhaps also track extra information to make parsing easier.	2014-04-28 23:40:36 +02:00
Yorick Peterse	030a0068bd	Basic pull parsing setup. This parser extends the regular DOM parser but instead delegates certain nodes to a block instead of building a DOM tree. The API is a bit raw in its current form but I'll extend it and make it a bit more user friendly in the following commits. In particular I want to make it easier to figure out if a certain node is nested inside another node.	2014-04-28 17:22:17 +02:00
Yorick Peterse	fd5bbbc9a2	Move element recursion handling into a method. This makes it easier to disable later on in the streaming parser.	2014-04-28 10:25:05 +02:00
Yorick Peterse	785ec26fe7	Create Element instances before recursing.	2014-04-28 10:21:34 +02:00
Yorick Peterse	9939cf49eb	Move parser callback code into dedicated methods.	2014-04-28 10:18:55 +02:00
Yorick Peterse	5d05aed6ec	Corrected docs for XML::Parser.	2014-04-26 12:57:35 +02:00
Yorick Peterse	f53fe4ed7c	Reset the lexer when resetting the parser. Also removed the unused @lines instance variable.	2014-04-25 00:15:24 +02:00
Yorick Peterse	83ff0e6656	Various small parser cleanups.	2014-04-25 00:07:53 +02:00
Yorick Peterse	ecf6851711	Revert "Move linking of child nodes to a dedicated mixin." This doesn't actually make things any easier. It also introduces a weirdly named mixin. This reverts commit `0968465f0c`.	2014-04-24 21:16:31 +02:00
Yorick Peterse	0968465f0c	Move linking of child nodes to a dedicated mixin.	2014-04-24 09:43:50 +02:00
Yorick Peterse	08d412da7e	First shot at removing the AST layer. The AST layer is being removed because it doesn't really serve a useful purpose. In particular when creating a streaming parser the AST nodes would only introduce extra overhead. As a result of this the parser will instead emit a DOM tree directly instead of first emitting an AST.	2014-04-21 23:05:39 +02:00
Yorick Peterse	9ee9ec14cb	Lexer: only pop elements when needed.	2014-04-19 01:10:32 +02:00
Yorick Peterse	c8c9da2922	Track the XML fixture in Git. To make running benchmarks easier we'll track the XML file in Git in its compressed form. I also decreased the size of the XML file from ~50 MB to ~10MB.	2014-04-19 01:03:14 +02:00
Yorick Peterse	97d8450cba	Removed the `regenerate` task.	2014-04-19 00:59:09 +02:00
Yorick Peterse	6f1ce17b31	Benchmark for lexer lines/second. This benchmark uses a fixture file that is automatically downloaded.	2014-04-17 20:06:24 +02:00

... 12 13 14 15 16 ...

830 Commits All Branches Search

830 Commits

All Branches