Yield tokens in the lexer and parser.

After some digging I found out that Racc has a method called `yyparse`. Using
this method (and a custom callback method) you can `yield` tokens as a form of
input. This makes it a lot easier to feed tokens as a stream from the lexer.

Sadly the current performance of the lexer is still total garbage. Most of the
memory usage also comes from using String#unpack, especially on large XML
inputs (e.g. 100 MB of XML). It looks like the resulting memory usage is about
10x the input size.

One option might be some kind of wrapper around String. This wrapper would have
a sliding window of, say, 1024 bytes. When you create it the first 1024 bytes
of the input would be unpacked. When seeking through the input this window
would move forward.

In theory this means that you'd only end up with having only 1024 Fixnum
instances around at any given time instead of "a very big number". I have to
test how efficient this is in practise.
This commit is contained in:
Yorick Peterse 2014-04-17 00:39:41 +02:00
parent 144c95cbb4
commit 70516b7447
2 changed files with 40 additions and 23 deletions

View File

@ -7,13 +7,18 @@ module Oga
# To lex HTML input set the `:html` option to `true` when creating an # To lex HTML input set the `:html` option to `true` when creating an
# instance of the lexer: # instance of the lexer:
# #
# lexer = Oga::Lexer.new(:html => true) # lexer = Oga::XML::Lexer.new(:html => true)
# #
# @!attribute [r] html # @!attribute [r] html
# @return [TrueClass|FalseClass] # @return [TrueClass|FalseClass]
# #
# @!attribute [r] tokens
# @return [Array]
#
class Lexer class Lexer
%% write data; # % %% write data;
# % fix highlight
attr_reader :html attr_reader :html
@ -80,7 +85,6 @@ module Oga
@line = 1 @line = 1
@ts = nil @ts = nil
@te = nil @te = nil
@tokens = []
@stack = [] @stack = []
@top = 0 @top = 0
@cs = self.class.lexer_start @cs = self.class.lexer_start
@ -94,12 +98,7 @@ module Oga
end end
## ##
# Lexes the supplied String and returns an Array of tokens. Each token is # Gathers all the tokens for the input and returns them as an Array.
# an Array in the following format:
#
# [TYPE, VALUE]
#
# The type is a symbol, the value is either nil or a String.
# #
# This method resets the internal state of the lexer after consuming the # This method resets the internal state of the lexer after consuming the
# input. # input.
@ -111,7 +110,7 @@ module Oga
def lex def lex
tokens = [] tokens = []
while token = advance advance do |token|
tokens << token tokens << token
end end
@ -121,17 +120,32 @@ module Oga
end end
## ##
# Advances through the input and generates the corresponding tokens. # Advances through the input and generates the corresponding tokens. Each
# token is yielded to the supplied block.
#
# Each token is an Array in the following format:
#
# [TYPE, VALUE]
#
# The type is a symbol, the value is either nil or a String.
#
# This method stores the supplied block in `@block` and resets it after
# the lexer loop has finished.
# #
# This method does *not* reset the internal state of the lexer. # This method does *not* reset the internal state of the lexer.
# #
#
# @param [String] data The String to consume. # @param [String] data The String to consume.
# @return [Array] # @return [Array]
# #
def advance def advance(&block)
%% write exec; # % fix highlight @block = block
return @tokens.shift %% write exec;
# % fix highlight
ensure
@block = nil
end end
## ##
@ -189,7 +203,8 @@ module Oga
def add_token(type, value = nil) def add_token(type, value = nil)
token = [type, value, @line] token = [type, value, @line]
@tokens << token @block.call(token)
#@tokens << token
end end
## ##
@ -463,7 +478,7 @@ module Oga
add_token(:T_ELEM_NS, ns) add_token(:T_ELEM_NS, ns)
end end
@elements << name @elements << name if html
add_token(:T_ELEM_NAME, name) add_token(:T_ELEM_NAME, name)

View File

@ -168,16 +168,18 @@ end
end end
## ##
# Returns the next token from the lexer. # Yields the next token from the lexer.
# #
# @return [Array] # @yieldparam [Array]
# #
def next_token def yield_next_token
type, value, line = @lexer.advance @lexer.advance do |(type, value, line)|
@line = line if line
@line = line if line yield [type, value]
end
return type ? [type, value] : [false, false] yield [false, false]
end end
## ##
@ -231,7 +233,7 @@ Unexpected #{name} with value #{value.inspect} on line #{@line}:
# @return [Oga::AST::Node] # @return [Oga::AST::Node]
# #
def parse def parse
ast = do_parse ast = yyparse(self, :yield_next_token)
reset reset