Parsing is the process of generating expression trees from raw text input. A parser is a composition of two components: a lexical analyzer and a syntactic analyzer. First, the lexical analyzer partitions the input string into tokens, which are the minimal syntactic units of the language such as names and symbols. Second, the syntactic analyzer constructs an expression tree from this sequence of tokens. The sequence of tokens produced by the lexical analyzer is consumed by the syntactic analyzer.

Lexical analysis. The component that interprets a string as a token sequence is called a tokenizer or lexical analyzer. In our implementation, the tokenizer is a function called tokenize_line in scheme_tokens. Scheme tokens are delimited by white space, parentheses, dots, or single quotation marks. Delimiters are tokens, as are symbols and numerals. The tokenizer analyzes a line character by character, validating the format of symbols and numerals.

Tokenizing a well-formed Calculator expression separates all symbols and delimiters, but identifies multi-character numbers (e.g., 2.3) and converts them into numeric types.

>>> tokenize_line('(+ 1 (* 2.3 45))')
    ['(', '+', 1, '(', '*', 2.3, 45, ')', ')']
    

Lexical analysis is an iterative process, and it can be applied to each line of an input program in isolation.

Syntactic analysis. The component that interprets a token sequence as an expression tree is called a syntactic analyzer. Syntactic analysis is a tree-recursive process, and it must consider an entire expression that may span multiple lines.

Syntactic analysis is implemented by the scheme_read function in scheme_reader. It is tree-recursive because analyzing a sequence of tokens often involves analyzing a subsequence of those tokens into a subexpression, which itself serves as a branch (e.g., operand) of a larger expression tree. Recursion generates the hierarchical structures consumed by the evaluator.

The scheme_read function expects its input src to be a Buffer instance that gives access to a sequence of tokens. A Buffer, defined in the buffer module, collects tokens that span multiple lines into a single object that can be analyzed syntactically.

>>> lines = ['(+ 1', '   (* 2.3 45))']
    >>> expression = scheme_read(Buffer(tokenize_lines(lines)))
    >>> expression
    Pair('+', Pair(1, Pair(Pair('*', Pair(2.3, Pair(45, nil))), nil)))
    >>> print(expression)
    (+ 1 (* 2.3 45))
    

The scheme_read function first checks for various base cases, including empty input (which raises an end-of-file exception, called EOFError in Python) and primitive expressions. A recursive call to read_tail is invoked whenever a ( token indicates the beginning of a list.

The read_tail function continues to read from the same input src, but expects to be called after a list has begun. Its base cases are an empty input (an error) or a closing parenthesis that terminates the list. Its recursive call reads the first element of the list with scheme_read, reads the rest of the list with read_tail, and then returns a list represented as a Pair.

This implementation of scheme_read can read well-formed Scheme lists, which are all we need for the Calculator language. Parsing dotted lists and quoted forms is left as an exercise.

Informative syntax errors improve the usability of an interpreter substantially. The SyntaxError exceptions that are raised include a description of the problem encountered.