In order to create an interpreter for the subset of SQL we have introduced so far, we need to create a representation for tables, a parser for statements written as text, and an evaluator for parsed statements. The sql interpreter example includes all of these components, providing a simple but functional demonstration of a declarative language interpreter.
In this implementation, each table has its own a class, and each row in a table is represented by an instance of its table's class. A row has one attribute per column in the table, and a table is a sequence of rows.
The class for a table is created using the namedtuple function in the collections package of the Python standard library, which returns a new sub-class of tuple that gives names to each element in the tuple.
Consider the cities table from the previous section, repeated below.
sqlite> create table cities as
...> select 38 as latitude, 122 as longitude, "Berkeley" as name union
...> select 42, 71, "Cambridge" union
...> select 45, 93, "Minneapolis";
The following Python statements construct a representation for this table.
>>> from collections import namedtuple
>>> CitiesRow = namedtuple("Row", ["latitude", "longitude", "name"])
>>> cities = [CitiesRow(38, 122, "Berkeley"),
CitiesRow(42, 71, "Cambridge"),
CitiesRow(43, 93, "Minneapolis")]
The result of a select statement can be interpreted using sequence operations. Consider the distances table from the previous section, repeated below.
sqlite> create table distances as
...> select name, 60*abs(latitude-38) as distance from cities;
sqlite> select distance/5, name from distances;
0|Berkeley
48|Cambridge
84|Minneapolis
This table is generated from the name and latitude columns of the cities table. This resulting table can be generated by mapping a function over the rows of the input table, a function that returns a DistancesRow for each CitiesRow.
>>> DistancesRow = namedtuple("Row", ["name", "distance"])
>>> def select(cities_row):
latitude, longitude, name = cities_row
return DistancesRow(name, 60*abs(latitude-38))
>>> distances = list(map(select, cities))
>>> for row in distances:
print(row)
Row(name='Berkeley', distance=0)
Row(name='Cambridge', distance=240)
Row(name='Minneapolis', distance=300)
The design of our SQL interpreter generalizes this approach. A select statement is represented as an instance of a class Select that is constructed from the clauses of the select statement.
>>> class Select:
"""select [columns] from [tables] where [condition] order by [order]."""
def __init__(self, columns, tables, condition, order):
self.columns = columns
self.tables = tables
self.condition = condition
self.order = order
self.make_row = create_make_row(self.columns)
def execute(self, env):
"""Join, filter, sort, and map rows from tables to columns."""
from_rows = join(self.tables, env)
filtered_rows = filter(self.filter, from_rows)
ordered_rows = self.sort(filtered_rows)
return map(self.make_row, ordered_rows)
def filter(self, row):
if self.condition:
return eval(self.condition, row)
else:
return True
def sort(self, rows):
if self.order:
return sorted(rows, key=lambda r: eval(self.order, r))
else:
return rows
The execute method joins input tables, filters and orders the resulting rows, then maps a function called make_row over those resulting rows. The make_row function is created in the Select constructor by a call to create_make_row, a higher-order function that creates a new class for the resulting table and defines how to project an input row to an output row. (A version of this function with more error handling and special cases appears in sql.)
>>> def create_make_row(description):
"""Return a function from an input environment (dict) to an output row.
description -- a comma-separated list of [expression] as [column name]
"""
columns = description.split(", ")
expressions, names = [], []
for column in columns:
if " as " in column:
expression, name = column.split(" as ")
else:
expression, name = column, column
expressions.append(expression)
names.append(name)
row = namedtuple("Row", names)
return lambda env: row(*[eval(e, env) for e in expressions])
Finally, we need to define the join function that creates the input rows. Given an env dictionary contains existing tables (lists of rows) keyed by their name, the join function groups together all combinations of rows in the input tables using the product function in the itertools package. It maps a function called make_env over the joined rows, a function that converts each combination of rows into a dictionary so that it can be used to evaluate expressions. (A version of this function with more error handling and special cases appears in sql.)
>>> from itertools import product
>>> def join(tables, env):
"""Return an iterator over dictionaries from names to values in a row.
tables -- a comma-separate sequences of table names
env -- a dictionary from global names to tables
"""
names = tables.split(", ")
joined_rows = product(*[env[name] for name in names])
return map(lambda rows: make_env(rows, names), joined_rows)
>>> def make_env(rows, names):
"""Create an environment of names bound to values."""
env = dict(zip(names, rows))
for row in rows:
for name in row._fields:
env[name] = getattr(row, name)
return env
Above, row._fields evaluates to the column names of the table containing the row. The _fields attribute exists because the type of row is a namedtuple class.
Our interpreter is complete enough to execute select statements. For instance, we can compute the latitude distance from Berkeley for all other cities, ordered by their longitude.
>>> env = {"cities": cities}
>>> select = Select("name, 60*abs(latitude-38) as distance",
"cities", "name != 'Berkeley'", "-longitude")
>>> for row in select.execute(env):
print(row)
Row(name='Minneapolis', distance=300)
Row(name='Cambridge', distance=240)
The example above is equivalent to the following SQL statement.
sqlite> select name, 60*abs(latitude-38) as distance
...> from cities where name != "Berkeley" order by -longitude;
Minneapolis|420
Cambridge|240
We can also store this resulting table in the environment and join it with the cities table, retrieving the longitude for each city.
>>> env["distances"] = list(select.execute(env))
>>> joined = Select("cities.name as name, distance, longitude", "cities, distances",
"cities.name == distances.name", None)
>>> for row in joined.execute(env):
print(row)
Row(name='Cambridge', distance=240, longitude=71)
Row(name='Minneapolis', distance=300, longitude=93)
The example above is equivalent to the following SQL statement.
sqlite> select cities.name as name, distance, longitude
...> from cities, distances where cities.name = distances.name;
Cambridge|240|71
Minneapolis|420|93
The full sql example program also contains a simple parser for select statements, as well as execute methods for create table and union. The interpreter can correctly execute all SQL statements included within the text so far. While this simple interpreter only implements a small amount of the full Structured Query Language, its structure demonstrates the relationship between sequence processing operations and query languages.
Query Plans. Declarative languages describe the form of a result, but do not explicitly describe how that result should be computed. This interpreter always joins, filters, orders, and then projects the input rows in order to compute the result rows. However, more efficient ways to compute the same result may exist, and query interpreters are free to choose among them. Choosing efficient procedures for computing query results is a core feature of database systems.
For example, consider the final select statement above. Rather than computing the join of cities and distances and then filtering the result, the same result may be computed by first sorting both tables by the name column and then joining only the rows that have the same name in a linear pass through the sorted tables. When tables are large, efficiency gains from query plan selection can be substantial.