Chapter 6 - Query Optimization

Two queries are semantically equivalent if and only if they produce identical results for every dataset. In such cases, it does not matter which query we write if all we care about is getting the right answers. On the other hand, it is possible that one query is computationally better than another in that our evaluation algorithm gets those answers more quickly.

In this chapter, we look at a variety of techniques for optimizing queries, i.e. converting expensive queries into semantically equivalent queries that are computationally less complex. We begin with a discussion of subgoal ordering within query rules. We then look at eliminating useless subgoals. And we finish with a discussion of eliminating rules from queries with multiple rules.

6.2 Subgoal Ordering

One very common source of inefficiency in evaluation stems from non-optimal ordering of subgoals within queries. The good news is that it is often possible to find better orderings just by looking at the form of the queries involved even without looking at the data to which the queries are applied.

As an example of inefficiency due to poor subgoal ordering, consider the query shown below. Our goal is true of X and Y if p is true of X and r is true of X and Y and q is true of X.

Intuitively, this seems like a bad way to write this query for our usual evaluation procedure. It seems as though the q condition should come before the r condition, as in the following query.

In fact, there is good reason for this intuition. For our standard evaluation procedure, the worst-case cost of evaluating the first query is n⁴, where n is the size of the domain of objects. By contrast, the worst-case cost of evaluating the second query is just n³.

Let's look at these two cases in more detail. In the worst case, there are n² + 2n facts in the database, where n is the number of objects in the domain.

In evaluating the first query, our algorithm would first examine all n^2 + 2n facts to find those that match p(X). There would be at most n answers. For each of these, the algorithm would again look at n^2 + 2n facts to find those that match r(X,Y). There would be n for each value of X. Finally, for each of these pairs, the algorithm would again examine at n^2 + 2n facts to find those that match q(X). The grand total is shown below.

In evaluating the second query, the algorithm would again examine all n^2 + 2n facts to find those that match p(X). There would be at most n answers. For each of these, the algorithm would again look at n^2 + 2n facts to find those that match q(X). There would be at least one for each value of X. Finally, for each such X, the algorithm would examine n^2 + 2n facts to find those that match r(X,Y). The grand total is shown below.

Suppose, for example, there were 4 objects in the domain. In this case, there would be at most 4 p facts and 4 q facts and 16 r facts. Evaluating the first query would require 504 matches, while evaluating the second query would require only 208.

In the presence of indexing, the asymptotic complexity is the same for both orderings. However, the lower degree terms for the second ordering suggest that it is still the better ordering. Moreover, it is possible to show that, averaging over all possible databases, the second query is better than the first.

Fortunately, there is a simple method for reordering subgoals in situations like this. The basic idea is to assemble a new body for the query incrementally, picking a subgoal on each step and removing it from the list of remaining subgoals to be considered. In making its choice, the method examines the remaining subgoals in left-to-right order. If it encounters a subgoal all of whose variables are bound by subgoals already chosen, then that subgoal is added to the new query and removed from the list of remaining subgoals. If not, the method removes the first remaining subgoal from the list, adds it to the new query, updates its list of bound variables, and moves on to the next step.

As an example of this method in action, consider the first query shown above. At the start, the list of remaining subgoals consists of all three subgoals in the query. At this point, none of the three subgoals is ground, so the method chooses the first subgoal p(X), adds it to its new query, and puts X on the list of bound variables. On the second step, the method looks at the remaining two subgoals. The subgoal r(X,Y) contains the unbound variable Y and so it is not chosen. By contrast, all of the variables in the subgoal q(X) are bound, so the method outputs this subgoal next. On the third step, the final subgoal is added forming the second query shown above.

6.3 Subgoal Removal

Another common source of inefficiency in evaluation stems from the presence of redundant subgoals within queries. In many cases, it is possible to detect and eliminate such redundancies.

As a simple example of the problem, consider the query shown below. Relation goal is true of X and Y if p is true of X and Y and q is true of Y and q is also true of some Z.

It should be clear that the subgoal q(Z) is redundant here. If there is a value for Y such that q(Y) is true, then that value for Y also works as a value for Z. Consequently, we can drop the q(Z) subgoal (while retaining q(Y)), resulting in the query shown below.

Note that the opposite is not true. If we were to drop q(Y) and retain q(Z), we would lose the constraint on the second argument of p, which is also an argument to r.

Fortunately, it is easy to determine which subgoals can be removed and which need to be retained. The basic idea is to assemble a new body for the query incrementally, picking a subgoal on each step, checking for redundancy, and adding the subgoal once if it is not redundant.

As an illustration of this method in action, consider the example shown above. The method first computes the variables in the head of the query, viz. [X,Y] and initializes the variable newquery to a new query with the same head as the original query. It then iterates over the body of the query, adding subgoals to the new query once they are checked for redundancy.

On the first step of the iteration, the method focusses on p(X,Y). It creates a dataset consisting of instances of the remaining subgoals, viz. q(x1) and q(x2); it then tries to derive p(x0,x1); and, in this case, it fails. So p(X,Y) is added to the new query.

On the second step, the method focusses on q(Y). It creates a dataset consisting of instances of the remaining subgoals, viz. p(x0,x1) and q(x2); it then tries to derive q(x1); and, once again, it fails. So q(Y) is added to the new query.

Finally, the method focusses on q(Z). It creates a dataset consisting of instances of the other subgoals, viz. p(x0,x1) and q(x1); it then tries to derive an instance of q(Z). Note that Z is not bound here since it does not occur as a head variable or a variable in any of the other subgoals. In this case, the test succeeds; and so q(Z) is not added to the new query.

This method is sound in that it removes only redundant subgoals. As a result, any query produced by this method is equivalent to the query it is given as input.

Unfortunately, the method is not complete. There are redundant subgoals that it is does not detect. The problem arises when multiple redundant subgoals share variables that prevent the method from detecting the redundancy.

As an example, consider the query shown below. The relation goal is true of X if p is true of X and Y and q is true of X and Y and p is true of X and Z and q is true of X and Z.

Clearly the last two subgoals are redundant with the first two subgoals. Unfortunately, our method does not detect that either of the subgoals in either pair is redundant because of the variables shared with the other subgoal of the pair. Try it.

Detecting this sort of redundancy can be done mechanically by considering subsets of subgoals and not just individual subgoals. However, this is more expensive than the simple method outlined above.

6.4 Rule Removal

An analogous form of inefficiency in evaluation stems from the presence of redundant rules. As with redundant subgoals within rules, it is often easy to detect and eliminate such redundancies.

As an example of the problem, consider the rules shown below. Relation goal is true of X if p is true of X and Y and q is true of b and r is true of Z. Relation goal is true of X if p is true of X and Y and q is true of Y.

Any answer produced by the first rule here is also produced by the second rule, so the first rule is redundant and can be eliminated.

The trick to detecting such redundancies is to recognize that the second rule subsumes the first, i.e. all of the answers produced by the second rule are produced by the first rule. If we replace some or all of the variables in the body the second rule that do not occur in the head, then the heads remain the same and all of its subgoals are members of the body of the first rule. In this case, all we need to do is to replace Y by b, and we get the rule goal(X) :- p(X,b) & q(b), which is the same as the first rule except with fewer subgoals. Every output of the first rule is, therefore, an output of the second; and, consequently, the first rule can be dropped.

6.5 Example - Cryptarithmetic

A cryptarithmetic problem is a constraint satisfaction problem characterized by a finite set of letters and a finite set of numbers and an arithmetic constraint written in terms of the letters. A solution to the problem is a 1-1 mapping of letters to numbers such that, when the letters are replaced by their corresponding numbers, the arithmetic constraint is satisfied.

A classic cryptarithmetic is shown below. Here, the letters are {S, E, N, D, M, O, R, Y} and the numbers are the digits from 0 through 9 and we are looking for an assignment of the letters to digits such that the equation holds.

We can formulate this problem up as a query in much the same way that we formalized the map coloring problem presented in Chapter 3. First, we have a dataset listing the allowable digits.

Next, we write a query listing the eight letters as goal values, with subgoals to fix the ranges of these variables, disjointness constraints, and additional constraints to capture the arithmetic conditions. See below. In the interest of conciseness here, we have used ordinary arithmetic operators in place of the corresponding builtins, e.g. we have used E!=S for used distinct(E,S), and we have used used 1000*S for times(1000,S).

Having expressed our goal in this way, we can use our evaluation procedure to generate the answer to this problem. However, this is not without difficulty. Given the way this query is written, the evaluation process will take a long time. There are 10⁸ = 100,000,000 possible variable bindings. In the worst case (where there is no solution), it would check all of these. On average, it would have to look at a substantial fraction.

The good news is that we can use subgoal ordering to transform this query into one that is easier to evaluate. We simply move the inequalities up to the point where the variables are bound. See below.

Having done this, we eliminate many of the possibilities before they are even generated. The upshot is that this query takes orders of magnitude less time to evaluate, on most computers taking no more than a fraction of a second.

Exercises

Exercise 6.1: For each of the following groups of query rules, say which rule is best in terms of worst case evaluation complexity using our standard algorithm without indexing.

Exercise 6.2: For each of the following groups of query rules, select the alternative that is equivalent to the first rule in the group.

Exercise 6.3: For each of the following pairs of queries, say whether the first query subsumes the second, i.e. whether the set of answers to the first query contains the answers to the second query.

	(a)	`goal(X) :- p(X,Y)`
		`goal(X) :- p(X,a) & p(X,b)`

	(b)	`goal(X) :- p(X,a)`
		`goal(X) :- p(X,Y)`

	(c)	`goal(X) :- p(X,Y) & p(X,Z)`
		`goal(X) :- p(X,b) & q(b)`

6.1 Introduction

6.2 Subgoal Ordering

6.3 Subgoal Removal

6.4 Rule Removal

6.5 Example - Cryptarithmetic

Exercises