C H A P T E R  2
Datasets

2.1 Introduction

A dataset is a collection of simple facts that together characterize the state of an application area. Facts in the dataset are assumed to be true; facts that are not included in the dataset are assumed to be false. Different sets of facts characterize different states.

Datasets are the simplest form of logic programs. They can be used by themselves to encode information, e.g. in simple database systems. Or they can be used in combination with rules to form more complex logic programs, as we shall see in the coming chapters.

We begin this chapter by defining a formal language for encoding information in the form of datasets. We provide some examples of datasets encoded within this language. And we discuss the issues in involved in reconceptualizing an application area and encoding those different conceptualizations as datasets with different vocabularies.

2.2 Datasets

The vocabulary for a dataset is a collection of object constants, function constants, and relation constants. Object constants are used to represent objects in an application area. Function constants represent functions that map objects or combinations of objects into more complex objects. And relation constants represent properties of objects or relationships among objects.

Each function constant and relation constant in a well-formed dataset has an associated arity, i.e. the number of arguments allowed in any expression involving the corresponding function or relation. A unary function or relation is one that takes one argument; a binary function or relation is one that takes two arguments; a ternary function or relation is one that takes three arguments; and so forth. More generally, an n-ary function or relation is one that takes n arguments

A term is either an object constant or a functional term. A functional term is an expression consisting of an n-ary function constant and n terms. In what follows, we write functional terms in traditional mathematical notation - the function followed by its arguments enclosed in parentheses and separated by commas. For example, if f is a binary function constant and if a and b are object constants, then f(a,a) and f(a,b) and f(b,a) and f(b,b) are all functional terms.

Note that functional terms can be nested within other functional terms. For example, if f(a,b) is a functional term, then so is f(f(a,b),b) and f(f(a,b),f(a,b)).

A relational sentence, or datum or fact, is an expression formed from an n-ary relation constant and n terms. In what follows, we write data in mathematical notation analogous to that used in writing functional terms - the relation followed by its arguments enclosed in parentheses and separated by commas. For example, if r is a binary relation constant and a and b are object constants, then r(a,b) is a datum.

Remember that functional terms are terms and, as such, can be used within relational sentences. For example, if f(a,b) is a functional term, then r(f(a,b),b) is a relational sentences and so is r(f(a,b),f(a,b)). On the other hand, relational sentences are not allowed within other relational sentences. For example, if r is a relation constant it is not permissible to write r(r(a,b),b).

The Herbrand base for a vocabulary is the set of all relational sentences that can be formed from the constants in the vocabulary. For example, for a vocabulary with just two object constants a and b and a single binary relation constant r, the sentences below constitute the Herbrand base for this language.

r(a,a)
r(a,b)
r(b,a)
r(b,b)

Finally, we define a dataset to be any subset of the Herbrand base, i.e. an arbitrary set of facts that can be formed from the vocabulary of a database. Intuitively, we can think of the data in a dataset as the facts that we believe to be true; data that are not in the dataset are assumed to be false.

Note that, when a vocabulary contains function constants, there can be infinitely many functional terms, e.g. g(a), g(g(a)), g(g(g(a))), and so forth. The upshot is that, in such situations, the Herbrand base is infinite in size. Although the Herbrand base for a vocabulary is infinite in size, this does not mean that our datasets are necessarily infinite. In practical applications, we typically choose that vocabularies so that our datasets are finite, i.e. there are only finitely many facts that are true while infinitely many facts are false.

2.3 Example - Sorority World

Consider the interpersonal relations of a small sorority. There are just four members - Abby, Bess, Cody, and Dana. Some of the girls like each other, but some do not.

Figure 1 shows one set of possibilities. The checkmark in the first row here means that Abby likes Cody, while the absence of a checkmark means that Abby does not like the other girls (including herself). Bess likes Cody too. Cody likes everyone but herself. And Dana also likes the popular Cody.

  Abby Bess Cody Dana
Abby      
Bess      
Cody  
Dana      

Figure 1 - One state of Sorority World

In order to encode this information as a sentential dataset, we adopt a vocabulary with four object constants (abby, bess, cody, dana) and one binary relation constant (likes). Using this vocabulary, we can encode the information in Figure 1 by writing the sentential dataset shown below.

likes(abby,cody)
likes(bess,cody)
likes(cody,abby)
likes(cody,bess)
likes(cody,dana)
likes(dana,cody)

Note that the likes relation has no inherent restrictions. It is possible for one person to like a second without the second person liking the first. It is possible for a person to like just one other person or many people or nobody. It is possible that everyone likes everyone or no one likes anyone.

Even for a small world like this one, there are quite a few possible ways the world could be. Given four girls, there are sixteen possible instances of the likes relation - likes(abby,abby), likes(abby,bess), likes(abby,cody), likes(abby,dana), likes(bess,abby), and so forth. Each of these sixteen can be either true or false. There are 216 (i.e. 65,536) possible combinations of these true-false possibilities; and so there are 216 possible states of this world and, therefore, 216 possible datasets.

2.4 Example - Kinship

As another example, consider a small dataset about kinship. The terms in this case once again represent people. The relation constants name properties of these people and their relationships with each other.

In our example, we use the binary relation constant parent to specify that one person is a parent of another. The sentences below constitute a dataset describing six instances of the parent relation. The person named art is a parent of the person named bob and the person named bea; bob is the parent of cal and cam; and bea is the parent of coe and cory.

parent(art,bob)
parent(art,bea)
parent(bob,cal)
parent(bob,cam)
parent(bea,coe)
parent(bea,cory)

The adult relation is unary relation, i.e. a simple property of a person, not a relationship other people. Everyone in our dataset is an adult except for Art's grandchildren.

adult(art)
adult(bob)
adult(bea)

We can express gender with two unary relation constants male and female. The following data expresses the genders of all of the people in our dataset. Note that, in principle, we need only one relation here, since one gender is the complement of the other. However, representing both allows us to enumerate instances of both gender equally efficiently, which can be useful in certain applications.

male(art)female(bea)
male(bob)female(coe)
male(cal)female(cory)
male(cam)

As an example of a ternary relation, consider the data shown below. Here, we use prefers to represent the fact that the first person likes the second person more than the third person. For example, the first sentence says that Art prefers bea to bob; the second sentence says that bob prefers cal to cam.

prefers(art,bea,bob)
prefers(bob,cal,cam)

Note that the order of arguments in such sentences is arbitrary. Given the meaning of the prefers relation in our example, the first argument denotes the subject, the second argument is the person who is preferred, and the third argument denotes the person who is less preferred. We could equally well have interpreted the arguments in other orders. The important thing is consistency - once we choose to interpret the arguments in one way, we must stick to that interpretation everywhere.

One noteworthy difference difference between Sorority World and Kinship is that there is just one relation in the former (i.e. the likes relation), whereas there are multiple relations in the latter (one binary relation, three unary relations, and one ternary relation).

A more subtle and interesting difference is that the relations in Kinship are constrained in various ways while the likes relation in Sorority World is not. It is possible for any person in Sorority World to like any other person; all combinations of likes and dislikes are possible. By contrast, in Kinship there are constraints that limit the number of possible states. For example, it is not possible for a person to be his own parent, and it is not possible for a person to be both male and female.

2.5 Example - Blocks World

The Blocks World is a popular application area for illustrating ideas in the field of Artificial Intelligence. A typical Blocks World scene is shown in Figure 2.

A
B
D
C
E
 
Figure 2 - One state of Blocks World.

Most people looking at this figure interpret it as a configuration of five toy blocks. Some people conceptualize the table on which the blocks are resting as an object as well; but, for simplicity, we ignore it here.

In order to describe this scene, we adopt a vocabulary with five object constants (a, b, c, d, e), with one object constant for each of the five blocks in the scene. The intent here is for each of these object constants to represent the block marked with the corresponding capital letter in the scene.

In a spatial conceptualization of the Blocks World, there are numerous meaningful relations. For example, it makes sense to think about the relation that holds between two blocks if and only if one is resting on the other. In what follows, we use the relation constant on to refer to this relation. We might also think about the relation that holds between two blocks if and only if one is anywhere above the other, i.e. the first is resting on the second or is resting on a block that is resting on the second, and so forth. In what follows, we use the relation constant above to talk about this relation. There is the relation that holds of three blocks that are stacked one on top of the other. We use the relation constant stack as a name for this relation. We use the relation constant clear to denote the relation that holds of a block if and only if there is no block on top of it. We use the relation constant table to denote the relation that holds of a block if and only if that block is resting on the table.

The arities of these relation constants are determined by their intended use. Since on is intended to denote a relation between two blocks, it has arity 2. Similarly, above has arity 2. The stack relation constant has arity 3. Relation constants clear and table each have arity 1.

Given this vocabulary, we can describe the scene in Figure 2 by writing ground literals that state which relations hold of which objects or groups of objects. Let's start with on. The following sentences tell us directly for each ground relational sentence whether it is true or false.

on(a,b)
on(b,c)
on(d,e)

There are four above facts. The above relation holds of the same pairs of blocks as the on relation, but it includes one additional fact for block a and block c.

above(a,b)
above(b,c)
above(a,c)
above(d,e)

In similar fashion, we can encode the stack relation and the above relation. There is just one stack here - block a on block b and block b on block c.

stack(a,b,c)

Finally, we can write out the facts for clear and table. Blocks a and d are clear, while blocks c and e are on the table.

clear(a)table(c)
clear(d)table(e)

As with Kinship, the relations in Blocks World are constrained in various ways. For example, it is not possible for a block to be on itself. Moreover, some of these relations are entirely determined by others. For example, given the on relation, the facts about all of the other relations are entirely determined. In the next chapter, we see how to write out definitions for such concepts and thereby avoid having to write out individual facts for the defined concepts.

2.6 Example - Food World

As another example of these concepts, consider a small dataset about food and menus. The goal here is to create a dataset that lists meals that are available at a restaurant on different days of the week.

The object constants in this case come in two types - days of the week (monday, ... , friday) and different types of food (calamari, vichyssoise, beef, and so forth). There are three function constants - a 3-ary function for three course meals (three), a 4-ary function for four course meals (four), and a 5-ary function for five course meals (five). There is a single binary relation that relates days of the week and available meals.

The following is an example of a dataset using this vocabulary. On Monday, the restaurant offers a three course meal with calamari and beef and shortcake for dessert, and a different three course meal with puree and beef and ice cream for dessert. On Tuesday, the restaurant offer the one of the same three course meals and a four course meal as well. On Wednesday, the restaurant offers just one meal - the four course meal from the day before. On Thursday and Friday, the restaurant offers two different five course meals.

menu(monday,three(calamari,beef,shortcake))
menu(monday,three(puree,beef,icecream))
menu(tuesday,three(puree,beef,icecream))
menu(tuesday,four(consomme,greek,lamb,baklava))
menu(wednesday,four(consomme,greek,lamb,baklava))
menu(thursday,five(vichyssoise,caesar,trout,chicken,tiramisu))
menu(friday,five(vichyssoise,green,trout,beef,souffle))

Note that, although there are functions here, the dataset is finite in size. In fact, there are strong restrictions on what sentences make sense. For example, only object constants representing days of the week appear as the first argument of the menu relation. Only object constants representing foods appear as arguments in functional terms. And only whole meals appear as the second argument of the menurelation. Note also that functional terms are never nested. These kinds of restrictions are common in datasets. In Chapter 4, we show how we can formalize these constraints.

2.7 Reformulation

No matter how we choose to conceptualize the world, it is important to realize that there are other conceptualizations as well. Furthermore, there need not be any correspondence between the objects, functions, and relations in one conceptualization and the objects, functions, and relations in another.

In some cases, changing one's conceptualization of the world can make it impossible to express certain kinds of knowledge. A famous example of this is the controversy in the field of physics between the view of light as a wave phenomenon and the view of light in terms of particles. Each conceptualization allowed physicists to explain different aspects of the behavior of light, but neither alone sufficed. Not until the two views were merged in modern quantum physics were the discrepancies resolved.

In other cases, changing one's conceptualization can make it more difficult to express knowledge, without necessarily making it impossible. A good example of this, once again in the field of physics, is changing one's frame of reference. Given Aristotle's geocentric view of the universe, astronomers had great difficulty explaining the motions of the moon and other planets. The data were explained (with epicycles, etc.) in the Aristotelian conceptualization, although the explanation was extremely cumbersome. The switch to a heliocentric view quickly led to a more perspicuous theory.

This raises the question of what makes one conceptualization more appropriate than another. Currently, there is no comprehensive answer to this question. However, there are a few issues that are especially noteworthy.

One such issue is the grain size of the objects associated with a conceptualization. Choosing too small a grain can make knowledge formalization prohibitively tedious. Choosing too large a grain can make it impossible.

As an example of the former problem, consider a conceptualization of the scene in blocksworld in which the objects in the universe of discourse are the atoms composing the blocks in the picture. Each block is composed of enormously many atoms, so the universe of discourse is extremely large. Although it is in principle possible to describe the scene at this level of detail, it is senseless if we are interested in only the vertical relationship of the blocks made up of those atoms. Of course, for a chemist interested in the composition of blocks, the atomic view of the scene might be more appropriate. For this purpose, our conceptualization in terms of blocks has too large a grain.

Indistinguishability abstraction is a form of object reformulation that deals with grain size. If several objects mentioned in a set of rules satisfy all of the same conditions, under appropriate circumstances, it is possible to abstract the objects to a single object that does not distinguish the identities of the individuals. This can decrease the cost of processing queries by avoiding redundant computation in which the only difference is the identities of these objects.

Another way of reconceptualizing the world is the reification of functions and relations as objects in the universe of discourse. The advantage of this is that it allows us to consider properties of properties.

As an example, consider a Blocks World conceptualization in which there are five blocks, no functions, and three unary relations, each corresponding to a different color. This conceptualization allows us to consider the colors of blocks but not the properties of those colors.

We can remedy this deficiency by reifying various color relations as objects in their own right and by adding a relation to associate blocks with colors. Because the colors are objects in the universe of discourse, we can then add relations that characterize them; e.g. nice.

There is also the reverse of reification, viz. relationalization. Combining relationalization and reification is a common way to change from one ontology to another.

Note that, in this discussion, no attention has been paid to the question of whether the objects in one's conceptualization of the world really exist. We have adopted neither the standpoint of realism, which posits that the objects in one's conceptualization really exist, nor that of nominalism, which holds that one's concepts have no necessary external existence. Conceptualizations are our inventions, and their justification is based solely on their utility. This lack of commitment indicates the essential ontological promiscuity of AI: Any conceptualization of the world is accommodated, and we seek those that are useful for our purposes.

Exercises

Exercise 2.1: Consider the Sorority World introduced above. Write out a dataset describing a state in which every girl likes herself and no one else.

Exercise 2.2: Consider a variation of the Sorority World example in which we have a single binary relation, called friend. friend differs from likes in two ways. It is non-reflexive, i.e. a person cannot be friends with herself; and it is symmetric, i.e. if one person is a friend of a second person, then the second person is friends with the first. Write out a dataset describing a state that satisfies the non-reflexivity and symmetry of the friend relation and so that as few friend facts as possible are true. Note that there are multiple ways in which this can be done.

Exercise 2.3: Consider a variation of the Sorority World example in which we have a single binary relation, called younger. younger differs from likes in three ways. It is non-reflexive, i.e. a person cannot be younger than herself. It is antisymmetric, i.e. if one person is younger than a second person, then the second person is not younger than the first. It is transitive, i.e. if one person is younger than a second and the second is younger than a third, then the first is younger than the third. Write out a dataset describing a state that satisfies the reflexivity, antisymmetry, and transitivity of the younger relation and so that the maximum number of younger facts are true. Note that there are multiple ways in which this can be done.

Exercise 2.4: A person x is a sibling of a person y if and only if x is a brother or a sister of y. Write out the sibling facts corresponding to the parent facts shown below.

parent(art,bob)
parent(art,bea)
parent(bob,cal)
parent(bob,cam)
parent(bea,coe)
parent(bea,cory)

Exercise 2.5: Consider the state of the Blocks World pictured below. Write out all of the above facts that are true in this state.

A
B
C
D
E

Exercise 2.6: Consider a world with n object constants, no function constants, and a single binary relation constant. How many distinct facts can be written in this language?

n, 2n, n2, 2n, nn, 2n2, 22n

Exercise 2.7: Consider a world with n object constants, no function constants, and a single binary relation constant. How many distinct datasets are possible for this language?

n, 2n, n2, 2n, nn, 2n2, 22n

Exercise 2.8: Consider a world with n object constants, no function constants, and a single binary relation constant; and suppose that the binary relation is functional, i.e. every object constant in the first position is paired with exactly one object constant in the second position. How many distinct datasets satisfy this restriction?

n, 2n, n2, nn, 2n, 2n2, 22n

Exercise 2.9: In the Kinship example, we represent gender using the unary relations male and female. Consider an alternative formulation in which we reify these relations as object constants and instead use a single binary relation constant gender. Encode the gender information below using this new vocabulary. No tricks. Very easy. Just reinforcing the notion of reification.

male(art)female(bea)
male(bob)female(coe)
male(cal)female(cory)
male(cam)

Exercise 2.10: In the Foodworld example, we represent daily menus using the binary relation menu. Consider an alternative formulation in which we relationalize days of the week as unary relations that can be applied to menus that are offered on the corresponding days. Encode the menu information below using this new vocabulary. Again, no tricks. Just reinforcing the notion of relationalization.

menu(monday,three(calamari,beef,shortcake))
menu(monday,three(puree,beef,icecream))
menu(wednesday,four(consomme,greek,lamb,baklava))
menu(friday,five(vichyssoise,green,trout,beef,souffle))