The case for an ‘unknown’ value in software logic.
Relational databases, relational theory, relational calculus, and predicate logic all rely on a two-value truth. That is, that a given proposition or predicate is either true or false. The fact that the results of the query can be proven to be correct rests on the fact that it can be established that each part of a query, each piece of predicate, is either true or false.
The Customer ID in the order table either equals the ID in the Customer table or it doesn’t. The value for the customer location either equals the literal that was put in the query or it doesn’t. We don’t think about this much. It works and, as pragmatists, we use it. However, there are many situations where we have to find subtle workarounds to the limitations of the two-truth values. There are already many situations where the two-value logic is a hindrance. As our systems become more dynamic and as the reach of our queries becomes more global we will be bumping up against these limitations more and more often. In this paper, we will discuss three-value logic and just a few of the implications for systems building.
The three values are: true, false, don’t know. Let’s use an example from one of our clients. This client wishes to find patients in their patient database who are eligible for a particular clinical drug trial. The drug trial has very specific criteria for eligibility. Let’s say you have to be between 21 and 50 years of age, not pregnant, and have not taken inhaled steroids for at least six months. Typical drug trial eligibility criteria are typically more complex than this but are structurally similar.
There are two problems with doing this query against our patient database. The first is that we’ve likely expressed these concepts in different terms in our database. However, we can eventually resolve that. But the second is, how do we deal with missing data in our database? The simple answer is to make all data required. However, in practice this is completely unworkable. There are small amounts of data that you might make required for some purposes. You might be able to get a birth date for all your patients.
Probably you should. However, if you haven’t seen a patient for several months, you might not know whether they are pregnant and unless you ask them or unless you prescribed the aerosol inhaler you likely won’t know whether they have taken the inhaled steroids. And this is the first place where the two-value logic falls down. We are forced to assume that the “absence of evidence” is equivalent to the “evidence of absence;” that is, if we do not have a record of your steroid use, then you haven’t used steroids.
But it doesn’t take long for us to realize that that is just not so and that any database representation is a tiny subset of the total amount of data and the state of any given object at any point in time. The first thing that the three-value logic introduces is the concept of don’t know. We are dividing the result set of our query into three populations.
Firstly, there will be some people for whom we can knowingly answer that all three predicates are true. We know their date of birth and therefore we know their age to be in the range; we know, for instance, they are male and therefore we infer they are not pregnant (and we’ll have to discuss in more detail how an inference like this can be converted to a truth value); and if we happened to have recently asked if they have used an inhaled steroid in the last six months and they answered in the negative, we take those three pieces of data and assign that individual to the set of people that matched truthfully all three parts of the predicate.
It’s even easier to put someone in the category of ineligible because a truth value on any one of the three predicates is sufficient to exclude them from the population. So, underage, overage, pregnant, or the existence of a record of dispensing an inhaled steroid medication would be sufficient to exclude people from the population. However, this very likely leaves us with a large group of people that are in the don’t know category.
Don’t know and inference
The don’t know category is the most interesting and in some ways the most productive of the categories. If you’re looking for clinical trial members, certainly those that match in all the criteria are the “low hanging fruit.” But the fact is that this set is generally far too small to satisfy the recruitment requirements, and we are forced to deal with the large and imperfectly known set of patients that might be eligible.
By the way, this thinking is not restricted to an exotic field like clinical trial recruiting. I’m using this example because I think it’s a clear and real example. But any application with a search involved, including a Web-based search, has an element of this problem. That is, there are many documents or many people or many topics that “may” be relevant and we won’t know until we investigate further. The first line of offense is to determine what can be known categorically. In other words, we already know that we do not have complete information at the instance level. In other words, for this particular patient we do not have the particular data value we need to make the evaluation.
But there are some things that we can know with a great deal of certainty at a categorical level. Earlier we alluded to the fact that we can pretty well assume that a male patient is not pregnant. There are many ways to express and evaluate this and we will not go into this here except to say that this style of evaluation is different than pure predicate evaluation based on the value of the instance data. Perhaps the most important difference is that as we begin to move off of implications that are near certainty, such as pregnant males, we move into more probabilistic inferences. We would like to be able to subdivide our population into groups which are statistically likely to be included once we find their true values.
For instance, we may know statistically that at any point in time women between the age of 21 and 30 may have a 5% chance of being pregnant and women between age of 40 and 50 may have a 2% chance of being pregnant. (Those are made up numbers but I think you can see the point.) By extending this logic over all the pieces of data in our sample, such as the likelihood that any individual who has taken an inhaled steroid which may be adjusted based on medical condition – certainly asthma patients have a much higher chance of doing this than the population at large – could give us a stratified population with a likelihood of a match.
It’s a bit of a stretch but this is somewhat analogous to some of the techniques that Google employs to improve the percentage chance that the documents they return are the documents we’re interested in. They don’t use this technique – they use techniques based on references from authoritative sites and proximity of words to each other and the like – but the result is somewhat similar.
Uncertainty and the cost of knowing
The next thing that is inescapable if you pursue this is that eventually we will have to turn our uncertainty into an acceptable level of certainty. In virtually all cases, this involves effort. What we are looking for is what can give us the best result with the least effort. As it turns out, acquisition of information can very often be categorized based on the process that will be required to acquire the knowledge.
For instance, if a piece of data that we needed had to do with recent blood chemistry, our effort is great; we must schedule the patient to visit a facility where we can remove some of their blood and we have to send that blood to some sort of lab to acquire the information. If the information is about blood chemistry and we merely have to find and pull their paper chart and look it up, that’s considerably easier, less expensive, and faster. If we can look it up in an electronic medical record, better yet. In many cases, it means we have to go to the source. If we have to call and ask someone a question there is a cost of time.
If we can send an e-mail with some reasonable likelihood of response, there is a lower cost. What we would like to do is stratify our population based on the expected cost to acquire information that is most likely to make a difference. In other words, if there are dozens of pieces of data that we need confirmed in order to get to a positive yes, then the fact that any one of those pieces of data is inexpensive to acquire is not necessarily helpful. We may need to stratify our search by pieces of data that could either rule in or rule out a particular individual.
In this white paper we are not dealing with the technology approach of exactly how you would do this or any one of the many, many implications. What I wanted to do was create some awareness of both the need of doing this and the potential benefits of incorporating this into an overall architecture.