The Flagging Art of Saying Nothing

Who doesn’t like a nice flag? Waving in the breeze, reminding us of who we are and what we stand for. Flags are a nice way of providing Understanding Meaning in Data a rallying point around which to gather and show our colors to the world. They are a way of showing membership in a group, or providing a warning. Which is why it is so unfortunate when we find flags in a data management system, because they are reduced to saying nothing. Let me explain.

When we see Old Glory, we instantly know it is emblematic of the United States. We also instantly recognize the United Kingdom’s emblematic Union Jack and Canada’s Maple Leaf Flag. Another type of flag is a Warning flag alerting us to danger. In either case, we have a clear reference to what the flag represents. How about when you look at a data set and see ‘Yes’, or ‘7’? Sure, ‘Yes’ is a positive assertion and 7 is a number, but those are classifications, not meaning. Yes what? 7 what? There is no intrinsic meaning in these flags. Another step is required to understand the context of what is being asserted as ‘Yes’. Numeric values have even more ambiguity. Is it a count of something, perhaps 7 toasters? Is it a ranking, 7th place? Or perhaps it is just a label, Group 7?

In data systems the number of steps required to understand a value’s meaning is critical both for reducing ambiguity, and, more importantly, for increasing efficiency. An additional step to understand that ‘Yes’ means ‘needs review‘, so the processing steps have doubled to extract its meaning. In traditional systems, the two-step flag dance is required because two steps are required to capture the value. First a structure has to be created to hold the value, the ‘Needs Review‘ column. Then a value must be placed into that structure. More often than not, an obfuscated name like ‘NdsRvw’ is used which requires a third step to understand what that means. Only when the structure is understood can the value and meaning the system designer was hoping to capture be deciphered.

In cases where what value should be contained in the structure isn’t known, a NULL value is inserted as a placeholder. That’s right, a value literally saying nothing. Traditional systems are built as structure first, content second. First the schema, the structure definition, gets built. Then it is populated with content. The meaning of the content may or may not survive the contortions required to stuff it into the structure, but it gets stuffed in anyway in the hope it can deciphered later when extracted for a given purpose. Given situations where there is a paucity of data, there is a special name for a structure that largely says nothing – sparse tables. These are tables known to likely contain only a very few of the possible values, but the structure still has to be defined before the rare case values actually show up. Sparse tables are like requiring you to have a shoe box for every type of shoe you could possibly ever own even though you actually only own a few pair.

Structure-first thinking is so embedded in our DNA that we find it inconceivable that we can manage data without first building the structure. As a result, flag structures are often put in to drive system functionality. Logic then gets built to execute the flag dance and get executed every time interaction with the data occurs. The logic says something like this:
IF this flag DOESN’T say nothing
THEN do this next thing
OTHERWISE skip that next step
OR do something else completely.
Sadly, structure-first thinking requires this type of logic to be in place. The NULL placeholders are a default value to keep the empty space accounted for, and there has to be logic to deal with them.

Semantics, on the other hand, is meaning-first thinking. Since there is no meaning in NULL, there is no concept of storing NULL. Semantics captures meaning by making assertions. In semantics we write code that says “DO this with this data set.” No IF-THEN logic, just DO this and get on with it. Here is an example of how semantics maintains the fidelity of our information without having vacuous assertions.

The system can contain an assertion that the Jefferson contract is categorized as ‘Needs Review‘ which puts it into the set of all contracts needing review. It is a subset of all the contracts. The rest of the contracts are in the set of all contracts NOT needing review. These are separate and distinct sets which are collectively the set of all contracts, a third set. System functionality can be driven by simply selecting the set requiring action, the “Needs Review” set, the set that excludes those that need review, or the set of all contracts. Because the contracts requiring review are in a different set, a sub-set, and it was done with a single step, the processing logic is cut in half. Where else can you get a 50% discount and do less work to get it?

I love a good flag, but I don’t think they would have caught on if we needed to ask the flag-bearer what the label on the flagpole said to understand what it stood for.

Blog post by Mark Ouska

For more reading on the topic, check out this post by Dave McComb.