Shirky, Syllogism and the Semantic Web

Revisiting Clay Shirky’s piece on the Semantic Web

A friend recently sent me the link to Clay Shirky’s piece on the Semantic Web with “I assume you’ve seen this, what do you think?”

I had seen it, but I hadn’t looked at it for years. So I went back for another look.

As usual, Shirky’s writing is intelligent, insightful and even funny. Recommended reading. I had hoped the ensuing years would prove “us” (Semantic Technologists) right, and that the argument would look amusing in retrospect.

Alas we still have a long way to go to staunch the critics. More on that in a future article.

For today, I have to point out the real irony of the article that I managed to miss the first time I read it.

At the risk of oversimplifying his article to the same degree he oversimplified the Semantic Web, the essence of the article went like this:

• The Semantic Web relies on syllogisms “The semantic web is a machine for creating syllogisms”

• Nobody uses syllogisms “it will improve all the areas of your life where you currently use syllogisms, Which is to say, almost nowhere”

• Therefore nobody will use the Semantic Web “it requires too much coordination and too much energy to effect in the real world”

The first two quotes from the opening the last from the closing.

The irony being of course, that this entire article is a syllogism. To make one of the major premises of an argument that something will fail because nobody uses that style of argument, reminds me of the admonition Yogi Berra gave to some teammates who had suggested a restaurant for the evenings dinner “Nah, nobody goes there anymore. It’s too crowded.”

The article points out some areas we need to pay more attention to, including controlling the hype machine. Reading between the lines, it appears that one of his major points is: the web is complex and only humans can really understand the nuances that our complex utterances mean.

But traffic is complex, and we know that traffic lights will never be as good as police in managing an intersection, but we’ve decided that an automated solution that gets us consistently pretty good results is good enough.

Back to the article, he relies on Lewis Carroll’s syllogisms as a critique of the medium, and by extension, the Semantic Web. The knock out punch was meant to be a five line syllogism about soap-bubble poems. But even here there were two implications: one that humans could follow this logic, and two that formalized ontologies could not. I of course rose to the bait and tried to formalize this syllogism.

I was not successful. Not because of the poverty of expression in the Semantic Web, nor even my own understanding, but attempting to get formal about this doggerel shone a light on the fact that it doesn’t make any sense at all. Indeed if he makes a point at all it is that humans can often get fooled by things that sound like they make sense, but actually don’t. Seems to me, defending that level of confusion and ambiguity isn’t an argument against the Semantic Web.

Quantities, Number Units and Counting in gist

We have a simple and effective way in gist to represent a wide range of physical quantities such as ‘82 kg’, ‘3 meters’ and ‘20 minutes’.  Each quantity has a number and a unit, such as ‘meter’ or ‘second’.  In addition to these simple units, we have unit multiplication and division to represent more complex units, e.g. for speed and acceleration. A standard speed unit is meters per second [m/s] and a standard acceleration unit is meters per second per second [(m/s)/s] or simply  [m/s^2].

Physicists as well as business people like to avoid the inconvenience of working with very large or very small numbers like 1000 meters, or .00000001 meters (a trillionth of a meter).  If you counted to see if the number of zeros was correct, you understand the problem.  So we create units like kilometer and picometer and give them conversion factors.   This works for any kind of unit (time, electric current, mass).  Note that the standard units have a conversion of 1 (which in normal parlance, means there is no conversion necessary). See figure 1 for some examples.

Figure 1: Example Quantities

We also have found a need for counting units like dozen or gross. For example, a wine merchant stocks and sells cases of 12 bottles of wine, so counting in dozens is more convenient than counting single bottles of wine.  What is interesting is that we can use the exact same structure for representing ‘4 dozen’ or  ‘7 gross’ as we do for representing things like ‘82 kg’ and ‘20 minutes’.   Take ‘4 dozen’, the number is 4, and the unit is ‘dozen’ and the conversion is 12.

In gist there is also a way to represent percentages, which we have always treated as a ratio. After all, when speaking of a percentage, there is always an explicit or implicit ratio somewhere.  For example:

  1. “Shipment A has only 65% as much oil as shipment B” corresponds to the ratio:
    (No. of barrels in shipment A) / (No of barrels in shipment B) = .65
  2. “There are 20% more grams of chocolate in the new package size” corresponds to the ratio:
    (NewQuantity – OldQuantity) / (OldQuantity) = .20

The units for the first example are barrels/barrels which cancel out leaving a pure number. Similarly, the units for the second example are grams/grams which again cancel out. In fact, every ratio unit that corresponds to a percentage will cancel out and leave a pure number. This means that although it may be useful to do so, we don’t need to represent gist:Percentage using a ratio unit.

Another thing that we never realized before is that, being a pure number,  a percentage can be represented in the same way we represent dozen or gross. The only difference is the conversion (12 vs. .01).  We can use this same structure to represent:

  • parts per million (ppm), used by toxicologists say to measure amounts of mercury in tuna
  • basis points (used by the Fed for describing interest rates)
    Investopedia defines a basis point as “a unit that is equal to 1/100 of 1%”

See figure 2 for the representational structures.

 

Figure 2: One structure for number units and ordinary units

 

Notice how ‘ 4 cm’ is very similar to ‘4 percent’:

  • to convert 4 cm to its standard unit, we multiply 4 by the conversion factor of .01 resulting in .04 meters
  • to convert 4 percent to its standard unit, we multiply 4 by the conversion factor of .01 resulting in .04 ??.

This means we can use the same computational mechanism to perform units conversion for pure numbers like 4 dozen and 4% as we do for ordinary physical quantities like 4 cm or 82 kg.

One question remains. Whereas we can readily see that the conversion factor for kilometer is based on the standard unit of meter, and the conversion factor for hour is based on the standard unit of second, what are the conversion factors of 12, .01 and .00001 (for dozen, percent and basis point) based on? What does it mean to have a standard unit for these pure numbers with a conversion of 1?

Let’s look to see how gist represents dozen and kilometer to see if that gives us any insight.

  1. gist:kilometer is an instance of gist:DistanceUnit &
    ‘3 meters’ is an instance of gist:Extent &
    the base unit is gist:meterAnalogously:
  2. gist:dozen is an instance of gist:CountingUnit,
    ‘4 dozen’ is an instance of gist:Count &
    the base unit is gist:each

Curiously, while ‘meter’ actually means something to us, and we know what it means to say ‘3 meters’, it strange to think what ‘3 eaches’ could possibly mean.  I invite you to stare at the following table for a while and see some analogies.

Figure 3: Standard Unit for Pure Number Quanties

Then notice that:

  1. 4 dozen = 48 eaches
  2. 4 dozen = 48 (just a simple number)
  3. Therefore, 48 must equal 48 eaches (because both are equal to 4 dozen).

But what is it, such that if you have 48 of them gives you the number 48?  The answer is the number one:  48 x 1 = 1.  So the meaning of gist:each is the number one acting as a unit. This is a mathematical abstraction. The ??’s in figure 2 stand for ‘each’ which is the standard number unit. So when you say ‘3 eaches’ it is just 3 of the number one which is just the pure number 3.  As an aside, we can also say that ‘each’ is the identity element for unit multiplication and division. This is analogous to the number 1 being the identity element for multiplication and division of numbers.

  • You can multiply or divide any number by 1 and you get that number back.
  • You can multiply or divide any unit by each (which means one) and get that unit back.

Note that while conceptually they mean the same thing, syntactically gist:each is very different from the number one as a number whose datatype is say integer, or float.

Notice that for these pure numbers in convenient sized units, we are usually counting things: how many dozens, how many basis or percentage points, or how many parts per million.  We refer to ‘each’ thing as ‘one’ thing being counted.  So that links gist:each to the number one.  Thus, despite the awkwardness of speaking of ‘3 eaches’ the names ‘Count’, ‘CountingUnit’ and ‘each’ are quite reasonable.

Finally, insofar as all instances of CountingUnits are based on the number one, and all instances of Count represent pure numbers, we can think of every CountingUnits as a degenerate unit, and we can think of gist:Count as a degenerate quantity.  A ‘real’ quantity is not just a number, it has a number and has a non-numeric unit.

So in conclusion:

  1. We have extended the notion of gist:Count and gist:CountingUnit to apply to pure numbers that are less than one as well as those that are greater than one.
  2. We can represent pure numbers expressed in dozens, percentages, basis points and ppm just like we express the more usual quantities: ‘82 kg’, ‘3 meters’ and ‘20 minutes’.
  3. We can use the same computational mechanism to do units conversions on pure numbers as we can for ordinary physical quantities.
  4. We can represent gist:Percentage using a new unit called gist:percent with a conversion of .01 instead of using a ratio unit, making a more uniform representation.
  5. It will often be helpful to represent a gist:Percentage using a ratio, but it is no longer required.
  6. gist:Count could meaningfully and accurately be called gist:PureNumber since every instance of gist:Count (e.g. ‘4 dozen’, ‘65%’) is a pure number (e.g. 48, .65)
  7. gist:CountingUnit could meaningfully and accurately be called gist:PureNumberUnit because every instance of gist:CountingUnit is used to express pure numbers.
  8. gist:each corresponds to the number one
  9. We can think of Counts (pure numbers) and CountingUnits (number units) as degenerate cases of ordinary quantities and units like ’82 kg’ and ‘kg’

Written by Michael Uschold

Collections and Queries: Two Sides of the Same Coin?

This blog is wholly  inspired by an observation made by Dave McComb that collections and queries have an interesting relationship.

Collections frequently arise when creating enterprise ontologies.  In manufacturing, there are  lists of approved suppliers and lists of ingredients. In finance there are baskets of stocks making up a portfolio, and a 30-year mortgage corresponds to a collection of  360 monthly payments.   In healthcare, there are lists of side effects for a given drug,  and a patient bill is essentially a collection of line items, each with an associated cost. We will look at two ways to model collections this and consider the pros and cons of each. We will consider a list of approved suppliers for salt.

Represent the list as an explicit collection

The first is to create an explicit collection called say: _ApprovedSaltSuppliers.   The members of this collection are each suppliers of salt, say _Supplier1_Supplier14 and _Supplier 23.  We can link each of the suppliers to the collection using the object property gist:memberOf.   So far, we have 4 instances, one property,  three triples and no classes.

a collection

Figure 1: Simple Collection

It is always good practice to say what class an instance belongs to.  What kind of a thing is the instance: _ApprovedSaltSuppliers? First, it is a collection.  We have a class for that called gist:Collection, so we will want _ApprovedSaltSuppliers to be an instance of gist:Collection.  However, it more than just any collection, it is not a jury, it is not a deck of cards. More specifically, it is a list of approved suppliers. So we create a class called ListOfApprovedSuppliers and declare _ApprovedSaltSuppliers  to be an instance of that class. We also make ListOfApprovedSuppliersa subclass of gist:Collection, ensuring that _ApprovedSaltSuppliers is an instance of gist:Collection.

Using a SPARQL query to get the list

The second approach is inspired by the fact that a SPARQL query returns a report that is a collection of items from a triple store based on some specific criteria.   Instead of having an explicit collection in the triple store for approved suppliers of a given substance, you could simply link that substance to each of the approved suppliers and then write a SPARQL query to find them from a triple store of  past, present and potential future suppliers.  See figure for how this would be done.

Note, we have added some contextual information. First, we indicated the kind of substance they are being approved to supply. We also have a link to a person in charge of maintaining that list.  Can you think of what other information you might want to associate with the list of approved salt suppliers if you were modeling this for a specific organization?

a query

Figure 2: Comparing two approaches

How should we skin this cat?

Which approach is best under what circumstances and why?  Have another look at the examples above in finance, manufacturing and healthcare.  For each, consider the following questions:

  1. can you think of an easy, natural and obvious way to represent the information and write a query that can generate the list?
  2. are the items in the list likely to change with some degree of frequency?
  3. is the collection mainly ‘just a collection’ i.e. does it have little meaning on its own and few if any attributes or relationships linking it to other things?

For the list of approved suppliers, we already saw that the answer to the first question is yes. The answer to the second question is also yes because suppliers come and go.  There will likely be a moderate amount of change.  The third question is less clear.  It could be that the an approved supplier list is just connected to a given substance, and nothing else.  In this case, the answer to the third question would be yes.  The way we have modeled it includes a named individual responsible for creating and maintaining it.  The list might also be part of a larger body of documentation.  Or, if it is a larger organization where different divisions had their own approved supplier list, you would need to indicate which organization is being supplied.  With this extra information, the answer to the third question would be no.

Consider a patient bill.  It is not that obvious how to represent the information so a simple query can give an answer. First, a given patient might have many bills over time, which would connect them to many different line items. It is a bit awkward.  Second, these items will never change. Finally, while it makes sense to represent a patient bill as being in essence, a list of line items, it is much more than that. It is not only connected to the patient, but also to the hospital, to the provider, and possibly to an insurance company.

To the extent that the answer to the above three questions is yes, you are probably better off just writing an on-demand query.  Conversely, if the answers to the above questions tend to be no, then you probably do want to represent and manage an explicit collection.

Degeneracy, Pot and Freebasing

In 2013 marijuana became legal in Colorado and Washington. It did not escape the notice of our clients that all Semantic Arts staff are from those two states. Suspicions grew deeper when we told them about the large collaborative knowledge base called Freebase. The looks on their faces told us they were thinking about freebase.   We were but a whisker away from the last straw on the day we introduced them to the idea of degeneracy.  “No, no we insisted, we are talking about a situation that commonly arises in mathematics and computing  where something is considered to be a degenerate case of something else”.  That was a close call;  fortunately, all that confusion was “just semantics”.

Today we explain the idea of degeneracy and why it is useful for computing and ontology.

Figure 1: Examples of Degeneracy

Examples of Degeneracy: A circle is defined to be the set of points that are equidistant from a given point. The radius of the circle is that distance.  But what do you have if the radius is zero?  You have the set of points that are zero length from a given point, which is to say just that one point. In mathematics we would say a point is the degenerate case of a circle.

We all know what a rectangle is, but what happens if the width of an otherwise ordinary rectangle is zero? Then you just get a single line segment.  Again, we say that a line segment is a degenerate case of a rectangle.

A set normally has two or more members, otherwise what is the point of calling it a set? Yet, the need for speaking of and representing sets that have 0 or 1 elements often arises. It happens so frequently that they have names:  empty set and singleton set.  They are degenerate cases of a set.

An example of a more complex structure than a set, is a process that consists of any number of tasks, and some ordering indicating what tasks must be done before what other tasks.  However, sometimes, during computation or analysis, it can be convenient or even necessary to allow for processes that have zero tasks, or just one task.  We could refer to such processes as empty or singleton processes.  These are degenerate cases of a process that ordinarily should have two or more tasks.

What do all these examples have in common? What can we say about every case of degeneracy?

Definition

I propose the following as a working definition of degeneracy.  We say that an X is a degenerate case of a Y when:

  1. Strictly speaking, an X can be seen to be an example of a Y
    1. a point is a circle with radius equal to zero
    2. a line segment is a rectangle with one of the dimensions having zero length.
  2. An X is substantially simpler than a Y, some of the essence of being a Y is missing.
  3. The much simpler nature of an X results in the X having lost so much of the essence of being a Y, that in most circumstances, no one would even think to call the X a Y. Ordinarily:
    1. no one would think to refer to a point as a circle.
    2. no one would think to refer to nothing at all as a set.
    3. no one would think to refer to doing a single thing (much less nothing) as a process

 

degeneracy

Figure 2: Number units as Degenerate Units of Measure

Why bother?

It might seem rather silly, or something that only mathematicians would bother about, but it turns out that in computing, degeneracy is very important.  Let’s say you want to compute the average of an arbitrary set of numbers. In every day parlance, it makes no sense to speak of the average of a single number. However, you want your algorithm to work if you get a set that happens to only have one number in it.  You therefore want to be able to pass the algorithm a set with one element in it.

When processes are being executed, tasks are being done, so a set of tasks may dwindle down to one and then to zero. If you want the algorithm to work, it needs to understand what it means to have a process with 1 or zero tasks.  When there are no tasks left, the task is to do nothing. Sometimes it is even helpful to consciously put a ‘do nothing’ task in a plan or process.

Generally speaking, degenerate cases are useful when you want computational infrastructure to still work on the edge cases.

The most interesting example that I have seen of this in the context of ontology work arises in the context of doing unit conversions for physical quantities. For example, you convert 4 cm meters using a conversion factor of .01.  You convert 3 kg to grams by using a conversion factor of 1000.   We have an ontology for representing such physical quantities with units and conversion factors.  Using this ontology, we have code to do units analysis and computing conversions.

It turns out to be convenient to give pure numbers ‘units’ just like we give physical quantities units. For example, a wine merchant might sell cases with 12 bottles each. A unit of ‘dozen’ would come in handy, with a conversion of 12.  Another convenient number unit is percent with a conversion of .01.  To convert 250% to the true number, you multiply 250 by .01 to get 2.5. To convert 4 dozen to the true number, you multiply 4 by 12 to get 48.

The is completely analogous to converting 250 cm to meters, you multiply 250 by .01 to get 2.5 meters.  It turns out that you can represent pure numbers this way using units in a way that is exactly analogous to how your represent physical quantities like 4 cm and 5 watts. This means that the same code for doing units conversions on physical quantities with ordinary units also works for number units.  A pure number like 48 represented as a 4 with the unit ‘dozen’ is a degenerate case of a physical quantity such as ‘250 cm’.  Pure number units like dozen and percent are degenerate cases of ordinary physical units like cm or watt.

Go back and look at the previous section and check the extent to which pure numbers and number units fit the definition of degeneracy.

  1. Strictly speaking a pure number like ‘4 dozen’ can be seen as a physical quantity, in the sense that it can be represented exactly like one, with a unit and a number.
  2. Pure numbers are simpler than physical quantities because they are only numbers, very different from say 3 meters or 20 amperes.
  3. No one would normally think to refer to a pure number like 4 or 48 as something that had a unit attached to it. The essence that has been lost is to have a unit that changes the character from being a pure number.  Number units like dozen and percent do not change the quantity from still just being a number.  4 dozen and 4 and 4% are all just numbers.  Whereas the difference between 4 and 4 cm and 4 amperes is huge.

For a detailed look at number units see the blog: Quantities, Number Units and Counting

White Paper: The Distinctionary

Encyclopedias are generally not intended to help with definition. An encyclopedia is useful in that once you know what something means, you can find out what else is known about it.

Semantics is predicated on the idea of good definitions. However, most definitions are not very good. In this essay we’re going to explore why well-intentioned definitions miss the mark and propose an alternate way to construct definitions. We call this alternate way the “distinctionary.”

Dictionary Definitions

The dictionary creates definitions for words based on common and accepted usage. Generally, this usage is culled from reputable,distinctionary published sources. Lexicographers comb through existing uses of a word and create definitions that describe what the word means in those contexts. Very often this will give you a reasonable understanding for many types of words. This is why dictionaries have become relatively popular and sometimes even bestsellers. However, it is not nearly enough. In the first place, there is not a great deal of visibility in attaching the definitions to their source. There’s a very casual relationship between the source of the definition and the definition itself

Perhaps the larger problem is that the definition describes but it does not discern. In other words, if there are other terms or concepts that are close in meaning, this type of definition would not necessarily help you distinguish between them.

Thesauri Definitions

Another way to get at meaning is through a thesaurus. The trouble with a thesaurus is that it is a connected graph of similar concepts. This is helpful if you are overusing a particular word and would like to find a synonym, or if you want to search for a similar word with a slightly different concept. But again, it does very little good actually describing the differences between the similar terms.

WordNet

WordNet is an online searchable lexicon that in some ways is similar to a thesaurus. The interesting and important difference is that in WordNet there are six or seven relationship links between terms and each has a specific meaning. So whereas in a thesaurus the two major links between terms are the synonym and antonym links, in other words, similar to and not similar to, in WordNet there are links that define whether one term is a proper subtype of another term, whether one term is a part of another term, etc. This is very helpful, and it takes us a good way toward definitions that make a difference.

Taxonomies

A rigorous taxonomy is a hierarchical arrangement of terms where each subterm is a proper subtype of the parent term. A really good taxonomy includes rule in and rule out tests to help with the placement of items in the taxonomy. Unfortunately, few good taxonomies are available but they do form a good starting point for rigorous definitions.

Ontologies

An ontology, as Tom Gruber pointed out, is a specification of a conceptualization,. A good ontology will have not only the characteristics of a good taxonomy, with formal subtyping and rules for inclusion and exclusion, it will also include other more complex inference relationships. The ontology as well as the taxonomy also has the powerful notion of “committing to” the ontology. With a dictionary definition there’s no formal concept of the user committing to the meaning as defined by the source authority for the term. However, we do find this in taxonomies and ontologies.

The Distinctionary

The preceding lays out a landscape of gradually increasing rigor in the tools we use for defining and managing the terms and concepts we employ. We’re going to propose one more tool not nearly as comprehensive or rigorous as a formal taxonomy or ontology, but which we have found to be very useful in the day to day task of defining and using terms: the distinctionary.

The distinctionary is a glossary. It is distinct from other glossaries in that it is structured such that a term is first placed as a type of a broader term or concept and then a definition is applied which would distinguish this particular term or concept from its peers.

Eventually, each of the terms or concepts referred to in a distinctionary definition, i.e., “this term is a subtype of another one,” would also have to have their own entry in the distinctionary. But in the short term and for practical purposes we have to agree that there is some common acceptance of some of the terms we use.

A Few Examples

I looked up several definitions of the word “badger” In this case I was looking for the noun, the mammal. I remembered that a badger was an animal but I couldn’t remember what kind of animal, so I thought maybe the dictionary would help. Here is what I found:

Badger:

Merriam Webster:

1 a: any of various burrowing mammals (especially Taxidea taxus and Meles meles) that are related to the weasel and are widely distributed in the northern hemisphere

Encarta:

a medium-sized burrowing animal that is related to the weasel and has short legs, strong claws, and a thick coat. It usually has black and white stripes on the sides of its head.

Cambridge Advanced Learners Dictionary:

an animal with greyish brown fur, a black and white head and a pointed face, which lives underground and comes out to feed at night

American Herritage:

1. Any of several carnivorous burrowing mammals of the family Mustelidae, such as Meles meles of Eurasia or Taxidea taxus of North America, having short legs, long claws on the front feet, and a heavy grizzled coat.

Websters Dictionary (1828 Edition)

1. In law, a person who is licensed to buy corn in one place and sell it in another, without incurring the penalties of engrossing.

2. A quadruped of the genus Ursus, of a clumsy make, with short, thick legs, and long claws on the fore feet. It inhabits the north of Europe and Asia, burrows, is indolent and sleepy, feeds by night on vegetables, and is very fat. Its skin is used for pistol furniture; its flesh makes good bacon, and its hair is used for brushes to soften the shades in painting. The American badger is called the ground hog, and is sometimes white.

Encyclopedia Definitions

Columbia Encyclopedia

name for several related members of the weasel family. Most badgers are large, nocturnal, burrowing animals, with broad, heavy bodies, long snouts, large, sharp claws, and long, grizzled fur. The Old World badger, Meles meles, is found in Europe and in Asia N of the Himalayas; it is about 3 ft (90 cm) long, with a 4-in. (10-cm) tail, and weighs about 30 lb (13.6 kg). Its unusual coloring, light above and dark below, is unlike that of most mammals but is found in some other members of the family. The head is white, with a conspicuous black stripe on each side. European badgers live, often in groups, in large burrows called sets, which they usually dig in dry slopes in woods. They emerge at night to forage for food; their diet is mainly earthworms but also includes rodents, young rabbits, insects, and plant matter. The American badger, Taxidea taxus, is about 2 ft (60 cm) long, with a 5-in. (13-cm) tail and weighs 12 to 24 lb (5.4–10.8 kg); it is very short-legged, which gives its body a flattened appearance. The fur is yellowish gray and the face black, with a white stripe over the forehead and around each eye. It is found in open grasslands and deserts of W and central North America, from N Alberta to N Mexico. It feeds largely on rodents and carrion; an extremely swift burrower, it pursues ground squirrels and prairie dogs into their holes, and may construct its own living quarters 30 ft (9.1 m) below ground level. American badgers are solitary and mostly nocturnal; in the extreme north they sleep through the winter. Several kinds of badger are found in SE Asia; these are classified in a number of genera. Badgers are classified in the phylum Chordata, subphylum Vertebrata, class Mammalia, order Carnivora, family Mustelidae.

Wikipedia

is an animal of the typical genus Meles or of the Mustelidae, with a distinctive black and white striped face – see Badger (animal). [Badger Animal] Badger is the common name for any animal of three subfamilies, which belong to the family Mustelidae: the same mammal family as the ferrets, the weasels, the otters, and several other types of carnivore.

Firstly, I intentionally picked a very easy word. Specific nouns like this are among the easiest things to define. I could have picked “love” or “quantum mechanics” or a verb like “generate” if I wanted to make this hard. As a noun, the definition of this word would be greatly aided by (although not completed by) a picture.

Let’s look at what we got. First, all the definitions establish that a badger is an animal, or mammal. Anyone trying to find out what a badger was could easily be assumed to know what those two terms are. Most rely on Latin genus/species definitions, which is not terribly helpful. If you already know the precise definition of these things then you know what a badger is. Worse, many of them are imprecise in their references:“especially Taxidea taxus and Meles meles.”What is that supposed to mean?

Some of the more useful parts of these definitions are “burrowing” and “carnivorous.” However, these don’t actually distinguish badgers from, say, skunks, foxes or anteaters. “Weasle-like” is interesting, but we don’t know in what way they are like weasels. Indeed some of these definitions would have you think they were weasels.

Encyclopedias are generally not intended to help with definition. An encyclopedia is useful in that once you know what something means, you can find out what else is known about it. However, these encyclopedia entries are much better at defining “badger” than the dictionary definitions. (By the way, a lot of the encyclopedia information will make great “rule in/rule out” criteria.)

I had to include the 1828 definition, if only for its humor value. In the first place, the first definition is one that, less than 200 years later, is now virtually extinct. The rest of the definition seems to be in good form, but mostly wrong (“genus ursus” [bears] “feeds on vegetables,” “ground hog”) or irrelevant (“pistol furniture” and “brushes to soften the shades in painting”).

So what would the distinctionary entry look like for badger? I’m sad to say, even after reading all this I still don’t know what a badger is. Structurally the definition would look something like this:

A badger is a mammal. It is a four legged, burrowing carnivore. It is distinct from other burrowing carnivores in that [this is the part I still don’t know, but this part should distinguish it from close relatives (weasels and otters) as well as more distant burrowing carnivores, such as fox and skunk]. Its most distinguishing feature is two white stripes on the sides of its head.

The point of the distinctionary is to help us keep from getting complacent about our definitions. In the everyday world of glossaries and dictionaries, most definitions sound good, but when you look more closely you realize that they hide as much ignorance as they reveal. As you can see from my above attempt at a distinctionary entry for badger, it’s pretty hard to cover up ignorance.

White Paper: Semantic Profiling

Semantic profiling is a technique using semantic-based tools and ontologies in order to gain a deeper understanding of the information being stored and manipulated in an existing system.

Semantic Profiling

In this paper we will describe an approach to understanding the data in an existing system through a process called semantic profiling.

What is semantic profiling?

Semantic profiling is a technique using semantic-based tools and ontologies in order to gain a deeper understanding of the information being stored and manipulated in an existing system. This approach leads to a more systematic and rigorous approach to the problem and creates a result that can be correlated with profiling efforts in other applications.

Why would you want to do semantic profiling?

The immediate motivation to do semantic profiling is typically either a system integration effort, a data conversion effort, a new data warehousing project, or, more recently, a desire to use some form of federated query in order to pull together enterprise-wide information. Each of these may be the initial motivator for doing semantic profiling but the question still remains: why do semantic profiling rather than any of the other techniques that we might do? To answer that let’s look at each of the typically employed techniques:

  • Analysis. By far, the most common strategy is some form of “analysis.” What this usually means is studying existing documentation and interviewing users and developers about how the current system works and what data is contained in it. From this the specification for the extraction or interface logic is designed. This approach, while popular, is fraught with many problems. The most significant is that very often what the documentation says and what the users and developers think or remember is not a very high fidelity representation of what will actually be found when one looks deeper.
  • Legacy understanding. The legacy understanding approach is to examine the source code of the system that maintains the current data and, from the source code, deduce the rules that are being applied to the data in the current system. This can be done by hand for relatively small applications. We have done it with custom analysis tools in some cases and there are commercial products from companies like Relativity and Merant that will automate this process. The strength of this approach is that it makes explicit some of what was implicit, and it’s far more authoritative than the documentation. The code is what’s being implemented; the documentation is someone’s interpretation of either what should have been done or their idea of what was done. While legacy understanding can be helpful, it’s generally expensive and time-consuming and still only gives a partial answer. The reason it only gives a partial answer is that there are many fields in most applications that have relatively little system enforcement of data values. Most fields with text data and many fields with dates and the like have very little system enforced validation. Over time users have adapted their usage and procedures have been refined to fill in missing semantics for the system. It should be noted though that the larger the user base the more useful legacy understanding is. In a larger user base, relying on informal convention becomes less and less likely, because the scale of the system means that users would have had to institutionalize their conventions, which usually means systems changes.
  • Data profiling. Data profiling is a technique that’s been popularized by vendors of data profiling software such as Evoke, Ascential and Firstlogic. This process relies on reviewing the existing data to determine and uncover anomalies in the databases. These tools can be incredibly useful in finding areas where the content of the existing system is not what we would have expected it to be. Indeed, the popularity of these tools stems largely from the almost universal surprise factor when people are shown the content of their existing databases that they were convinced were populated only with clean, scrubbed data of high integrity, only to find a gross number of irregularities. While we find data profiling very useful, we find that it doesn’t go far enough. In this paper we’ll outline a procedure that adds on and takes it further.

So how is semantic profiling different?

The first difference is that semantic profiling is more rigorous. We will get into exactly why this is in the section on how to do semantic profiling but the primary difference is that with data profiling you can search for and catalog as many anomalies as you like. After you’ve found and investigated five strange circumstances in a database you can stop. It is primarily an aid to doing other things and as such you can take it as far as you want. With semantic profiling, once you select a domain of study you are pretty much committed to take it “to ground.” The second main difference is that the results are reusable. Once you’ve done a semantic profile on one system, if you do a profile on another system the results of the first system will be available and can be combined with those from the second system. This is extremely useful in environments where you are attempting to draw information from multiple sources to pull into one definitive source; whether that is for data warehousing or EII (Enterprise Information Integration). And finally the semantic profiling approach sets up a series of testable hypotheses that can be used to monitor a system as it continues in production, to detect semantic drift.

What you’ll need

For this exercise you will need the following materials:

  • A database to be studied, with live or nearly live data. You can’t do this exercise with developer-created test data.
  • Data profiling software. Any of the major vendors’ products will be suitable for this. It is possible for you to roll your own, although this can be a pretty time consuming exercise.
  • A binding to your database available to the data profiling software. If your database is in a traditional relational form with an ODBC or JDBC access capability then that’s all you need. If your data is in some more exotic format you will need an adapter.
  • Meta-data. You will need access to as much as you can find about the official meta-data for the fields under study. This may be in a data dictionary, it may be in a repository, it may be in the copy books; you may have to search around a bit for it.
  • An ontology editor. You will be constructing an ontology based on what you find in the actual data. There are a number of good ontology editors; however, for our purposes Protégé from Stanford, a freeware version, should be adequate for most versions.
  • An inferencing engine. While there are many proprietary inferencing engines, we strongly advocate adopting one based on the recent standards RDF and OWL. There are open and freeware versions, such as open RDF or Kowari.
  • A core ontology. The final ingredient is a starting point ontology that you will use to define concepts as you uncover them in your database. For some applications this may be an industry reference data model such as HL7 for health care. However, we are advocating the use of what we call the semantic primes as the initial starting point. We’ll cover the semantic primes in another white paper or perhaps in a book. However, they are a relatively small number of primitive concepts that are very useful in clarifying your thinking regarding other concepts.

How to proceed

Overall, this process is one of forming and testing hypotheses about the semantics of the information in the extant database.

The hypotheses being formed concern both the fidelity and the precision of the definition of the items as well as uncovering and defining the many hidden subtypes that lurk in any given system.

This business of “running to ground” means that we will continue the process until every data item is unambiguously defined and all variations and subtypes have been identified and also unambiguously defined.

The process begins with some fairly simple hypotheses about the data, hypotheses that can be gleaned directly from the meta-data. Let’s say we notice in the data dictionary that BB104 has a data type of date or even that it has a mask of MMDDYYYY. We hypothesize that it is a date and further, in our case, our semantic prime ontology forces us to select between a historical date or a planned date. We select historical. We add this assertion to our ontology. The assertion is that BB104 is of type historical date. We run the data profiling and find all kinds of stuff. We find that some of the “historical dates” are in the future. So, depending on the number of future dates and other contextual clues we may decide that either our initial assignment was incorrect and these actually represent planned dates, some of which are in the past because the plans were made in the past, or, in fact, that most of these dates are historical dates but there are some records in this database of a different type. Additionally, we find some of these dates are not dates at all. This begins an investigation to determine if there’s a systemic pattern to the dates that are not dates at all. In other words, is there a value in field BB101, BB102, or BB103 that correlates with the non-date values? And if so, does this create a different subtype of record where we don’t need a date?

In some cases we will uncover errors that are just pure errors. We found that in some cases data validation rules had changed over time and that older records had different and anomalous values. And in some cases people have, on an exception basis, used system level utilities to “repair” data records and in some cases they create these strange circumstances. In cases where we uncover what is finally determined to be genuine errors, rather than semantically defining them we should be creating a punch list for both correcting them and correcting their cause if possible or necessary.

Meanwhile, back to the profiling exercise. As we discover subtypes with different constraints on their date values, we introduce these into the ontology we’re building. In order to do this, as we are documenting our date, we need to further qualify the date. What is it the date of? For instance, if we determine that it is, in fact, a historical date, what event was recorded on that date? As we hypothesize and deduce this we add it to the ontology and the information that this BB104 date is the occurred on date for the event that we described. As we find that the database has some records with legitimate historic dates and others with future dates and we find some correlation with another value, we hypothesize that, indeed, there are two types of historical events or perhaps even some historical events mixed with some planned events or planned activities. What we then do is define these as separate concepts in the ontology with the predicate for defining eligibility in the class. To make it simple, if we found that BB101 had one of two values, either P or H, we may hypothesize that H meant historical and P meant planned and we would say that the inclusion criteria for the planned events is that the value of BB101 equals P. This is a testable hypothesis. At some point the ontology becomes rich enough to begin its own interpretation. We load the data either directly from the database or, more likely, from the profiling tool as instances in the RDF inferencer. The inferencing engine itself can then challenge class assignment; can detect inconsistent property values; etc. We proceed in this fashion until we have unambiguously defined all the semantics of all the data in the area under question.

Conclusion

Having done this, what do you have? You have an unambiguous description of the data as it exists and a set of hypotheses against which you can test any new data to determine whether it agrees or not. But more simply, you know exactly what you have in your database if you were to perform a conversion or a system integration. More interestingly, you also have the basis for a set of rules if you wanted to do a combined integration of data from many sources. You would know, for instance, that you would need to apply a predicate from records in certain databases to exclude those that do not match the semantic criteria that you want to use from the other system. Say you want to get a “single view of the customer.” You will need to know, of all the many records in all your many systems that say or allude to customer data; which ones really are customers; which ones are channel partners or prospects or various other partners that might be included in some file. You need a way to unambiguously define that or system wide integration efforts are going to fall flat. We believe this to be the only rigorous and complete approach to this problem. While it is somewhat complex and time-consuming, it delivers a reasonable value and contributes to the predictability of other efforts which are often incredibly unpredictable.

Written by Dave McComb

Fractal Data Modeling

Fractal geometry creates beautiful patterns from simple recursive algorithms. One of the things we find so appealing is their “self-similarity” at different scales. That is, as you zoom in to look at a detail under more magnification you see many of the same patterns that were visible when zoomed out at the macro level.

After decades of working with clients at large organizations, I’ve concluded that they secretly would like to have a fractal approach to managing all their data. I say this desire is secret for several reasons:

  • No one has ever used that term in my presence
  • What they currently have is 180 degrees away from a fractal landscape
  • They probably haven’t been exposed to any technology or approach that would make this possible

And yet, we know this is what they want. It exists in part in a small subset of their data. The ability to “drill down” on predefined dimensions gives a taste of what is possible, but it is limited to that subset of the data. It exists in small measure in any corpus made zoom-able by faceted categorization., but it is far from the universal organizing principle that it could be. Several of the projects we have worked on over the last few years have allowed us to triangulate in on exactly this capability. This fractal approach leads us to information-scapes that have these characteristics:

  • They are understandable
  • They are pre-conditioned for easy integration
  • They are less likely to be loaded with ambiguity

The Anti-fractal data landscape

The data landscape of most large enterprises looks alike:

  • There are tens of thousands to hundreds of thousands of database tables.
  •  In hundreds to thousands of applications
  • In total there are hundreds of thousands to millions of attributes

There is nothing fractal about this. It is a vast, detailed data landscape, with no organizing principle. The only thing that might stand in for an organizing principle is the bounds of the application, which actually makes matters worse. The presence of applications allow us to take data that is similar, and structure it differently, categorize it differently, and name it differently. Rather than provide an organizing principle, applications make understanding our data more difficult.

And this is just the internal/ structured data. There is far more data that is unstructured (and we have nearly nothing to help us now with unstructured data), external (ditto) and “big data” (double ditto).

Download the White-paper to read more.

Groans, Giggles and Gales of Laughter

As the Season to be Jolly gets into full swing, I reflect one of my favorite ways to be jolly:  laughter. What makes us laugh (or not) in a given situation?    Some people just laugh, like my mother, who regularly burst into gales of laughter, for no apparent reason.   The more puzzled our looks, the harder she laughed.   I dedicate this blog to my mom, who passed away just before Christmas a year ago.

What happened?

There are many things that can trigger us to laugh. It is often someone making a quip or writing/telling a story or joke with the intention to make others laugh.  Of course, such attempts often fall flat.  Conversely, sometimes things just happen that are found to be funny, and no one was playing the role of comedian.  The trigger event might be something that is unfolding live – perhaps just a stray thought that comes into your head.  Or you might be watching videos, reading, or listening to a podcast.  Or worse, the person next to you on a boring commute, earbuds in place, is doing so and laughing their head off.

Then what did you do?

So an event happens that comes into our awareness, and we outwardly respond to it in some way.  f we are not impressed, we may: grimace, groan, guffaw, blank stare,  or say “yark yark”.  On the positive side, reactions include: smile, chuckle, giggle, laugh, gales of laughter and falling over laughing. In extreme cases, each time you recall the event, in subsequent minutes, hours and days, you will again laugh uncontrollably. It could be weeks, months or even years before your response to merely remembering the original event fades back to a mere smile, or warm feeling inside.  See the diagram below.  Note the similarity to how gratitude was characterized in the Thanksgiving Blog several weeks ago.

Why did you do that?

What is interesting is not so much the trigger event itself, nor even what the reactions are: it’s what happens in between. What do we find funny and why?  First, we see or understand the thing that is [or is supposed to be] funny.  When the event is an overt attempt at humor, we call this ‘getting it’.   The next step is very personal; how much and why do we appreciate what we just saw or understood?  Many people that get a bad pun won’t enjoy or appreciate it in any way.   Other people may readily acknowledge that that same pun is pretty bad, but they giggle nevertheless.  Below are just a few thoughts that come to mind: patterns for things that contribute to being funny.

  • We watch someone make mistakes, e.g. the Darwin awards are often uproariously funny. We will often laugh if somoen falls into a puddle, or at ourselves for say putting the left shoe on the right foot (er, the wrong foot).
  • Plays on words, including the lowly pun as well as other more respectable forms of double entendre. For example: “The past, the present and the future walk into a bar. It was tense.”
  • Something is startling, non-obvious, or the exact opposite of what we are expecting. This is not enough on its own to be funny, but often contributes. For example, did you hear about the guy who walked onto a train and saw Albert Einstein?  He asked Albert: “do you know whether Boston comes to this train”?
  • Reference to an event that happened recently that is similar to what is going on at a given moment. Surprisingly, this can often be funny, all by itself, in the right situation, but it is hard to convey – you really do have to be there. Look out for it.

For the 2014 Season to be Jolly, you now have a simple conceptual model describing the main elements of laughter & amusement how they relate to each other.  In our business, we call this an ‘ontology’.   Sometimes the hardest thing is to come up with is a good name for an ontology.    The ones that first come to mind are descriptive, but boring – like Ontology of  Laughter (OoL) , or The Laughter Ontology (TLO).  Or if you want cool acronym, you might try Crafted Ontology Of Laughter (yark yark).   Alternatively,  the name you think of calls attention to itself by trying so hard to be clever it falls flat on its face, like “OnJollygy” (pronounced: on jolly jee).

Exercise for the reader:

  1. Notice what just happened: a bad attempt at humor
  2. Notice you own reaction to this proposed name
  3. See how your own reaction matches with the elements in OnJollygy

For example, for me, this name just came as a random thought (trigger event).  Because I am total sucker for plays on words, I immediately enjoyed it and giggled with glee (reaction) – but the first person I told gave me a blank stare and the next one just grimaced.

Feel free to think I’m a bit weird.   I attribute this to genetics.  My mom never needed reason to laugh and my dad is a die-hard punster who wears a t-shirt that says:  “A Mature Pun is Fully Groan”.  His favorite was a quintuple pun involving lions, sea gulls and commerce – have you heard that one?

Whether you experience Groans,  Giggles or Gales of Laughter, may your Holidays be Jollily filled with Joy.

Is the Road to Euphoria Paved by Thanking with Reckless Abandon?

I wrote the following last year, and am inspired to share this more publicly, today, the Monday of Thanksgiving week, 2014.

It is Thanksgiving Day, 2013 and I just came across an article describing how science has tied gratitude to “the tendency to feel more hopeful and optimistic about one’s own future, better coping mechanisms for dealing with adversity and stress, [and] less instances of depression” among other things.

But what exactly is gratitude? And will all forms of gratitude give a similar boost to happiness? Examples range from a simple automatic thank you when someone opens a door for you, to a profound mystical experience where one may find oneself weeping in an alpine meadow of flowers surrounded by glacier-draped peaks.

An ontological analysis that I conducted in the last hour reveals the following essential elements of gratitude.

  1. Person: the thankful one
  2. Trigger Event: An event that triggers the person’s gratitude response
  3. Gratitude Response: The expression of gratitude by that person in response to the event.

Here is a picture. The rectangles represent the key kind of things, the links indicate how they are related to each other.  Note that the person has to be aware of the trigger event in order for them to express gratitude in response to it.

Main Elements of Gratitude

 

In addition, every  gratitude response will have a certain form and a certain character.  The character of the gratitude response might be unconscious or conscious; if the latter, it might be at a thinking level without any real feeling, or it might be deeply felt.  A gratitude response will also take a certain form, e.g. returning a favor, a verbal or written thank you, or just an inner feeling.

This is what is essential, but there are various optional things too. For example, the trigger event may have been triggered by another person, or others may have had nothing to do with it (e.g. a rainbow). If the former, the triggering person may or may not have explicitly intended to benefit the thankful one. They may have just made an offhand remark that someone found value in, or written something in an article that had a great benefit to a reader who emailed a thank you.  In this example, the gratitude response was targeted at the author who triggered the response, but that would not always be the case. The thankful one may have just felt deep gratitude that the author never knew about.

Then there is the question of whether the gratitude response actually affected anyone.  The author might be pleased that a reader expressed gratitude, or they might not care, or they might not even see the thank you email.   Even the thankful one might not be affected, if their gratitude response is fully on autopilot (e.g. thanking one for opening a door).

Below is a diagram summarizing all the points we have raised about gratitude. It is essentially an ontology of gratitude. The dotted lines indicate optional links, the solid ones are necessary.

So how can we use this ontology of gratitude to pave the road to euphoria?  I speculate that the science on gratitude will show that the gratitude has to be felt to be the most valuable.  On Thanksgiving, we often go around the table and say what we are thankful for.  But does it really mean anything? Are we just saying it or do we really feel it?

This thanksgiving, I am thankful for my creative mind and that I have a job that pays me to do what I love: distilling the essence from a complex web of ideas. It is deeply felt.  There, I feel much better already!

Happy Thanksgiving

gist: Buckets, Buckets Everywhere, Who Knows What to Think?

We humans are categorizing machines, which is to say, we like to create metaphorical buckets, and put things inside. But there are different kinds of buckets, and different ways to model them in OWL and gist. The most common bucket represents a kind of thing, such as Person, or Building. Things that go into those buckets are individuals of those kinds, e.g. Albert Einstein, or the particular office building you work in. We represent this kind of bucket as an owl:Class and we use rdf:type to put something into the bucket.

Another kind of bucket is when you have a group of things, like a jury or a deck of cards that are functionally connected in some way. Those related things go into the bucket (12 members of a jury, or 52 cards). We have a special class in gist called Collection, for this kind of bucket. A specific bucket of this sort will be an instance of a subclass of gist:Collection. E.g. OJs_Jury is an instance of the class Jury, a subclass of gist:Collection. We use gist:memberOf to put things into the bucket. Convince yourself that these buckets do not represent a kind of thing. A jury is a kind of thing, a particular jury is not. We would use rdf:type to connect OJ’s jury to the owl:ClassJury, and use gist:memberOf to connect the specific jurors to OJ’s jury.

We humans are categorizing machines. But there are different kinds of buckets, and different ways to model them in OWL and gist.

 

A third kind of bucket is a tag which represents a topic and is used to categorize individual items for the purpose of indexing a body of content. For example, the tag “Winter” might be used to index photographs, books and/or YouTube videos. Any content item that depicts or relates to winter in some way should be categorized using this tag. In gist, we represent this in a way that is structurally the same as how we represent buckets that are collections of functionally connected items. The differences are 1) the bucket is an instance of a subclass of gist:Category, rather than of gist:Collection and 2) we put things into the bucket using gist:categorizedBy rather than gist:memberOf . The Winter tag is essentially a bucket containing all the things that have been indexed or categorized using that tag.

Below is a summary table showing these different kinds of buckets, and how we represent them in OWL and gist.

Kind of Bucket Example Representing the Bucket Putting something in the Bucket
Individual of a Kind John Doe is a Person Instance of owl:Class rdf:type
A bucket with functionally connected things inside Sheila Woods is a member of OJ’s Jury Instance of a subclass of gist:Collection gist:memberOf
An index term for categorizing content The book “Winter of our Discontent” has Winter as one of its tags Instance of a subclass of gist:Category gist:categorizedBy