White Paper: What is Semantic Profiling?

Semantic profiling is a technique using semantic-based tools and ontologies in order to gain a deeper understanding of the information being stored and manipulated in an existing system.

Semantic Profiling

In this paper we will describe an approach to understanding the data in an existing system through a process called semantic profiling.

What is semantic profiling?

Semantic profiling is a technique using semantic-based tools and ontologies in order to gain a deeper understanding of the information being stored and manipulated in an existing system. This approach leads to a more systematic and rigorous approach to the problem and creates a result that can be correlated with profiling efforts in other applications.

Why would you want to do semantic profiling?

The immediate motivation to do semantic profiling is typically either a system integration effort, a data conversion effort, a new data warehousing project, or, more recently, a desire to use some form of federated query in order to pull together enterprise-wide information. Each of these may be the initial motivator for doing semantic profiling but the question still remains: why do semantic profiling rather than any of the other techniques that we might do? To answer that let’s look at each of the typically employed techniques:

Analysis. By far, the most common strategy is some form of “analysis.” What this usually means is studying existing documentation and interviewing users and developers about how the current system works and what data is contained in it. From this the specification for the extraction or interface logic is designed. This approach, while popular, is fraught with many problems. The most significant is that very often what the documentation says and what the users and developers think or remember is not a very high fidelity representation of what will actually be found when one looks deeper.
Legacy understanding. The legacy understanding approach is to examine the source code of the system that maintains the current data and, from the source code, deduce the rules that are being applied to the data in the current system. This can be done by hand for relatively small applications. We have done it with custom analysis tools in some cases and there are commercial products from companies like Relativity and Merant that will automate this process. The strength of this approach is that it makes explicit some of what was implicit, and it’s far more authoritative than the documentation. The code is what’s being implemented; the documentation is someone’s interpretation of either what should have been done or their idea of what was done. While legacy understanding can be helpful, it’s generally expensive and time-consuming and still only gives a partial answer. The reason it only gives a partial answer is that there are many fields in most applications that have relatively little system enforcement of data values. Most fields with text data and many fields with dates and the like have very little system enforced validation. Over time users have adapted their usage and procedures have been refined to fill in missing semantics for the system. It should be noted though that the larger the user base the more useful legacy understanding is. In a larger user base, relying on informal convention becomes less and less likely, because the scale of the system means that users would have had to institutionalize their conventions, which usually means systems changes.
Data profiling. Data profiling is a technique that’s been popularized by vendors of data profiling software such as Evoke, Ascential and Firstlogic. This process relies on reviewing the existing data to determine and uncover anomalies in the databases. These tools can be incredibly useful in finding areas where the content of the existing system is not what we would have expected it to be. Indeed, the popularity of these tools stems largely from the almost universal surprise factor when people are shown the content of their existing databases that they were convinced were populated only with clean, scrubbed data of high integrity, only to find a gross number of irregularities. While we find data profiling very useful, we find that it doesn’t go far enough. In this paper we’ll outline a procedure that adds on and takes it further.

So how is semantic profiling different?

The first difference is that semantic profiling is more rigorous. We will get into exactly why this is in the section on how to do semantic profiling but the primary difference is that with data profiling you can search for and catalog as many anomalies as you like. After you’ve found and investigated five strange circumstances in a database you can stop. It is primarily an aid to doing other things and as such you can take it as far as you want. With semantic profiling, once you select a domain of study you are pretty much committed to take it “to ground.” The second main difference is that the results are reusable. Once you’ve done a semantic profile on one system, if you do a profile on another system the results of the first system will be available and can be combined with those from the second system. This is extremely useful in environments where you are attempting to draw information from multiple sources to pull into one definitive source; whether that is for data warehousing or EII (Enterprise Information Integration). And finally the semantic profiling approach sets up a series of testable hypotheses that can be used to monitor a system as it continues in production, to detect semantic drift.

What you’ll need

For this exercise you will need the following materials:

A database to be studied, with live or nearly live data. You can’t do this exercise with developer-created test data.
Data profiling software. Any of the major vendors’ products will be suitable for this. It is possible for you to roll your own, although this can be a pretty time consuming exercise.
A binding to your database available to the data profiling software. If your database is in a traditional relational form with an ODBC or JDBC access capability then that’s all you need. If your data is in some more exotic format you will need an adapter.
Meta-data. You will need access to as much as you can find about the official meta-data for the fields under study. This may be in a data dictionary, it may be in a repository, it may be in the copy books; you may have to search around a bit for it.
An ontology editor. You will be constructing an ontology based on what you find in the actual data. There are a number of good ontology editors; however, for our purposes Protégé from Stanford, a freeware version, should be adequate for most versions.
An inferencing engine. While there are many proprietary inferencing engines, we strongly advocate adopting one based on the recent standards RDF and OWL. There are open and freeware versions, such as open RDF or Kowari.
A core ontology. The final ingredient is a starting point ontology that you will use to define concepts as you uncover them in your database. For some applications this may be an industry reference data model such as HL7 for health care. However, we are advocating the use of what we call the semantic primes as the initial starting point. We’ll cover the semantic primes in another white paper or perhaps in a book. However, they are a relatively small number of primitive concepts that are very useful in clarifying your thinking regarding other concepts.

How to proceed

Overall, this process is one of forming and testing hypotheses about the semantics of the information in the extant database.

The hypotheses being formed concern both the fidelity and the precision of the definition of the items as well as uncovering and defining the many hidden subtypes that lurk in any given system.

This business of “running to ground” means that we will continue the process until every data item is unambiguously defined and all variations and subtypes have been identified and also unambiguously defined.

The process begins with some fairly simple hypotheses about the data, hypotheses that can be gleaned directly from the meta-data. Let’s say we notice in the data dictionary that BB104 has a data type of date or even that it has a mask of MMDDYYYY. We hypothesize that it is a date and further, in our case, our semantic prime ontology forces us to select between a historical date or a planned date. We select historical. We add this assertion to our ontology. The assertion is that BB104 is of type historical date. We run the data profiling and find all kinds of stuff. We find that some of the “historical dates” are in the future. So, depending on the number of future dates and other contextual clues we may decide that either our initial assignment was incorrect and these actually represent planned dates, some of which are in the past because the plans were made in the past, or, in fact, that most of these dates are historical dates but there are some records in this database of a different type. Additionally, we find some of these dates are not dates at all. This begins an investigation to determine if there’s a systemic pattern to the dates that are not dates at all. In other words, is there a value in field BB101, BB102, or BB103 that correlates with the non-date values? And if so, does this create a different subtype of record where we don’t need a date?

In some cases we will uncover errors that are just pure errors. We found that in some cases data validation rules had changed over time and that older records had different and anomalous values. And in some cases people have, on an exception basis, used system level utilities to “repair” data records and in some cases they create these strange circumstances. In cases where we uncover what is finally determined to be genuine errors, rather than semantically defining them we should be creating a punch list for both correcting them and correcting their cause if possible or necessary.

Meanwhile, back to the profiling exercise. As we discover subtypes with different constraints on their date values, we introduce these into the ontology we’re building. In order to do this, as we are documenting our date, we need to further qualify the date. What is it the date of? For instance, if we determine that it is, in fact, a historical date, what event was recorded on that date? As we hypothesize and deduce this we add it to the ontology and the information that this BB104 date is the occurred on date for the event that we described. As we find that the database has some records with legitimate historic dates and others with future dates and we find some correlation with another value, we hypothesize that, indeed, there are two types of historical events or perhaps even some historical events mixed with some planned events or planned activities. What we then do is define these as separate concepts in the ontology with the predicate for defining eligibility in the class. To make it simple, if we found that BB101 had one of two values, either P or H, we may hypothesize that H meant historical and P meant planned and we would say that the inclusion criteria for the planned events is that the value of BB101 equals P. This is a testable hypothesis. At some point the ontology becomes rich enough to begin its own interpretation. We load the data either directly from the database or, more likely, from the profiling tool as instances in the RDF inferencer. The inferencing engine itself can then challenge class assignment; can detect inconsistent property values; etc. We proceed in this fashion until we have unambiguously defined all the semantics of all the data in the area under question.

Conclusion

Having done this, what do you have? You have an unambiguous description of the data as it exists and a set of hypotheses against which you can test any new data to determine whether it agrees or not. But more simply, you know exactly what you have in your database if you were to perform a conversion or a system integration. More interestingly, you also have the basis for a set of rules if you wanted to do a combined integration of data from many sources. You would know, for instance, that you would need to apply a predicate from records in certain databases to exclude those that do not match the semantic criteria that you want to use from the other system. Say you want to get a “single view of the customer.” You will need to know, of all the many records in all your many systems that say or allude to customer data; which ones really are customers; which ones are channel partners or prospects or various other partners that might be included in some file. You need a way to unambiguously define that or system wide integration efforts are going to fall flat. We believe this to be the only rigorous and complete approach to this problem. While it is somewhat complex and time-consuming, it delivers a reasonable value and contributes to the predictability of other efforts which are often incredibly unpredictable.

Written by Dave McComb