A Semantic Enterprise Architecture

We do enterprise architectures, service-oriented architectures, and semantics. I suppose it was just a matter of time until we put them together. This essay is a first look at what a semantic enterprise architecture might look like.

What problem are we trying to solve?

There are several problems we would like to address with semantic architecture. Some of them are existing problems that are just not solved well enough yet. Even though many organizations intellectually understand what a service-oriented architecture is and what it might do for them, the vast majority remain unmotivated to invest in the journey to migrate in that direction.

Certainly we want the semantic enterprise architecture to address the same sorts of situations that the service-oriented architecture can handle, such as the ability to rapidly adapt new applications; to swap out technologies and applications at will; and to do so with commodity priced services and technologies. But let's also focus on what additional problems a semantic enterprise architecture could address.

Dark matter and dark energy

As strange as it may seem there is mounting evidence that the knowable and visible universe, that is, the earth, the sun, the planets, the stars, the galaxies, and all the interplanetary and Enterprise Architecture: dark matter and dark energy intergalactic dust represent some four to five percent of the total amount of "stuff" that there is in the universe. We live in the froth of a giant wave whose presence we can only infer.

What makes up the rest of the universe is what physicists now call dark matter and dark energy. Dark matter makes up some 25% of the mass and energy deficit of the universe and is the primary force holding galaxies together, as the gravitational attraction of the masses of many stars and black holes is insufficient to do the job on its own.

Rather than the universe's expansion slowing down since the big bang as the amount of mass and dark matter would suggest, apparently the universe is moving apart at an even faster rate. The propelling force has been dubbed "dark energy" and it comprises the remaining 70% of all the "stuff" in the universe. And so it is with our corporate information systems. We fret continuously over our SAP databases or the corporate data warehouse or the integrated database system that we run our company with, but this is very much like the five percent of the universe that we can perceive. It's comfortable to believe that's all there is.

But it just doesn't square with the facts. Rogue applications, such as those built in Microsoft Access, Excel or FileMaker are the Dark Matter of an information system. Like the cosmic Dark Matter, in some fashion they are holding an enterprise together, even though most of the time we can't see them. Messages and documents are our Dark Energy equivalent: they are the expansive force in the enterprise. And like Dark Energy in the Universe, they are undetected by the casual observer.

What Does Enterprise Architecture Have to do with Semantics?

We ignore our information dark energy and information dark matter largely because at a corporate level we literally do not understand them. In our corporate databases we've invested decades of effort and typically millions and usually tens and hundreds of millions of dollars of implementation, standardization, training, and documentation in an attempt to arrive at a shared meaning for the corporate information systems. As we'll discuss in other white papers, this has still left a great deal of room for improvement. Indeed, most of the meaning is shared only within an individual application.

Occasionally, corporations invest mightily in ad hoc semantic sharing between applications, under the guise of Systems Integration. But what we're going to talk about today is the information dark matter and information dark energy and bringing them into the light. With the rogue systems, what we need to know is: what is the correspondence, if any, (and, by the way, it is usually considerable) between the rogue systems and the approved systems. More often than not, a rogue system is populated from either extracts or manual creation of data from approved systems. The rogue system is often created in order to make some additional distinctions or provide additional behavior or extensions that were not possible in the approved system. But this does not mean they don't have a shared root and some shared meaning.

Occasionally, rogue systems are created to deal with some aspect of the corporation that, at least initially, appeared to be completely outside the scope of any existing application. You may have a videotape renting application or a phone number change request application or any of a number of small special-purpose systems. However, if they become successful and if they grow, inevitably they begin to touch aspects of the company that are covered by officially sanctioned systems.

What we want to do with semantics is to find and define where the commonalities lie in such a way that we may be able to take advantage of them in the future. For the unstructured data we have an even bigger challenge. With the rogue system, once we deduce what a column in the Access database means, we have a reasonable prediction of what we are going to find in each change record. This is because the Access database, while it may not be as rigorous as the corporate database, provides structure and validation.

Not so with the unstructured data. With the unstructured data, we need to find ways to find and organize meaning where every instance may be different. Every email, every memo, every reference document contains different information. The semantic challenge is to find, wherever possible, references in this unstructured data to information that is known at a corporate level. The approach here is almost exactly the opposite: in documents, people rarely refer to what we think of as meta-data or categories or classes or columns or entities -- or anything like that. In documents, people refer to specific instances. They may refer to their order number in an email; they may refer to a set of specific codes in a reference manual; they may refer to a particular procedure in procedure manual. Our semantic challenge in this case is to find these items, index them, and associate them to the meta-data and even the instances that exist at a corporate level.

Semantic Enterprise Architecture

So what's in the architecture? The Sematic Enterprise Architecture is still primarily based on Service Oriented Architecture concepts. We want to be able to communicate between largely independent applications and services using well-defined and corporate standard messages. These messages should be produced and consumed in such a way that allows at least some local change in their structure and syntax without breaking the rest of the architecture.

But we need to go considerably beyond that. We will need a meta-data repository that links the enterprise's shared schema with a more generalized and, at the same time, more precise description of what these things mean. This meta-data repository will be populated by a combination of machine and human inferences from the description of the meta-data that exists in the many dictionaries and documentation bits as well as from the product of data profiling. Data profiling in the corporate systems will tell us not what we intended the data in our corporate systems to mean but in practice, from how we have been using the system, what it has come to mean.

This expression of the enterprise meta-data in a rigorous format is just the beginning or the gateway for incorporating our dark energy and dark matter. The rogue systems need to have their meta-data catalogued in a compatible fashion to the enterprise meta-data repository. This will allow us at least to know when the corporate systems may want to refer to the rogue system for additional details. Conversely, it creates at least some hope that the rogue system may have a defined interface to the corporate system and may be informed if things change.

The unstructured data will be incorporated using technologies that already exist, including text interpretation, first to find any specific nouns or events that are called out from the unstructured data. Using this information, the unstructured data can be cross-referenced to instances and by extension into the entire enterprise data network.

How to Get Started

This all sounds a bit incredible. And the endgame is likely a ways off. But we don't have to go to the endgame for this to be useful. As Jim Hendler says, "A little bit of semantics goes a long way." Even if we only pick a few topics to index in our meta-data repository, and even if we choose a few well-known rogue applications to cross-reference, and even if we only grab the low hanging fruit from our unstructured data, as many companies are already doing, we will still see considerable benefit.

Many content management and some knowledge management projects are aimed at using humans to perform this style of indexing on the reference documents that we often use in our organizations. But with a little extension this can go considerably farther. As it is, it's generally an island unto itself. But as we're proposing here the island can be extended and incorporated into the broader enterprise landscape.

Concluding Thoughts

We've barely scratched the surface here. However, many of the technologies needed to make this work already exist and have been proven in isolated settings. What are needed are companies willing to invest in the research and infrastructure in order to profitably include the other 95% of their information infrastructure into their enterprise architecture.