White Paper: Veracity

Encarta defines veracity as “the truth, accuracy or precision of something” and that seems like a pretty good place to start.

Our systems don’t model uncertainty very well, and yet that is exactly what we deal with on a day-to-day basis. This paper examines one aspect of modeling certainty, namely veracity, and begins a dialog on how to represent it.

Veracity

Encarta defines veracity as “the truth, accuracy or precision of something” and that seems like a pretty good place to start. In our case we will primarily be dealing with whether a symbolic representation of something in the real world faithfully represents the item in the real world. Primarily we are dealing with these main artifacts of systems:

  • Measurements – is the measurement recorded in the system an accurate reflection of what it was meant to measure in the real
    world?
  • Events – do the events recorded in the system accurately record what really happened?
  • Relationships – do the relationships as represented in the system accurately reflect the state of affairs in the world?
  • Categorization – are the categories that we have assigned things to useful and defensible?
  • Cause – do our implied notions of causality really bear out in the world? (This also includes predictions and hypotheses.)

Only the first has ever received systematic attention. Fuzzy numbers are a way of representing uncertainty in measurements, as is “interval math” and the uncertainty calculations used in Chemistry (2.034 +/ -.005 for instance).

But in business systems, all of these are recorded as if we are certain of them, and then as events unfold, we eventually may decide not only that we are not certain, but that we are certain of an opposite conclusion. We record an event as if it occurred and until we have
proof that it didn’t, we believe that it did.

Download the White-paper

White Paper: International Conference on Service Oriented Computing

In this write up I’ll try to capture the tone of the conference, what seemed to be important and what some of the more interesting presentations were.

This was the first ever Conference on Service Oriented Computing.  In some ways it was reminiscent of the first Object OrientedService Oriented Computing conference (OOPSLA in 1986): highly biased toward academic and research topics, at the same time shining a light on the issues that are likely to face the industry over the next decade.  In this write up I’ll try to capture the tone of the conference, what seemed to be important and what some of the more interesting presentations were.

Why Trento?

Apparently, a year and a half ago several researchers in Service Oriented Computing began planning an Italian conference on Service Oriented Computing and it kind of spun up to an International conference. Trento was an interesting, but logistically difficult choice.  Trento is in Dolomite region of the Italian Alps, and is difficult even for Europeans to get to.  It is a charming University town, founded in the roman times with a rich history though the middle ages.  The town is a beautiful blend of old and new and very pedestrian friendly, large cobblestone courtyards can be found every few blocks usually adjoining a renaissance building or two.  We took a side trip one hour further up the Alps to Balzano, and saw Otzi, “the ice man.”

This conference had some of the best after hours arrangements of any I’ve attended: one night we got a guided tour of “the castle” followed by a brief speech from the vice mayor and wine and dinner sized hors d’odourves.  The final night was a tour of Ferrari Spumante the leading producer of Italian Champagne, with a five or six course sit down dinner.

Attendees & Presenters

There were about 140 attendees, at least a third of who were also presenters. All but eight were from academia, and we were among 6 who were from North America.  Next Years venue will be in New York City in mid November, which should change the nature and size of the audience considerably.

Keynotes were by Peter Diry, who is in charge of a large European government research fund that is sinking billions into research topics in advanced technology.  There was a great deal of interest in this as I suspect many of the attendees bread was buttered either directly or indirectly from these funds.  Bertrand Meyer was the pre dinner keynote the night of the formal dinner.  Had a very provocative talk on the constructs that are needed to managed distributed concurrency (we’ve managed to avoid most of this in our designs, but you could certainly see how with some designs this could be a main issue.)  Frank Heyman from IBM was the final keynote, which was primarily about how all this fits into Grid computing and open standards.

The 37 major presenters and 10 who had informal talks at a wine and cheese event, were chosen from 140 submissions.  Apparently many of these people are leading lights in the research area of this discipline, although I had never heard of any of them. In addition there were two half day tutorials on the first day. Presentations were in English, although often highly accented English.

General Topics

It was a bit curious that the conference was “Service Oriented Computing” and not “Service Oriented Architecture” as we hear it; it marked some subtle and interesting distinctions.  This was far more about Web services than EAI or Message Oriented Middleware.  They were far more interested in Internet scale problems than enterprise issues.

Some of the main themes that recurred throughout the conference were: service discovery and composition, security, P2P and grid issues and Quality of Service issues.  Everyone has pretty much accepted WSDL and BPEL4WS (which everyone just calls “bee pell”) as the defacto technologies that will be used.  There was some discussion and reference to the Semantic Web technologies (RDF, DAML-S and OWL).  They seemed to be pretty consistent on the difference between Orchestration and Choreography (more later)

There was a lot of talk about dynamic composition, but when you probed a bit, not as much agreement as to how far it was likely to go or when the dynamic-ness was likely to occur.

Things clarified for me

There were several things that weren’t necessarily presented in a single talk, but with the combination and context several things became clearer to me.  Many people may have already tripped to these observations, but for the sake of those who haven’t:

Virtualization

In much the same way that SAN and NAS virtualized storage (that is removed the users  specific knowledge of where the data was being stored)  SOC is meant to virtualize functionality.  This is really the grid angle of Service Oriented Compu.  There were a few people there who noted that unlike application servers or web servers, it will not be as easy to virtualize “stateful” services.

Service Discovery

Most of the discussion about service discovery was design time discovery.  Although there were some who felt that using the UDDI registry in an interactive mode constituted run time discovery.  There were many approaches described to aid the discovery process.

Capabilities

There was pretty widespread agreement that WSDL’s matching of signatures was not enough.  Getting beyond that was called several different things and there were several difference approaches to it.  One of the terms used was “capabilities” in other words how can we structure a spec that describes the capability of the service.  This means finding a way to describe how the state of the caller and the called objects were changed as well as noting side effects (intentional and otherwise.)

Binding

Frank Heyman from IBM made the point that WSDL is really about describing the binding  between “port types” (what the service is constructed to deal with) and specific ports (what it gets attached to).  While the default binding is of course SOAP, he had several examples, and could show that the binding was no more complex for JMS, J2EE, even CICS Comm Region binding.

Orchestration and Choreography

The tutorial clarified and subsequent presentations seemed to agree that Orchestration is what you do in machine time.  It is BPEL.  It is a unit of composition.  It is message routing primarily synchronously.  While the tools that are good for Orchestration could be used for Choreography, that’s not using each tool to its strength.

Choreography involves coordination, usually over time.  So when you have multiple organizations involved, you often have Choreography issues.  Same with having other people in the loop.  Most of what we currently think of as work flow will be subsumed into this choreography category.

Specific Talks of Note

Capabilities: Describing What Web Services Can Do –Phillipa Oaks, eta all Queensland University

This paper gets at the need to model what a service does if we are to have any hope of “discovering” them either at design time or run time.  They had a meta model that expanded the signature based description to include rules such as pre and post conditions as well as effects on items not in the signature.  It also allowed for location, time and manner of delivery constraints.

Service Based Distributed Querying on the Grid, Alpdemir, et al University of Manchester

I didn’t see this presentation, but after reading the paper wish I had.  They outline the issues involved with setting up distributed queries, and outline using the OGSA (Open Grid Service Architecture) and OGSI (Open Grid Services Infrastructure).  They got into how to set up an architecture for managing distributed queries, and then into issues such as setting up and optimizing query plans in a distributed environment.

Single Sign on for Service Based Computing, Kurt Geihs, etal Berlin University of Technology

The presentation was given by Robert Kalchlosch (one of the etal’s).  One of the best values for me was a good overview of Microsoft Passport and the Liberty Alliance, especially in regard to what cooperating services need to do to work with these standards.  This paper took the position that it may be more economical to leave services as they are and wrap them with a service/ broker that is handling the security and especially the single sign on aspect.

Semantic Structure Matching for Assessing Web-Service Similarity, Yiqiao Wang etal Univeristy of Alberta

Issues and problems in using Semantics (rdf ) is service discovery.  They noted that a simple semantic match was not of much use, but that by using Word-Net similarity coupled with structural similarity they were able to get high value matching in discovery.

“Everything Personal, not Just Business” Improving User Experience through Rule-Based Service Customization, Richard Hull, etal Bell Labs

Richard Hull wrote one of the seminal works in Semantic Modeling, so I was hoping to meet him.  Unfortunately he didn’t make it and sent a tape of his presentation instead.  The context was if people had devices that revealed their geographic location, what sort of rules would they like to set up about who they would make this information available to?  One of the things that was of interest to us was their evaluation, and then dismissal of general purpose constraint solving rule engines (like ILOG) for performance reasons.  They had some statistics and some very impressive performance on their rule evaluation.

Conclusion

The first ever Conference on Service Oriented Computing was a good one, it provided a great deal of food for thought and ideas about where this industry is headed in the medium term.

Written by Dave McComb

White Paper: How Service Oriented Architecture is Changing the Balance of Power Between Information Systems Line and Staff

As service oriented architecture (SOA) begins to become widely adopted through organizations, there will be major dislocations in the balance of power and control within IS organizations.

As service oriented architecture (SOA) begins to become widely adopted through organizations, there will be major dislocations in the balance of power and control within IS organizations. In this paper when we refer to information systems (IS) line functions, we mean those functions that are primarily aligned with the line of business systems, especially development and maintenance. When we refer to the IS staff functions, we’re referring to functions that maintain control over the shared aspects of the IS structure, such as database administration, technology implementation, networks, etc.

What is Service Oriented Architecture?

Service oriented architecture is primarily a different way to arrange the major components in an information system.  There are many technologies involved with SOA that are necessary in order to implement an SOA and we will touch on them briefly here; but the important distinction for most enterprises will be that the exemplar implementations of SOA will involve major changes in boundaries between systems and in how systems communicate.

In the past, when companies wished to integrate their applications, they either attempted to put multiple applications on a single database or wrote individual interfacing programs to connect one application to another.  The SOA approach says that all communication between applications will be done through a shared message bus and it will be done in messages that are not application-specific.  This definition is a bit extreme for some people, especially those who are just beginning their foray into SOA, but this is the end result for the companies who wish to enjoy the benefit that this new approach promises.

A message is an XML document or transaction that has been defined at the enterprise level and represents a unit of business functionality that can be exchanged between systems.  For instance, a purchase order could be expressed as an HTML document and sent between the system that originated it, such as a purchasing system, and a system that was interested in it, perhaps an inventory system.

The message bus is implemented in a set of technologies that ensure that the producers and consumers of these messages are not talking directly to each other.  The message bus mediates the communication in much the same way as the bus within a personal computer mediates communication between the various subcomponents.

The net result of these changes is that functionality can be implemented once, put on the message bus, and subsequently used by other applications.  For instance, logic that was once replicated in every application (such as production of outbound correspondence, collection on receivables, workflow routing, management of security and entitlements), as well as functionality that has not existed because of a lack of a place to put it (such as enterprise wide cross-referencing of customers and vendors), can now be implemented only once.  [Note to self: I think that sentence is not correct anymore.]  However, in order to achieve the benefits from this type of arrangement, we are going to have to make some very fundamental changes to the way responsibilities are coordinated in the building and maintaining of systems.

Web Services and SOA

Many people have confused SOA with Web services.  This is understandable as both deal with communications between applications and services over a network using XML messages.  The difference is that Web services is a technology choice; it is a protocol for the API (application programming interface).  A service oriented architecture is not a technology but an overall way of dividing up the responsibilities between applications and having them communicate.  So, while it is possible to implement an SOA using Web services technology, this is not the only option.  Many people have used message oriented middleware, enterprise application integration technologies, and message brokers to achieve the same end.  More importantly, merely implementing Web services in a default mode will not result in a service oriented architecture.  It will result in a number of point-to-point connections between applications merely using the newest technology.

Now let’s look at the organizational dynamics that are involved in building and maintaining applications within an enterprise.

The Current Balance of Power

In most IS organizations, what has evolved over the last decade or so is a balance of power between the line organizations and the staff organizations that looks something like the following.

In the beginning, the line organizations had all the budget, all the power, and all the control.  They pretty much still do.  The reason they have the budget and the power is that it’s the line organization that has been employed to solve specific business problems.  Each business problem brings with it a return on investment analysis which specifies what functionality is needed to solve a particular business problem.  Typically, each business owner or sponsor has not been very interested or motivated in spending any more money than needed to in order to solve anyone else’s problem.

However, somewhere along the line some of the central IS staff noticed that solving similar problems over and over again, arbitrarily differently, was dis-economic to the enterprise as a whole.  Through a long series of cajoling and negotiating, they have managed to wrest some control of some of the infrastructure components of the applications from the line personnel.  Typically, the conversations went something like, “I can’t believe this project went out and bought their own database management system, paid a whole bunch of money when we already have one which would’ve worked just fine!”  And through the process, the staff groups eventually wrested at least some degree of control over such things as choice of operating systems, database management systems, middleware and, in some cases, programming languages.  They also very often had a great deal of influence or at least coordination on data models, data naming standards, and the like.  So what has evolved is a sort of happy peace where the central groups can dictate the technical environment and some of the data considerations, while the application groups are free to do pretty much as they will with the scope of their application, functionality, and interfaces to other applications.

For much of the same reason, the decentralization of these decisions leads to dis-economic behavior, however, it is not quite as obvious because the corporation is not shelling out for another license for another database management system that isn’t necessary.

The New World Order

In the New World, the very things that the line function had most control of, namely the scope, functionality, and interfaces of its applications, will move into the province of the staff organization.  In order to get the economic benefit of the service oriented architecture, the main thing that has to be determined centrally for the enterprise as a whole is: what is the scope of each application and service, and what interfaces is it required to provide to others?

In most organizations, this will not go down easily.  There’s a great deal of inertia and control built up over many years with the current arrangement.  Senior IS management is going to have to realize that this change needs to take place and may well have to intervene at some fairly low levels.  As Clayton Christensen stated in his recent book The Innovator’s Solution, the strategic direction that an enterprise or department takes doesn’t matter nearly as much. What matters is whether they can get agreement from the day-to-day decision makers who are allocating resources and setting short-term goals.  For most organizations, this will require a two-pronged attack.  On one hand, the senior IS management and especially the staff function management will have to partner more closely with the business units that are sponsoring the individual projects.  Part of this partnering and working together will be in order to educate the sponsors on the economic benefits that will accrue to the applications that adhere to the architectural guidelines.  While at first this sounds like a difficult thing to convince them of, the economic benefits in most cases are quite compelling.  Not only are there benefits to be had on the individual or initial project but the real benefit for the business owner is that it can be demonstrated that this approach leads to much greater flexibility, which is ultimately what the business owner wants.  This is really a governance issue, but we need to be careful and not confuse the essence of governance, with the bureaucracy that it often entails.

The second prong of the two-pronged approach is to put a great deal of thought into how project managers and team leads are rewarded for “doing the right thing.”  In most organizations, regardless of what is said, most rewards go to the project managers who deliver the promised functionality on time and on budget.  It is up to IS management to add to these worthwhile goals equivalent goals aimed at contributing to and complying with the newer, flexible architecture, such that a project that goes off and does its own thing will be seen as a renegade and that regardless of hitting its short-term budgets, the project managers will not be given accolades but instead will be asked to try harder next time.  Each culture, of course, has to find its own way in terms of its reward structure but this is the essential issue to be dealt with.

Finally, and by a funny coincidence, the issues that were paramount to the central group, such as choice of operating system, database, programming language, and the like, are now very secondary considerations.  It’s quite conceivable that a given project or service will find that acquiring an appliance running on a completely different operating system and database management system can be far more cost-effective, even when you consider the overhead costs of managing the additional technologies.  This difference comes from two sources.  First, in many cases, the provider of the service will also provide all the administrative support for the service and its infrastructure, effectively negating any additional cost involved in managing the extra infrastructure.  Second, the service oriented architecture implementation technologies shield the rest of the enterprise from being aware of what technology, language, operating system, and DBMS are being used, so the decision does not have the secondary side effects that it does in pre-SOA architectures.

Conclusion

To wrap up, the move to service oriented architecture is not going to be a simple transition or one that can be accomplished by merely acquiring products and implementing a new architecture.  It is going to be accompanied by an inversion in the traditional control relationship between line and staff IS functions.

In the past the business units and application teams they funded determined the scope and functionality of the projects, the central IS groups determined technology and to some extent common data standards.  In the service oriented future these responsibilities will be move in the opposite direction.  The scope and functionality of projects will be an enterprise wide decision, whilst individual application teams will have more flexibility in the technologies they can economically use, and the data designs they can employ.

The primary benefits of the architecture will only accrue to those who commit to a course of action where the boundaries, functionality, and interface points of their system will no longer be delegated to the individual projects implementing them but will be determined at a corporate level ahead of time and will merely delegate the implementation to the line organization.  This migration will be resisted by many of the incumbents and the IS management that wishes to enjoy the benefits will need to prepare themselves for the investment in cultural and organizational change that will be necessary to bring it about.

White Paper: Shedding Light on the “Shared Services” Conversation

Although there are at least seven levels of granularity to “shared services,” little time has been spent to categorize these.

My observation is that although there are at least seven levels of granularity to “shared services,” little time has been spent to categorize these. Please refer to the illustration below. The degree of sharing runs a gamut from the most sharing most at the top to the least at the bottom. Mostly the higher levels of sharing imply the levels below, but that’s only most of the time, not all the time.

The colors could come in handy later to help visualize sharing by function and by agency in a large matrix. An example might help–let’s say we were trying to sort out shared services in the area of the motor pool. Let’s go through each level:

The motor pool example doesn’t quite do justice to the distinction between the application front end and the application back end, which we think may end up being the significant difference.

A larger and more traditional application may showcase that difference better. Let’s take payroll. When most people talk about HR as a shared service they are talking about sharing the application (there hasn’t been much discussion about the possibility of rebadging HR employees or relocating them) So assume we’re just talking about the HR application, there is still an extra degree of sharing to discuss; front end or back end. Traditionally when you implement a package, like SAP, most everyone affected has to learn the new application. It has new screens, new terminology, new work flow, new exceptions and new conventions. It requires new interfaces to existing systems in the field. This is why packaged implementations cost so much. The software isn’t very expensive. The literal installation and configuration doesn’t take all that much effort. It is the number and degree to which people, processes and other systems are impacted
that runs the price tags up. For most of the agencies we have been involved with, HRMS was a wrenching conversion. Many have still not recovered to their previous level of productivity. But at least one agency that we know of had a pretty easy go of it. This is because they had built an app they called HR Café. HR Café was the interface that everyone in the agency knew and used. HR Café implemented many of their local idiosyncrasies. Almost no one had direct access to the old Payroll system. So when HRMS came up, the agency just changed the interface from HR Café so that it now interacted with HRMS, and there was very little collateral damage. The back end of HRMS was shared and not the front end. In this case, the good result was sort of an inadvertent result of some other good decisions that were taken. But we think this approach can be generalized with a tremendous amount of economic benefit.

Download the White-paper

What Will It Take to Build the Semantic Technology Industry?

I get asked this question a lot, and I’d like to get your help in answering it. As co-chairman of the Semantic Technology Conference, I seesemantic technology lots of customer organizations experimenting and adopting semantic technologies – especially ontology-driven development projects and semantic search tools – and seemingly as many start-ups and new products emerging to address their requirements. It’s an exciting time to be in this space and I’m glad to have a part to play. But back to the question of “what will it take?” I don’t think anyone has all the answers, though it seems there’s a growing consensus about how semantics will eventually take hold:

1. A Little Semantics Goes a Long Way

I think it was Jim Hendler who first used the expression, and I find myself in stark agreement. Much of the criticism of the semantic web vision focuses on the folly of trying to boil the ocean, yet many of the successful early adopters are getting nice results by taking small incremental steps. There’s a good exchange at Dave Beckett’s blog on this point.

2. Realistic Expectations

I guess this relates to my first point, but I remain concerned about the hype and expectations that are being set around the semantic web, and now the term Web 3.0. I, as much as anyone, would love to see the semantics field explode with growth, but this market is going to be driven by customers, not vendors, and the corporate clients I see are taking a cautious approach. I think they’ll catch on eventually, but let’s not try to push them too far, too fast.

3. We Don’t Need a Killer App

Personally I think we need to look at semantic capabilities as an increasing component of the web and computing infrastructure, as opposed than trying to identify a killer app that’s going to kickstart a buying frenzy. If a killer app emerges then that’s great, but don’t hold your breath. There’s plenty of value to be gained in the meantime. More than anything, we need to demonstrate speedy, cheap ways to get started with semantics. This will be far more useful in the long run.

4. We Need to Get Business Mindshare

It’s so obvious that I’m almost embarrassed to say it, but the main point is that we need to improve how we’re currently demonstrating the business value of semantic technology. I see a few key ways we can improve, starting with a greater willingness to talk about the projects already taking place. Secondly, I think we can leverage existing technology trends – especially SOA and mashups – to show how semantic technology can add value to these efforts. Third, and I might risk offending some people with this, but in the short term we should be emphasizing cost savings and reduced time to deployment over and above the extra intelligence and functionality that semantics can provide. Especially for corporate customers. Semantic SOA can save hugely over conventional approaches in data integration and interface projects, and this is where most businesses are really feeling the pain right now. This is a short and probably incomplete list of ideas. There’s more at the Semantic Technology Conference.

Shirky, Syllogism and the Semantic Web

Revisiting Clay Shirky’s piece on the Semantic Web

A friend recently sent me the link to Clay Shirky’s piece on the Semantic Web with “I assume you’ve seen this, what do you think?”

I had seen it, but I hadn’t looked at it for years. So I went back for another look.

As usual, Shirky’s writing is intelligent, insightful and even funny. Recommended reading. I had hoped the ensuing years would prove “us” (Semantic Technologists) right, and that the argument would look amusing in retrospect.

Alas we still have a long way to go to staunch the critics. More on that in a future article.

For today, I have to point out the real irony of the article that I managed to miss the first time I read it.

At the risk of oversimplifying his article to the same degree he oversimplified the Semantic Web, the essence of the article went like this:

• The Semantic Web relies on syllogisms “The semantic web is a machine for creating syllogisms”

• Nobody uses syllogisms “it will improve all the areas of your life where you currently use syllogisms, Which is to say, almost nowhere”

• Therefore nobody will use the Semantic Web “it requires too much coordination and too much energy to effect in the real world”

The first two quotes from the opening the last from the closing.

The irony being of course, that this entire article is a syllogism. To make one of the major premises of an argument that something will fail because nobody uses that style of argument, reminds me of the admonition Yogi Berra gave to some teammates who had suggested a restaurant for the evenings dinner “Nah, nobody goes there anymore. It’s too crowded.”

The article points out some areas we need to pay more attention to, including controlling the hype machine. Reading between the lines, it appears that one of his major points is: the web is complex and only humans can really understand the nuances that our complex utterances mean.

But traffic is complex, and we know that traffic lights will never be as good as police in managing an intersection, but we’ve decided that an automated solution that gets us consistently pretty good results is good enough.

Back to the article, he relies on Lewis Carroll’s syllogisms as a critique of the medium, and by extension, the Semantic Web. The knock out punch was meant to be a five line syllogism about soap-bubble poems. But even here there were two implications: one that humans could follow this logic, and two that formalized ontologies could not. I of course rose to the bait and tried to formalize this syllogism.

I was not successful. Not because of the poverty of expression in the Semantic Web, nor even my own understanding, but attempting to get formal about this doggerel shone a light on the fact that it doesn’t make any sense at all. Indeed if he makes a point at all it is that humans can often get fooled by things that sound like they make sense, but actually don’t. Seems to me, defending that level of confusion and ambiguity isn’t an argument against the Semantic Web.

White Paper: The Distinctionary

Encyclopedias are generally not intended to help with definition. An encyclopedia is useful in that once you know what something means, you can find out what else is known about it.

Semantics is predicated on the idea of good definitions. However, most definitions are not very good. In this essay we’re going to explore why well-intentioned definitions miss the mark and propose an alternate way to construct definitions. We call this alternate way the “distinctionary.”

Dictionary Definitions

The dictionary creates definitions for words based on common and accepted usage. Generally, this usage is culled from reputable,distinctionary published sources. Lexicographers comb through existing uses of a word and create definitions that describe what the word means in those contexts. Very often this will give you a reasonable understanding for many types of words. This is why dictionaries have become relatively popular and sometimes even bestsellers. However, it is not nearly enough. In the first place, there is not a great deal of visibility in attaching the definitions to their source. There’s a very casual relationship between the source of the definition and the definition itself

Perhaps the larger problem is that the definition describes but it does not discern. In other words, if there are other terms or concepts that are close in meaning, this type of definition would not necessarily help you distinguish between them.

Thesauri Definitions

Another way to get at meaning is through a thesaurus. The trouble with a thesaurus is that it is a connected graph of similar concepts. This is helpful if you are overusing a particular word and would like to find a synonym, or if you want to search for a similar word with a slightly different concept. But again, it does very little good actually describing the differences between the similar terms.

WordNet

WordNet is an online searchable lexicon that in some ways is similar to a thesaurus. The interesting and important difference is that in WordNet there are six or seven relationship links between terms and each has a specific meaning. So whereas in a thesaurus the two major links between terms are the synonym and antonym links, in other words, similar to and not similar to, in WordNet there are links that define whether one term is a proper subtype of another term, whether one term is a part of another term, etc. This is very helpful, and it takes us a good way toward definitions that make a difference.

Taxonomies

A rigorous taxonomy is a hierarchical arrangement of terms where each subterm is a proper subtype of the parent term. A really good taxonomy includes rule in and rule out tests to help with the placement of items in the taxonomy. Unfortunately, few good taxonomies are available but they do form a good starting point for rigorous definitions.

Ontologies

An ontology, as Tom Gruber pointed out, is a specification of a conceptualization,. A good ontology will have not only the characteristics of a good taxonomy, with formal subtyping and rules for inclusion and exclusion, it will also include other more complex inference relationships. The ontology as well as the taxonomy also has the powerful notion of “committing to” the ontology. With a dictionary definition there’s no formal concept of the user committing to the meaning as defined by the source authority for the term. However, we do find this in taxonomies and ontologies.

The Distinctionary

The preceding lays out a landscape of gradually increasing rigor in the tools we use for defining and managing the terms and concepts we employ. We’re going to propose one more tool not nearly as comprehensive or rigorous as a formal taxonomy or ontology, but which we have found to be very useful in the day to day task of defining and using terms: the distinctionary.

The distinctionary is a glossary. It is distinct from other glossaries in that it is structured such that a term is first placed as a type of a broader term or concept and then a definition is applied which would distinguish this particular term or concept from its peers.

Eventually, each of the terms or concepts referred to in a distinctionary definition, i.e., “this term is a subtype of another one,” would also have to have their own entry in the distinctionary. But in the short term and for practical purposes we have to agree that there is some common acceptance of some of the terms we use.

A Few Examples

I looked up several definitions of the word “badger” In this case I was looking for the noun, the mammal. I remembered that a badger was an animal but I couldn’t remember what kind of animal, so I thought maybe the dictionary would help. Here is what I found:

Badger:

Merriam Webster:

1 a: any of various burrowing mammals (especially Taxidea taxus and Meles meles) that are related to the weasel and are widely distributed in the northern hemisphere

Encarta:

a medium-sized burrowing animal that is related to the weasel and has short legs, strong claws, and a thick coat. It usually has black and white stripes on the sides of its head.

Cambridge Advanced Learners Dictionary:

an animal with greyish brown fur, a black and white head and a pointed face, which lives underground and comes out to feed at night

American Herritage:

1. Any of several carnivorous burrowing mammals of the family Mustelidae, such as Meles meles of Eurasia or Taxidea taxus of North America, having short legs, long claws on the front feet, and a heavy grizzled coat.

Websters Dictionary (1828 Edition)

1. In law, a person who is licensed to buy corn in one place and sell it in another, without incurring the penalties of engrossing.

2. A quadruped of the genus Ursus, of a clumsy make, with short, thick legs, and long claws on the fore feet. It inhabits the north of Europe and Asia, burrows, is indolent and sleepy, feeds by night on vegetables, and is very fat. Its skin is used for pistol furniture; its flesh makes good bacon, and its hair is used for brushes to soften the shades in painting. The American badger is called the ground hog, and is sometimes white.

Encyclopedia Definitions

Columbia Encyclopedia

name for several related members of the weasel family. Most badgers are large, nocturnal, burrowing animals, with broad, heavy bodies, long snouts, large, sharp claws, and long, grizzled fur. The Old World badger, Meles meles, is found in Europe and in Asia N of the Himalayas; it is about 3 ft (90 cm) long, with a 4-in. (10-cm) tail, and weighs about 30 lb (13.6 kg). Its unusual coloring, light above and dark below, is unlike that of most mammals but is found in some other members of the family. The head is white, with a conspicuous black stripe on each side. European badgers live, often in groups, in large burrows called sets, which they usually dig in dry slopes in woods. They emerge at night to forage for food; their diet is mainly earthworms but also includes rodents, young rabbits, insects, and plant matter. The American badger, Taxidea taxus, is about 2 ft (60 cm) long, with a 5-in. (13-cm) tail and weighs 12 to 24 lb (5.4–10.8 kg); it is very short-legged, which gives its body a flattened appearance. The fur is yellowish gray and the face black, with a white stripe over the forehead and around each eye. It is found in open grasslands and deserts of W and central North America, from N Alberta to N Mexico. It feeds largely on rodents and carrion; an extremely swift burrower, it pursues ground squirrels and prairie dogs into their holes, and may construct its own living quarters 30 ft (9.1 m) below ground level. American badgers are solitary and mostly nocturnal; in the extreme north they sleep through the winter. Several kinds of badger are found in SE Asia; these are classified in a number of genera. Badgers are classified in the phylum Chordata, subphylum Vertebrata, class Mammalia, order Carnivora, family Mustelidae.

Wikipedia

is an animal of the typical genus Meles or of the Mustelidae, with a distinctive black and white striped face – see Badger (animal). [Badger Animal] Badger is the common name for any animal of three subfamilies, which belong to the family Mustelidae: the same mammal family as the ferrets, the weasels, the otters, and several other types of carnivore.

Firstly, I intentionally picked a very easy word. Specific nouns like this are among the easiest things to define. I could have picked “love” or “quantum mechanics” or a verb like “generate” if I wanted to make this hard. As a noun, the definition of this word would be greatly aided by (although not completed by) a picture.

Let’s look at what we got. First, all the definitions establish that a badger is an animal, or mammal. Anyone trying to find out what a badger was could easily be assumed to know what those two terms are. Most rely on Latin genus/species definitions, which is not terribly helpful. If you already know the precise definition of these things then you know what a badger is. Worse, many of them are imprecise in their references:“especially Taxidea taxus and Meles meles.”What is that supposed to mean?

Some of the more useful parts of these definitions are “burrowing” and “carnivorous.” However, these don’t actually distinguish badgers from, say, skunks, foxes or anteaters. “Weasle-like” is interesting, but we don’t know in what way they are like weasels. Indeed some of these definitions would have you think they were weasels.

Encyclopedias are generally not intended to help with definition. An encyclopedia is useful in that once you know what something means, you can find out what else is known about it. However, these encyclopedia entries are much better at defining “badger” than the dictionary definitions. (By the way, a lot of the encyclopedia information will make great “rule in/rule out” criteria.)

I had to include the 1828 definition, if only for its humor value. In the first place, the first definition is one that, less than 200 years later, is now virtually extinct. The rest of the definition seems to be in good form, but mostly wrong (“genus ursus” [bears] “feeds on vegetables,” “ground hog”) or irrelevant (“pistol furniture” and “brushes to soften the shades in painting”).

So what would the distinctionary entry look like for badger? I’m sad to say, even after reading all this I still don’t know what a badger is. Structurally the definition would look something like this:

A badger is a mammal. It is a four legged, burrowing carnivore. It is distinct from other burrowing carnivores in that [this is the part I still don’t know, but this part should distinguish it from close relatives (weasels and otters) as well as more distant burrowing carnivores, such as fox and skunk]. Its most distinguishing feature is two white stripes on the sides of its head.

The point of the distinctionary is to help us keep from getting complacent about our definitions. In the everyday world of glossaries and dictionaries, most definitions sound good, but when you look more closely you realize that they hide as much ignorance as they reveal. As you can see from my above attempt at a distinctionary entry for badger, it’s pretty hard to cover up ignorance.

White Paper: Semantic Profiling

Semantic profiling is a technique using semantic-based tools and ontologies in order to gain a deeper understanding of the information being stored and manipulated in an existing system.

Semantic Profiling

In this paper we will describe an approach to understanding the data in an existing system through a process called semantic profiling.

What is semantic profiling?

Semantic profiling is a technique using semantic-based tools and ontologies in order to gain a deeper understanding of the information being stored and manipulated in an existing system. This approach leads to a more systematic and rigorous approach to the problem and creates a result that can be correlated with profiling efforts in other applications.

Why would you want to do semantic profiling?

The immediate motivation to do semantic profiling is typically either a system integration effort, a data conversion effort, a new data warehousing project, or, more recently, a desire to use some form of federated query in order to pull together enterprise-wide information. Each of these may be the initial motivator for doing semantic profiling but the question still remains: why do semantic profiling rather than any of the other techniques that we might do? To answer that let’s look at each of the typically employed techniques:

  • Analysis. By far, the most common strategy is some form of “analysis.” What this usually means is studying existing documentation and interviewing users and developers about how the current system works and what data is contained in it. From this the specification for the extraction or interface logic is designed. This approach, while popular, is fraught with many problems. The most significant is that very often what the documentation says and what the users and developers think or remember is not a very high fidelity representation of what will actually be found when one looks deeper.
  • Legacy understanding. The legacy understanding approach is to examine the source code of the system that maintains the current data and, from the source code, deduce the rules that are being applied to the data in the current system. This can be done by hand for relatively small applications. We have done it with custom analysis tools in some cases and there are commercial products from companies like Relativity and Merant that will automate this process. The strength of this approach is that it makes explicit some of what was implicit, and it’s far more authoritative than the documentation. The code is what’s being implemented; the documentation is someone’s interpretation of either what should have been done or their idea of what was done. While legacy understanding can be helpful, it’s generally expensive and time-consuming and still only gives a partial answer. The reason it only gives a partial answer is that there are many fields in most applications that have relatively little system enforcement of data values. Most fields with text data and many fields with dates and the like have very little system enforced validation. Over time users have adapted their usage and procedures have been refined to fill in missing semantics for the system. It should be noted though that the larger the user base the more useful legacy understanding is. In a larger user base, relying on informal convention becomes less and less likely, because the scale of the system means that users would have had to institutionalize their conventions, which usually means systems changes.
  • Data profiling. Data profiling is a technique that’s been popularized by vendors of data profiling software such as Evoke, Ascential and Firstlogic. This process relies on reviewing the existing data to determine and uncover anomalies in the databases. These tools can be incredibly useful in finding areas where the content of the existing system is not what we would have expected it to be. Indeed, the popularity of these tools stems largely from the almost universal surprise factor when people are shown the content of their existing databases that they were convinced were populated only with clean, scrubbed data of high integrity, only to find a gross number of irregularities. While we find data profiling very useful, we find that it doesn’t go far enough. In this paper we’ll outline a procedure that adds on and takes it further.

So how is semantic profiling different?

The first difference is that semantic profiling is more rigorous. We will get into exactly why this is in the section on how to do semantic profiling but the primary difference is that with data profiling you can search for and catalog as many anomalies as you like. After you’ve found and investigated five strange circumstances in a database you can stop. It is primarily an aid to doing other things and as such you can take it as far as you want. With semantic profiling, once you select a domain of study you are pretty much committed to take it “to ground.” The second main difference is that the results are reusable. Once you’ve done a semantic profile on one system, if you do a profile on another system the results of the first system will be available and can be combined with those from the second system. This is extremely useful in environments where you are attempting to draw information from multiple sources to pull into one definitive source; whether that is for data warehousing or EII (Enterprise Information Integration). And finally the semantic profiling approach sets up a series of testable hypotheses that can be used to monitor a system as it continues in production, to detect semantic drift.

What you’ll need

For this exercise you will need the following materials:

  • A database to be studied, with live or nearly live data. You can’t do this exercise with developer-created test data.
  • Data profiling software. Any of the major vendors’ products will be suitable for this. It is possible for you to roll your own, although this can be a pretty time consuming exercise.
  • A binding to your database available to the data profiling software. If your database is in a traditional relational form with an ODBC or JDBC access capability then that’s all you need. If your data is in some more exotic format you will need an adapter.
  • Meta-data. You will need access to as much as you can find about the official meta-data for the fields under study. This may be in a data dictionary, it may be in a repository, it may be in the copy books; you may have to search around a bit for it.
  • An ontology editor. You will be constructing an ontology based on what you find in the actual data. There are a number of good ontology editors; however, for our purposes Protégé from Stanford, a freeware version, should be adequate for most versions.
  • An inferencing engine. While there are many proprietary inferencing engines, we strongly advocate adopting one based on the recent standards RDF and OWL. There are open and freeware versions, such as open RDF or Kowari.
  • A core ontology. The final ingredient is a starting point ontology that you will use to define concepts as you uncover them in your database. For some applications this may be an industry reference data model such as HL7 for health care. However, we are advocating the use of what we call the semantic primes as the initial starting point. We’ll cover the semantic primes in another white paper or perhaps in a book. However, they are a relatively small number of primitive concepts that are very useful in clarifying your thinking regarding other concepts.

How to proceed

Overall, this process is one of forming and testing hypotheses about the semantics of the information in the extant database.

The hypotheses being formed concern both the fidelity and the precision of the definition of the items as well as uncovering and defining the many hidden subtypes that lurk in any given system.

This business of “running to ground” means that we will continue the process until every data item is unambiguously defined and all variations and subtypes have been identified and also unambiguously defined.

The process begins with some fairly simple hypotheses about the data, hypotheses that can be gleaned directly from the meta-data. Let’s say we notice in the data dictionary that BB104 has a data type of date or even that it has a mask of MMDDYYYY. We hypothesize that it is a date and further, in our case, our semantic prime ontology forces us to select between a historical date or a planned date. We select historical. We add this assertion to our ontology. The assertion is that BB104 is of type historical date. We run the data profiling and find all kinds of stuff. We find that some of the “historical dates” are in the future. So, depending on the number of future dates and other contextual clues we may decide that either our initial assignment was incorrect and these actually represent planned dates, some of which are in the past because the plans were made in the past, or, in fact, that most of these dates are historical dates but there are some records in this database of a different type. Additionally, we find some of these dates are not dates at all. This begins an investigation to determine if there’s a systemic pattern to the dates that are not dates at all. In other words, is there a value in field BB101, BB102, or BB103 that correlates with the non-date values? And if so, does this create a different subtype of record where we don’t need a date?

In some cases we will uncover errors that are just pure errors. We found that in some cases data validation rules had changed over time and that older records had different and anomalous values. And in some cases people have, on an exception basis, used system level utilities to “repair” data records and in some cases they create these strange circumstances. In cases where we uncover what is finally determined to be genuine errors, rather than semantically defining them we should be creating a punch list for both correcting them and correcting their cause if possible or necessary.

Meanwhile, back to the profiling exercise. As we discover subtypes with different constraints on their date values, we introduce these into the ontology we’re building. In order to do this, as we are documenting our date, we need to further qualify the date. What is it the date of? For instance, if we determine that it is, in fact, a historical date, what event was recorded on that date? As we hypothesize and deduce this we add it to the ontology and the information that this BB104 date is the occurred on date for the event that we described. As we find that the database has some records with legitimate historic dates and others with future dates and we find some correlation with another value, we hypothesize that, indeed, there are two types of historical events or perhaps even some historical events mixed with some planned events or planned activities. What we then do is define these as separate concepts in the ontology with the predicate for defining eligibility in the class. To make it simple, if we found that BB101 had one of two values, either P or H, we may hypothesize that H meant historical and P meant planned and we would say that the inclusion criteria for the planned events is that the value of BB101 equals P. This is a testable hypothesis. At some point the ontology becomes rich enough to begin its own interpretation. We load the data either directly from the database or, more likely, from the profiling tool as instances in the RDF inferencer. The inferencing engine itself can then challenge class assignment; can detect inconsistent property values; etc. We proceed in this fashion until we have unambiguously defined all the semantics of all the data in the area under question.

Conclusion

Having done this, what do you have? You have an unambiguous description of the data as it exists and a set of hypotheses against which you can test any new data to determine whether it agrees or not. But more simply, you know exactly what you have in your database if you were to perform a conversion or a system integration. More interestingly, you also have the basis for a set of rules if you wanted to do a combined integration of data from many sources. You would know, for instance, that you would need to apply a predicate from records in certain databases to exclude those that do not match the semantic criteria that you want to use from the other system. Say you want to get a “single view of the customer.” You will need to know, of all the many records in all your many systems that say or allude to customer data; which ones really are customers; which ones are channel partners or prospects or various other partners that might be included in some file. You need a way to unambiguously define that or system wide integration efforts are going to fall flat. We believe this to be the only rigorous and complete approach to this problem. While it is somewhat complex and time-consuming, it delivers a reasonable value and contributes to the predictability of other efforts which are often incredibly unpredictable.

Written by Dave McComb

Fractal Data Modeling

Fractal geometry creates beautiful patterns from simple recursive algorithms. One of the things we find so appealing is their “self-similarity” at different scales. That is, as you zoom in to look at a detail under more magnification you see many of the same patterns that were visible when zoomed out at the macro level.

After decades of working with clients at large organizations, I’ve concluded that they secretly would like to have a fractal approach to managing all their data. I say this desire is secret for several reasons:

  • No one has ever used that term in my presence
  • What they currently have is 180 degrees away from a fractal landscape
  • They probably haven’t been exposed to any technology or approach that would make this possible

And yet, we know this is what they want. It exists in part in a small subset of their data. The ability to “drill down” on predefined dimensions gives a taste of what is possible, but it is limited to that subset of the data. It exists in small measure in any corpus made zoom-able by faceted categorization., but it is far from the universal organizing principle that it could be. Several of the projects we have worked on over the last few years have allowed us to triangulate in on exactly this capability. This fractal approach leads us to information-scapes that have these characteristics:

  • They are understandable
  • They are pre-conditioned for easy integration
  • They are less likely to be loaded with ambiguity

The Anti-fractal data landscape

The data landscape of most large enterprises looks alike:

  • There are tens of thousands to hundreds of thousands of database tables.
  •  In hundreds to thousands of applications
  • In total there are hundreds of thousands to millions of attributes

There is nothing fractal about this. It is a vast, detailed data landscape, with no organizing principle. The only thing that might stand in for an organizing principle is the bounds of the application, which actually makes matters worse. The presence of applications allow us to take data that is similar, and structure it differently, categorize it differently, and name it differently. Rather than provide an organizing principle, applications make understanding our data more difficult.

And this is just the internal/ structured data. There is far more data that is unstructured (and we have nearly nothing to help us now with unstructured data), external (ditto) and “big data” (double ditto).

Download the White-paper to read more.

Why Not to Use Boolean Datatypes in Taxonomies

Many taxonomies, especially well designed taxonomies with many facets, have dimensions that consist of very few, often just two categories, however this may cause more harm than it’s worth.

Why Not to Use Boolean Datatypes in Taxonomies

It is tempting to give these Boolean like tags, such as “Yes”/”No” or “Y”/”N” or “True”/”False” or even  near Booleans like “H, M, L.”  I’m going to suggest in this article not doing that, and instead use self describing meaningful names for the categories.

Before I do, let me do a bit of color commentary on the types of situations where this shows up.  Recently we were designing a Resolution Planning system in the Financial Industry.  In the course of this design it became tempting to have categories for inter-affiliate services  such as “resolution criticality” or “materiality” or “Impact on Reputation,” not just tempting but these were part of the requirements from the regulators.  It was tempting to have the specific terms within each category be something like “yes”/”no” or “high”, “medium”, “low”.  Partly this is because you may want the reports to have columns like “resolution critical” and “yes” or “no” in the rows.

That’s the backdrop.  I can speak from experience that it is very tempting to just create two taxonomic categories “Yes” and “No.”  There are actually two flavors of this temptation:

  • Just create two terms “yes” and “no” and use them in all the places they occur, that is there is a instance with a uri like :_Yes and  an instance with a uri like :_No with labels “Yes” and “No”
  • Create a different “yes” and “no” instances for each of the categories (that is that there is a uri with a name like  :_resCrit_Yes which has a label “Yes” and elsewhere a uri with a name like :_materiality_Yes)

I’m going to suggest that both are flawed.  The first requires us to have a new property for every distinction we make.  In other words we can’t just say “categorizedBy” as we do with other categories, because you would need the name of the property to find out what “yes” means.  While at first this seems reasonable, it leads to the type of design we find in legacy systems, with an excessive number of properties that have to be modeled, programmed to and learned by consumers of the data.  The second approach is closer to what we will advocate here, but doesn’t go far enough as we’ll see.

My perspective here is based on two things:

  • Years of forensic work, profiling and reverse engineering trying to deduce what existing data in legacy systems actually means, plus
  • My commitment to the “Data Centric Revolution” wherein data becomes the permanent artifact and applications come and go.  This is not the way things are now.  In virtually all organizations now when people want new functionality they implement new applications, and “convert” their data from the old to the new.  Moving to truly data centric enterprises will take some changes to points of view in this area.

I am reminded of a project we did with Sallie Mae, where we were using an ontology as the basis for their Service Oriented Architecture messages.  Every day we’d tackle a few new messages and try to divine what the elements and attributes in the legacy systems meant.  We would identify the obvious elements and have to send the analysts back to the developers to try to work out the more difficult ones. After several weeks of this I made an observation: the shorter the length of the element, the longer it would take us to figure out what it meant, with Booleans taking the longest.

I’ve been reflecting on this for years, and I think the confluence of our Resolution Planning application and the emergence of the Data Centric approach have led me to what the issue was and is.

“Yes” or even “True” doesn’t mean anything in isolation.  It only means something in context.  Yes is often the answer to a question, and if you don’t know what the question was, you don’t know what “yes” means.  And in an application centric world, the question is in the application.  Often it appears in the user interface.  Then the reporting subsystem reinterprets it.  Usually, due to space restrictions the reporting interpretation is an abbreviated version of the user interface version.  So the user interface might say “Would the unavailability of this service for more than 24 hours impair the ability for a resolution team to complete trades considered essentially to continued operation of the financial system as a whole?” And the report might say “Resolution Critical.”  Of course the question could just as well be expressed the other way around: “Could a team function through the resolution period without this services?” (Where “Yes” would mean approximately the same as “No” to the previous question).

In either event, Boolean data like this does not speak for itself.  The data is inextricably linked to the application, which is what we’re trying to get beyond.

If we step back and reflect on what we’re trying to do we can address the problem.  We are attempting to categorize things.  In this case we’re trying to categorize “Inter-affiliate Services.”  The categories we are trying to put things in are categories like “Would be Essential in the Event of a Resolution” and “Would not be Essential in the Event of a Resolution.”   I recognize that this sounds a lot like “Yes” and “No” or perhaps the slightly improved “Essential” and “Non-Essential.”  Now if you ask the question “Would the unavailability of this service for more than 24 hours impair the ability for a resolution team to complete trades considered essentially to continued operation of the financial system as a whole?” the user answer “Yes” would correspond to “Would be Essential in the Event of a Resolution.”  If the question were changed to “Could a team function through the resolution period without this services?” we would map “No” to “Would be Essential in the Event of a Resolution.”

Consider the implication.  With the fully qualified categories, you get several advantages:

  • The data does speak for itself.  You can review the data and know what it means without having to refer to application code, and without being forever dependent on the application code for interpretation.
  • You could write a query and interpret the results, without needing labels from the application or the report.
  • You could query for all the essential services.  Consider how hard this would be in the Boolean case. You can query for things that are in the Resolution Critical mini taxonomy with the value of “Yes,” but you don’t really know what “Yes” means.  With the fully qualified category you just query for the things that are categorized by “Would be Essential in the Event of a Resolution” and you’ve got it
  • You can confidently create derivative classes.  Let’s say you wanted the set of all departments that provided resolution critical services.  You would just create a restriction class that related the department to the service with that category.  You could do it with the Boolean, but you’d be continually dogged by the question “what did ‘yes’ mean in this context?”
  • You can use the data outside the context in which it was originally created.  In a world of linked data, it will be far easier to consume and use data that has more fully qualified categories.

Finally if you find you really need to put “Yes” on a report, you can always put an alternate display label on the category and this way the data would know what “yes” meant without having to refer to the application.

In conclusion: it is often tempting to introduce Boolean values, or very small taxonomies that function as Booleans into your ontology design.  This leads to long term problems with coupling between the data and the application, and hampers maintenance and long term use of the data.

Preparing and using these more qualified categories only takes a bit more up front design work, and has no downside to implementation or subsequent use.