Data-Centric’s Role in the Reduction of Complexity

Complexity Drives Cost in Information Systems

A system with twice the number of lines of code will typically cost more than twice as much to build and maintain.

There is no economy of scale in enterprise applications.  There is dis economy of scale.   In manufacturing, every doubling of output results in predictable reduction in the cost per unit.  This is often called a learning curve or an experience curve.

Just the opposite happens with enterprise applications.  Every doubling of code size means that additional code is added at ever lower productivity.  This is because of complex dependency.  When you manufacture widgets, each widget has no relationship to or dependency on, any of the other widgets.  With code, it is just the opposite.  Each line must fit in with all those that preceded it.  We can reduce the dependency, with discipline, but we cannot eliminate it.

If you are interested in reducing the cost of building, maintaining, and integrating systems, you need to tackle the complexity issue head on.

The first stopping point on this journey is recognizing the role that schema has in the proliferation of code.  Study software estimating methodologies, such as function point analysis, and you will quickly see the central role that schema size has on code bloat.  Function point analysis estimates effort based on inputs such as the number of fields on a form, the elements in a transaction, or the columns in a report.  Each of these is directly driven by the size of the schema.  If you add attributes to your schema they must show up in forms, transactions, and reports, otherwise, what was the point?

I recently did a bit of forensics on a popular and well known high quality application: Quick Books, which I think is representative.  The Quick Books code base is 10 million lines of code.  The schema consists of 150 tables and 7500 attributes (or 7650 schema concepts in total).  That means that each schema concept, on average, contributed another 1300 lines of code to the solutions.  Given that most studies have placed the cost to build and deploy software at between $10 and $100 per line of code (it is an admittedly large range but you have to start somewhere) that means that each attribute added to the schema is committing the enterprise to somewhere between $13K and $130K of expense just to deploy, and probably an equal amount over the life of the product for maintenance.

I’m hoping this would give data modelers a bit of pause.  It is so easy to add another column, let alone another table to a design; it is sobering to consider the economic impact.

But that’s not what this article is about.  This article is about the insidious multiplier effect that not following the data centric approach is having on enterprises these days.

Let us summarize what is happening in enterprise applications:

  • The size of each application’s schema is driving the cost of building, implementing, and maintaining it (even if the application is purchased).
  • The number of applications drives the cost of systems integration (which is now 30-60% of all IT costs).
  • The overlap, without alignment, is the main driver of integration costs (if the fields are identical from application to application, integration is easy; if the applications have no overlap, integration is unnecessary).

We now know that most applications can be reduced in complexity by a factor of 10-100.  That is pretty good.  But the systems of systems potential is even greater.  We now know that even very complex enterprises have a core model that has just a few hundred concepts.  Most of the rest of the distinctions can be made taxonomically and not involve programming changes.

When each sub domain directly extends the core model, instead of the complexity being multiplicative, it is only incrementally additive.

We worked with a manufacturing company whose core product management system had 700 tables and 7000 attributes (7700 concepts).  Our replacement system had 46 classes and 36 attributes (82 concepts) – almost a 100-fold reduction in complexity.  They acquired another company that had their own systems, completely and arbitrarily different, smaller and simpler at 60 tables and 1000 attributes or 1060 concepts total.  To accommodate the differences in the acquired company we had to add 2 concepts to the core model, or about 3%.

Normally, trying to integrate 7700 concepts with 1060 concepts would require a very complex systems integration project.  But once the problem is reduced to its essence, we realize that there is a 3% increment, which is easily managed.

What does this have to do with data centricity?

Until you embrace data centricity, you think that the 7700 concepts and the 1060 concepts are valid and necessary.  You’d be willing to spend considerable money to integrate them (it is worth mentioning that in this case the client we were working with had acquired the other company ten years ago and had not integrated their systems, mostly due to the “complexity” of doing so).

Once you embrace data centricity, you begin to see the incredible opportunities.

You don’t need data centricity to fix one application.  You merely need elegance.  That is a discipline that helps guide you to the simplest design that solves the problem.  You may have thought you were doing that already.  What is interesting is that real creativity comes with constraints.  And when you constrain your design choice to be in alignment with a firms’ “core model,” it is surprising how rapidly the complexity drops.  More importantly for the long-term economics, the divergence for the overlapped bits drops even faster.

When you step back and look at the economics though, there is a bigger story:

The total cost of enterprise applications is roughly proportional to:

mccomb01

These items are multiplicative (except for the last which is a divisor).   This means if you drop any one of them in half the overall result drops in half.  If you drop two of them in half the result drops by a factor of four, and if you drop all of them in half the result is an eight-fold reduction in cost.

Dropping any of these in half is not that hard.  If you drop them all by a factor of ten (very do-able) the result is a 1000 fold reduction in cost.  Sounds too incredible to believe, but let’s take a closer look at what it would take to reduce each in half or by a factor of ten.

Click here to read more on TDAN.com

The Core Model at the Heart of Your Architecture

We have taken the position that a core model is an essential part of your data-centric architecture. In this article, we will review what a core model is, how to go about building one, and how to apply it both to analytics as well as new application development.

What is a Core Model?

A core model is an elegant, high fidelity, computable, conceptual, and physical data model for your enterprise.

Let’s break that down a bit.

Elegant

By elegant we mean appropriately simple, but not so simple as to impair usefulness. All enterprise applications have data models. Many of them are documented and up to date. Data models come with packaged software, and often these models are either intentionally or unintentionally hidden from the data consumer. Even hidden, their presence is felt through the myriad of screens and reports they create. These models are the antithesis of elegant. We routinely see data models meant to solve simple problems with thousands of tables and tens of thousands of columns. Most large enterprises have hundreds to thousands of these data models, and are therefore attempting to manage their datascape with over a million bits of metadata.

No one can understand or apply one million distinctions. There are limits to our cognitive functioning. Most of us have vocabularies in the range of 40,000-60,000, which should suggest the upper limit to a domain that people are willing to spend years to master.

Our experience tells us that at the heart of most large enterprises lays a core model that consists of fewer than 500 concepts, qualified by a few thousand taxonomic modifiers. When we use the term “concept” we mean a class (e.g., set, entity, table, etc.) or property (e.g., attribute, column, element, etc.). An elegant core model is typically 10 times simpler than the application it’s modeling, 100 times simpler than a sub-domain of an enterprise, and at least 1000 times simpler than the datascape of a firm.

Click here to continue reading on TDAN.com

Enterprise Ontology, Semantic Silos, and Cowpaths

Paving Cow Paths

Numerous modern day streets in downtown Boston defy logic – until you realize that the city fathers literally paved over the transit system created and used by cows.*  This gave the immediate benefit of getting places faster, while losing out on longer-term gains that designing a purpose-built street plan could have yielded.  This type of thing is pervasive in today’s enterprise ranging from computerizing paper forms to the plethora of information silos requiring an enterprise ontology– the subject of today’s blog.

Figure 1: Paving the cowpaths in Boston**

Semantic Technology

Semantic Arts works with a wide variety of companies and, unlike just a few years ago, it is now common for our new clients to already have a number of efforts and groups exploring semantic technology in-house.  Gone is fear of the ‘O word’. In its place are a range of projects and activities such as creating ontologies, working with triple stores, and creating proofs of concept. Unfortunately, what we see very little of is coordination of these efforts throughout the enterprise.

It would be mistaken to regard this as a misuse of the technology, because point solutions will often result in significant benefits locally – just like paving cow paths gave immediate gains. It’s more a missed opportunity in the form of a great irony. The very technology designed to break down silos gets used to build yet more silos – Semantic Silos.

Figure 2: Avoid Semantic Silos

Building semantic silos is an easy trap to fall into, because it takes a while to fully comprehend the power of semantic technology (or any other disruptive technology).  Information silos arise for many reasons, both technological and organizational.  Key technological factors include the inability of relational databases to (1) reuse schema and (2) uniquely identify data elements globally across databases.  That’s where the URI and RDF triples come in. It is hard to overstate the power of the URI in semantic technology. URIs uniquely identify not only data elements but also the schema elements. The former eliminates the need for joins, and the coordination of URIs makes the snapping together of disparate databases, well, a snap.  The latter enables something entirely foreign to relational technology: the ability to easily share and reuse all or parts of existing schema.

Enterprise Ontology

The key to avoiding semantic silos is to use an enterprise ontology, which is a small and elegant representation of the core concepts and relationships in your enterprise that are stable over time.  It is at the same time both a conceptual model, and a computable artifact that plays the role of a physical data schema. The enterprise ontology is a foundation for building more specialized ontologies that are loaded into dozens, hundreds or thousands of graph databases, called triple stores that are populated with data.  Data elements are also shared across multiple databases.  This is depicted in figure 3.

These stores can be used by many applications, not just one or two, as is common in today’s siloed, application-centric enterprise.  Collectively, these ontologies and their data form an enterprise knowledge graph. Such graphs are hugely important for modern companies such as Google, Facebook and LinkedIn.

enterprise ontology

 

Figure 3: The triple stores depicted in the top row are not silos. Globally unique URIs snap together to form a single enterprise knowledge graph that is accessible using federated SPARQL queries.  Letters denote ontology URIs and numbers denote data URIs.

Having built enterprise ontologies now in a variety of industries, we are confident in stating the surprising result that there are only a few hundred such concepts that form this core for any given enterprise.  This is what makes it possible to build an enterprise ontology, where building enterprise-wide data models has failed for decades. There is no need to have millions of attributes in the core model.

Summary and Conclusion

  1. It is entirely possible to use semantic technology to develop point solutions around your enterprise and unwittingly end up recreating the very silos that semantic technology aims to get rid of.
  2. We see this happening in organizations that are using semantic technology.
  3. You don’t want to do that, you will miss out on some of the main benefits of the technology. The data will not snap together if there is no coordination.
  4. The answer is to use an enterprise ontology as a core data model that is shared among all the applications and data stores that collectively make up your enterprise knowledge graph.
  5. The URI is the hero: they are globally unique identifiers that allow seamless sharing of data and schema, joins are history.

Keep in mind that technology as enabler is only part of the story. To get real traction in breaking up silos also requires meeting plenty of social and organizational challenges and putting governance policies into place.  But that’s another topic for another day.

Don’t fall in the trap of paving the cow paths to semantic silos. Use an enterprise ontology to create the beginning of an integrated enterprise.

Afterward

See also the delightful and well-known poem by S.W. Foss called, “The Calf Path”.***

* Change Management: Paving the Cowpaths
https://www.fastcompany.com/1769710/change-management-paving-cowpaths

** Picture credit:
http://bostonography.com/2011/cartographic-greetings-from-boston/bostontownoldrenown/

*** https://m.poets.org/poetsorg/poem/calf-path

The Data-Centric Revolution: Gaining Traction

There is a movement afoot. I’m seeing it all around me. Let me outline some of the early outposts.

Data-Centric Manifesto

We put out the data-centric manifesto on datacentricmanifesto.org over two years ago now. I continue to be impressed with the depth ofData-centric manifesto thought that the signers have put into their comments. When you read the signatory page (and I encourage you to do so now) I think you’ll be struck. A few just randomly selected give you the flavor:

This is the single most critical change that enterprise architects can advocate – it will dwarf the level of transformation seen from the creation of the Internet. – Susan Bright, Johnson & Johnson

Back in “the day” when I started my career we weren’t called IT, we were called Data Processing. The harsh reality is that the application isn’t the asset and never has been. What good is the application that your organization just spent north of 300K to license without the data?   Time to get real, time to get back to basics. Time for a reboot! –  Kevin Chandos

This seems a mundane item to most leaders, but if they knew its significance, they would ask why we are already not using a data-centric approach. I would perhaps even broaden the name to a knowledge-centric approach and leverage the modern knowledge management and representation technologies that we have and are currently emerging. But the principles stand either way. – David Chasteen, Enterprise Ecologist

Because I’ve encountered the decades of inertia and want to be an instrument of change and evolution. – Vince Marinelli, Medidata Solutions Worldwide

And I love this one for it’s simple frustration:

In my life I try to fight with silos – Enn Õunapuu, Tallinn University of Technology

Click here to continue reading on TDAN.com

A Semantic Bank

What does it mean to be a “Semantic Bank"?

 

In the last two months I’ve heard at least 6 financial institutions declare that they intended to become “A Semantic Bank.”  We still haven’t seen even the slightest glimmer as to what any of them mean by that.

Allow me to step into that breach.

What follows is our take on what it would mean to be a “Semantic Bank.”

The End Game

I’m reluctant to start with the end state, because pretty much anyone reading this, including those who aspire to be semantic banks, will find this to be a “bridge too far.”  Bear with me.  I know this will take at least a decade, perhaps longer to achieve.  However, having the end in mind, allows us to understand with a clarity few currently have, exactly where it is we are wasting our money now.

If we had the benefit of time and could look back from 2026 and ask ourselves “which of our investments in 2016 were really investments, and which were wastes of money?” how would we handicap the projects we are now funding?  Now to be clear, not all expenditures need to be leading to the semantic future.  There are tactical projects that are worth so much in the short term that we can overlook the fact that we are anti-investing the future.  But we should be aware of when we are doing this, and it should be an exception. The semantic bank of the future will be the organization that can intentionally divert the greatest percent of their current IT capital spend toward their semantic future.

A Semantic Bank will be known by the extent to which its information systems are mediated by a single (potentially fractal, but with a single simple core) conceptual model.  Unlike conceptual models of the past, this one will be directly implemented.  That is, a query to the conceptual model will return production data, and a transaction expressed in conceptual model terms will be committed, subject to permissions and constraints which will also be semantically described.

Semantics?

For those who just wandered into this conversation: semantics is the study of meaning.  Semantic Technology allows us to implement systems, and to integrate systems at the level of conceptual meaning, rather that the level of structural description (which is what traditional technology relies on).

It may sound like a bit of hair splitting, but the hair splitting is very significant in this case.  This technology allows practitioners to drop the costs of development, integration and change by over an order of magnitude, and allows incorporation of data types (unstructured, semi structure and social media for instance) that hitherto were difficult to impossible to integrate.

It accomplishes this through a couple of interesting departures from traditional development:

  • All data is represented in a single format (the triple). There aren’t hundreds or thousands of different tables, there is just the triple.
  • Each triple is an assertion, a mini sentence composed of a subject, predicate and object. All data can be reduced to a set of triples.
  • All the subjects, all the predicates, and most of the objects are identified with globally unique identifiers (URIs, which are analogous to URLs)
  • Because the identifiers are globally unique, the system can join records, without an analyst or programming having to write the explicit joins.
  • A database that assembles triples like this, is called a “triple store” and is in the family of “graph databases.” A semantic triple store is different from a non semantic database in that it is standards compliant and supports a very rich schema (even though it is not dependent on having a schema).
  • Every individually identifiable thing (whether a person, a bank account or even the concept of “Bank Account”) is given a uri. Whereever the uri is stored or used it always means exactly the same thing.  Meaning is not dependent on context or location.
  • New concepts can be formed by combining existing concepts.
  • The schema can evolve in place, even in the presence of a very large database dependent on it.

A set of concepts so defined is called an “Ontology” (loosely an organized body of knowledge). When the definitions are shared at the level of the firm, this is called an “Enterprise Ontology.”

Our experience has been that using these semantic approaches an ontology can be radically simpler, and at the same time more precise and more complete, than traditional application databases.  When the semantics are done at the firm level the benefits are even greater, because each additional application is benefiting from the concepts shared with the others.

Business Value

What is the business value of rethinking information systems?  They come in two main varieties: generic and specific.

Generic Value

Dropping the cost of change by a factor of 10 has all sorts of positive value.  Systems that were too difficult to change become malleable.

The integration story is even better: once all the similar concepts are expressed in a way that their similarity is obvious and baked into their identity, systems integration, currently one of the largest costs in IT will become almost free

Back to the End Game

In the end game, a semantic bank will have all their systems directly implemented on a shared semantic model.  The scary thing is: who has a better shot at this, the established oligarchy (the “too big to fail”) or FinTech?  Each have about half the advantages.  Queue Clayton Christiansen’s “Innovators Dilemma” : in some situations a new upstart enters a market with superior technology and the incumbents crush the upstart.  In other situations, the upstart creates a beachhead in an underserved market and continually walks their way up the value chain until the incumbents are on the ropes.  What is the difference and how will it play out with the “Semantic Banks?” is the ultimate question.

A Bit More on the Target

Most vendors have a tendency to see the future in terms of the next version of their offering.

In the future, a progressive firm will have an “enterprise ontology” that represents the key concepts that they deal with.  Currently they have thousands of application systems, each of which has thousands of tables, and tens of thousands of columns that they manage.  In aggregate they are managing millions of concepts.

But really, there are a few hundred concepts that everything they deal with are based on.  When we figure out what these few hundred concepts, we have started down the road of profound simplicity.

Once you have this model (the “core ontology”) you are armed with a weapon that delivers on three fronts:

  • All integration can be mediated through this model. By mapping into and out of the shared model, all integration becomes easier
  • New development can be made incredibly simpler. Building an app on a model that is 10 times simpler than normal and 100 times simpler than the collective model of the firm, economizes the application development and integration process.
  • The economics of change become manageable. Currently there is such a penalty for changing an information system, that we spend inordinate amount of energy staving off changes. In the semantic future change is economically (not free but far far less than current costs). Once we get to that point, the low cost of change translates into rapidly evolvable systems.

What Will Distinguish the Leaders in the Next Five Years?

Only the smallest start up will be completely semantic within the next five years.  If they develop a semantic core, their challenge will be growing out to overtake the incumbents.

This white paper is mostly written for the incumbents (by the way we are happy to help FinTech startups, but our core market is established players dealing with legacy issues) .

Most financial services companies right now are executing “proof of concept” projects.  Those that do this may well be the losers.  NASA has a concept called “TRL” (Technology Readiness Level) they have a scale of 1-9 with 1-3 being levels wacky ideas that no one has any idea whether they could be implemented to 7-9 being technology that has already been commercialized and there is no more risk left in implementation.  Experiments are typically done in level 1-3, to learn what else do we need to know to make this technology real.  Proofs of Concept are typically done in levels 4-6 to narrow down some implementation parameters.  The issue is, all the important semantic technology is at level 8 or 9.  Everyone knows it works and knows how it works.  The companies who are doing “proof of concept projects” in semantic technology at this point are vamping[1] and will ultimately be eclipsed by companies who can commit when appropriate.

What are the benefits of becoming semantic?

The benefits of adopting this approach are so favorable that many people would challenge our credibility for suggesting them (sounds like hype) but these differences are really true, so we won’t shrink from our responsibility for the sake of credibility.

Integration

When you map your existing (and especially future) systems to a simple, shared model, the cost of integration plummets.  Currently integration consumes 30-60% of the already bloated IT budget because analysts are essentially negotiating agreement between systems that each have tens of thousands of concepts.  That’s hard.

What’s easy (well relatively easy) is to maps a complex system to a simple model.  Once you’ve done this, it is integrated to all the other systems that have also been mapped to that model.  It becomes the network effect of knowledge.

Application Development

A great deal of the cost of application development is the cost of programming to a complex data model.  Semantic Technology helps at two levels.  The first level is by reducing the complexity of the model, any code dependent on the model is reduced proportionately.  The second level is that semantic technology is very compatible with RESTful development.  The RESTful approach encourages a style of development that is less dependent on and less coupled to, the existing schema. We have found that a semantic  based system using RESTful APIs is amazingly resilient to changes in the model (other than those that introduce major structural changes, but that is a commercial for getting your core ontology right to start with)

New Data Types

Many leading edge projects are predicated on being able to incorporate data that was hitherto unrealistic to incorporate.  This might be unstructured data, it might be open data, it might we social media.  All of these are difficult for traditional technology, but semantic technology takes in stride.

Observations from other industries

Our observation about what has worked in other industries (which by the way are also minimally converted to semantic technology, but the early adopters provide some important signposts for what works and what doesn’t)

Vision and Constancy Trump Moon Shots

What we have seen from the firms that have implemented impressive architectures based on semantics, is that a small team, with continual funding vastly outperforms attempts to catch up with huge projects.  The most impressive firms have had a core of 3-8 people who were at it continually for 2-4 years.  Once you reach critical mass with these teams and the capability they create, putting 50-100 people on a catch up project will never catch them.  The lead that can be established now with a small focused team, will open up an insurmountable lead 3-5 years from now, when this movement becomes obvious.

The Semantic Bank Maturity Model

Eventually we will come to the point where we will want to know “how semantic are you?” Click here to take an assessment to discover the answer to this questions.

We will take this up in a separate white paper, with considerably more detail, but the central concept is: what percent of your data stores are semantically enabled and how semantic are they really?

Getting Started

Let’s assume you want to take this on and become a “Semantic Bank”.  How do you go about it?

What we know from other industries is the winner is not the firm that spends the most, or even who starts first (although at some point failing to start is going to be starting to fail).  The issue is who can have a modest, but continual initiative. This means that the winner will be the firm that can finance a continual improvement project over several years.  While you might make a bit of incremental progress through a series of tactic projects, the big wins will come from the companies that can set up an initiative and stick with it.  We have seen this in healthcare, manufacturing and publishing, we expect it to be true in financial services as well.

Often this means that the sponsor must be at a position where they can dedicate a continual (but not very large) budget to achieve this goal.  If that is not you, you may want to start the conversation with the person who can make a difference.  If this is you, what are you waiting for?

[1] Vamping is term professional jugglers use to refer to the act you perform when you drop a juggling club.  Vamping is the process of continuing the cadence with an imaginary club until you can find a moment to lift the dropped club back into the rotation.

The Data-Centric Revolution: Integration Debt

Integration Debt is a Form of Technical Debt

As with so many things, we owe the coining of the metaphor “Technical Debt” to Ward Cunningham and the agile community. It is thetechnical debt confluence of several interesting conclusions the community has come to. The first was that being agile means being able to make a simple change to a system in a limited amount of time, and being able to test it easily. That sounds like a goal anyone could get behind, and yet, this is nearly impossible in a legacy environment. Agile proponents know that any well-intentioned agile system is only six months’ worth of entropy away from devolving into that same sad state where small changes take big effort.

One of the tenants of agile is that patterns of code architecture exist that are conducive to making changes. While these patterns are known in general (there is a whole pattern languages movement to keep refining the knowledge and use of these patterns), how they will play out on any given project is emergent. Once you have a starting structure for a system, a given change often perturbs that structure. Usually not a lot. But changes add up, and over time, can greatly impede progress.

One school of thought is to be continually refactoring your code, such that, at all times, it is in its optimal structure to receive new changes. The more pragmatic approach favored by many is that for any given sprint or set of sprints, it is preferable to just accept the fact that the changes are making things architecturally worse; as a result, you set aside a specific sprint every 2-5 sprints to address the accumulated “technical debt” that these un-refactored changes have added to the system. Like financial debt, technical debt accrues compounding interest, and if you let it grow, it gets worse—eventually, exponentially worse, as debt accrues upon debt.

Integration Debt

I’d like to coin a new term: “integration debt.” In some ways it is a type of technical debt, but as we will see here, it is broader, more pervasive, and probably more costly.

Integration debt occurs when we take on a new project that, by its existence, is likely to lead someone at some later point to incur additional work to integrate it with the rest of the enterprise. While technical debt tends to occur within a project or application, integration debt takes place across projects or applications. While technical debt creeps in one change at a time, integration debt tends to come in large leaps.

Here’s how it works: let’s say you’ve been tasked with creating a system to track the effectiveness of direct mail campaigns. It’s pretty simple – you implement these campaigns as some form of project and their results as some form of outcomes. As the system becomes more successful, you add in more information on the total cost of the campaign, perhaps more granular success criteria. Maybe you want to know which prospects and clients were touched by each campaign.

Gradually, it dawns that in order to get this additional information (and especially in order to get it without incurring more research time and re-entry of data), it will require integration with other systems within the firm: the accounting system to get the true costs, the customer service systems to get customer contact information, the marketing systems to get the overlapping target groups, etc. At this point, you recognize that the firm is going to consume a great deal of resources to get a complete data picture. Yet, this could have been known and dealt with at project launch time. It even could have been prevented.

Click here to read more on TDAN.com

The Inelegance of having Brothers and Sisters

This blog follows from a recent blog by Dan Carey called Screwdrivers and Properties. It points to a longer whitepaper on the topic of avoiding property proliferation.

One way we keep the number of primitives small is to avoid creating a subproperty if its meaning is essentially the same as the superproperty, but has a more restricted domain or range. We illustrate this with an example in the genealogy domain. Suppose we have the property myeo:hasSibling and we want to model brothers and sisters. One way would be to create two subproperties, myeo:hasBrother and myeo:hasSister, whose ranges are myeo:Male and myeo:Female respectively, and define the class myeo:Brother as a property restriction class that means “any individual that is the brother of some person”.  In Manchester syntax, this looks like: “myeo:brotherOf some myeo:Person” where myeo:brotherOf is the inverse of myeo:hasBrother. Similarly we can define myeo:Sister as “myeo:sisterOf some myeo:Person”. This introduces two new classes and two new properties.

However, we can easily capture the semantics of brother and sister without introducing any new properties. We define the class myeo:Brother as “myeo:Male and myeo:siblingOf some myeo:Person” and myeo:Sister is defined as “myeo:Female and myeo:siblingOf some myeo:Person”. This way we can define the brother and sister concepts entirely in terms of existing primitives with the same number of classes and without creating any new properties.

The only thing that differs about myeo:hasBrother and myeo:hasSister compared to myeo:hasSiblingis that the former two properties have more restricted ranges (myeo:Male & myeo:Female vs. myeo:Person). Otherwise the meaning is identical.  We have essentially moved the semantics of brother from the domains of two new properties into the class expression that define the classes myeo:Brother and myeo:Sister  (see figure below).

Keeping the number of primitives low is not only more elegant, but it has practical value.  The fewer things you have, the easier it is to find what you need. Not only does it help during ontology development, it also helps downstream when others evolve and apply the ontology.

property

Whitepaper: Avoiding Property Proliferation

Domain and range for ontological properties are not about data integrity, but logical necessity. Misusing them leads to an inelegant (and unnecessary) proliferation of properties.

Logical Necessity Meets Elegance

Screwdrivers generally have only a small set of head configurations (flat, Phillips, hex) because the intention is to make accessingproperties contents or securing parts easy (or at least uniform). Now, imagine how frustrating it would be if every screw and bolt in your house or car required a unique screwdriver head. They might be grouped together (for example, a bunch of different sized hex heads), but each one was slightly different. Any maintenance task would take much longer and the amount of time spent just organizing the screwdrivers would be inordinate. Yet that is precisely the approach that most OWL modelers take when they over-specify their ontology’s properties.
On our blog, we once briefly discussed the concept of elegance in ontologies. A key criterion was, “An ontology is elegant if it has the fewest possible concepts to cover the required scope with minimal redundancy and complexity.” Let’s take a deeper look at object properties in that light. First, a quick review of some of the basics.

  1. An ontology describes some subject matter in terms of the meaning of the concepts and relationships within that ontology’s domain.
  2. Object properties are responsible for describing the relationships between things.
  3. In the RDFS and OWL modeling languages, a developer can declare a property’s domain and/or its range (the class to which the Subject and/or Object, respectively, must belong). Domain and range for ontological properties are not about data integrity, but logical necessity. Misusing them leads to an inelegant (and unnecessary) proliferation of properties. Avoiding Property Proliferation 2

Break the Habit

In our many years’ experience teaching our classes on designing and building ontologies, we find that most new ontology modelers have a background in relational databases or Object-Oriented modelling/development. Their prior experience habitually leads them to strongly tie properties to classes via specific domains and ranges. Usually, this pattern comes from a desire to curate the triplestore’s data by controlling what is getting into it. But specifying a property’s domain and range will not (necessarily) do that.
For example, let’s take the following assertions:

  • The domain of the property :hasManager is class :Organization.
  • The individual entity :_Jane is of type class :Employee.
  • :_Jane :hasManager :_George.

Many newcomers to semantic technology (especially those with a SQL background) expect that the ontology will prevent the third statement from being entered into the triplestore because :_Jane is not declared to be of the correct class. But that’s not what happens in OWL. The domain says that :_Jane must be an :Organization, which presumably is not the intended meaning. Because of OWL’s Open World paradigm, the only real constraints are those that prevent us from making statements that are logically inconsistent. Since in our example we have not declared the :Organization and :Employee classes to be disjoint, there is no logical reason that :_Jane cannot belong to both of those classes. A reasoning engine will simply infer that :_Jane is also a member of the :Organization class. No errors will be raised; the assertion will not be rejected. (That said, we almost certainly do want to declare those
classes to be disjoint.)

Read More and Download the White-paper

White Paper by Dan Carey

Screwdrivers & Properties

Screwdrivers generally have only a small set of head configurations (flat, Phillips, hex) because the intention is to make accessing contents or securing parts easy (or at least uniform).Properties & Proliferations

Now imagine how frustrating it would be if every screw and bolt in your house or car required a unique screwdriver head.  They might be grouped together (for example, a bunch of different sized hex heads), but each one was slightly different.  Any maintenance task would take much longer and the amount of time spent just organizing the screwdrivers would be inordinate.

Yet that is precisely the approach that most OWL modelers take when they over-specify their ontology’s properties.

“Avoiding Property Proliferations – Part 1” discusses the pitfalls of habitually applying domains and ranges to properties.

Click here to download the whitepaper.

Binary Instances

Sometimes when we’re designing ontologies we’re faced with design choices that would lead us to create what we call “binary instances” or a situation where it will take the instantiation of two instances (often of different classes) in order to capture one concept.  For instance we may be considering creating a patient instance that is different from the corresponding person instance.

In an effort to move this design decision from the realm of arbitrary designers choice to something more principled, this article we explore the factors that go into a decision that leads to binary instances.

Some Examples

This section will outline some examples that we have come across, as it is often easier to work from a large pallet of examples than from abstractions.  Some of these examples may seem odd, some you may be surprised that anyone would consider them either one way or the other (binary or unary) but we have seen these at various times.

My guess is your background and predisposition will cause you to look at each one of these and say, either “obviously one instance” or “obviously two instances” but we suggest that any of these could go either way (a few are a bit of a stretch, but bear with it, we’re trying to make a point).  After the examples we introduce some principles that we think will lead to reasonably consistent decisions in this arena.

Statue v. Bronze

This is a classic philosophical argument.  What is the difference between the statue and the clay, or bronze.  The knee jerk reaction is to think they are two things, but consider: if you have a 10-pound statute made out of 10 pounds of bronze, when you go to ship it will you be charged for 20 pounds of freight or 10?

Person v. Employee

When you take on a job, are you two things (person and employee) or one thing (person who is an employee).  Hint: your employer and the Unemployment Insurance Agency are likely to come up with different answers for this one.

The Restrictions of Law v. The Text of Statute

If a lawmaker writes a law that says “it is illegal to turn right on a red light” and we model this.  What do we end up with?  Semantically the law is a restriction on behavior.  Tthere is a behavior (turning on the red) that the law intends to reduce the incidence of, either through cooperation or through punishment.  The question is: is the text of law (the literal words) its own object, separate from the meaning of the words.  If we are writing a text management system, or even a statute management system, there probably is only the text object (the system doesn’t care much about what the words mean).  However if we attempt to manage meaning, we need to consider that there are objects that represent the behavior we are interested in reducing, such that we could detect (via cameras say) behavior in the world that was in violation.  The question then becomes: is there one object that represents the restriction and a second that holds the text of the law, or is there just the restriction with a data type property that is the text?

A Creative Work v A Document

We know that there is a particular rendition of Moby Dick (in English or the Portuguese translation).  Certainly the English and Portuguese documents are different instances.  The real question is: is the recognition of the “work” (Moby Dick in the slightly abstract) a different instance, and do we need it dragging around with each rendition ( i.e. The Portuguese Moby Dick is a derivative of the creative work)

Government Organization v. Region Governed

When we speak of the Ukraine, are we referring to the governing body, which is an organization, or the region (recently diminished) that the government holds sway over.  Should we have one instance that represents the government and the region or two that are linked?

Specification v Model

When companies design and build products they often create specifications (is has 8 GB of memory, is 8 inches wide, and 2 inches tall, etc) and they also create “models” which they usually name (iPhone 6 for instance).  Is the specification a separate object from the model, or is there just one object?

Position v. Incumbent

Barack Obama is the President of the United States.  Is that two instances or one?

Actor v. Role

When Val Kilmer played Doc Holliday in Tombstone, was there one instance (Val Kilmer) who was a Person and was a role, or are there two instances, the role and the person?

Event v. Time Interval

We say an event is something that happened over a particular time interval.  So a particular concert, your attendance at the staff meeting Tuesday morning or World War II would all be considered events.  Each of course has a beginning and ending date and time.  The question is: is the time interval (May 22 from 9:00 AM to 10:00 AM) a separate instance from the staff meeting that occurred over that interval?

Diagnosis v. Disease

Up until the moment we are diagnosed with Cancer, or Diabetes, or even Toe nail fungus, we were unaware of our having the disease.  The diagnosis and the disease seem to coexist in most cases.  Are they two things or one?

Person v. Legal Person

We’ve seen systems that focus on the distinction between the flesh and blood person and the social artifact that is allowed to enter into contract. Two instances or one?

Organization v. Organization in Role

In some systems we’ve seen recently there is a distinction between an Organization (say Goldman Sachs) and an Organization in a Role (Goldman Sachs as an Underwriter v. Goldman Sachs as a Trader)

Contract Document v. Financial Agreement

Two parties agree to a complex financial transaction.  They paper it up with a contract that they sign.  If we model the essence of their agreement is it a separate instance from the written contract?  If not, how?

Person v. Patient

As a matter of history, your medical record is attached to your patient ID. If you’ve been to many medical institutions you have many patient IDs.  The question is, at any one of them are there two instances (Person and Patient) or one instance who is both Person and Patient?

Person v. Address

This one is hilarious.  Of course a person is separate from his or her address.  Except in almost every system ever built, where a persons address are merely attributes attached to the Person record.  When should we make the two distinct instances?

Planned Task v. Completed Task

If we plan a vacation, that is what we would call a Planned Event. We can book flights, hotels and the like and continue to add to this instance.  When we finally go on the vacation, we’ve created an actual or historical event.  Is there one event that changed state from planned to actual, or two events?

Person v. Sole Proprietor

Many independent contractors file tax returns as “Sole Proprietors” should we consider the person as a separate entity from the Sole Proprietor?

Part v. Catalog Item

Our definition of a Catalog Item, is the description of parts to a sufficient level of detail that a buyer would accept any item offered that met the description.  The Catalog Item typically has a part number, in retail a UPC.  The physical part also has the same UPC. Is the part a different item from the Catalog Item.

Customer v. (Person or Organization)

Is your customer, the person or organization that purchased your product or received your services, your customer, or is there another instance that represents your relationship with that entity?  Norms in your industry or limitations of your development environment probably color your answer here more than you think.

Relational technology makes it a relatively unnatural act to have say a Person table and an Organization table and then an order table with a foreign key to one or the other.  It’s far more “natural” in relational to have another table that represents the role of the customer.  Even if you have a “party” table, (which both the Person and the Organization extend) you have created another instance.  There is an id for each entry in the Party table, an id for each entry in the organization table (with a foreign key to the party) and an id for each entry in the person table (with a foreign key to the party).  Even without the role concept, there is an extra instance there.

Having a technology that allows us to have a single id to represent either a Person or Organization (Object Oriented or Semantic Technology) doesn’t get us completely out of the woods.  Now we could have the order refer directly to the Person or Organization.  Now the question becomes: should we?

I have been told by a data modeler from an Australian airline, that many of the people riding in an airplane are not customers.  The only ones they consider to be customers are those that belong to their frequent flyer program.  This makes some sense: they need to keep track of the miles and segments flown and accumulate them, only for the frequent flyers.  Additionally they incur obligations (to redeem balances for flights) but again only for the frequent flyers.

Pictorially

What we’re talking about is: are there two different things, that each have their own identity and properties, but that occur as a pair:

binary instances

Or is there really just one thing, and it is the conventions of our speech that make us think there are two things when really all the properties are on the one thing.

Historical Perspective

Very often design decisions are influenced by the tools that we use to implement solutions. We protest that our designs are independent of target architectures but years of designing databases and then converting them to relational DBMSs lead to thinking in design terms that more easily translate.

One implication is that relational DBMSs (and most Object Oriented languages) tend to see a class as a template for instances.  This has a tendency to suggest that instances that have properties not shared by most of the other instances should be shuttled off to another table.  This almost always ends up creating additional primary keys in other tables and therefore binary instances for anything that is in both tables.  Designed brought up on relational will be inclined to think of the Person and the Patient as two different instances (this isn’t wrong as much as it is an indication of how our experience shapes our design choices)

In an analogous fashion, Object Oriented developers often invoke the Decorator Pattern (from the Gang of Four Pattern Language).  In the decorator pattern, some functionality has been shuffled off to a companion object that performs some of the functionality.  People from this background will tend to see the decorator as a separate individual.

Principles

Our starting point is ten principles: the first principle is: if at all possible have one instance.  The next eight principles suggest circumstances where one instance is not appropriate.  The last one, we call the ambiguity trump, says even if the principles suggest two instances are needed to model the concept in question, you have a final override to say: in this domain we don’t care enough about the distinction and are willing to live with the ambiguity.

Principle 1 – Ockham’s Razor – “Entities should not be multiplied needlessly” The first principle here says the benefit of the doubt goes to simplicity.  If you can represent the concept adequately with one instance, then by all means do so.  This should be the starting point.  Start by imagining one instance.

A second consideration for sticking with one, even if you are tempted by previous designs, habits, industry norms etc., is: with a binary set of objects, each property (predicate) that is to be attached to the concept, must be attached to one or the other.  If you find it difficult to decide which of the two the property belongs on, and you end up making arbitrary choices, you should really consider sticking with one.

Principle 2 – Cardinality – There are two aspects of the concept, and you’re considering whether to devote an instance to each.  One of the trump concepts is: can you have more than one of one aspect for each one of the other.  This is trickier than it first sounds, because we have fooled ourselves a lot over time with the way we couch the question.  One of the more clear cases is Person and Sole Proprietor.  Normally “Joe Jones, the plumber” is “Joe Jones” and when he files his taxes as a Sole Proprietor, the proprietorship is Joe.  Certainly he doesn’t have the firewall that he would have had, had he incorporated.  “Joe Jones, LLC” is recognized as a separate entity, can contract on its own behalf, and can, at least in theory, and declare bankruptcy without bankrupting Joe.  So the corporate case clearly two or more instances.  But at first it would seem that the sole proprietor should fall back to principle 1. However, it turns out that Joe can have multiple Sole Proprietorships.  It doesn’t happen often, but the existence of this case, makes the case that there must be something different between Joe and his Sole Proprietorship.

Principle 3 —   Potential instance Separation   — Is it possible to separate the two aspects that are being potentially represented by two instances?  Can you have the statute without the bronze or vice versa? (probably not and this argues for one) Can you have a waterway without the river (seems like a dry riverbed would satisfy the waterway without being a river, argues for potential separation) can some properties only logically apply to one of the pair and not the other?

Principle 4 – Separate properties – are there properties that would apply only to one of the instances?  For instance a property like “annual rainfall” would apply to a country region but not to the country government.   Often the different properties are shining a light on something deeper: that there are really two different types of things yearning to be separated.  In the case of the customer v Person or Organization, when you start entertaining adding properties (number of segments flown, miles about to expire etc.) you may realize that the entity with the balances is actually an agreement.

Principle 5 – Behavioral Impact – do most (all?) real world behaviors that apply to one also apply to the other? If we end an employee (employment really) have we ended (killed) the person (no wonder so many people cringe at the thought of termination).

Principle 6 – Inference from Definition – if we have formal definitions for the classes that make sense and an inference engine infers one to be a subclass of the other, that makes a case for one instance.  If the formal definitions put the two in disjoint classes, that is a strong argument for two instances.

Principle 7 – Identify Function – is the way we establish whether we already have an instance different in one or the other of these?  The identity function is a set of properties that we use to figure out whether we already have a particular instance in our database.  For instance if the identity function for Person is SSN + Date of Birth, and so is the identity function for employee, then it argues for one instance (it may be that the identity functions are wrong, but it should at least have us pause to reflect)

Principle 8 Granularity – Sometimes the two instances are trying to represent different levels of specificity.  For instance the difference between a Product Model and a Catalog Item may be level of detail.  If there are so many Product Models (or so little variation offered) then the Product Model and Catalog Item are at the same granularity and could be considered one instance.  If however they are at different levels of detail it makes the case for two instances.

Principle 9 – Temporal Difference – if one instance can end independent of the other, that is if they have different lifetimes, it suggests two instances.

Principle 10 – Tolerating Ambiguity  — there are cases where the above analysis suggest that there should be, semantically there are, two instances, but in our domain we really don’t care.  For instance we may be convinced that the GeoRegion of a country is different from the organization that governs it, but for our application or domain, which will not exercise any of the properties that would highlight that difference, we may say we really don’t care.  In this case we would suggest created a supertype of the two classes, and instantiating the supertype. So for instance you may create the class of GeoPoliticalEntities as the union of GeoRegion and Government Organization.  Make your instances of the supertype.  What this does is two fold:

  • If you later decide that you do need to make a distinction, very few things you’ve built to date will be adversely affected. Anything that didn’t care whether you were talking about a region or a government will still not care after you make that distinction.
  • If you have to interface with applications or domains that do make the distinction you will have what you need to incorporate their distinctions without upsetting your part of the system.

Re-examining the examples in light of the principles

Let’s return to the examples we introduced in the beginning and see if the principles shine any light on them.  Note: there will still be situations and domains that come to different conclusions, but we think these will be the conclusion informed from the above principles

Design Example Proposal (one instance or two) Principled Evidence
Statue v. Bronze 1 Principle 1, if you steal the statue you’ve stolen the bronze.  They’re really inseparable.  Also principle 7, how we establish the identity of the item (say we have an RFID tag on the statute it is also identifying the bronze)
Person v. Employee 2 for employers, 1 for unemployment Principle 2 (you can have two jobs at a time) and principle 4 (your employee(ment) has a salary and seniority, you don’t, you have a birthday, your employee role doesn’t) and principle 9 (your job can end before you do) argue for 2 . However the Unemployment Division point of view argues for one.  A formal definition of someone who is employed (has at least one job) argues by principle 6 and the cardinality argument works the other way (your second job doesn’t alter the unemployment rate)
The Restrictions of Law v. The Text of Statute 2, and will have drug / drug interactions regardless of which patients you give the drugs.ed for one. ense tually an agreement. Principle 8, granularity, and principle 2 cardinality.  When we start to interpret the law and get it to the point that we can begin having systems make at least some initial determination of the legality of an action, we find that a given law is many restrictions and at many levels of detail.
A Creative Work v A Document 2 Principle 2 (many derivatives from a single work)
Government Organization v. Region Governed 2 Principle 3 (we can separate the government from the land, and the land area can change without changing the government (sorry Ukraine) and principle 4 there are properties (rainfall) that apply to one and not the other
Specification v Model 2 Principle 8 in most cases the specification is at a lower level of detail than the product model (color is typically not part of the product model, but is typically in the specification, and most product domains different colors of the same product are not equally interchangeable)
Position v. Incumbent 2 Principle 9 (the position usually outlives the incumbent) and also occasionally principle 2 (can have co-presidents, two people in one position)
Actor v. Role 2 Principle 2 (Greater Tuna where two actors played all the roles)
Event v. Time Interval 1 Principle 6 (if a time interval is defined as having a start and an end, and so is an event, the event is a time interval)
Diagnosis v. Disease 2 Even though they initially co-exist, they soon develop their own time lines (principle 9) and properties
Person v. Legal Person 1 Principle 1 the person is the legal person, there isn’t another entity to hide behind.  None of the other principles argues for 2.  Legal Person is a type of Person, except in the case where it means Organization and in that case they are separate because of principle 6, they are disjoint and can’t be the same.
Organization v. Organization in Role 1, unless there is something formal set up to establish the extra role Even though there is a bit of temptation from principle 9 it isn’t convincing.  If you participate as a buyer in one transaction and a seller in another are you three entities (yourself, you the buyer and you the seller) no not really.  Only if there is something formal set up.  In the airline industry the difference between a customer (has a role and therefore 2 entities) and a passenger (doesn’t one entity) is the frequent flyer agreement, where they are accumulating miles, getting various metal colors etc.
Contract Document v. Financial Agreement 1 Principle 1: the document is a representation of the agreement.  Where there are cardinality issues (the contract/ agreement contains many obligations) the cardinality is true of both, in the same way (if the contract has 6 obligations so does the agreement).
Person v. Patient 1 Principle 1. Unlike the cat with nine lives, the person that has 9 patient identities will die if any of them die, and will have drug / drug interactions regardless of which patients you give the drugs.
Person v. Address 2 Principle 1 Addresses are not attributes of people.  Addresses are attributes of buildings that people live in and work in which are obviously separate entities.
Planned Task v. Completed Task 1 for personal 2 for hospital, project management Principle 2 (cardinality) trumps for any organization that has to keep track of either multiple appointments for one visit, or multiple reschedulings for the same task.  Where that doesn’t apply (say your vacation plan or personal todo’s) you can just have one task that transitions from planned to actual by merely being done, in a way that is principle 10, suggesting their may be a difference in personal task management but we just don’t care .
Person v. Sole Proprietor 2 Principle 2 cardinality, since we can have multiple sole proprietorships, we need to allow for two.
Part v. Catalog Item 2 Principle 4: while they both appear to have some of the same characteristics (weight for instance) they aren’t really the same.  That is a structural similarity not a semantic similarity.  A catalog with parts that weight thousands of pounds can be picked up with a single hand.
Customer v (Person or Organization) 1 unless there is a separate agreement, then 2 Principle 4: it is the existence of a separate agreement (separate from the individual order) that is the second instance.  Really the second instance isn’t “customer” but “customer agreement.”  In the absence of a second agreement (Master agreement, frequent shopper agreement etc.) there is only need for one.