Dave McComb

Building an Ontology with LLMs

October 16, 2025June 30, 2025 by Dave McComb

Building an Ontology with LLMs

We implement Enterprise Knowledge Graphs for our clients. One of the key skills in doing so is ontology modeling. One might think that with the onslaught of ChatGPT and the resulting death knell of professional services, we’d be worried. We’re not. We are using LLMs in our practice, and we are finding ways to leverage them in what we do but using them to design ontologies is not one of the use cases we’re leaning on.

A Financial Reporting Ontology

Last week Charlie Hoffman, who is an accomplished accountant and CPA, showed me the financial reporting ontology he had built with the help of an LLM. As so many of us are these days, he was surprised at the credible job it had done in so little time. It loaded into Protégé, the reasoner ran successfully (there weren’t any real restrictions so that isn’t too hard to pull off). It created a companion SHACL file. In the prompt, he asked it to base it on gist, our upper ontology, and sure enough, there was a gist namespace (an old one, but still it was a correct one) with the requisite gist: prefix. It built a bunch of reasonable-sounding classes and properties in the gist namespace (technically, namespace squatting, but we haven’t gotten very far on ethical AI yet).

Now I look at this and think, while it is a clever trick, it would not have helped me build a financial reporting ontology at all (a task I have been working on in my spare time, so I would have welcomed the help if there was any). I would have tossed out every line. There wasn’t a single line in the file I would have kept.

One Click Ontology Building

But here’s where it gets interesting. A few months ago, at the KM World AI Conference, one of my fellow panelists, Dave Hannibal of Squirro, stated confidently that within a year there would be a one-click ontology builder. As I reflect on it, he was probably right. And I think there is a market for that. I overheard attendees saying, “even if the quality isn’t very good, it’s a starting point, and we need an ontology to get started.”

An old partner and mentor once told me, “Most people are better editors than authors.” What he meant was: give someone a blank sheet of paper and they struggle to get started, but give them a first draft and they tear into it.

The Zeitgeist

I think the emerging consensus out there is roughly as follows:

GraphRAG is vastly superior to prompt engineering or traditional RAG (it’s kind of hard for me to call something “traditional” that’s only a year old), in terms of reigning in LLM errors and hallucinations.
In order to do graphRAG you need a Knowledge Graph, preferably a curated Enterprise Knowledge Graph.
A proper Enterprise Knowledge Graph has an Ontology at its core.
Ontology modeling skills are in short supply and therefore are a bit of a bottleneck to this whole operation.
Therefore, getting an LLM to create even a lousy ontology is a good starting point.

This seems to me to be the zeitgeist as it now exists. But I think the reasoning is flawed and it will lead most of its followers down the wrong path.

The flawed implicit assumption

You see, lurking behind the above train of thought is an assumption. That assumption is that we need to build a lot of ontologies. Every project needs an ontology.

There are already tens of thousands of open-source ontologies “out there” and unknowable multiples of that on internal enterprise projects. The zeitgeist seems to suggest that with the explosion of LLM powered projects we are going to need orders of magnitude more ontologies. Hundreds of thousands, maybe millions. And our only hope is automation.

The Coming Ontology Implosion

What we need are orders of magnitude fewer ontologies. You really see the superpowers of ontologies when you have the simplest possible expression of complex concepts in an enterprise. Small is beautiful. Simpler is better. Fewer is liberating.

I have nearly 1000 ontologies on our shared drive that I’ve scavenged over the years (kind of a hobby of mine). Other than gist, I’d say there are barely a handful that I would rate as “good.” Most range from distracting to actively getting in the way of getting something done. And this is the training set that LLMs went to ontology school on.

Now I don’t think the world has all the ontologies it needs yet. However, when the dust settles, we’ll be in a much better place the fewer and simpler the remaining ontologies are. Because what we’re trying to do is negotiate the meaning of our information, between ourselves and between our systems. Automating the generation of ontologies is going to slow progress down.

How Many Ontologies Do We Need?

Our work with a number of very large as well as medium-sized firms has convinced me that, at least for the next five years, every enterprise will need an Enterprise Ontology. As in 1. This enterprise ontology that some of our clients call their “core ontology” is extended into their specific sub-domains.

But let’s look at some important numbers.

gist, our starter kit (which is free and freely available on our web site) has about 100 classes and almost that many properties, for a cognitive load of 200 concepts.
When we build enterprise ontologies, we often move many distinctions into taxonomies. What this does is shift a big part of the complexity of business information out of the structure (in the ontology and the shapes derived from the ontology) and into a much simpler structure that can be maintained by subject matter experts and has very little chance of disrupting anything that is based on the ontology. It is not unusual to have many thousands of distinctions in taxonomies, but this complexity does not leak into the structure or complexity of the model.
When we work with clients to build their core ontology, we often double or triple the number of concepts that we started with in gist, to 400-600 total concepts. This gets the breadth and depth needed to provide what we call the scaffolding to include all the key concepts in their various lines of businesses and functions.
Each department often extends this further, but it continues to astound us how little extension is often needed to cover the requisite variety. We have yet to find a firm that really needs more than about 1000 concepts (classes and properties) to express the variety of information they are managing.
A well-designed Enterprise Ontology (a core and a series of well-managed extensions) will have far fewer concepts to master than even an average-sized enterprise application database schema. Orders of magnitude fewer concepts than a large packaged application, and many, many orders of magnitude fewer than the sum total of all the schemas that have been implemented.

We’re already seeing signs of a potential further decrease. Most of the firms in the same industry share about 70-80% of their core concepts. Industry ontologies will emerge. I really mean useful ones; there are many industry ontologies out there, but we haven’t found any useful ones yet. As they emerge, and as firms move to specializing their shared industry ontology, they will need even fewer new unique concepts.

What we need are a few thousand well-crafted concepts that information providers and consumers can agree on and leverage. We currently have millions of concepts in the
many ontologies that are out there, and billions of concepts in the many database schemas that are out there.

We need a drastic reduction in quantity and a ramp up in quality if we are to have any hope of reigning in the complexity we have created. LLMs used for ontology building promise a major distraction to that goal. Let’s use LLMs instead for things they are good at, like extracting information from text, finding complex patterns in noise, and generating collateral content at wicked rates to improve the marketing department’s vanity metrics.

The year of the Knowledge Graph (2025)

October 16, 2025June 18, 2025 by Dave McComb

The year of the Knowledge Graph (2025)

There are a lot of signals converging on this being the year of the Knowledge Graph.

Before we get too carried away with this prognosis, let’s review some of the previous candidates for year of the Knowledge Graph, and see why they didn’t work out.

2001

Clearly the first year of the Knowledge Graph was 2001, marked by the unveiling of the Semantic Web by Tim Berners-Lee, James Hendler and Ora Lassila in Scientific American¹. This seemed like it was the year of the Knowledge Graph (even though the term “Knowledge Graph” wouldn’t come into widespread use for over a decade). They were talking about the same technology, even the exact same standards.

What made it especially seem like it was the year of the Knowledge graph was that it was only ten years earlier that Tim Berners-Lee had unleased the World Wide Web, and it seemed like lightning was going to strike twice. It didn’t. Not much happened publicly for the next decade. Many companies were toiling in stealth, but there were no real breakthroughs.

2010

Another breakthrough year was 2010, with the launching of DBPedia as the hub of the Linked Open Data movement. DBPedia came out of the Free University of Berlin, where they had discovered that the info boxes in Wikipedia could be scraped and turned into triples with very little extra work. By this point the infrastructure had caught up to the dream a bit, there were several commercial triple stores, including Virtuoso which hosted DBPedia.

The Linked Open Data movement grew to thousands of RDF linked datasets, many of them publicly available. But still it failed to reach escape velocity.

2012

Another good candidate is 2012 with the launch of the Google Knowledge Graph. Google purchased what was essentially a Linked Open Data reseller (MetaWeb) and morphed it into what they called the Google Knowledge Graph, inventing and cementing the name at the same time. Starting in 2012 Google began the shift from providing you with pages on the web where you might find the answers to your questions, to directly answering them from their graph.

Microsoft followed suit almost immediately picking up a Metaweb competitor, Powerset, and using it as the underpinning of Bing.

Around this same time, June of 2009 Siri was unveiled at our Semantic Technology Conference. This was about a year before Apple acquired Siri, Inc., the RDF based spin off from SRI international and morphed it into their digital assistant of the same name.

By the late 20teens, most of the digital native firms were graph based. Facebook is a graph, and in the early days had an API where you could download RDF. Cambridge Analytics abused that feature, and it got shut down, but Facebook remains fundamentally a graph. LinkedIn adopted an RDF graph and morphed it to their own specific needs (two hop and three hop optimizations) in what they call “Liquid.” AirBnB relaunched in 2019 on the back of a Knowledge Graph to become an end-to-end travel platform. Netflix calls their Knowledge Graph StudioEdge.

One would think about Google’s publicity and the fact that they were managing hundreds of billions of triples, and with virtually all the digital natives on board, the enterprises would soon follow. But they weren’t. A few did to be sure, but most did not.

2025

I’ve been around long enough to know that it’s easy to get worked up every year thinking that this might be the big year, but there are a lot of dominos lining up to suggest that we might finally be arriving. Let’s go through a few (and let me know if I’ve missed any).

It was tempting to think that enterprises might follow the FAANG lead (Facebook, Amazon, Apple, Netflix, and Google) as they have done with some other technologies, but in this case they have not yet followed. Nevertheless, some intermediaries, those that tend to influence Enterprises more directly seem to be on the bandwagon now.

Service Now

A few years ago, Service Now rebranded their annual event as “Knowledge 202x²” and this year acquired Moveworks and Data.World. Gaurav Rewari, an SVP and GM said at the time: “As I like to say, this path to agentic ‘AI heaven’ goes through some form of data hell, and that’s the grim reality.”

SAP

As SAP correctly pointed out in the October 2024 announcement³of the SAP Knowledge Graph As they said in the announcement, “The concept of a knowledge graph is not new…” Earlier versions of HANA supported openCyber as their query language, the 2025 version brings RDF and OWL to the forefront, and therefore top of mind for many enterprise customers.

Samsung

Samsung recently acquired the RDF triple store vendor RDFox⁴. Their new “Now Brief” (a personal assistant which integrates all the apps on your phone via the in-device knowledge

²https://www.servicenow.com/events/knowledge.html

³https://ignitesap.com/sap-knowledge-graph/

⁴https://news.samsung.com/global/samsung-electronics-announces-acquisition-of-oxford-semantic technologies-uk-based-knowledge-graph-startup

graph) is sure to turn some heads. In parallel this acquisition has launched Samsung’s Enterprise Knowledge Graph project to remake the parent company’s data landscape.

AWS and Amazon

Around 2018 Amazon “acqui-hired” Blazegraph, an open-source RDF graph database, and made it the basis of their Neptune AWS graph (offering the option of RDF graph or Labeled Property Graph, and working on a grand unification of the two graph types under the banner of “OneGraph”).

As significant as offering a graph database as a product, is their own internal “dogfooding.” Every movement of every package that Amazon (the eCommerce side) ships is tracked by the Amazon Inventory Graph.

graphRAG

Last year everyone was into “Prompt Engineering” (no, software developers did not become any more punctual, it was a job for a few months to learn how to set up the right prompts for LLMs). Prompt Engineering gave way to RAG (Retrieval-Augmented Generation) which extended prompting to include additional data that could be used to supplement and LLMs response.

A year in and RAG was still not very good at inhibiting LLMs hallucinatory inclinations. Enter graphRAG. The underlying limitation of RAG is that most of the data that could be queried to supplement a prompt, in the enterprise, is ambiguous. There are just too many sources, too many conflicting versions of the truth. Faced with ambiguity, LLMs hallucinate.

GraphRAG starts from the assumption (only valid in a handful of companies) that there is a grounded set of truth that has been harmonized and curated in the enterprise knowledge graph. If this exists it is the perfect place to supply vetted information to the LLM. If the enterprise knowledge graph doesn’t exist, this is an excellent reason to create one.

CIO Magazine

CIO.com magazine proclaims that Knowledge Graphs are the missing link in Enterprise AI⁵To quote from this article: “To gain competitive advantage from gen AI, enterprises need to be able to add their own expertise to off-the-shelf systems. Yet standard enterprise data stores aren’t a good fit to train large language models.”

CIO Magazine has a wide following and is likely to influence many decision makers.

Gartner

Gartner have nudged Knowledge Graph into the “Slope of Enlightenment”⁶

⁵https://www.cio.com/article/3808569/knowledge-graphs-the-missing-link-in-enterprise-ai.html ⁶https://www.gartner.com/en/articles/hype-cycle-for-artificial-intelligence

Summary

Those of you who know me know I’m mostly an anti-hype kind of guy. We, at Semantic Arts, don’t benefit from hype, as many software firms do. Indeed, hype generally attracts lower quality competitors and generates noise. These are generally more trouble than they are worth.

But sometimes the evidence is too great. The influencers are in their blocks, and the race is about to begin. And if I were a betting man, I’d say this is going to be the year that a lot of enterprises wake up and say, “we’ve got to have an Enterprise Knowledge Graph (whatever that means).”

The Case for Enterprise Ontology

October 17, 2025April 7, 2025 by Dave McComb

The Case for Enterprise Ontology

I was asked by one of our senior staff why someone might want an enterprise ontology. From my perspective, there are three main categories of value for integrating all your enterprise’s data into a single core:

Economy
Cross Domain Use Cases
Serendipity

Economy

For many of our clients there is an opportunity that stems from simple rationalization and elimination of duplication. Every replicated data set incurs costs. It incurs costs in the creation and maintenance of the processes that generate it. But the far bigger costs are associated with data reconciliation. Inevitably each extract and population create variation. These variations add up, triggering additional research to find out why there are slight differences between these datasets.

Even with ontological based systems, these difference creep in. We know that many of our clients ontological based domains contain an inventory (or a sub inventory). Employees are a good example. These sub-directories show up all over the place. There is a very good chance each domain has their own feed from HR. They may be fed from the same system, but as is often the case, each was directed to a warehouse or a different system for their source. Even if they came from the same source – the pipeline, IRI assignment and transformation are all likely different.

Here’s an illustration from a large bank associated with records retention within their legal department. One part of this project involved getting a full directory of all the employees into the graph. Later on we were working with another group on the technical infrastructure, and they wanted to get their own feed from HR to convert into triples. Fortunately we were able to divert them by pointing out that there was already a feed that provided curated employee triples.

They accepted our justification but asked … “can we have a copy of those triples to conform to our needs.” This gave us the opportunity to explain there is no conforming. Each triple is an individual asserted fact with its own provenance. You either accept it or ignore it. There really isn’t anything to conform. There is no need to restructure.

At first glance all their sub domains seemed to stand alone, but the truth is there is a surprising amount of overlap between them. There were many similar but not identical definitions of “business units.” There were several incompatible ways to describe geographic aggregation. Many different divisions dealt with the same counterparties or with the same products. And it is only when the domains are unified that most of these differences come to light.

Just unifying and integrating duplicate data sets provided economic justification for the project. We know of another company that justified their whole graph undertaking simply from the rationalization and reduction of subscriptions to the same or similar datasets from different parts of the business.

The good news is that harmonizing ontologically based systems is an order of magnitude cheaper than traditional systems.

Cross Domain Use Cases

Reuse of concepts is one of the most compelling reasons for an enterprise ontology. Some of the obvious cross-domain use cases from some of our pharmaceutical clients include:

Translation of manufacturing process from bench to trial to full scale • Integration of Real-World Evidence and Adverse events
Collapsing submission time for regulatory reporting
Clinical trial recruiting
Cross channel customer integration

Some of the best opportunities come from combining previously separate sub-domains. Sometimes you can know this going into a project. But sometimes you don’t discover the opportunity until you are well into the project. Those are the ones that fall into the serendipity category.

Serendipity

I’ve recently come to the realization that the most important use cases for unification might in fact be serendipity. That is, the power might be in unanticipated use cases. I’ll give some examples and then we’ll point you to a video from one of Amazon’s lead ontologists who came to the same conclusion.

Schneider-Electric

We did a project for Schneider-Electric (see case study). We constructed the scaffolding of their enterprise ontology and then drilled in on their product catalog and offering. Our initial goal was to get their 1 million parts into a knowledge graph and demonstrate that it was as complete and as detailed as their incumbent system. At the end of the project we had all their products in a knowledge graph, with all their physical, electrical, thermal and many other characteristics defined and classified.

Serendipity 1: Inherent Product Compatibility

We interviewed product designers to find out the nature of product compatibility. It was easy to write a different type of rule (using SPARQL) with our greatly simplified ontology that persisted the “inherent” compatibility of parts into the catalog. By doing this it reversed the sequence of events. Previously, because the compatibility process was difficult and time-consuming, they would wait until they were ready to sell a line of products in a new market before beginning the compatibility studies. Not knowing the compatibility added months into their time-to-market. In the new approach, the graph knew which products were compatible before the decision to offer them to new markets.

Serendipity 2: Standards Alignment

Schneider were interested in aligning their product offerings with the standard called eCl@ss which has over 15,000 classes and thousands of attributes. It is a complex mapping process, which had been attempted before but abandoned. By starting with the extreme simplification of the ontology (46 classes and 36 properties out of the several hundred in the enterprise ontology), working toward the standard was far easier and we had an initial map completed in about two months.

Serendipity 3: Integrating Acquisitions

Schneider had acquired another electrical part manufacturer, Clipsal. They asked if we could integrate the Clipsal catalogue with the new graph catalogue. Clipsal also had a complex product catalogue. It was not as complex as Schneider’s, but it was complex and structured quite differently.

Rather than reverse engineering the Clipsal catalogue we just asked for their data engineers to point us to where the 46 classes and 36 properties were in the catalogue. Once we’d extracted all that we asked if we were missing anything. Turns out there were a few items, which we added to the model.

The whole exercise took about six weeks. At the end of the project we were reviewing the Schneider-Electric page in Wikipedia and found that they had acquired Clipsal over ten years prior. When we asked why they hadn’t integrated their catalogue in all the time they responded that it was “too hard.”

All three of these use cases are of interest, because they weren’t the use cases we were hired to solve but only manifested when the data was integrated into a simple model.

—————————–

Amazon Story of Serendipity

This video of Ora Lassila is excellent and inspiring.

https://videolectures.net/videos/iswc2024_lassila_web_and_ai

If you don’t have time to watch to the whole thing, skip into minute 14:40 where he describes the “inventory graph” for tracking packages in the Amazon ecosystem. They have 1 Trillion triples in the graph and the query response is far better than it was in their previous systems. At minute 23:20 he makes the case for serendipity.

SIX AXES OF DECOUPLING

October 17, 2025April 7, 2025 by Dave McComb

SIX AXES OF DECOUPLING

Loose coupling has been a Holy Grail for systems developers for generations.

The virtues of loose coupling have been widely lauded, yet there has been little description about what is needed to achieve loose coupling. In this paper we describe our observations from projects we’ve been involved with.

Coupling

Two systems or two parts of a single system are considered coupled if a change to one of the systems unnecessarily affects the other system. So for instance, if we upgrade the version of our database and it requires that we upgrade the operating system for every client attached to that database, then we would say those two systems or those two parts of the system are tightly coupled.

Coupling is widely understood to be undesirable because of the spread of the side effects. As systems get larger and more complex, anything that causes a change in one part to affect a larger and larger footprint in the entire system is going to be expensive and destabilizing.

Loose Coupling/Decoupling

So, the converse of this is to design systems that are either “loosely coupled” or “decoupled.” Loosely coupled systems do not arise by accident. They are intentionally designed such that change can be introduced around predefined flex points.

For instance, one common strategy is to define an application programming interface (API) which external users of a module or class can use. This simple technique allows the interior of the class or module or method to change without necessarily exporting a change in behavior to the users.

The Role of the Intermediate

In virtually every system that we’ve investigated that has achieved any degree of decoupling, we’ve found an “intermediate form.” It is this intermediate form that allows the two systems or subsystems not to be directly connected to each other.

As shown in Figure (1), they are connected through an intermediary. In the example described above with an API, the signature of the interface is the intermediate.

What Makes a Good Intermediary?

An intermediary needs several characteristics to be useful:

It doesn’t change as rapidly as its clients. Introducing an intermediate that changes more frequently than either the producer or consumer of the service will not reduce change traffic in the system. Imagine a system built on an API which changes on a weekly basis. Every producer and consumer of the services that use the API would have to change along with the API and chaos would ensue.

It is nonproprietary. A proprietary intermediary is one that is effectively owned and controlled by a single group or small number of vendors. The reason proprietary intermediaries are undesirable is because the rate of change of the intermediary itself has been placed outside the control of the consumer. In many cases to use the service you must adopt the intermediary of the provider. It should also be noted that in many cases the controller of the proprietary standard has incentive to continue to change the standard if that can result in additional revenue for upgrades and the like.

It is evolvable. It’s highly unlikely that anyone will design an intermediate form that is correct for all time from the initial design. Because of this, it’s highly desirable to have intermediate forms that are evolvable. The best trait of an evolvable intermediate is that it can be added on to, without invalidating previous uses of it. We sometimes more accurately call this an accretive capability, meaning that things can be added on incrementally. The great advantage of an evolvable or accretive intermediary is that if there are many clients and many suppliers using the intermediary they do not have to all be changed in lockstep, which allows many more options for upgrade and change.

It is simple to use. An intermediate form that is complex or overly difficult to use will not be used and either other forms will be adopted which may be more various and different or the intermediate form will be skipped altogether and the benefit lost.

Shared Intermediates

In addition to the simple reduction in change traffic from having the intermediate be more stable than the components at either end, we also have an advantage in most cases where the intermediate allows re use of connections. This has been popularized in the Systems Integration business where people have pointed out time and time again that creating a hub will drastically reduce the number of interfaces needed to supply a system.

In Figure (2), we have an example of what we call the traditional interface math, where the introduction of a hub or intermediate form can drastically reduce the number of interconnections in a system.

People selling hubs very often refer to this as: (n * n – 1) / 2 or sometimes simply the n2 problem. While this makes for very compelling economics, our observation is that the true math for this style of system is much less generous but still positive. Just because two systems might be interconnected does not mean that they will be. Systems are not completely arbitrarily divided and therefore not every interconnection need be accounted for.

Figure (3) shows a more traditional scenario where, in the case on the left without a hub, there are many but not an exponential number of interfaces between systems. As the coloring shows, if you change one

of those systems, any of the systems it touches may be affected and should at least be reviewed with an impact analysis. In the figure on the right, when the one system is changed, the evaluation is whether the effect spreads beyond the intermediary hub in the center. If it does not, if the system continues to obey the dictates of the intermediary form, than the change effect is, in fact, drastically reduced.

The Axes of Decoupling

We found in our work that, in many cases, people desire to decouple their systems and even go through the effort of creating intermediate forms or hubs and then build their systems to connect to those intermediate forms. However, as the systems evolve, very often they realize that a change in one of the systems does, in fact, “leak through” the abstraction in the intermediate and affects other systems.

In examining cases such as this, we have determined that there are six major considerations that cause systems that otherwise appear to be decoupled to have a secret or hidden coupling. We call these the axes of decoupling. If a system is successfully decoupled on each of these axes, then the impact of a change in any one of the systems should be greatly minimized.

Technology Dependency

The first axis that needs to be decoupled, and in some ways the hardest, is what we call technology dependency. In the current state of the practice, people attempt to achieve integration, as well as economy of system operation, by standardizing on a small number of underlying technologies, such as operating systems and databases. The hidden trap in this is that it is very easy to rely on the fact that two systems or subsystems are operating on the same platform. As a result, developers find it easy to join a table from another database to one in their own database if they find that to be a convenient solution. They find it easy to make use of a system function on a remote system if they know that the remote system supports the same programming languages, the same API, etc.

However, this is one of the most pernicious traps because as a complex system is constructed with more and more of these subtle technology dependencies, it becomes very hard to separate out any portion and re – implement it.

The solution to this, as shown in Figure (4), is to introduce an intermediate form that ensures that a system does not talk directly to another platform. The end result is that each application or subsystem or service can run on its own hardware, in its own operating system, using its own database management system, and not be affected by changes in other systems. Of course, each system or subsystem does have a technological dependency on the technology of the intermediary in the middle. This is the trade-off; you introduce the dependence on one platform in exchange for being independent of n other platforms. In the current state-of-the-art, most people use what’s called an integration broker to achieve this. An integration broker is a product such as IBM’s WebSphere or TIBCO or BEA, which allows one application to communicate with another without being aware of, or care, what platform the second application runs on.

Destination Dependency

Even when you’ve successfully decoupled the platforms the two applications rely on, we’ve sometimes observed problems where one application “knows” of the existence and location of another application or service. By the way, this will become a very “normal problem” as Web services become more popular because the default method of implementing Web services has the requester knowing of the nature and destination of the service.

In Figure (5), we show a little more clearly through an example where two systems have an intermediary. In this case, the distribution and shipping application would like to send messages to a freight application, for instance to get a freight rating or to determine how long it would take to get a package somewhere. Imagine if you were to introduce a new service in the freight area that in some cases handled international shipping, but we continue to do domestic the old way. If we had not decoupled these services, it is highly likely that the calling program would now need to be aware of the difference and make a determination in terms of what message to send, what API to call, where to send its request, etc. The only other defense would be to have yet another service that accepted all requests and then dispatched them; but this is really an unnecessary artifact that would have to be added into a system where the destination intermediary had not been designed in.

Syntax Intermediary

Classically in an API, the application programming interface defines very specifically the syntax of any message sent between two systems. For instance, the API specifies the number of arguments, their order, and their type; and any change to any of those will affect any of the calling programs. Also EDI (electronic data interchange) relies very much on a strict syntactical definition of the message being passed between partners.

In Figure (6), we show a small snippet of XML, which has recently become the de facto syntactic intermediate form. Virtually all new initiatives now use XML as the syntactic lingua franca. As such, any two systems that communicate through XML at least do not have to mediate differences at that syntactic level. Also, fortunately, XML is a nonproprietary standard and, at least to date, has been evolving very slowly.

Semantic Intermediary

Where systems integration projects generally run into the greatest amount of trouble is from semantic differences or ambiguities in the meaning of the information being passed back and forth. Traditionally, we find that developers build interfaces and run them and test them against live data, and then find that the ways in which the systems have been used does not conform particularly well to the spec. Additionally, in each case the names and therefore the implied semantics of all the elements used in the interface are typically different from system to system and must be reconciled. The n2 way of resolving this is to reconcile every system to every other system, a very tedious process.

There have been a few products and some approaches, as we show very simply and schematically in Figure (7), that have attempted to provide a semantic intermediary. Two that we’re most familiar with are

Condivo and Unicorn. Over the long-term, the intent of the Semantic Web is to build shared ontologies in OWL, which is the Web Ontology Language and a derivative of RDF and DAML+OIL. In the long-term,

it’s expected that systems will be able to communicate shared meaning through mutually committed ontologies.

Identity Intermediary

A much subtler coupling that we’ve found in several systems is in the use of identifiers. Most systems have identifiers for all the key real world and invented entities that they deal with. For instance, most systems have identifiers for customers, patients, employees, sales orders, purchase orders, production lines, etc. All of these things must be given unique unambiguous names. That is not the problem; the problem is that each system has a tendency to create its own identifiers for items that are very often shared. In the real world, there is only one instance of many of these items. There is only one of each of us as individuals, one each for each building, one each for each corporation, etc. And yet each system tends to create its own numbering system and when it discovers a new customer it will give it the next available customer number. In order to communicate unambiguously with the system that’s done this, to date the two main approaches have been either to force universal identifiers onto a large number of systems or to store other people’s identifiers in your own system. Both of these approaches are flawed and do not scale well. In the case of the universal identifier, besides having all the problems of attempting to get coverage on the multiple domains, there is the converse problem of privacy. Once people, for instance, are given universal identifiers it’s very hard to keep information about individuals anonymous. The other approach of storing others’ identifiers in your systems does not scale well because as the number of systems you must communicate with grows, the number of other identifiers that you must store also grows. In addition, there is the problem of being notified when any changes to these identifiers occur.

In Figure (8), we outline a new intermediary, which is just beginning to be discussed as a general-purpose service, variously called the identity intermediary or the handle intermediary. The reason we’ve begun shifting from calling it an identity intermediary is because the security industry has been referring to identity systems and it does not mean exactly the same thing as what we mean here. Essentially, this is a service where each subscribing system recognizes that it may be dealing with an entity that any of the other systems may have previously dealt with. So this has a discovery piece that systems can discover if they’re dealing with, communicating with, or aware of any entity that has already been identified in the larger federation. It also acts as a cross reference so that each system need not keep track of all the synonyms of identifiers or handles to all the other systems. Figure (8) shows a very simple representation of this with two very similar individuals that need to be identified separately. To date, the only system that we know of that covers some of this territory is called ChoiceMaker, but it is not configured to be used in exactly the manner that we show here.

Nomenclature Intermediary

Very similar to the identity or handle intermediary is the nomenclature intermediary. We separate it because typically, with the identity intermediary, we’re dealing with discovered real world entities and the reason we have synonyms is because multiple different systems are “discovering” the same physical real-world item.

In the case of the nomenclature intermediary system, we’re dealing with an invented categorization system. Sometimes categorization systems are quite complex. In the medical industry we have SNOMED, HCPCS, and the CPT nomenclature. But also we have incredibly simple, and very often internally made up, classification systems, so in every case where we create a code file where we might have seven types of customer or orders or accidents or whatever that we tend to codify in order to get more uniformity, these are nomenclatures. What is helpful about having intermediary forms is that it enables multiple systems to either share or map to a common set of nomenclatures or codes.

Figure (9) shows a simple case of how the mapping could be centralized. Again, this is another example where over the long term, developments in Semantic Web may be a great help and may provide clearinghouses for the communication between disparate systems. In the meantime, the only example that we’re aware of where a company has internally devoted a lot of attention to this is the Allstate Insurance Co., which has built what they call a domain management system where they have found, catalogued, and cross-referenced over 6,000 different nomenclatures that are in use within Allstate.

Summary

Loose coupling has been a Holy Grail for systems developers for generations. There is no silver bullet that will slay these problems; however, as we have discussed in this paper, there are a number of specific disciplined things that we can look at as developers, and as we continue to pay attention to these, we will make our systems more and more decoupled, and therefore easier and easier to evolve and change.

Documents, Events and Actions

October 17, 2025March 21, 2025 by Dave McComb

Documents, Events and Actions

We have recently been reexamining the weird relationship of “documents” to “events” in enterprise information systems and have surfaced some new insights that are worth sharing.

Documents and Events

Just to make sure we are all seeing things clearly, the documents we’re referring to are those that give rise to financial change in an enterprise. This includes invoices, purchase orders, receiving reports and sales contracts. We’re not including other documents like memos, reports, news articles and emails – nor are we focusing on document structures such as JSON or XML.

In this context, the “events” represent the recording of something happening that has a high probability of affecting the finances of the firm. Many people call these “transactions” or “financial transactions.” The deeper we investigated, the more we found a need to distinguish the “event” (which is occurring in the real world) from the “transaction” (which is its reflection in the database). But I’m getting ahead of myself and will just stick with documents and events for this article.

Documents and Events, Historically

For most of recorded history, the document was the event, or at least it was the only tangibly recorded interpretation of the event. That piece of actual paper was both the document and the representation of the event. When you wrote up a purchase order (and had it signed by the other party) you had an event.

In the 1950’s we began computerizing these documents, turning them into a skeuomorph (a design that imitates a real-world object to make it more familiar). The user interfaces looked like paper forms. There were boxes on the top for “ship to” and “bill to” and small boxes in the middle for things like “payment terms,” and “free on board.” This was accompanied by a line item of the components that made up the bill, invoice, purchase order, timecard, etc.

For the longest time, the paper was also the “source document” which would be entered into the computer at the home office. Somewhere along the way some clever person realized you could start by entering the data into the computer for things you originated and then print out the paper. That paper was then sent to the other party for them to key it into their system.

Now, most of these “events” are not produced by humans, but by some other computer program. These ‘bills of materials’ processors can generate purchase orders much faster than a room full of procurement specialists. Many industries now consider these “events” to be primary. The documents (if they exist at all) are part of the audit trail. Industries like healthcare have long ago replaced the “superbill” (a document on a clipboard with 3 dozen check boxes to represent what the physician did to you on that visit) with 80 specific types of HL7 messages that ricochet back and forth from provider to payer.

And yet, even in the 21^stcentury, we still find ourselves often excerpting facts from unstructured documents and entering them into our computer systems. Here at Semantic Arts, we take the contracts we’ve signed with our clients and scan them for the tidbits that we need to put into our systems (such as the budgets, time frame, staffing and billing rates) and conveniently leave the other 95% of the document in a file somewhere.

Documents and Events, what is the difference?

So for hundreds of years, documents and events were more or less the same thing. Now they have drifted apart. In today’s environment, the real questions are not “what’s the difference” but rather “which one is the truth.” In other words, if there is a difference which one do we use? There is not a one-size-fits-all answer to that dilemma. It varies from industry to industry.

But I think it’s fairly safe to say the current difference is that an “event” is a structured data representation of the business activity, while a “document” is the unstructured data representation. Either one could have come first. Each is meant to be the reflection of the other.

The Event and the Transaction

The event has a very active sense to it because it occurs at a specific point in time. And therefore, we record it in our computer system and create a transaction, which updates our database at the posting date and as the effective accounting date.

The transaction and the event often appear to be the same thing, partly because so many events terminate in the accounting department. But, in reality, the transaction is adding information to the event that allows it to be posted. The main information that is being added is the valuation, the classification and the effective dates. Most people enter these at the same time they capture the event, but they are distinct. The distinction is more obvious when you consider events such as “issuing material” to a production order. The issuer doesn’t know what account number should be charged, nor do they know the valuation (this is buried in an accounting policy that determines whether to cost this widget based on the most recent cost, the oldest cost or the average cost of widgets on hand.) So the “transaction” is different from the “event” even if they occur at the same time.

Until fairly recently, administrators wouldn’t sit at their computer and enter invoices until they were prepared for them to be issued. Most people wait until they ship the widget or complete the milestone before they key in the invoice data and email it to their customer. In this circumstance, the event and the transaction are cotemporaneous – they happen at the same time. And the document being sent to the customer follows shortly thereafter.

One More Disconnect

We are implementing data-centric accounting at Semantic Arts and have disconnected the “event” that is the structured data representation of the event, from its classification as an event. We realized that as soon as we had signed a contract, we knew at least one of the two aspects of our future invoices, and in many cases, we knew both. For fixed price projects, we knew the amount of the future invoices. The only thing we didn’t know was when we could invoice them – because that was based on the date of some given milestone. For time and material contracts we know the dates of our future invoices (end of the month often) but don’t know the amount. And for our best efforts contracts we know the dates and the amounts and adjust the scope to fit.

But knowing these things and capturing them in our accounting system creates a problem. They weren’t actually real yet (or at least they weren’t real enough to be invoices). The sad thing was they looked just like invoices. They had all the data, and it was all valid. They could be rendered to pdfs, and even printed, but we knew we couldn’t send all the invoices to our client all at once. So we now had some invoices in our system that weren’t really invoices, and didn’t have a good way to make the distinction.

As we puzzled over this, we came across a university that was dealing with the same challenge. In their case they were implementing “commitment accounting,” which is trying to keep track of the commitments (purchase orders mostly) that are outstanding as a way to prevent overrunning budgets. As people entered their purchase orders (structured records as we’ve been describing them) the system captured them as events. These events were captured and tallied by the system. In order to get the system to work, people entered purchase orders long before they were approved. In fact, you have to enter them to get an event (or a document) that can be approved and agreed to by your vendor.

The problem was many of these purchase order events never were approved. The apparent commitments vastly exceeded the budgets, and the whole system was shut down.

Actions

We discovered that it isn’t the document, and it isn’t even the event (if we think of the event as the structured data record of the business event) that makes the financial effect real. It is something we are now calling the “action,” or really a special type of “action.”

There is a magic moment when an event, or perhaps more accurately a proto-event becomes real. On a website, it is the “buy” button. In the enterprise ,it is often the “approval” button.

As we worked on this, we discovered it is just one of the steps in a workflow. The workflow for a purchase order might start with sourcing, getting quotes, negotiating, etc. The special step that makes the purchase order “real” isn’t even the last step. After the purchase order is accepted by the vendor, we still need to exchange more documents to get shipping notifications, deal with warranties, etc. It is one of those steps that makes the commitment. We are now calling this the “green button.” There is one step, one button in the workflow progression that makes the event real. In our internal systems we’re going to make that one green, so that employees know when they are committing the firm.

Once you have this idea in your head, you’ll be surprised how often it is missed. I go on my bank’s website and work through the process of transferring money. I get a number of red buttons, and with each one, I wonder, “is this the green one.” Nope, one more step before we’re committed. Same with booking a flight. There are lots of purple buttons, but you have to pay a lot of attention before you notice which one of those purple buttons is really the green one.

Promotion

And what does the green button in our internal systems do? Well, it varies a bit, workflow to workflow, but in many cases it just “promotes” a draft item to a committed one.

In a traditional system you would likely have draft items in one table and then copy them over to the approved table. Or you might have a status and just be careful to exclude the unapproved ones from most queries.

But we’ve discovered that many of these events can be thought of as subtypes of their draft versions. When the green button gets pressed in an invoicing workflow, the draft invoice gains another triple, which makes it also an approved or a submitted invoice – in addition to its being a draft invoice.

Summary

We in the enterprise software industry have had a long history of conflating documents and events. Usually we get away with it, but occasionally it bites us.

What we’re discovering now with the looming advent of data-centric accounting is the need not only to distinguish the document from the event but also distinguish the event (as a structure) from the action that enlivens it. We see this as an important step in the further automation of direct financial reporting.

Data-Centric Credentialling

October 16, 2025February 11, 2025 by Dave McComb

Data-Centric Credentialling

In order to ensure that clients can get what they expect when they buy software or services that purport to be “data-centric” we are going to implement a credentialling program. The program will be available at three levels.

Implementation Awards

These are assessments and awards given to clients for projects or enterprises to recognize the milestones on their journey to becoming completely data-centric.

It is a long journey. There is great benefit along the way, and these awards are meant to recognize progress on the journey

Software Certification

The second area is in certify that software meets the goals of the data-centric approach. There will be two major categories:

Middleware – databases, messaging systems, and non-application-specific tools that might be used in a data-centric implementation will be evaluated on its consistency with the approach
Applications – as described in the book “Real Time Financial Accounting, the Data Centric Way” we expect that vertical industries will be far easier and more consistent with the data-centric approach. Horizontal applications will be evaluated based on their ease of being truly integrated with the rest of a data-centric enterprise. Adhering to open models and avoiding proprietary structures will also improve the rating in this area.

Professional Services

There will be two levels of professional services credentialling, one based on what you know and the other on what you’ve done.

The “what you know “will be based on studying and testing akin to the Project Management Institute of the Data Management DMBOK.

The “what you’ve done” recognizes that a great deal of the ability to deliver these types of projects is based on field experience.

The Data-Centric Graph Tech Stack

October 20, 2025February 11, 2025 by Dave McComb

The Data-Centric Graph Tech Stack

Virtually all technology projects these days start with a “tech stack.” The tech stack is a primarily a description of the languages, libraries and middleware that will be used to implement a project. Data-Centric projects too, have a stack, but the relative importance of some parts of the stack are different in data-centric than traditional applications.

This article started life as the appendix to the book “Real Time Financial Accounting, the Data-Centric Way” and as a result it may emphasize features of interest to accounting a bit more, than otherwise, but hopefully will still be helpful.

Typical Tech Stacks

Here is a classic example of a tech stack, or really this is more of a Chinese menu to select your tech stack from (I don’t think most architects would pick all of these)

A traditional Tech Stack

Most of defining a stack is choosing among these. The choices will influence the capabilities of the final product, and they will especially define what will be easy and what will be hard. There are also dependencies in the stack. It used to be that the hardware (platform / OS) was the first and most important choice to make and the others were options on top of that. For instance, if you picked the DEC VAX as your platform you had a limited number of databases and even a limited number of languages to choose from.

But these days many of the constraining choices have been abstracted away. When you select a cloud-based database, you might not even know what the operating system or database is. And the ubiquity of browser based front ends has abstracted away a lot of the differences there as well.

But that doesn’t mean there aren’t tradeoffs and constraints. One of the tradeoffs is longevity. If you pick a trendy stack, it may not have the same half-life as one that has been around a long while (although you might get lucky). And your choice of stack may influence the kind of developers you can attract.

Every decade or so there seem to be new camps that develop. For a while it was java stacks v C# and .Net stacks. Now a days two of the mainstream camps are react/ JavaScript v. python. Yes, there are many more but those two seem to get a lot of attention.

React/JavaScript seems to be the choice when UI development is the dominant activity and python when data wrangling, NLP and AI are top of mind.

Data-Centric Graph Stack

For those of us pursuing data-centric, the languages are important, but less so than with traditional development. A traditional development project with hundreds of interactive use cases is going to be concerned with tools that will help with the productivity and quality of the user experience.

In a mostly model-driven (we’ll get to that in a minute) data-centric environment, we’re trying to drastically reduce (close to zero) the amount of custom code that is written per each use case. In the extreme case, if there is no user interface code, it doesn’t really matter what language it wasn’t written it.

And on the other side if your data wrangling will involve hundreds of pipelines the ease at which each step is defined and combined will be a big factor. But when we focus on data at rest, rather than data flow, the tradeoffs change again.

Model Driven Development (short version)

In a traditional application user interface, the presentation and especially the behavior of the user interface is written in software code. If you have 100 user interfaces you will have 100 programs, typically each of them many thousand lines of code that access the data from a database, move it around in the DOM (the in memory data model of a web based app as an example) present it in the many fields in the user interface, manage the validation and constraint management and posting back to the database.

In a model driven environment rather than coding each user interface, you code one prototypical user interface and then adapt it parametrically. In a traditional environment you might have one form that has person, social security number and tax status and another form that has project name, sponsor, project management, start date and budget. Each would be a separate program. The model driven approach says we have one program, and you just send it a list of fields. The first example would get three fields and the second five. It’s obviously not that simple, and there are limits to what you can do this way, but we’ve found for many enterprise applications you can get good functionality for 90+% of your use cases this way.

If you only write one program, and the hundreds of use cases are “just parameters” (we’ll get back to them later) that’s why we say it doesn’t matter what language you don’t write your programs in.

One more quick thought on model driven (which by the way Gartner tends to call low code / no code) is there are two approaches. One approach is code generation. In that approach you write one program that writes the hundreds of application programs for you. This is likely what we’ll see from GenAI in the very near future, if not already. Some practitioners go into the generated code and tweak it to get exactly what they want. In that case it matters a great deal what language its written in.

But the other approach does not generate any code. The one bit of architectural code treats the definition of the form as if it were data and does what is appropriate.

Back to the Graph Stack

So, if we not overly focused on the programming languages what are we focused on, and why? In order for this discussion to make sense, we need to lay out a few concepts and distinctions, so that the priorities can make sense.

One of the big changes is the reliance on graph structures. We need to get on the same page about this before we proceed, which will require a bit of backtracking as to what the predominant alternatives are and how they differ.

Proto truths

We’re going to employ a pedagogical approach using “proto-truths.” Much of the technology we’re about to describe has deep and often arcane specifics. Technologists feel the need to drill down and explain all the variations of every new concept they introduce, which often gets in the way of the readers grasping the gestalt, such that they could, in due course appreciate the specifics.

The proto-truth approach says when we introduce a new concept, we’re going to describe it in a simplified way. This simplified way takes a subset of the concept, often the most exemplar subset, and describes the concept and how it fits with other concepts using the exemplars. Once we’ve conveyed how all the pieces fit together, we will cycle back and explain how the concepts work with less exemplar definitions. For technical readers we will mention that it is a proto-truth every time we introduce one, lest you say in your mind “no that isn’t the full definition of that concept.”

Structured Data

A graph is a different way of representing structured information. Two more common ways are tables and “documents.” Documents is in quotes here, as depending on your background you may read that and think Microsoft Word, or you may think json. Here we will mean the latter. But first let’s talk about tables as an organizing principle for structured data.

Tables

We use tables in relational databases as well as in spreadsheets, and we cut and paste them into financial reports.

In a table the data is in the cell. The meaning of the data is contextual. This context includes the row and column, but it also includes the database and the table. One allure of tables is their simplicity. But the downside is there is a lot of context for a human to know, especially when you consider that a large firm will have millions of tables. Most firms are currently trying to get a handle on their data explosion, including processes to document (sorry – different form of the word document) what all the databases, tables and columns mean. Collectively, these are the structured data’s “meta-data.” This is hard work, and most firms can only get a partial view, but even partial is quite helpful.

In “table-world” even if you know what all the databases, tables and columns mean, you are only part way home. As a famous writer once said:

“There is a lot more to being a good writer than knowing a lot of good words. You … have … to … put … them … in … the … right … order.”

In an analogous way to writers putting words in the right order, people who deal with tabular data spend much of their time reassembling tables into something useful. It is rare that all the information you needed is in a single table. If it is, it is likely that one of your predecessors assembled it from other tables and so happened to do so in a way that benefits you.

This process of assembling tables from other tables is called “joining.” It sounds simple in classroom descriptions. You “merely” declare the column of one table that is to be joined (via matching) to another table.

But think about this for a few minutes. The person “joining” the tables needs to have considerable external knowledge about which columns would be good candidates to join to which others. Most combinations make no sense at all and will get little or no result. You could join the zip code on a table of addresses with the salaries of physicians, but the only matches you’d get would be a few underpaid physicians on the West coast.

This only scratches the surface of the problem with tables. This “joining” approach only works for tables in the same database. Most tables are not in the same database. Large firms have thousands of databases. To solve this problem, people “extract” tables from several databases and send them somewhere else where they can be joined. This partially explains the incredible explosion of numbers of tables that can be found in most enterprises.

The big problem with table based systems is how rapidly the number of tables can explode, and as it does the difficulty of know which table to access, what the columns mean and how to join them back together becomes a bit barrier to productivity. In a relational database the meaning of the columns (if defined at all) is not in the table. It might be in something the database vendor calls a “directory” but more likely its in another application, a “data dictionary” or a “data catalog.”

This was a bit of a torturous explanation of just a small aspect of how traditional databases work. We did this to motivate the explanation of the alternative. We know from decades of explaining; the new technology sounds complex. If you really understand how difficult the status quo is, you are ready to appreciate something better. And by the way we should all appreciate the many people who toil behind the scenes to keep the existing systems functioning. It is yeoman’s work and should be applauded. At the same time, we can entertain a future that requires far less of this yeoman’s work.

Documents

Documents, in this sense, as a store of structured information, are not the same as “unstructured documents.” Unstructured documents are narrative, written and read by humans. Microsoft Word, PDFs and emails are mostly unstructured. They may have a bit of structured information cut and pasted in, and they often have a bit of “meta-data” associated with them. This meta-data is different than the meta-data in tables. In tables the meta-data is primarily about the tables and columns and what they mean. For an unstructured document, meta-data is typically the author, maybe the format, the date created and last modified date, and often some tags supplied by the author to help others search later.

Documents in the Document Database sense though are a bit different. The exemplars here are XML and json (JavaScript Object Notation).

XML JSON

semi structured “documents”

The difference here between tables and documents is that with documents the equivalent of their meta-data (part of it anyway) is co-located with the data itself. The json version is a bit more compact and easier to read, so we’ll use json for the rest of this section.

The key (if you pardon the pun) to understanding json lies in understanding the key/value concept. The json equivalent to a column in a table is the key. The json equivalent to a cell in a table is a value. In the above, “city” is a key, and “Fort Collins” is a value. Everything surrounding the key/value pair is structure or context. For instance, you can group all the key/value pairs that would have made up a row in a table, inside a matching pair of “{ }”s. The nesting that you see so often in json (where you have “{… }” inside another “{ …}” or “[…]” ) is mostly the equivalent of a join.

An individual file, with a collection of json in it, is often called a dataset. A dataset is a mostly self-contained data structure that serves some specific purpose. These files / datasets look and act like documents. They are documents, just not the type for casual reading. When people put a lot of them in a database for easier querying, this is called a “document database.” They are handy and powerful, but unless you know what the keys and the structure mean, you don’t know what the data means. The number and complexity of these datasets can be overwhelming. Again, kudos to the many men and women who manage these and keep everything running, but again, we can do better.

Graph view of tables or documents

Many readers already familiar with graph technology will be raising their hands right now and saying things like “what about R2RML or json-LD?” Yes, there are ways to make tables and documents look like graphs, and consume like graphs, but this never occurs to the people using tables and documents. This occurs to the people using graph who want to consume this legacy data. And we will get to this, but first we need to deal with graphs and what makes them different (and better).

Graph as a Structuring Mechanism

In graph technology, the primitive (and only) way of representing information is in a “triple.” A triple has three parts: two nodes and an edge connecting them.

graph fundamentals

At the proto-truth level, the nodes represent individual real-world instances, sometimes called individuals. In this context individual is not a synonym for person, for instance in this example we have an individual house on an individual lot.

things not strings

These parenthetic comments are just for the reader, in a few moments we’ll fill in how we know the node on the left represents a particular house and the node on the right an individual lot that the house is on.

The line between the individuals indicates that there is some sort of relationship between the individuals.

naming the edges

In this case, the relationship indicates that the house is located at (on) the specific lot. The lot is not located at or on the house, and so we introduce the idea of directionality.

edges are directional

The node/edge/node is the basic graph structure and that the edge is “directed” that is, has an arrow on the end, makes this a “directed graph.”

There are two major types of graph databases in current use: labeled property graphs and RDF Databases, which are also known as Triple Stores. Labeled property graphs, such as the very popular Neo4j, are essentially document stores with graph features. The above triple might look more like this in a labeled property graph:

Attributes on the edges

Each node has a small document with all the known attributes for that node, in this case we’re showing the address, and the lot for the two nodes. The edge also has a small document hanging off it. This is what some people call “attributes on the edges” and can be a very handy feature. Astute readers will notice that we left the “:” off the front of the node and edge names in this picture. We will fill in that detail in a bit.

Triple stores do not yet have this feature (attributes on the edges) universally implemented, it is working its way through the standards committees, but there are still several reasons to consider RDF Triple stores. If you choose not to implement an RDF Triple store for your Data-Centric system, these Labeled Property Graphs are probably your next best bet. Both types of graph databases are going to be far easier than relational or document databases for solving the many issues that will need to be dealt with going forward.

Triple stores have these additional features which we think make them especially suitable for building out this type of system:

• They are W3C open standards compatible – there are a dozen viable commercial options and many good open-source options available. They are very compatible. Converting from one triple store to another or combining two in production is very straightforward.

• They support unambiguous identifiers – all nodes and all edges are defined by globally unique and potentially resolvable identifiers (more later) • They support the definition of unambiguous meaning (semantics) – also more later.

We have a few proto-truths that we have skipped over that we can fill in before we proceed. They have to do with “where did these things that look like identifiers come from and what do they signify?”

Figure 1 — the basic “triple”

The leading “:” is a presentation shorthand for a much longer identifier. In most cases there is a specific shorthand for this contextualizing information, which is called a “namespace.” The namespace represents a coherent vocabulary, and any triplestore can mix and match information from multiple vocabularies/ namespaces.

Figure 2 — introducing namespaces

In this example we show these items coming from three different namespaces or vocabularies. The one on the left: “rel:” might be short for a realtor group that identified the house. The “gist:” refers to an open-source ontology provided by Semantic Arts and the “geoN:” is short for geoNames – another open-source vocabulary of geospatial information. The examples without any qualifiers (the ones with only “:”) still have a namespace but it is whatever the local environment has declared to be the default.

Let’s inflate the identifier:

Prefixes are Shorthand for Namespaces

The “rel:”is just a local prefix that will get expanded anytime this data is to be stored or compared. The local environment fills in the full name of the namespace as shown here (a hypothetical example). The namespace is concatenated with what is called the “fragment” (the “item6” in this example) to get the real identifier, the “URI” or “IRI.”

IRIs are globally unique. So are “guids” (Globally Unique IDentifiers).

guids as globally unique ids

Being globally unique has some incredible advantages that we will get to in a minute, but before we do, we want to spend a minute to distinguish guids from IRIs. This guid (which I just generated a few minutes ago) may indeed be globally unique, but I have no idea where to go to find out what it represents.

The namespace portion of the IRI gives us a route to meaning and identity.

Using Domain Names in Namespaces to Achieve Global Uniqueness

Best practice (followed by 99% of all triple store implementations) is to base the namespace on a domain name that you own or have control over. As the owner of the domain name, and by extension the name space, you have the exclusive right to “mint” IRIs in this namespace. “Minting” is the process of making up new IRIs. With that right comes the responsibility to not reuse the same IRI for two different things. This is how global uniqueness is maintained in Triple Stores. It also provides the mechanism to find out what something means. If you want to know what https://data.theRealtor.com/house/item6 refers to you can at least ask the owners of theRealtor.com. In many cases the domain name owner will go one step further, and not only guarantee that the identifier is globally unique, but they will tell you what it means in a process call “resolution.” An IRI, following this convention, looks a lot like a URL. The minter of this IRI can, and often does, make the IRI resolvable. To the general public the resolution may just say that it is a house and here is its address. If you are logged in and authenticated it may tell you a lot more, such as who the listing agent is and what the current asking price is.

The URI/IRI provides an identifier that is both resolvable and globally unique. Resolvable means you have the potential of finding out what an identifier refers to. Let’s return to the value of global identifiers.

In the tabular, document and even labeled property graph, the identifiers are hyper-local. That is an identifier such as “007” only means what you think it means in a given database, table and column.

Figure 3 — Traditional systems require the query-er to reassemble tables into small graphs to get anything done

That same “007” could refer to a secret agent in the secret agent database, and a ham sandwich in the deli database. More importantly, if we want to know who has the Aston Martin this week we need to know, as humans, that we “join” the “id” column in the “agent table” with the “assigned to” column in the “car” table. This doesn’t scale and it’s completely unnecessary.

When you have global ids, you don’t need to refer to any meta data to assemble data. The system does it all for you. https://data.hmss.org.uk/agent/007 refers to James Bond no matter what table or column you find it in or if you find it on a web site or in a document.

Say we found or harvested these factoids and converted them to “triples”. This is depicted in the figure below. For readability, we’ve temporarily dropped the namespace prefixes and added in parenthetical comments.

Triples sourced independently

The first triple says the house is on a particular lot. The second triple says what city that lot is in. The third adds that it is also in a flood plain. The fourth, which we might have gotten from county records, says there is a utility easement on this lot. And the last is an inspection that was about this house (note the arrow pointing in the other direction).

The database takes all the IRIs that are identical and joins them. This is what people normally think of when they think of a graph.

Triples Auto-snapped Together

Notice that no metadata, was involved in assembling this data into a useful set and note that no human wrote any joins. Hopefully this hints at the potential. Any data we find from any source can be united, flexibly. If a different lot had 100 facts about it, that would be ok. We are not bound by some rigid structure.

Triples

But we still have a few more distinctions we’ve introduced without elaborating.

We introduced the “triple,” but didn’t elaborate. A triple is an assertion, like a small statement or sentence. It has the form: subject, predicate, object. In this case:
:item6 (subject), :hasPhysicalLocation (predicate), :geo27 (object).

(subject) (predicate) (object)

Triples as Tiny Sentences

The one we showed backward should be read from the tail to the head of the arrow.

Read Triples in the Direction of the Arrow

The :insp44 (subject) :isAbout (predicate) :item6 (object).

A Schema Emerging

You may be willing to accept that behind the scenes we did something called “Entity Resolution.” This is very similar to what it is in traditional systems; it is the gathering up of clues about an entity (in this case the house, the lot, the city etc.) to determine whether the new fact is about the same entity we already have information about.

Assuming we have the software, and we are competent, we can figure out from clues (which we’ve skipped over so far) to determine that all facts about item6 are in fact about the same house. And that also behind the scenes we came up with some way to assign a unique fragment to the house (in this case the unusually short, but ok for the illustration “item6.”)

But you should wonder where did “:hasPhysicalLocation” come from. Truth is, we didn’t just make it up on the fly. This is the first part of the schema of this graph database. It must have existed before we could make this assertion using it.

We are going to draw this a bit differently, but trust us, everything is done in triples, it is just that some triples are a bit more primitive and special and well known than others. In this case we created a bit of terminology before we created that first triple. We declared that there was a “property” that we could reuse later. We did it something like this:

Schema are Triples too!

This is the beginning of an “ontology” which is a semantic data model of a domain. It is built with triples, exactly as everything else is, but it uses some primitive terms that came with the standards. In this case the RDF standard1 gives us the ability to declare a type for things, and in this case, we use the OWL standard 2to assert that this “property” is an object property. What that means is that it can be used to connect two nodes together, which is what we did in the prior example.

We’ve noticed that having everything be triples kind of messes with your mind when you first pick this up so we’re going to introduce a drawing convention, but keep in mind this is just a drawing convention to make it easier to understand, behind the scenes everything is triples, which is a very good thing as we’ll get to later.

There is something pretty cool going on here. The metadata, and therefore the meaning, of data is co-located with the data, using the same structural mechanics as the data itself. This is not what you find in traditional systems. In traditional systems the meta data is typically in a different syntax (DDL Data Definition Language is the metadata language for relational and DML Data Manipulation Language is its manipulation language), which is often in a different location (the directory as opposed to the data tables themselves) and is often augmented with more metadata, entirely elsewhere, initially in a data dictionary, and more recently in data catalogs, metadata management systems and enterprise glossary systems. With Graph Databases once you get used to the idea that the metadata is always right there, one triple away, you wonder how we lived so long without it.

In our drawing convention, this boxy arrow (which we call “defining a property”):

Shorthand for Defining a Property

Is shorthand for this declaration:

Defining a Property as Triples

Which makes it easier to see, when we want to use this property as a predicate in an assertion:

Defining a Property v. Using it as a Predicate in a Triple

This dotted line means that the predicate refers to the property, there isn’t really another triple there, in fact the two IRIs are the same. The one in the boxy arrow is defining what the property means. The one on the arrow is using that meaning to make a sensible assertion.

When we create a new property, we will add additional information to it (narrative to describe what it means, and additional semantic properties, but rather than clutter up the explanation, let’s just accept that there is more than just a label when you create a new property).

Classes

You may have noticed that we haven’t introduced any “classes” yet. This was intentional. Most design methodologies start with classes. But classes in traditional technology are very limiting. In relational “class” equals “table.” That is, the class tells you what attributes (columns) you can have, and in so doing limits you to what attributes you can have. If one row wants more attributes you must either grant them for all the rows in the table, or build a new table, for this new type of row.

In semantics the relationship between individuals and classes is quite different. A class is a set. We declare membership in a set by, (wait for it) a triple.

While this is all done with triples, once again they are pretty special triples, that are called out in the standards. In order for us to say that item6 is a House, we first had to create the class House.

Class Definition as Triples

Again, because we humans like to think of schema or metadata differently than instance data, we will draw classes differently — but keep in mind this is just a drawing convention and is a bit more economical on ink.

A shorthand for asserting an instance to be a member of a class

The incredible power comes when you free yourself from the idea that an instance (a row in a relational database) can only be in one class (table). When relational people want to say that something is a both a X (House) and a Y (Duplex) the copy the id into a different table, and export the complexity to the consumer of the data to know that they have to reassemble it..

Instances can be members of more than one class

In Object Oriented design, we might say that a Duplex is a sub type of a House. (all duplexes are houses not all houses are duplexes), but this is at the class level, which ends up being surprisingly limiting.

Now there might be a relationship between Duplex and House, but what if we also said

The classes themselves need not have any pre-existing relationship to each other

Maybe because you’re an insurance company or a fire department and you’re interested in which homes are made of brick. Note that many brick buildings are neither houses nor duplexes (they can be hospitals, warehouses or outhouses). In any event this is what we have

Venn diagram of instance to class membership

Our :item6 is in the set of Brick Buildings, Duplexes and Houses. Another item might be in any other combination of sets.

This is different from Object Oriented, which occasionally has “multiple inheritance,” where one class can have multiple parents. Here as you can see, one instance can belong to multiple unrelated classes.

This is where semantics comes in. We can define the set “Duplex,” and we would likely create a human readable definition for “Duplex.” But with Semantics (and OWL) we can create a formal, machine-readable definition. These machine-readable definitions allow the system to infer instances into a class, based on what we know about it. Let’s say that in our domain we decided that a Duplex was: a Building that was residential and had two public entrances. In the formal modeling language this looks like

Figure 4 — formal definition of a class

Which diagrammatically looks like this:

Defining a Class as the Intersection of Several Classes or Abstract Class Definitions

The two dashed circles represent sets that are defined by properties their individuals have. If an individual is categorized as being residential, it is in the upper dashed (unnamed) circle. If it has two public entrances, it is in the lower one. We are defining a duplex to be the intersection of all three sets, which we cross hatched here.

Don’t worry about understanding the syntax or how the modeling works, the important thing is this discipline is very useful in creating unambiguous definitions of things, and while it certainly doesn’t look like it here, this style of modeling contributes to much simpler overall conceptual models.

Inference

Semantic based systems have “inference engines” (also called “reasoners”) which can deduce new information from the information and definitions provided. We are doing two things with the above definition. One is if we know that something is a building and it is residential and has exactly two entrances, then we can infer that it is a Duplex.

Inferring a Triple is Functionally Equivalent to Declaring it

In this illustration we find :item6 has two public entrances and is a building and has been categorized as being residential. This is sufficient to infer it into the class of duplexes (the dotted line from :item6 to the Duplex class). Diagrammatically this is what causes it to be in the crosshatched part of the Venn diagram

On the other hand if all we know is that it is a Duplex, (that is if we assert that it is a member of the class :Duplex), then we can infer that it is residential and has two public entrances (and that it is a Building).

Triples can be Inferred to be true, even if we don’t know all their specifics

These additional inferred triples are shown dashed. This includes the case where we know that it has two public entrances even if we don’t know what or where they are.

Other Types of Instances

One of our proto-truths was that the individuals were real world things, like houses and lots and people. It turns out there are many other types of things that can be individuals and therefore can be members of classes and therefore can participate in assertions.

Any kind of electronic document that has an identity (a file name) can be an individual, so can any word document or a json file if it is saved to disk (and named). There are many real-world things that we represent as individuals even though they don’t have a physical embodiment. The obligation to pay your mortgage is real. It is not tangible. It may have been originally memorialized on a piece of paper but burning that paper doesn’t absolve you of the obligation.

Similarly, we identify “Events” — both those that will happen in the future (your upcoming vacation) and those that occurred in the past (the shipment of widgets last Tuesday). Categories (such as found in taxonomies) can also be individuals.

Other Types of Properties

We introduced a property that can connect two nodes (individuals). This is called an “Object Property.” There are two other types of properties:

• Datatype Properties

• Annotations

Datatype Properties allow you to attach a literal to a node. They provide an analog to the document that was attached to a node in the labeled property graph above.

Datatype Properties are for Attaching Literals to Instances

This is how we declare a datatype property in the ontology (model). Again, for diagraming we show it as a boxy arrow, and here we use it

Similar to Object Properties We Define Datatype Properties and then Assert them on Instances

Note the literal (“40.5853”) is not a node and therefore cannot be the subject (left hand side) of a triple. Literals are typically labels, descriptions, dates and amounts.

Annotation properties are properties that the inference engine ignores. They are handy for documentation to humans; they can be used in queries, and they can be used as parameters for other programs that are using the graph.

Triples Are Really Quads

Recall when we introduced the triple

(subject) (predicate) (object)

Recall the Classic Three-part Triple

Conceptually you can think of this being one line in a very narrow deep table:

Subject Predicate Object

:item6 :hasPhysicalLocation :geo27

:geo27 :hasEasment :oblig2

:insp44 :isAbout :item6

…

One Way of Thinking About Triples

Really triples have (at least) four parts. The fourth part is part of the spec, the other parts are implementation-specific.

Subject Predicate Object Named Graph

:item6 :hasPhysicalLocation :geo27 :tran1

:geo27 :hasEasment :oblig2 :tran1

:insp44 :isAbout :item6 File6

…

Really Triples Have Four Parts

Pictorially it is like this:

A Pictorial Way to Show the Named Graph

The named graph contains the whole statement, it is not directly connected to either node, or to the edge. Note from the table above many triples could be in the same named graph.

The named graph is a very powerful concept, and there are many uses for it. Unfortunately, you must pick one of the uses and use that consistently. We have found three of the most common uses for the named graph are:

• Partitioning, especially for speed and ease of querying – it is possible to put a tag in the named graph position that can greatly speed querying. • Security – some people tag triples to their security level, and use them in authorization

• Provenance – it is possible to identify exactly where each triple or group of triples came from, for instance from another system, a dataset or an online transaction.

Because of the importance of auditability in accounting systems we are going to use named graphs to manage provenance. We’ll dive in on how to do that when we get to the provenance section, but for now, there is a tag on every triple that can describe its source.

Querying Your Graph Database

Once you have your data expressed as triples and loaded in your graph database, you will want to query it. The query language, SPARQL, is the only part of the stack that isn’t expressed in triples. SPARQL is a syntactic language. We assume they did this in order to appeal to traditional database users who are used to SQL, which is the relational database query language. Despite the fact that SPARQL is simpler and

more powerful than SQL, it seems to have gathered few converts from the relational world. If they had known that making a syntactically similar language was not going to bring converts the standards group might have opted for a triples based query language (like WOQL) but they didn’t, so we’ll deal with SPARQL.

Syntactically, SPARQL looks a bit like SQL, or at least the SPARQL SELECT syntax looks like the SQL SELECT syntax. The big difference is the query writer does not need to “join” datasets, all the data is already joined. The query writer is traversing connections in the graph that are already connected.

SQL SPARQL

comparing SQL and SPARQL

At this simple level it isn’t obvious how much simpler a SPARQL query is. In practice SPARQL queries tend to be 3-20 times simpler than their SQL equivalents. Many have no SQL equivalent.

The SPARQL SELECT statement creates table-like structures, so when you need to export data from a graph database this is often the most convenient way to do so. SPARQL can also INSERT and DELETE data in a graph database, which is analogous to SQL, but SPARQL’s INSERTs and DELETEs must be shaped like triples.

The real power in SPARQL is its native ability to federate. You can easily write queries that interrogate multiple triple stores, even triple stores from different vendors. Because the triples are identical and there are very few and easy to avoid extensions to the query language it is feasible and often desirable to partition your data into multiple Graph Databases and assemble them at query time. This assembly is not the equivalent of “joins” you do not need to describe which strings to be matched to assemble a complete dataset, this assembly is just pointing at which databases you want to include in your scope.

Back to the Stack

That was a long segue. We now have all the requisite distinctions to begin to talk about the preferred stack.

Before we do, a quick disclaimer: we don’t sell any of the tech we describe in this stack (or any stack for that matter). We are trying to describe, from our experience, what the preferred components of the stack should be.

Center of the Stack

As we said earlier, once upon a time, stacks centered on hardware. Over time they centered on operating systems. We suggest the center of your universe should be a graph database conforming to the RDF spec (also usually called a “triple store”). Yes you can build your system on a proprietary database (and all the proprietary database vendors are silently muttering “no ours is better.”). Yes, yours is better. It might be easier, it might scale better, it might be easier for traditional developers to embrace. But those advantages pale, in our opinion, to the advantages we’re about to describe.

RDF Triple Stores are remarkably compatible. If you’ve ever ported a relational database application from one vendor to another (say IBM DB2 to Oracle or Oracle to Microsoft SQL Server) you know what I’m talking about. Depending on the size of the application that is a 6–18-month project. You will get T-Shirts at the end for your endurance.

The analogue in the triple store world is somewhere between a weekend and a few weeks. No T-Shirt for you. We’ve done this several times. Easy portability sounds nice, but you think: “I don’t port that often.” Yeah you don’t but that’s largely because it’s hard and this is the source of your vendors lock-in and therefore price pressure. Ease of porting brings the database closer to the commodity level, which is good for the economics.

But that’s not the only benefit. The huge advantage is the potential for easy heterogeneity. You might end up with several triple stores. They might be from the same vendor, but they might not. The fact that part of the SPARQL standard is how it federates, means that there is very little incremental effort to combine multiple databases (triple stores).

So, the first part of our stack is: RDF compliant Triple Stores.

The core of our recommended stack

The core of the core is the reliance on open standards based triple stores. The core of the UI are browsers. We’re sinking our pilings into two technologies that are not proprietary, have been stable for a long time, and we will not incur high switching costs as we move from vendor to vendor or open source product.

Before we move up a level in the stack we need to look at the role of models in a model driven environment in a triple store platform.

Configuration as Triples

In most software implementations, most configurations (the small datasets that turn on and off certain capabilities in the software) are expressed in json. This idea is super pervasive. It ends up meaning that configuration files are tantamount to code. They really say which code is going to be executed and which code will be ignored. This superpower of configuration files is what leads cyber security vendors to be hypervigilant about how the configuration files are set. A large percentage of the benchmarks from the Center for Internet Security deal with setting configuration files to insure the least compromised surface area for a given implementation.

But configuration files are out of band. We advocate most of the configuration that is possible in a given system be expressed in triples. The huge advantage is that the configuration triples are expressed using the same IRIs for the same concepts as the rest of the application. And the configuration can be interrogated by the query language (which a configuration file cannot)

Model Driven as Triples

Recall the earlier discussion about model driven development. Most model driven development also expresses their parameters in tables or some sort of json configuration file. But this requires a context shift to understand what’s going on. The parameters that define each use case can, and should be, triples. Many of the triples are referring to classes and properties in the triple store. If we use the same technology to store these parameters it becomes easy to query and find out for instance, which user interface refers to this class (because I’m contemplating changing it). This is a surprisingly hard thing to do in traditional technology. First you don’t know all the references in code to the concepts in the model. Second the queries are opaque text that aren’t offering up the secrets of their dependency.

Each new use case adds a few triples, that define the user interface, the fields on a form or the layout of a table or a card, and a few small snippets of sparql (for instance to populate a dropdown list).

The part of the stack where use cases are created

We also show a bespoke user interface. We’re finding that 2-5% of our user interfaces are bespoke. These green slivers are meant to suggest the incremental work to add a use case. Notice that the client architecture code doesn’t change and the server architecture code doesn’t change. (and of course the browser and triple store don’t change).

While sparql is in the stack, we should point out that the architecture does not allow sparql to be executed directly against the triplestore. This is a very hard security problem to control if you allow it. In this architecture, the sparql is stored in the triplestore along with all the other triples, at run time the client indicates the named sparql procedure to be executed and supplies the parameters.

Middleware

There are two middleware considerations: one much of what we described above can be purchased or obtained via open source. Depending on your needs, Metaphactory, Ontopic or AtomGraph may handle many of the requirements you have.

The second consideration is that you maybe want to add additional capability to your stack. Some of the more common are ingress and egress pipelines, constraint managers, entity resolution add ons, and unstructured text search.

Architecture showing some middleware add-ons

There you have a fairly complete data-centric graph stack.

The Data-Centric Graph Stack in Summary

This data-centric stack has some superficial resemblance to a more traditional development stack. While there are programming languages in both, in the traditional stack they are more important, as most of the application code will be built in code, and the choice is pretty key.

In the data-centric stack there is very little application code, and the language matters very little. The architecture is built is code, but again, it doesn’t matter much what language it is.

We think one we think some of the key distinctions of this architecture is in the red lines. There are very few, well controlled and well tested APIs that ensure there is only one pathway in for access to the database, and that all processes pass through the same gateways.

A BFO-ready version of gist

October 20, 2025January 15, 2025 by Dave McComb

A BFO-ready version of gist

The Federal Government and Life Sciences Life Sciences companies are moving toward adoption of the Basic Formal Ontology (BFO).

We have aligned the Semantic Arts foundational ontology (gist) with BFO to help these communities.

This paper describes how and why we did this.

Background

An upper ontology is a high-level data model that can be specialized to create a domain-specific data model. A good upper ontology is a force multiplier that can speed the development of your domain model. It promotes interoperability and can be used as the basis for an information system. Two domain models derived from the same upper ontology are far easier to harmonize.

gist (not an acronym, but the word meaning “get the essence of”) is an upper ontology, focused on the enterprise information systems domain. It was initially developed by Semantic Arts in 2007 and has been refined in over 100 commercial implementation projects. It is available for free under a creative commons’ attribution license at https://www.semanticarts.com/gist/

BFO was developed at the University at Buffalo in 2002 and has been used in hundreds of ontology projects and cited in as many papers. The focus has been on philosophical correctness and has been adopted primarily in life sciences and more recently the federal government. It is available https://basic-formal-ontology.org

BFO and gist share a great deal in common:

Simple – the current version of gist has 211 concepts (98 classes and 113 properties). The current version of BFO has 76 concepts (36 classes and 40 properties). We share the belief that the upper ontology should have the fewest concepts that provide the greatest coverage.
Formality – most of the concepts within both ontologies have very rigorous formal definitions. The axioms within BFO are primarily defined in first order logic, which are not available to the owl-based editors and reasoners – but they have developed an owl version. Half of their definitions are simple subsumption. The other half have subclass restrictions that don’t have as much inferential value as equivalent class axioms. BFO is one of the few other ontologies we have come across that makes extensive use of high-level disjoints. It is the combination of formal definitions with high level disjoints that is the best way to detect logical inconsistencies. gist is also highly axiomized. Half of all gist classes have full formal definitions of the equivalent class variety.
Breadth – both BFO and gist were designed to provide covering concepts for the maximum number of domain concepts. A well-designed domain ontology, derived from either starting point, should have few or no classes that are not derived from the upper ontology classes. In the early days of gist, we created some domain classes without derivation. But as we evolved gist we now find “orphans” (classes not descended from gist classes) to be rare. BFO with its high-level abstract classes certainly has the potential to cover virtually all possible domain classes, but in practice we find many BFO compliant ontologies with large numbers of orphan classes.
Active Evolution – both ontologies are in continual use and have active user communities. Both are well organized with major and minor releases including the ability to accept suggestions from users. Both are being used in production systems throughout the world.

Why Now?

In the early days of semantic adoption there were many options for an upper ontology. BFO, Dolce, Sumo and OpenCyc were considered the main contenders.

At Semantic Arts, we didn’t see a need to adopt BFO or any of the other upper ontologies. They didn’t contain the key concepts that we needed to implement enterprise systems, and they were very hard to explain to subject matter experts and project sponsors. We invest significant effort making sure our ontologies are understood by those that both implement and consume them.

Recently we have considered committing to both schema.org and ISO 15926. Neither of these purports to be an upper ontology. However, when we look at them in detail, we find they are pretty close to being upper ontologies by scope and positioning. In many ways these ontologies are more pragmatic and closer to what we are trying to achieve.

Schema.org is promoted by a consortium led by Google. Its primary use case is to make internet search more accurate by standardizing on many of the terms used for business descriptions. The pragmatic value for companies that tag their content with schema.org is major improvements in web searching. We also know that schema.org can be easily aligned with gist. This is how Schema App (https://www.schemaapp.com ) built their offering. While schema.org is a good solution for finding and describing a company’s offerings, it wasn’t designed for our primary purpose, which is to help a firm run their business.

ISO 15926 emerged from the Oil & Gas industry and is widely used in process manufacturing industries. The architecture is abstract and, in theory, could be applied in a much broader way.

Up until now we didn’t see much advantage in reducing flexibility in the pursuit of our core mission by committing to these candidate upper ontology and upper ontology-like models.

Motivation

We were driven to create an alignment with BFO based on input from some of our clients.

The first motivator is the huge volume of life science ontologies that (at least) purport to be based on BFO. The reason we say “purport” is that we have sampled many life science ontologies for their degree of commitment to BFO. Our measure of commitment is what percentage of their named classes are subclasses of BFO classes. Or to use the terminology earlier, the number of orphan classes they contain. We find many where fewer than half of the classes are proper descendants of BFO primitives.

The OBO (Open Biological and Biomedical Ontologies) Foundry is a great resource for ontologies in the life sciences space. That said, there are over 8 million classes in OBO alone that purport to conform to BFO, which gives other life science ontologies a reason to seek alignment.

The other development was the DoD’s publication of “Principles of The DoD-IC Ontology Foundry” (which is still in draft status). In this document the DoD have declared that all ontology related work within the defense community shall conform to BFO (and the Common Core Ontology, which we will pick up in a subsequent white paper).

For people who must conform to BFO (the defense community) this provides them with a more pragmatic way to build domain models while still complying with the directive. For life science practitioners this also provides assurance that their work will align with life science orthodoxy.

How to Get Started

This illustration shows how the key pieces fit together.

This file will bring in compatible versions of both gist and BFO. These arrows represent the import statements that bring in these ontologies. As we suggest in the tips section you may want to add the redundant import to directly import the same version of gist to your ontology.
This is what you will find when you look at the merged ontology in Protege. It is much easier to see which concepts came from BFO and which came from gist when you view using the “Render by prefixed name” option. The BFO class names are in the obo namespace, start with BFO and are numbers.

The capitalized terms starting with gist are from gist.

The alternative display “Render by label (rdfs:label)” it is still pretty easy to tell how they blended together. The BFO labels are lower case. (the order is slightly different because the labels sort differently from the class names, but the hierarchy is the same)

As you will see, almost all the gist classes are proper subclasses of BFO with three exceptions.

Artifact – things that were intentionally built.
Place – locatable in the world
Unit of Measure – a standard amount used to measure or specify things.

The first two of these are convenience classes that group dissimilar items underneath. The “Artifact” class groups physical and non-physical things that were intentionally built. “Place” groups geospatial regions on the earth with physical items that we often refer to as places, such as landmarks and buildings. Because they subsume items that are disjoint, they could not be subsumed under a single BFO class. But each of their subclasses is aligned with BFO so there is no ambiguity there.

We were not sure where “Units of Measure” fit in BFO, so rather than create inconsistencies we opted to leave UoM out of the BFO alignment. CCO went with our first inclination, which was that it was a “generically dependent continuant” (in gist-speak “content”). In fact, CCO went further and said that it was “descriptive information content entity” which I suppose it could be. But these focus on the content-ness of the unit. A case can be made that a unit of measure (say “inch”) is a special case, a reference

case or a magnitude, which in BFO is a “quality,” and more specifically a “relational quality.” For the time being we’ll leave gist:UnitOfMeasure an orphan, but for any specific purpose if people knew that it would be safe, they could declare it a “generically dependent continuant.”

If any of our alignments are inappropriate, we’d be happy to change.

We have done some alignment on the properties. There are some structural differences in the use of properties that will probably cause users of gistBFO to either use gist properties or BFO properties and not mix and match, however where there is some equivalence we’ve recorded.

BFO makes extensive use of inverse properties. There are only 6 properties in BFO that do not have inverses. After years of discouraging the use of inverses, we finally eliminated them altogether in gist. When using an ontology for definitional purposes inverses can be handy, but there are reasons to avoid them in production systems including ambiguity, inconsistency and performance issues. There is a brief white paper here: https://www.semanticarts.com/named-property-inverses-yay-or-nay/ and a longer discussion on here

https://www.youtube.com/watch?v=uz2GVWadBjg&list=PLk2kJrehubb4dc3e5Db5Lvv9W MaOhV3V7&index=4

BFO uses domain and range extensively. Gist uses them sparingly. We have observed in other ontologies being over specific on domain and range has made properties less reusable and contributes to unnecessary concept bloat. Because of the abstractness of BFOs classes this isn’t as much a problem, but it is a stylistic difference.

In BFO there are only four property pairs that participate in any class definitions: location of/located in, continuant part of/ has continuant part, has occurrent part/occurrent part of and has temporal part / temporal part of. We have aligned with these.

Some tips

We recommend anyone using gistBFO, and especially those that are contemplating building artifacts that may be used by BFO and non BFO communities, to primarily rely on the gist concepts when defining your domain specific classes. Doing so will make it far easier to explain to your stakeholders. And it will not sacrifice any of the BFO philosophical grounding as all the gist concepts (except unit of measure) algin with BFO.

The other advantage, suggested by the dotted line in the “how to get started” section, is that if you have defined all your concepts in gist terms, and you need to implement in a non BFO environment you can just drop the import of gistBFO and the BFO references will disappear, and nothing else needs to change.

If you are using BFO and gist in the Life Sciences arena, you might want to consider what we are doing with our Life Sciences clients: treating most of the classes in OBO as instances in an implementation ontology. Depending on your use case this might involve

punning (treating and class and instance interchangeably) or just use the rdfs:seeAlso annotation to resolve the instance to its class.

Coming soon: CCO ready gist

The Common Core Ontology is a DoD initiated project. It is more similar to gist in that the classes are more concrete and more easily understood by domain experts. It is different from gist in that it consists of 1417 classes and 275 properties and is growing. As such it is almost ten times as complex as gist.

We are working on alignment. Stay tuned, coming soon.

The Data-Centric Revolution: Best Practices and Schools of Ontology Design

October 20, 2025January 2, 2024 by Dave McComb

This article originally appeared at The Data-Centric Revolution: Best Practices and Schools of Ontology Design – TDAN.com. Subscribe to TDAN directly for this and other great content!

I was recently asked to present “Enterprise Ontology Design and Implementation Best Practices” to a group of motivated ontologists and wanna-be ontologists. I was flattered to be asked, but I really had to pause for a bit. First, I’m kind of jaded by the term “best practices.” Usually, it’s just a summary of what everyone already does. It’s often sort of a “corporate common sense.” Occasionally, there is some real insight in the observations, and even rarer, there are best practices without being mainstream practices. I wanted to shoot for that latter category.

As I reflected on a handful of best practices to present, it occurred to me that intelligent people may differ. We know this because on many of our projects, there are intelligent people and they often do differ. That got me to thinking: “Why do they differ?” What I came to was that there are really several different “schools of ontology design” within our profession. They are much like “schools of architectural design” or “schools of magic.” Each of those has their own tacit agreement as to what constitutes “best practice.”

Armed with that insight, I set out to identify the major schools of ontological design, and outline some of their main characteristics and consensus around “best practices.” The schools are (these are my made-up names, to the best of my knowledge none of them have planted a flag and named themselves — other than the last one):

Philosophy School
Vocabulary and Taxonomy School
Relational School
Object-Oriented School
Standards School
Linked Data School
NLP/LLM School
Data-Centric School

There are a few well known ontologies that are a hybrid of more than one of these schools. For instance, most of the OBO Life Sciences ontologies are a hybrid of the Philosophy and Taxonomy School, I think this will make more sense after we describe each school individually.

Philosophy School

The philosophy school aims to ensure that all modeled concepts adhere to strict rules of logic and conform to a small number of well vetted primitive concepts.

Exemplars

The Basic Formal Ontology (BFO), DOLCE and Cyc are the best-known exemplars of this school. Each has a set of philosophical primitives that all derived classes are meant to descend from.

How to Recognize

It’s pretty easy to spot an ontology that was developed by someone from the philosophy school. The top-level classes will be abstract philosophical terms such as “occurrent” and “continuant.”

Best Practices

All new classes should be based on the philosophical primitives. You can pretty much measure the adherence to the school by counting the number of classes that are not direct descendants of the 30-40 base classes.

Vocabulary and Taxonomy School

The vocabulary and taxonomy school tends to start with a glossary of terms from the domain and establish what they mean (vocabulary school) and how these terms are hierarchically related to each other (taxonomy school). The two schools are more alike than different.

The taxonomy school especially tends to be based on standards that were created before the Web Ontology Language (OWL). These taxonomies often model a domain as hierarchical structures without defining what a link in the hierarchy actually means. As a result, they often mix sub-component and sub-class hierarchies.

Exemplars

Many life sciences ontologies, such as SNOMED are primarily taxonomy ontologies, and only secondarily philosophy school ontologies. Also, the Suggested Upper Merged Ontology is primarily a vocabulary ontology, it was mostly derived from WordNet and one of its biggest strengths is its cross reference to 250,000 words and their many word senses.

How to Recognize

Vast numbers of classes. There are often tens of thousands or hundreds of thousands of classes in these ontologies.

Best Practices

For the vocabulary and taxonomy schools, completeness is the holy grail. A good ontology is one that contains as many of the terms from the domain as possible. The Simple Knowledge Organization System (SKOS) was designed for taxonomies. Thus, even though it is implemented in OWL, it is designed to add semantics to taxonomies that often are less rigorous, using generic predicates such as broaderThan and narrowerThan rather than more precise subclass or object properties such as “part of.” SKOS is a good tool for integrating taxonomies with ontologies.

Relational School

Most data modelers grew up with relational design, and when they design ontologies, they rely on ways of thinking that served them well in relational.

Exemplars

These are mostly internally created ontologies.

How to Recognize

Relational ontologists tend to be very rigorous about putting specific domains and ranges on all their properties. Properties are almost never reused. All properties will have inverses. Restrictions will be subclass axioms, and you will often see restrictions with “min 0” cardinality, which doesn’t mean anything to an inference engine, but to a relational ontologist it means “optional cardinality.” You will also see “max 1” and “exactly 1” restrictions which almost never imply what the modeler thought, and as a result, it is rare for relational modelers to run a reasoner (they don’t like the implications).

Best Practices

For relational ontologist best practices are to make ontologies that are as similar to existing relational structures as possible. Often, the model is a direct map from an existing relational system.

Modelers in the relational school (as well as the object-oriented school coming up next) tend to bring the “Closed World Assumption” (CWA) with them from their previous experience. CWA takes a mostly implicit attitude that the information in the system is a complete representation of the world. The “Open World Assumption” (OWA) takes the opposite starting point: that the data in the system is a subset of all knowable information on the subject.

CWA was and is more appropriate in narrow scope, bounded applications. When we query your employee master file looking for “Dave McComb” and don’t get a hit, we reasonably assume that he is not an employee of your enterprise. When TSA queries their system and doesn’t get a hit, they don’t assume that he is not a terrorist. They still use the X-ray and metal detectors. This is because they believe that their information is incomplete. They are open worlders. More and more of our systems combine internal and external data in ways that are more likely to be incomplete.

There are techniques for closing the open world, but the relational school tends not to use them because they assume their world is already closed.

Object-Oriented School

Like the relational school, the object-oriented school comes from designers who grew up with object-oriented modeling.

Exemplars

Again, a lot of object-oriented (OO) ontologies are internal client projects, but a few public ones of note include eCl@ss and Schema.org. eCl@ss is a standard for describing electrical products. It has been converted into an ontology. The ontology version has 60,000 classes, which combine taxonomic and OO style modeling. Schema.org is an ontology for tagging web sites that Google promotes to normalize SEO. It started life fairly elegant. It now has 1300 classes, many of which are taxonomic distinctions, rather than real classes.

How to Recognize

One giveaway for the object-oriented school is designing in SHACL. SHACL is a semantic constraint language, which is quite useful as a guard for updates to a triple store. Because SHACL is less concerned with meaning and more concerned with structure, many object-oriented ontologists prefer it to OWL for defining their classes.

Even those who design in OWL have some characteristic tells. OO ontologists tend to use subclassing far more than relational ontologists. They tend to declare which class is a subclass of another, rather than allowing the inference engine to infer subsumption. There is also a tendency to believe that the superclass will constrain subclass membership.

Best Practices

OO ontologies tend to co-exist with Graph QL and focus on json output. This is because the consuming applications are object oriented, and this style ontology and architecture have less impedance mismatch with the consuming applications. The level of detail tends to mirror the kind of detail you find in an application system. Best practices for an OO ontology would never consider the tens of thousands or hundreds of thousands of classes in a taxonomy ontology, nor would they go for the minimalist view of the philosophy or data-centric schools. They tend to make all distinctions at the class level.

Standards School

This is a Janus school, with two faces, one facing up and one facing down. The one facing down is concerned with building ontologies that others can (indeed should) reuse. The one facing up is the enterprise ontologies that import the standard ontologies in order to conform.

Exemplars

Many of the most popular ontology standards are produced and promoted by the W3C. These include DCAT (Data Catalog Vocabulary), the Ontology for Media Resources, Prov-O (an ontology of provenance), Time Ontology, and Dublin Core (an ontology for metadata, particular around library science).

How to Recognize

For the down facing standards ontology, it’s pretty easy. They are endorsed by some standards body. Most common are W3C, OMG and Oasis. ISO has been a bit late to this party, but we expect to see some soon. (Everyone uses the ISO country and currency codes, and yet there is no ISO ontology of countries or currencies.) There are also many domain-specific standard ontologies that are remakes of their previous message model standards, such as FHIR from HL7 in healthcare and ACORD in insurance.

The upward facing standards ontologies can be spotted by their importing a number of standard ontologies each meant to address an aspect of the problem at hand.

Best Practices

Best practice for downward facing standards ontologies is to be modular, fairly small, complete and standalone. Unfortunately, this best practice tends to result in modular ontologies that redefine (often inconsistently) shared concepts.

Best practice for upward facing standards ontologies is to rely as much as possible on ontologies defined elsewhere. This usually starts off by importing many ontologies and ends up with a number of bridges to the standards when it’s discovered that they are incompatible.

Linked Open Data School

The linked open data school promotes the idea of sharing identifiers across enterprises. Linked data is very focused on instance (individual or ABox) data, and only secondarily on classes.

Exemplars

The poster child for LOD is DBPedia, the LOD knowledge graph derived from the Wikipedia information boxes. It also includes the direct derivatives such as WikiData and the entire Linked Open Data Cloud.

I would put the Global Legal Entity Identifier Foundation (GLEIF) in this school as their primary focus is sharing between enterprises and there are more focused on the ABox (the instances).

How to Recognize

Linked open data ontologies are recognizable by their instances, often millions and in many cases billions of instances. The ontologies (TBox) is often very naïve, as they are often derived directly from informal classifications made by text editors in Wikipedia and its kin.

You will see many adhoc classes raised to the status of a formal class in LOD ontologies. I just notice the classes dbo:YearInSpaceFlight and yago:PsychologicalFeature100231001.

Best Practices

The first best practice (recognized more in the breach) is to rely on other organizations IRIs. This is often clumsy because historically, each organization invented identifiers for things in the world (their employees and vendors for instance) and they tend to build their IRIs around these well-known (at least locally) identifiers.

A second best practice is entity resolution and “owl:sameAs.” Entity resolution can determine if two IRIs represent the same real-world object. Once recognized, one of the organizations can choose to adopt the others IRI (previous paragraph best practice) or continue to use their own, but recognize the identity through owl:sameAs (which is mostly motivated by the following best practice).

LOD creates the opportunity for IRI resolution at the instance level. Put the DBPedia IRI for a famous person in your browser address bar and you will be redirected to DBPedia resolution page for that individual, showing all that DBPedia knows about them. For security reasons, most enterprises don’t yet do this. Because of this, another best practice is to only create triples with subjects whose domain name you control. Anything you state about a IRI in someone else’s name space will not be available for resolution by the organization that minted the subject URI.

NLP/LLM School

There is a school of ontology design that says turn ontology design over to the machines. It’s too hard anyway.

Exemplars

Most of these are also internal projects. About every two to three years, we see another startup with the premise that ontologies can be built by machines. For most of history, these were cleverly tailored NLP systems. The original works in this area took large teams of computational linguists to master.

This year (2023), they are all LLMs. You can ask ChatGPT to build an ontology for [fill in the blank] industry, and it will come up with something surprisingly credible looking.

How to Recognize

For LLMs, the first giveaway are hallucinations. These are hard to spot and require deep domain and ontology experience to pick out. The second clue is humans with six fingers (just kidding). There aren’t many publicly available LLM generated ontologies (or if there are they are so good we haven’t detected that they were machine generated).

Best Practices

Get a controlled set of documents that represent the domain you wish to model. This is better than relying on what ChatGPT learned by reading the internet.

And have a human in the loop. This is an approach that shows significant promise and several researchers have already created prototypes that utilize this approach. Consider that the NLP / LLM created artifacts are primarily speed reading or intelligent assistants for the ontologist.

In the broader adoption of LLMs, there is a lot of energy going into ways to use knowledge graphs as “guard rails” against some of LLMs excesses, and the value of keeping a human in the loop. Our immediate concern there are advocates of letting generative AI design ontologies, and as such it becomes a school of its own.

Data-Centric School

The data-centric school of ontology design, as promoted by Semantic Arts, focuses on ontologies that can be populated and implemented. In building architecture, they often say “It’s not architecture until it’s built.” The data-centric school says, “It’s not an ontology until it has been populated (with instance level, real world data, not just taxonomic tags).” The feedback loop of loading and querying the data is what validates the model.

Exemplars

Gist, an open-source owl ontology, is the exemplar data-centric ontology. SchemaApp, Morgan Stanley’s compliance graph, Broadridge’s Data Fabric, Procter & Gamble’s Material Safety graph, Schneider-Electric’s product catalog graph, Standard & Poor’s commodity graph, Sallie Mae’s Service Oriented Architecture and dozens of small firms’ enterprise ontologies are based on gist.

How to Recognize

Importing gist is a dead giveaway. Other telltale signs include a modest number of classes (less than 500 for almost all enterprises) and eschewing inverse and transitive properties (the overhead for these features in a large knowledge graph far outweigh their expressive power). Another giveaway is delegating taxonomic distinctions to be instances of subclasses of gist:Category rather than being classes in their own right.

Best Practices

One best practice is to have non primitive classes have “equivalent class” restrictions that define class membership and are used to infer the class hierarchy. Another best practice is to have domains and ranges at very high levels of abstraction (and often missing completely) in order to promote property reuse and reduce future refactoring.

Another best practice is to load a knowledge graph with data from the domain of discourse to prove that the model is appropriate and at the requisite level of detail.

Summary

One of the difficulties in getting wider spread adoption of ontologies and knowledge graphs is that if you recruit and/or assemble a group of ontologists, there is a very good chance you will have members from multiple of the above-described schools. There is a good chance they will have conflicting goals, and even a different definition of what “good” is. Often, they will not even realize that their difference of opinion is due to their being members of a different tribe.

There isn’t one of these schools that is better than any of the others for all purposes. They each grew up solving different problems and emphasizing different aspects of the problem.

When you look at existing ontologies, especially those that were created by communities, you’ll often find that many are an accidental hybrid of the above schools. This is caused by different members coming to the project from different schools and applying their own best practices to the design project.

Rather than try to pick which school is “best,” you should consider what the objectives of your ontology project are and use that to determine which school is better matched. Select ontologists and other team members who are willing to work to the style of that school. Only then is it appropriate to consider “best practices.”

Acknowledgement

I want to acknowledge Michael Debellis for several pages of input on an early draft of this paper. The bits that didn’t make it into this paper may surface in a subsequent paper.

The Data-Centric Revolution: “RDF is Too Hard”

October 20, 2025September 8, 2023 by Dave McComb

This article originally appeared at The Data-Centric Revolution: “RDF is Too Hard” – TDAN.com. Subscribe to The Data Administration Newsletter for this and other great content!

The Data-Centric Revolution: “RDF is Too Hard”

By Dave McComb

We hear this a lot. We hear it from very smart people. Just the other day we heard someone say they had tried RDF twice at previous companies and it failed both times. (RDF stands for Resource Description Framework,^[1] which is an open standard underlying many graph databases). It’s hard to convince someone like that that they should try again.

That particular refrain was from someone who was a Neo4j user (the leading contender in the LPG (Labeled Property Graph) camp). We hear the same thing from any of three camps: the relational camp, the JSON camp, and the aforementioned LPG camp.

Each has a different reason for believing this RDF stuff is just too hard. Convincing those who’ve encountered setbacks to give RDF another shot is also challenging. In this article, I’ll explore the nuances of RDF, shedding light on challenges and strengths in the context of enterprise integration and application development.

For a lot of problems, the two-dimensional world of relational tables is appealing. Once you know the column headers, you pretty much know how to get to everything. It’s not quite one form per table, but it isn’t wildly off from that. You don’t have to worry about some of the rows having additional columns, you don’t have to worry about some cells being arrays or having additional depth. Everything is just flat, two-dimensional tables. Most reporting is just a few joins away.

JSON is a bit more interesting. At some point you discover, or decree if you’re building it, that your dataset has a structure. Not a two-dimensional structure as in relational, but more of a tree-like structure. More specifically, it’s all about determining if this is an array of dictionaries or a dictionary of arrays. Or a dictionary of dictionaries. Or an array of arrays. Or any deeply nested combination of these simple structures. Are the keys static — that is, can they be known specifically at coding time, or are they derived dynamically from the data itself? Frankly, this can get complex, but at least it’s only locally complex. A lot of JSON programming is about turning someone else’s structure into a structure that suits the problem at hand.

One way to think of LPG is JSON on top of a graph database. It has a lot of the flexibility of JSON coupled with the flexibility of graph traversals and graph analytics. It solves problems difficult to solve with relational or just JSON and has beautiful graphics out of the box. Read more here

Each of these approaches can solve a wide range of problems. Indeed, almost all applications use one of those three approaches to structure the data they consume.

And I have to admit, I’ve seen a lot of very impressive Neo4j applications lately. Every once in a while, I question myself and wonder aloud if we should be using Neo4j. Not because RDF is too hard but because we’ve mastered it and have many successful implementations running at client sites and internally. But maybe, if it is really easier, we should switch. And maybe it just isn’t worth disagreeing with our prospects.

Enterprise Integration is Hard

Then it struck me. The core question isn’t really “RDF v LPG (or JSON or relational),” it’s “application development v. enterprise integration.”

I’ve heard Jans Aasman, CEO of Franz, the creators of AllegroGraph, make this observation more than once: “Most application developers have dedicated approximately 0 of their neurons contemplating how what they are working on is going to fit in with the rest of their enterprise, whereas people who are deeply into RDF may spend upwards of half their mental cycles thinking of how the task and data at hand fits into the overall enterprise model.”

That, I think, is the nub of the issue. If you are not concerned with enterprise integration, then maybe those features that scratch the itches that enterprise integration creates are not worth the added hassle.

Let’s take a look at the aspects of enterprise integration that are inherently hard, why RDF might be the right tool for the job, and why it might be overkill for traditional application development.

Complexity Reduction

One of the biggest issues dealing with enterprise integration is complexity. Most mid to large enterprises harbor thousands of applications. Each application has thousands of concepts (tables and columns or classes and attributes or forms and fields) that must be learned to become competent either in using the application and/or in debugging and extending it. No two application data models are alike. Even two applications in the same domain (e.g., two inventory systems) will have comically different terms, structures, and even levels of abstraction.

Each application is at about the complexity horizon that most mere mortals can handle. The combination of all those models is far beyond the ability of individuals to grasp.

Enterprise Resource Planning applications and Enterprise Data Modeling projects have shone a light on how complex it can get to attempt to model all an enterprise’s data. ERP systems now have tens of thousands of tables, and hundreds of thousands of columns. Enterprise Data Modeling fell into the same trap. Most efforts attempted to describe the union of all the application models that were in use. The complexity made them unusable.

What few who are focused on solving a point solution are aware of, is that there is a single simple model at the heart of every enterprise. It is simple enough that motivated analysts and developers can get their heads around it in a finite amount of time. And it can be mapped to the existing complex schemas in a lossless fashion.

The ability to posit these simple models is enabled by RDF (and its bigger brothers OWL and SHACL). RDF doesn’t guarantee you’ll create a simple or understandable model (there are plenty of counterexamples out there) but it at least makes the problem tractable.

Concept Sharing

An RDF based system is mostly structure-free, so we don’t have to be concerned with structural disparities between systems, but we do need a way to share concepts. We need to have a way to know that “employee,” “worker,” “user,” and “operator” are all referring to the same concept. Or if they aren’t, in what ways they overlap.

In an RDF-based system we spend a great deal of time understanding the concepts that are being used in all the application systems, and then creating a way that both the meaning and the identity of the concept can be easily shared across the enterprise. And that the map between the existing application schema elements and the shared concepts are also well known and findable.

One mechanism that helps with this is the idea that concepts have global identifiers (URIs /IRIs) that can be resolved. You don’t need to know which application defined a concept; the domain name (and therefore the source authority) is right there in the identifier and can be used much like a URL to surface everything known about the concept. This is an important feature of enterprise integration.

Instance Level Integration

It’s not just the concepts. All the instances referred to in application systems have identifiers. But often the identifiers are local. That is, “007” refers to James Bond in the Secret Agent table, but it refers to “Ham Sandwich” in the company cafeteria system.

The fact that systems have been creating identity aliases for decades is another problem that needs to be addressed at the enterprise level. The solution is not to attempt, as many have in the past, to jam a “universal identifier” into the thousands of affected systems. It is too much work, and they can’t handle it anyway. Plus, there are many identity problems that were unpredicted at the time their systems were built (who imagined that some of our vendors would also become customers?) and are even harder to resolve.

The solution involves a bit of entity resolution, coupled with a flexible data structure that can accommodate multiple identifiers without getting confused.

Data Warehouse, Data Lake, and Data Catalog all in One

Three solutions have been mooted over the last three decades to partially solve the enterprise integration problem: data warehouses, lakes, and catalogs. Data warehouses acknowledged that data has become balkanized. By conforming it to a shared dimensional model and co-locating the data, we could get combined reporting. But the data warehouse was lacking on many fronts: it only had a fraction of the enterprise’s data, it was structured in a way that wouldn’t allow transactional updates, and it was completely dependent on the legacy systems that fed it. Plus, it was a lot of work.

The data lake approach said co-location is good, let’s just put everything in one place and let the consumers sort it out. They’re still trying to sort it out.

Finally, the data catalog approach said: don’t try to co-locate the data, just create a catalog of it so consumers can find it when they need it.

The RDF model allows us to mix and match the best of all three approaches. We can conform some of the enterprise data (we usually recommend all the entity data such as MDM and the like, as well as some of the key transactional data). An RDF catalog, coupled with an R2RML or RML style map, will not only allow a consumer to find data sets of interest, in many cases they can be accessed using the same query language as the core graph. This ends up being a great solution for things like IoT, where there are great volumes of data that only need to be accessed on an exception basis.

Query Federation

We hinted at query federation in the above paragraph. The fact that query federation is built into the spec (of SPARQL, which is the query language of choice for RDF, and also doubles as a protocol for federation) allows data to be merged at query time, across different database instances, different vendors and even different types of databases (with real time mapping, relational and document databases can be federated into SPARQL queries).

Where RDF Might Be Overkill

The ability to aid enterprise integration comes at a cost. Making sure you have valid, resolvable identifiers is a lot of work. Harmonizing your data model with someone else’s is also a lot of work. Thinking primarily in graphs is a paradigm shift. Anticipating and dealing with the flexibility of schema-later modeling adds a lot of overhead. Dealing with the oddities of open world reasoning is a major brain breaker.

If you don’t have to deal with the complexities of enterprise integration, and you are consumed by solving the problem at hand, then maybe the added complexity of RDF is not for you.

But before you believe I’ve just given you a free pass consider this: half of all the work in most IT shops is putting back together data that was implemented by people who believed they were solving a standalone problem.

Summary

There are many aspects of the enterprise integration problem that lend themselves to RDF-based solutions. The very features that help at the enterprise integration level may indeed get in the way at the point solution level.

And yes, it would in theory be possible to graft solutions to each of the above problems (and more, including provenance and fine-grained authorization) onto relational, JSON or LPG. But it’s a lot of work and would just be reimplementing the very features that developers in these camps find so difficult.

If you are attempting to tackle enterprise integration issues, we strongly encourage you to consider RDF. There is a bit of a step function to learn it and apply it well, but we think it’s the right tool for the job.