Semantic Arts exists to shepherd organizations on their Data-Centric journey.

Our core capabilities include:

  • Semantic Knowledge Graph Development and Implementation

  • Legacy Avoidance, Erosion, and Replacement

We can help your organization to fix the tangled mess of information in your enterprise systems, while discovering ways to dissolve data silos and reduce integration debt.

What is Data-Centric?

Data-Centric is about reversing the priority of data and applications.  Right now, applications rule.  Applications own “their” data (it’s really your data, but good luck with that).  When you have 1,000 applications (which most large firms do) you have 1,000 incompatible data silos. This serves to further the entrenchment of legacy systems, with no real motivation for change.Data-Centric says data and their models come first.  Applications conform to the data, not the other way around.  Almost everyone is surprised at the fundamental simplicity, once it’s been articulated.It sounds simple, but fifty years of “application-centricity” is a hard habit to break.  We specialize in helping firms make this transition.  We recognize that in addition to new technology and design skills, a major part of most projects is helping shepherd the social change that this involves.

If you’re fed up with application-centricity and the IT-fad-of-the-month club, click on the contact us button.

Read More: What is Data-Centric?

What about those legacy systems?

The move to a more data-centric architecture requires thoughtful planning. Early phases look more like a surgical process of dealing with legacy applications in a way that realizes quick wins and begins to reduce costs, helping to fund future phases. Usually, it looks something like this:

  1. Legacy avoidance: The recognition that a firm has slowed down or stopped launching new legacy application systems projects, and instead rely on the data that is in the shared knowledge graph.

  2. Legacy erosion: Occurs when firms take use cases that were being performed in a legacy system and instead implement them directly on the graph. Rather that wholesale legacy elimination (which is hard) this approach allows the functionality of the legacy system to be gradually decommissioned.

  3. Legacy replacement: Once enough of the data, functionality and especially integration points have been shifted to the graph, legacy systems can be replaced. Not with other legacy systems as in “legacy modernization” but with lightweight standalone use cases on the graph.

Read more: Incremental Stealth Legacy Modernization

ABOUT US

Learn more about our mission, our history, and our team.

THOUGHT LEADERSHIP

See how we are leading the way towards a data-centric future, and those who have taken note.

PROBLEMS WE SOLVE

How we can help you along the journey.

Taking a different path STARTS NOW. Become Data-Centric to simplify and enhance your enterprise information landscape:

5 Business Reasons for Implementing a Knowledge Graph Solution

1. Comprehensive data integration

2. Contextualized knowledge discovery

3. Agile knowledge sharing and collaboration

4. Intelligent search and recommendation

5. Future-proof data strategy

Integrating semantic capabilities into enterprise business processes has been the foundational shift that organizations such as Google, Amazon, and countless others have leveraged. The results are tangible: increased market share and revenue, lower costs, better customer experiences, reduced risks and the promotion of innovation.

Semantic Arts' professional services deliver true solutions (not gimmicks) for current and future information management challenges.

FROM OUR BLOG

Knowledge Graph Modeling: Time series micro-pattern using GIST

Michael Sullivan is a Principle Cloud Solutions Architect at Oracle.  Article reprinted with permission (original is here) For any enterprise, being able to model time series is more than just important, in many cases it is critical. There are many examples but some trivial ones include “Person is employed By Employer” (Employment date-range), “Business has... Continue reading

Alan Morrison: Zero-Copy Integration and Radical Simplification

Dave McComb’s book Software Wasteland underscored a fundamental problem: Enterprise software sometimes costs 1,000 times more than it ought to. The poster child for cost overruns was highlighted in the book was Healthcare.gov, a public registration system for the US Affordable Care Act, enacted in 2010. By 2018, the US Federal government had spent $2.1... Continue reading

gist: 12.x

Click for More Information

gist: is our minimalist upper ontology. It is designed to have the maximum coverage of typical business ontology concepts with the fewest number of primitives and the least amount of ambiguity. Our gist: ontology is free (as in free speech and free beer--it is covered under the Creative Commons 3.0 attribution share-alike license). You can use as you see fit for any purpose, just give us attribution.

Semantic Arts Plays Well With Others

Click on a icon for a synopsis of our work with these firms.

HR Tech and The Kitchen Junk Drawer

I often joke that when I started with Semantic Arts nearly two years ago, I had no idea a solution existed to a certain problem that I well understood. I had experienced many of the challenges and frustrations of an application-centric world but had always assumed it was just a reality of doing business. As an HR professional, I’ve heard over the years about companies having to pick the “best of the worst” technologies. Discussion boards are full of people dissatisfied with current solutions – and when they try new ones, they are usually dissatisfied with those too!

The more I have come to understand the data-centric paradigm, the more I have discovered its potential value in all areas of business, but especially in human resources. It came as no surprise to me when a recent podcast by Josh Bersin revealed that the average large company is using 80 to 100 different HR Technology systems (link). Depending on who you ask, HR is comprised of twelve to fifteen key functions – meaning that we have an average of six applications for each key function. Even more ridiculously, many HR leaders would admit that there are probably even more applications in use that they don’t know about.  Looking beyond HR at all core business processes, larger companies are using more than two hundred applications, and the number is growing by 10% per year, according to research by Okta from earlier this year (link). From what we at Semantic Arts have seen, the problem is actually much greater than this research indicates.

Why Is This a Problem?

Most everyone has experienced the headaches of such application sprawl. Employees often have to crawl through multiple systems, wasting time and resources, either to find data they need or to recreate the analytics required for reporting. As more systems come online to try to address gaps, employees are growing weary of learning yet another system that carries big promises but usually fails to deliver (link). Let’s not forget the enormous amount of time spent by HR Tech and other IT resources to ensure everything is updated, patched and working properly. Then, there is the near daily barrage of emails and calls from yet another vendor promising some incremental improvement or ROI that you can’t afford to miss (“Can I have just 15 minutes of your time?”).

Bersin’s podcast used a great analogy for this: the kitchen drawer problem. We go out and procure some solution, but it gets thrown into the drawer with all the other legacy junk. When it comes time to look in the drawer, either it’s so disorganized or we are in such a hurry that it seems more worthwhile to just buy another app than to actually take the time to sort through the mess.

Traditional Solutions

When it comes to legacy applications, companies don’t even know where to start. We don’t know who is even using which system, so we don’t dare to shut off or replace anything. So we end up with a mess of piecemeal integrations that may solve the immediate issue, but just kicks the technical debt down the road. Sure, there are a few ETL and other integration tools out there that can be helpful, but without a unified data model and a broad plan, these initiatives usually end up in the drawer with all the other “flavor of the month” solutions.

Another route is to simply put a nice interface over the top of everything, such as ServiceNow or other similar solutions. This can enhance the employee experience by providing a “one stop shop” for information, but it does nothing to address the underlying issues. These systems have gotten quite expensive, and can run $50,000-$100,000 per year (link). The systems begin to look like ERPs in terms of price and upkeep, and eventually they become legacy systems themselves.

Others go out and acquire a “core” solution such as SAP, Oracle, or another ERP system. They hope that these solutions, together with the available extensions, will provide the same interface benefits. A company can then buy or build apps that integrate. Ultimately, these solutions are also expensive and become “black boxes” where data and its related insights are not visible to the user due to the complexity of the system. (Intentional? You decide…). So now you go out and either pay experts in the system to help you manipulate it or settle for whatever off-the-shelf capabilities and reporting you can find. (For one example of how this can go, see link).

A Better Path Forward

Many of the purveyors of these “solutions” would have you believe there is no better way forward; but those familiar with data-centricity know better. To be clear, I’m not a practioner or technologist. I joined Semantic Arts in an HR role, and the ensuing two years have reshaped the way I see HR and especially HR information systems. I’ll give you a decent snapshot as I understand it, along with an offer that if your interested in the ins and outs of these things I’d be happy to introduce you to someone that can answer them in greater detail.

Fundamentally, a true solution requires a mindset shift away from application silos and integration, towards a single, simple model that defines the core elements of the business, together with a few key applications that are bound to that core and speak the same language. This can be built incrementally, starting with specific use cases and expanding as it makes sense. This approach means you don’t need to have it “all figured out” from the start. With the adoption of an existing ontology, this is made even easier … but more on that later.

Once a core model is established, an organization can begin to deal methodically with legacy applications. You will find that over time many organizations go from legacy avoidance to legacy erosion, and eventually to legacy replacement. (See post on Incremental Stealth Legacy Modernization). This allows a business to slowly clean out that junk drawer and avoid filling it back up in the future (and what’s more satisfying than a clean junk drawer?).

Is this harder in the short term than traditional solutions? It may appear so on the surface, but really it isn’t. When a decision is made to start slowly, companies discover that the flexibility of semantic knowledge graphs allows for quick gains. Application development is less expensive and applications more easily modified as requirements change. Early steps help pay for future steps, and company buy-in becomes easier as stakeholders see their data come to life and find key business insights with ease.

For those who may be unfamiliar with semantic knowledge graphs, let me try to give a brief introduction. A graph database is a fundamental shift away from the traditional relational structure. When combined with formal semantics, a knowledge graph provides a method of storing and querying information that is more flexible and functional (more detail at link or link). Starting from scratch would be rather difficult, but luckily there are starter models (ontologies) available, including one we’ve developed in-house called gist, which is both free and freely available. By building on an established structure, you can avoid re-inventing the wheel.

HR departments looking to leverage AI and large language models in the future will find this data-centric transformation even more essential, but that’s a topic for another time.

Conclusion

HR departments face unique challenges. They deal with large amounts of information and must justifying their spending as non-revenue producing departments. The proliferation of systems and applications is a drain on employee morale and productivity and represents a major source of budget drain.

By adopting data-centric principles and applying them intentionally in future purchasing and application development, HR departments can realize greater strategic insights while saving money and providing a richer employee experience.

Taken all the way to completion, adoption of these technologies and principles would mean business data stored in a single, secured location. Small apps or dashboards can be rapidly built and deployed as the business evolves. No more legacy systems, no more hidden data, no more frustration with systems that simply don’t work.

Maybe, just maybe, this model will provide a success story that leads the rest of the organization to adopt similar principles.

 

JT Metcalf is the Chief Administrative Officer at Semantic Arts, managing HR functions along with many other hats.

The Data-Centric Revolution: “RDF is Too Hard”

This article originally appeared at The Data-Centric Revolution: “RDF is Too Hard” – TDAN.com. Subscribe to The Data Administration Newsletter for this and other great content!

The Data-Centric Revolution: “RDF is Too Hard”

By Dave McComb

We hear this a lot. We hear it from very smart people. Just the other day we heard someone say they had tried RDF twice at previous companies and it failed both times. (RDF stands for Resource Description Framework,[1] which is an open standard underlying many graph databases). It’s hard to convince someone like that that they should try again.

That particular refrain was from someone who was a Neo4j user (the leading contender in the LPG (Labeled Property Graph) camp). We hear the same thing from any of three camps: the relational camp, the JSON camp, and the aforementioned LPG camp.

Each has a different reason for believing this RDF stuff is just too hard. Convincing those who’ve encountered setbacks to give RDF another shot is also challenging. In this article, I’ll explore the nuances of RDF, shedding light on challenges and strengths in the context of enterprise integration and application development.

For a lot of problems, the two-dimensional world of relational tables is appealing. Once you know the column headers, you pretty much know how to get to everything. It’s not quite one form per table, but it isn’t wildly off from that. You don’t have to worry about some of the rows having additional columns, you don’t have to worry about some cells being arrays or having additional depth. Everything is just flat, two-dimensional tables. Most reporting is just a few joins away.

JSON is a bit more interesting. At some point you discover, or decree if you’re building it, that your dataset has a structure. Not a two-dimensional structure as in relational, but more of a tree-like structure. More specifically, it’s all about determining if this is an array of dictionaries or a dictionary of arrays. Or a dictionary of dictionaries. Or an array of arrays. Or any deeply nested combination of these simple structures. Are the keys static — that is, can they be known specifically at coding time, or are they derived dynamically from the data itself? Frankly, this can get complex, but at least it’s only locally complex. A lot of JSON programming is about turning someone else’s structure into a structure that suits the problem at hand.

One way to think of LPG is JSON on top of a graph database. It has a lot of the flexibility of JSON coupled with the flexibility of graph traversals and graph analytics. It solves problems difficult to solve with relational or just JSON and has beautiful graphics out of the box. Maybe link to your blog post about LPGs as training wheels?

Each of these approaches can solve a wide range of problems. Indeed, almost all applications use one of those three approaches to structure the data they consume.

And I have to admit, I’ve seen a lot of very impressive Neo4j applications lately. Every once in a while, I question myself and wonder aloud if we should be using Neo4j. Not because RDF is too hard but because we’ve mastered it and have many successful implementations running at client sites and internally. But maybe, if it is really easier, we should switch. And maybe it just isn’t worth disagreeing with our prospects.

Enterprise Integration is Hard

Then it struck me. The core question isn’t really “RDF v LPG (or JSON or relational),” it’s “application development v. enterprise integration.”

I’ve heard Jans Aasman, CEO of Franz, the creators of AllegroGraph, make this observation more than once: “Most application developers have dedicated approximately 0 of their neurons contemplating how what they are working on is going to fit in with the rest of their enterprise,  whereas people who are deeply into RDF may spend upwards of half their mental cycles thinking of how the task and data at hand fits into the overall enterprise model.”

That, I think, is the nub of the issue. If you are not concerned with enterprise integration, then maybe those features that scratch the itches that enterprise integration creates are not worth the added hassle.

Let’s take a look at the aspects of enterprise integration that are inherently hard, why RDF might be the right tool for the job, and why it might be overkill for traditional application development.

Complexity Reduction

One of the biggest issues dealing with enterprise integration is complexity. Most mid to large enterprises harbor thousands of applications. Each application has thousands of concepts (tables and columns or classes and attributes or forms and fields) that must be learned to become competent either in using the application and/or in debugging and extending it. No two application data models are alike. Even two applications in the same domain (e.g., two inventory systems) will have comically different terms, structures, and even levels of abstraction.

Each application is at about the complexity horizon that most mere mortals can handle. The combination of all those models is far beyond the ability of individuals to grasp.

Enterprise Resource Planning applications and Enterprise Data Modeling projects have shone a light on how complex it can get to attempt to model all an enterprise’s data. ERP systems now have tens of thousands of tables, and hundreds of thousands of columns. Enterprise Data Modeling fell into the same trap. Most efforts attempted to describe the union of all the application models that were in use. The complexity made them unusable.

What few who are focused on solving a point solution are aware of, is that there is a single simple model at the heart of every enterprise. It is simple enough that motivated analysts and developers can get their heads around it in a finite amount of time. And it can be mapped to the existing complex schemas in a lossless fashion.

The ability to posit these simple models is enabled by RDF (and its bigger brothers OWL and SHACL). RDF doesn’t guarantee you’ll create a simple or understandable model (there are plenty of counterexamples out there) but it at least makes the problem tractable.

Concept Sharing

An RDF based system is mostly structure-free, so we don’t have to be concerned with structural disparities between systems, but we do need a way to share concepts. We need to have a way to know that “employee,” “worker,” “user,” and “operator” are all referring to the same concept.  Or if they aren’t, in what ways they overlap.

In an RDF-based system we spend a great deal of time understanding the concepts that are being used in all the application systems, and then creating a way that both the meaning and the identity of the concept can be easily shared across the enterprise.  And that the map between the existing application schema elements and the shared concepts are also well known and findable.

One mechanism that helps with this is the idea that concepts have global identifiers (URIs /IRIs) that can be resolved.  You don’t need to know which application defined a concept; the domain name (and therefore the source authority) is right there in the identifier and can be used much like a URL to surface everything known about the concept.  This is an important feature of enterprise integration.

Instance Level Integration

It’s not just the concepts. All the instances referred to in application systems have identifiers.  But often the identifiers are local. That is, “007” refers to James Bond in the Secret Agent table, but it refers to “Ham Sandwich” in the company cafeteria system.

The fact that systems have been creating identity aliases for decades is another problem that needs to be addressed at the enterprise level. The solution is not to attempt, as many have in the past, to jam a “universal identifier” into the thousands of affected systems. It is too much work, and they can’t handle it anyway. Plus, there are many identity problems that were unpredicted at the time their systems were built (who imagined that some of our vendors would also become customers?) and are even harder to resolve.

The solution involves a bit of entity resolution, coupled with a flexible data structure that can accommodate multiple identifiers without getting confused.

Data Warehouse, Data Lake, and Data Catalog all in One

Three solutions have been mooted over the last three decades to partially solve the enterprise integration problem: data warehouses, lakes, and catalogs.  Data warehouses acknowledged that data has become balkanized.  By conforming it to a shared dimensional model and co-locating the data, we could get combined reporting.  But the data warehouse was lacking on many fronts: it only had a fraction of the enterprise’s data, it was structured in a way that wouldn’t allow transactional updates, and it was completely dependent on the legacy systems that fed it. Plus, it was a lot of work.

The data lake approach said co-location is good, let’s just put everything in one place and let the consumers sort it out. They’re still trying to sort it out.

Finally, the data catalog approach said: don’t try to co-locate the data, just create a catalog of it so consumers can find it when they need it.

The RDF model allows us to mix and match the best of all three approaches. We can conform some of the enterprise data (we usually recommend all the entity data such as MDM and the like, as well as some of the key transactional data). An RDF catalog, coupled with an R2RML or RML style map, will not only allow a consumer to find data sets of interest, in many cases they can be accessed using the same query language as the core graph. This ends up being a great solution for things like IoT, where there are great volumes of data that only need to be accessed on an exception basis.

Query Federation

We hinted at query federation in the above paragraph. The fact that query federation is built into the spec (of SPARQL, which is the query language of choice for RDF, and also doubles as a protocol for federation) allows data to be merged at query time, across different database instances, different vendors and even different types of databases (with real time mapping, relational and document databases can be federated into SPARQL queries).

Where RDF Might Be Overkill

The ability to aid enterprise integration comes at a cost. Making sure you have valid, resolvable identifiers is a lot of work. Harmonizing your data model with someone else’s is also a lot of work. Thinking primarily in graphs is a paradigm shift. Anticipating and dealing with the flexibility of schema-later modeling adds a lot of overhead. Dealing with the oddities of open world reasoning is a major brain breaker.

If you don’t have to deal with the complexities of enterprise integration, and you are consumed by solving the problem at hand, then maybe the added complexity of RDF is not for you.

But before you believe I’ve just given you a free pass consider this: half of all the work in most IT shops is putting back together data that was implemented by people who believed they were solving a standalone problem.

Summary

There are many aspects of the enterprise integration problem that lend themselves to RDF-based solutions. The very features that help at the enterprise integration level may indeed get in the way at the point solution level.

And yes, it would in theory be possible to graft solutions to each of the above problems (and more, including provenance and fine-grained authorization) onto relational, JSON or LPG. But it’s a lot of work and would just be reimplementing the very features that developers in these camps find so difficult.

If you are attempting to tackle enterprise integration issues, we strongly encourage you to consider RDF. There is a bit of a step function to learn it and apply it well, but we think it’s the right tool for the job.

gist Jumpstart

This blog post is for anyone responsible for Enterprise data management who would like to save time and costs by re-using a great piece of modeling work. It updates an earlier blog post, “A brief introduction to the gist semantic model”.

A core semantic model, also called an upper ontology, is a common model across the Enterprise that includes major concepts such as Event, Agreement, and Organization. Using an upper ontology greatly simplifies data integration across the Enterprise. Imagine, for example, being able to see all financial Events across your Enterprise; that kind of visibility would be a powerful enabler for accurate financial tracking, planning, and reporting.

If you are ready to incorporate semantics into your data environment, consider using the gist upper ontology. gist is available for free from Semantic Arts under a creative commons license. It is based on more than a hundred data-centric projects done with major corporations in a variety of lines of business.  gist “is designed to have the maximum coverage of typical business ontology concepts with the fewest number of primitives and the least amount of ambiguity.”  The Wikipedia entry for upper ontologies compares gist to other ontologies and gives a good sense of why gist is a match for Enterprise data management: it is comprehensive, unambiguous, and easy to understand.

 

So, what exactly is in gist?

First, gist includes types of things (classes) involved in running an Enterprise. Some of the more frequently used gist classes, grouped for ease of understanding, are:

Some of these classes have subclasses that are not shown. For example, an Intention could be a Goal or a Requirement.

Gist also includes properties that are used to describe things and to describe relationships between things. Many of the gist properties can be approximately grouped as above:

Other commonly used gist properties include:

Next, let’s look at a few typical graph patterns that illustrate how classes and properties work together to model the Enterprise world.

An Account might look like:

An Event might look like:

An ID such as a driver’s license might look like:

To explore gist in more detail, you can view it in an ontology editor such as Protégé. Try looking up the Classes and Properties in each group above (who, what, where, why, etc.). Join the gist Forum (select and scroll to the bottom) for regular discussion and updates.

Take a look at gist.  It’s worth your time, because adopting gist as your upper ontology can be a significant step toward reversing the proliferation of data siloes within your Enterprise.

Further reading and videos:

3-part video introduction to gist:

  1. https://www.youtube.com/watch?v=YbaDZSuhm54&t=123s
  2. https://www.youtube.com/watch?v=UzNVIErpGpQ&t=206s
  3. https://www.youtube.com/watch?v=2g0E6cFro18&t=14s

Software Wasteland, by Dave McComb

The Data-Centric Revolution, by Dave McComb

Demystifying OWL for the Enterprise, by Michael Uschold

 

Diagrams in this blog post were generated using a visualization tool.

Data-Centric Revolution: Is Knowledge Ontology the Missing Link?

“You would think that after knocking around in semantics and knowledge graphs for over two decades I’d have had a pretty good idea about Knowledge Management, but it turns out I didn’t.

I think in the rare event the term came up I internally conflated it with Knowledge Graphs and moved on. The first tap on the shoulder that I can remember was when we were promoting work on a mega project in Saudi Arabia (we didn’t get it, but this isn’t likely why). We were trying to pitch semantics and knowledge graphs as the unifying fiber for the smart city the Neom Line was to become.

In the process, we came across a short list of Certified Knowledge Management platforms they were considering. Consider my chagrin when I’d never heard of any of them. I can no longer find that list, but I’ve found several more since…”

Read the rest: Data Centric Revolution: Is Knowledge Ontology the Missing Link? – TDAN.com

Interested in joining the discussion? Join the gist Forum (Link to register here)

A Knowledge Graph for Mathematics

This blog post is for anyone interested in mathematics and knowledge representation as associated with career progression in today’s changing information eco-system. Mathematics and knowledge representation have a strong common thread; they both require finding good abstractions and simple, elegant solutions, and they both have a foundation in set theory. It could be used as the starting point for an accessible academic research project that deals with the foundation of mathematics and will also develop commercially marketable knowledge representation skills.

Hypothesis: Could the vast body of mathematical knowledge be put into a knowledge graph? Let’s explore, because doing so could provide a searchable data base of mathematical concepts and help identify previously unrecognized connections between concepts.

Every piece of data in a knowledge graph is a semantic triple of the form:

subject – predicate – object.

A brief look through mathematical documentation reveals the frequent appearance of semantic triples of the form:

A implies B, where A and B are statements.

“A implies B” is itself a statement, equivalent to “If A then B”. Definitions, axioms, and theorems can be stated using these if/then statements. The if/then statements build on each other, starting with a foundation of definitions and axioms (statements so fundamental they are made without proof). Furthermore, the predicate “implies” is transitive, meaning an “implies” relationship can be inferred from a chain of other “implies” relationships.

…. hence the possibility of programmatically discovering relationships between statements.

Before speculating further, let’s examine two examples from the field of point set topology, which deals abstractly with concepts like continuity, connectedness, and compactness.

Definition: a collection of sets T is a topology if and only if the following are true:

• the union of sets in any subcollection of T is a member of T
• the intersection of sets in any finite subcollection of T is a member of T.

Problem: Suppose there is a topology T and a set X that satisfies the following condition:

• for every member x of X there is a set Tx in T with x in Tx and Tx a subset of X.

Show that X is a member of T.

Here’s a diagram showing the condition stated in the problem, which holds for every x in X:

Perhaps you can already see what happens if we take the union of all of the Tx’s, one for each x in X.

In English, the solution to the problem is:

The union of all sets Tx is a subset of X because every Tx is a subset of X.

The union of all sets Tx contains X because there is a Tx containing x, for every x in X.

Based on the two statements above, the union of all sets Tx equals X because it is both a subset and a superset of X.

Finally, since every Tx belongs to T, the union of all sets Tx (which is X) is a member of T.

Let’s see how some of this might look in a knowledge graph. According to the definition of topology:

Applying this pattern to the problem above, we find:

While it may seem simple to recognize the sameness of the patterns on the left side of the two diagrams above, what precisely is it that makes the pattern in the problem match the pattern in the definition of topology? The definition applies because both left-hand statements conform to the same graph pattern:

This graph pattern consists of two triple patterns, each of which has the form:

[class of the subject] – predicate – [class or datatype of the object].

We now have the beginnings of a formal ontology based on triple patterns that we have encountered so far. Statements, including complex ones, can be represented using triples.

Note: in the Web Ontology Language, the properties hasSubject, hasPredicate, and hasObject will need to be annotation properties (they can be used in queries but will not be part of automated inference).

Major concepts can be represented as classes:

It’s generally good practice to use classes for major concepts, while using other methods such as categories to model other distinctions needed.

Other triple patterns we have seen describe a variety of relationships between sets and collections of sets, summarized as:

Could the vast body of mathematical knowledge be put into a knowledge graph? Certainly, a substantial amount of it, that which can be expressed as “A implies B”.

However, much remains to be done. For example, we have not looked at how to distinguish between a statement that is asserted to be true versus, for example, a statement that is part of an “if” clause.

Or imagine a math teacher on Monday saying “x + 3 = 7” and on Tuesday saying “x – 8 = 4”. In a knowledge graph, every thing has a unique permanent ID, so if x is 4 on Monday, it is still 4 on Tuesday. Perhaps there is a simple way to bridge the typical mathematical re-use of non-specific names like “x” and the knowledge graph requirement of unique IDs; finding it is left to the reader.

For a good challenge, try stating the Urysohn Lemma using triples, and see how much of its proof can be represented as triples and triple patterns.

To understand modeling options within the Web Ontology Language (OWL), I refer the reader to the book Demystifying OWL for the Enterprise by Michael Uschold. The serious investigator might also want to explore the semantics of rdf* since it explicitly deals with the semantics of statements.

Special thanks to Irina Filitovich for her insights and comments.

The ABCs of QUDT

This blog post is for anyone interested in understanding units of measure for the physical world.

The dominant standard for units of measure is the International System of Units, part of a collaborative effort that describes itself as:

Working together to promote and advance the global comparability of measurements.

While the International System of Units is defined in a document, QUDT has taken the next step and defined an ontology and a set of reference data that can be queried via a public SPARQL endpoint. QUDT provides a wonderful resource for data-centric efforts that involve quantitative data.

QUDT is an acronym for Quantities, Units, Dimensions, and Types. With 72 classes and 178 properties in its ontology, QUDT may at first appear daunting. In this note, we will use a few simple SPARQL queries to explore the QUDT graph. The main questions we will answer are:

  1. What units are applicable for a given measurable characteristic?
  2. How do I convert a value from one unit to another?
  3. How does QUDT support dimensional analysis?
  4. How can units be defined in terms of the International System of Units?

Let’s jump right in. Please follow along as a hands-on exercise. Pull up the QUDT web site at:

https://qudt.org/

On the right side of the QUDT home page select the link to the QUDT SPARQL Endpoint where we can run queries:

From the SPARQL endpoint, select the query option.

Question 1: What units are applicable for a given measurable characteristic?

First, let’s at the measurable characteristics defined in QUDT. Copy-paste this query into the SPARQL endpoint:

prefix qudt: <http://qudt.org/schema/qudt/>

prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

prefix owl:  <http://www.w3.org/2002/07/owl#>

prefix xsd:  <http://www.w3.org/2001/XMLSchema#>

select ?qk

where {?qk rdf:type qudt:QuantityKind . }

order by ?qk

 

QUDT calls the measurable characteristics QuantityKinds.

Note that there is a Filter box that lets us search the output.

Type “acceleration” into the Filter box and then select the first value, Acceleration, to get a new tab showing the properties of Acceleration. Voila, we get a list of units for measuring acceleration:

Now to get a complete answer to our first question, just add a line to the query:

prefix qudt: <http://qudt.org/schema/qudt/>

prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

prefix owl:  <http://www.w3.org/2002/07/owl#>

prefix xsd:  <http://www.w3.org/2001/XMLSchema#>

select ?qk ?unit

where {

?qk rdf:type qudt:QuantityKind ;

qudt:applicableUnit ?unit  ; # new line

.

}

order by ?qk ?unit

The output shows the units of measure for each QuantityKind.

Question 2: How do I convert a value from one unit to another?

Next, let’s look at how to do a unit conversion from feet to yards, with meter as an intermediary:

To convert from feet to meters, multiply by 0.3048. Then to convert from meters to yards, divide by 0.9144. Therefore, to convert from feet to yards, first multiply by 0.3048 and then divide by 0.9144. For example:

27 feet = 27 x (0.3048/0.9144) yards

= 9 yards

The 0.3048 and 0.9144 are in QUDT as the conversionMultipliers of foot and yard, respectively. You can see them with this query:

prefix qudt: <http://qudt.org/schema/qudt/>

prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

prefix owl:  <http://www.w3.org/2002/07/owl#>

prefix xsd:  <http://www.w3.org/2001/XMLSchema#>

select ?unit ?multiplier

where {

values ?unit {

<http://qudt.org/vocab/unit/FT>

<http://qudt.org/vocab/unit/YD> }

?unit  qudt:conversionMultiplier ?multiplier .

}

This example of conversionMultipliers answers our second question; to convert values from one unit of measure to another unit of measure, first multiply by the conversionMultiplier of the “from” unit and then divide by the conversionMultiplier of the “to” unit. [note: for temperatures, offsets are also needed]

Question 3: How does QUDT support dimensional analysis?

To answer our third question we will start with a simple example:

Force = mass x acceleration

In the following query, we retrieve the exponents of Mass, Acceleration, and Force to validate that Force does indeed equal Mass x Acceleration:

prefix qudt: <http://qudt.org/schema/qudt/>

prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

prefix owl:  <http://www.w3.org/2002/07/owl#>

prefix xsd:  <http://www.w3.org/2001/XMLSchema#>

select  ?qk ?dv ?exponentForMass ?exponentForLength ?exponentForTime

where {

values ?qk {

<http://qudt.org/vocab/quantitykind/Mass>

<http://qudt.org/vocab/quantitykind/Acceleration>

<http://qudt.org/vocab/quantitykind/Force> }

?qk  qudt:hasDimensionVector ?dv  .

?dv qudt:dimensionExponentForMass   ?exponentForMass  ;

qudt:dimensionExponentForLength ?exponentForLength ;

qudt:dimensionExponentForTime   ?exponentForTime ;

.

}

Recall that to multiply “like terms” with exponents, add the exponents, e.g.

length1 x length2 = length3

In the QUDT output, look at the columns for Mass, Length, and Time. Note that in each column the exponents associated with Mass and Acceleration add up to the exponent associated with Force, as expected.

Question 4: How can units be defined in terms of the International System of Units?

Finally, we want to see how QUDT can be used to define units in terms of the base units of the International System of Units as defined in the SI Brochure. We want to end up with equations like:

1 inch = 0.0254 meters

1 foot per second squared = 0.3048 meters per second squared

1 pound per cubic yard = 0.5932764212577829 kilograms per cubic meter

Delving deeper into QUDT, we see the concept of QuantityKindDimensionVector. Every unit and every quantity kind is related to one of these QuantityKindDimensionVectors.

Let’s unpack what that means by way of an example where we show the dimension vector A0E0L1I0M0H0T-2D0 means Length x Time-2 (linear acceleration):

Start with dimension vector: A0E0L1I0M0H0T-2D0

Each letter stands for a base dimension, and the vector can also be written as:

Amount0 x ElectricCurrent0 x Length1 x Intensity0 x Mass0 x Heat0 x Time-2 x Other0

Every term with an exponent of zero equals 1, so this expression can be reduced to:

Length x Time-2 (also known as Linear Acceleration)

The corresponding expression in terms of base units of the International System of Units is:

Meter x Second-2 (the standard unit for acceleration)

… which can also be written as:

meter per second squared

Using this example as a pattern, we can proceed to query QUDT to get an equation for each QUDT unit in terms of base units. To reduce the size of the query we will focus on mechanics, where the base dimensions are Mass, Length, and Time and the corresponding base units are kilogram, meter, and second.

Here is the query to create the equations we want; run it on the QUDT SPARQL Endpoint and see what you get:

prefix qudt: <http://qudt.org/schema/qudt/>

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

prefix owl: <http://www.w3.org/2002/07/owl#>

prefix xsd: <http://www.w3.org/2001/XMLSchema#>

select distinct ?equation

where {

?unit rdf:type  qudt:Unit ;

qudt:conversionMultiplier ?multiplier ;

qudt:hasDimensionVector ?dv ;

rdfs:label ?unitLabel ;

.

?dv qudt:dimensionExponentForMass    ?expKilogram ;  # translate to units

qudt:dimensionExponentForLength  ?expMeter ;

qudt:dimensionExponentForTime    ?expSecond ;

rdfs:label ?dvLabel ;

.

filter(regex(str(?dv), “A0E0L.*I0M.*H0T.*D0”)) # mechanics

filter(!regex(str(?dv), “A0E0L0I0M0H0T0D0”))

filter(?multiplier > 0)

bind(str(?unitLabel) as ?unitString)

# to form a label for the unit:

#    put positive terms first

#    omit zero-exponent terms

#    change exponents to words

bind(if(?expKilogram > 0, concat(“_kilogram_”, str(?expKilogram)), “”) as ?SiUnitTerm4)

bind(if(?expMeter    > 0, concat(“_meter_”,    str(?expMeter)),    “”) as ?SiUnitTerm5)

bind(if(?expSecond   > 0, concat(“_second_”,   str(?expSecond)),   “”) as ?SiUnitTerm7)

bind(if(?expKilogram < 0, concat(“_kilogram_”, str(-1 * ?expKilogram)), “”) as ?SiUnitTerm104)

bind(if(?expMeter    < 0, concat(“_meter_”,    str(-1 * ?expMeter)),    “”) as ?SiUnitTerm105)

bind(if(?expSecond   < 0, concat(“_second_”,   str(-1 * ?expSecond)),   “”) as ?SiUnitTerm107)

bind(concat(?SiUnitTerm4,   ?SiUnitTerm5,   ?SiUnitTerm7)   as ?part1)

bind(concat(?SiUnitTerm104, ?SiUnitTerm105, ?SiUnitTerm107) as ?part2)

bind(if(?part2 = “”, ?part1,

if(?part1 = “”, concat(“per”,?part2),

concat(?part1, “_per”, ?part2))) as ?SiUnitString1)

bind(replace(?SiUnitString1, “_1_|_1$”, “_”)             as ?SiUnitString2)

bind(replace(?SiUnitString2, “_2_|_2$”, “Squared_”)      as ?SiUnitString3)

bind(replace(?SiUnitString3, “_3_|_3$”, “Cubed_”)        as ?SiUnitString4)

bind(replace(?SiUnitString4, “_4_|_4$”, “ToTheFourth_”)  as ?SiUnitString5)

bind(replace(?SiUnitString5, “_5_|_5$”, “ToTheFifth_”)   as ?SiUnitString6)

bind(replace(?SiUnitString6, “_6_|_6$”, “ToTheSixth_”)   as ?SiUnitString7)

bind(replace(?SiUnitString7, “_7_|_7$”, “ToTheSeventh_”) as ?SiUnitString8)

bind(replace(?SiUnitString8, “_8_|_8$”, “ToTheEighth_”)  as ?SiUnitString9)

bind(replace(?SiUnitString9, “_9_|_9$”, “ToTheNinth_”)   as ?SiUnitString10)

bind(replace(?SiUnitString10, “_10_|_10$”,”ToTheTenth_”)  as ?SiUnitString11)

bind(replace(?SiUnitString11,  “^_”,  “”)  as ?SiUnitString12) # tidy up

bind(replace(?SiUnitString12,  “_$”,  “”)  as ?SiUnitString13)

bind(?SiUnitString13 as ?SiUnitLabel)

bind(concat(“1 “, str(?unitLabel), ” = “, str(?multiplier), ”  “,   ?SiUnitLabel) as ?equation)

}

order by ?equation

The result of this query is a set of equations that tie more than 1200 units back to the base units of the International System of Units, which in turn are defined in terms of seven fundamental physical constants.

And that’s a wrap. We answered all four questions with only 3 QUDT classes and 6 QUDT properties:

  1. What units are applicable for a given measurable characteristic?
  2. How do I convert a value from one unit to another?
  3. How does QUDT support dimensional analysis?
  4. How can units be defined in terms of the International System of Units?

For future reference, here’s a map of the territory we explored:

One final note: kudos to everyone who contributed to QUDT; it has a lot of great information in one place. Thank you!

Extending an upper-level ontology (like GIST)

Michael Sullivan is a Principle Cloud Solutions Architect at Oracle.  Article reprinted with permission (original is here)

If you have been following my blogs over the past year or so they you will know I am a big fan of adopting an upper-level ontology to help bootstrap your own bespoke ontology project. Of the available upper-level ontologies I happen to like gist as it embraces a “less is more” philosophy.

Given that this is 3rd party software with its own lifecycle, how does one “merge” such an upper ontology with your own? Like most things in life, there are two primary ways.

CLONE MODEL

This approach is straightforward: simply clone the upper ontology and then modify/extend it directly as if it were your own (being sure to retain any copyright notice). The assumption here is that you will change the “gist” domain into something else like “mydomain”. The benefit is that you don’t have to risk any 3rd party updates affecting your project down the road. The downside is that you lose out on the latest enhancements/improvements over time, which if you wish to adopt, would require you to manually re-factor into your own ontology.

As the inventors of gist have many dozens of man-years of hands-on experience with developing and implementing ontologies for dozens of enterprise customers, this is not an approach I would recommend for most projects.

EXTEND MODEL

Just as when you extend any 3rd party software library you do so in your own namespace, you should also extend an upper-level ontology in your own namespace. This involves just a couple of simple steps:

First, declare your own namespace as an owl ontology, then import the 3rd party upper-level ontology (e.g. gist) into that ontology. Something along the lines of this:

<https://ont.mydomain.com/core> 
    a owl:Ontology ;
    owl:imports <https://ontologies.semanticarts.com/o/gistCore11.0.0> ;
    .

Second, define your “extended” classes and properties, referencing appropriate gist subclasses, subproperties, domains, and/or range assertions as needed. A few samples shown below (where “my” is the prefix for your ontology domain):

my:isFriendOf 
     a owl:ObjectProperty ;
     rdfs:domain gist:Person ;
     .
my:Parent 
    a owl:Class ;
    rdfs:subClassOf gist:Person ;
    .
my:firstName 
    a owl:DatatypeProperty ;
    rdfs:subPropertyOf gist:name ;
    .

The above definitions would allow you to update to new versions of the upper-level ontology* without losing any of your extensions. Simple right?

*When a 3rd party upgrades the upper-level ontology to a new major version — defined as non-backward compatible — you may find changes that need to be made to your extension ontology; as a hypothetical example, if Semantic Arts decided to remove the class gist:Person, the assertions made above would no longer be compatible. Fortunately, when it comes to major updates Semantic Arts has consistently provided a set of migration scripts which assist with updating your extended ontology as well as your instance data. Other 3rd parties may or may not follow suit.

Thanks to Rebecca Younes of Semantic Arts for providing insight and clarity into this.

The Data-Centric Revolution: Zero Copy Integration

I love the term “Zero Copy Integration.” I didn’t come up with it, it was the Data Collaboration Alliance, that came up with that one. The Data Collaboration Alliance is a Canadian based advocacy group promoting localized control of data along with federated access.

What I like about the term is how evocative it is. Everyone knows that all integration consists of copying and transforming data. Whether you do that through an API or through ETL (Extract Transform and Load) or Data Lake style ELT (Extract Load and leave it to someone else to maybe eventually Transform). Either way, we know from decades of experience that integration is at its core copying data from a source to a destination.

This is why “copy-less copying” is so evocative.  It forces you to rethink your baseline assumptions.

We like it because it describes what we’ve been doing for years, and never had a name for. In this article, I’m going to drill a bit deeper into the enabling technology (i.e., what do you need to have in place to get Zero Copy Integration to work), then do a case study, and finally wrap up with “do you literally mean zero copy?”

 

Read more at: The Data-Centric Revolution: Zero Copy Integration – TDAN.com

Knowledge Graph Modeling: Time series micro-pattern using GIST

Michael Sullivan is a Principle Cloud Solutions Architect at Oracle.  Article reprinted with permission (original is here)

For any enterprise, being able to model time series is more than just important, in many cases it is critical. There are many examples but some trivial ones include “Person is employed By Employer” (Employment date-range), “Business has Business Address” (Established Location date-range), “Manager supervises Member Of Staff” (Supervision date-range), and so on. But many developers who dabble in RDF graph modeling end up scratching their heads — how can one pull that off if one can’t add attributes to an edge? While it is true that one can always model things using either reification or leveraging RDF Quads (see my previous blog semantic rdf properties) now might be a good time to take a step back and explore how the semantic gurus at Semantic Arts have neatly solved how to model time series starting with version 11 of GIST, their free upper-level ontology (link below).

First a little history. The core concept of RDF is to “connect” entities via predicates (a.k.a. “triples”) as shown below. Note that either predicate could be inferred from the other, bearing in mind that you need to maintain at least one explicit predicate between the two as there is no such thing in RDF as an subject without a predicate/object. Querying such data is also super simple.

Typical entity to entity relationships in RDF

So far so good. In fact, this is about as simple as it gets. But what if we wanted to later enrich the above simple semantic relationship with time-series? After all, it is common to want to know WHEN Mark supervised Emma. With out-of-the-box RDF you can’t just hang attributes on the predicates (I’d argue that this simplistic way of thinking is why property graphs tend to be much more comforting to developers). Further, we don’t want to throw out our existing model and go through the onerous task of re-modeling everything in the knowledge graph. Instead, what if we elevated the specific “supervises” relationship between Mark and Emma to become a first-class citizen? What would that look like? I would suggest that a “relation” entity that becomes a placeholder for the “Mark Supervises Emma” relationship would fit the bill. This entity would in turn reference Mark via a “supervision by” predicate while referencing Emma via a “supervision of” predicate.

Ok, now that we have a first-class relation entity, we are ready to add additional time attributes (i.e. triples), right? Well, not so fast! The key insight that in GIST, is that the “actual end date” and “actual start date” predicates as used here specify the precision of the data property (rather than letting the data value specifying the precision), which in our particular use case we want to be the overall date, not any specific time. Hence our use of gist:actualStartDate and gist:actualEndDate here instead of something more time-precise.

The rest is straightforward as depicted in the micro-pattern diagram shown immediately below. Note that in this case, BOTH the previous “supervised by” and “supervises” predicates connecting Mark to Emma directly can be — and probably should be — inferred! This will allow time-series to evolve and change over time while enabling queryable (inferred) predicates to always be up-to-date and in-sync. It also means that previous queries using the old model will continue to work. A win-win.

Time series micro-pattern using GIST

A clever ontological detail not shown here: A temporal relation such as “Mark supervises Emma” must be gist:isConnectedTo a minimum of two objects — this cardinality is defined in the GIST ontology itself and is thus inherited. The result is data integrity managed by the semantic database itself! Additionally, you can see the richness of the GIST “at date time” data properties most clearly in the expression of the hierarchical model in latest v11 ontology (see Protégé screenshot below). This allows the modeler to specify the precision of the start and end date times as well as distinguishing something that is “planned” vs. “actual”. Overall a very flexible and extensible upper ontology that will meet most enterprises’ requirements.

"at date time" data property hierarchy as defined in GIST v11

Further, this overall micro-pattern, wherein we elevate relationships to first-class status, is infinitely re-purposable in a whole host of other governance and provenance modeling use-cases that enterprises typically require. I urge you to explore and expand upon this simple yet powerful pattern and leverage it for things other than time-series!

One more thing…

Given that with this micro-pattern we’ve essentially elevated relations to be first class citizens — just like in classic Object Role Modeling (ORM) — we might want to consider also updating the namespaces of the subject/predicate/object domains to better reflect the objects and roles. After all, this type of notation is much more familiar to developers. For example, the common notation object.instance is much more intuitive than owner.instance. As such, I propose that the traditional/generic use of “ex:” as used previously should be replaced with self-descriptive prefixes that can represent both the owner as well as the object type. This is good for readability and is self-documenting. And ultimately doing so may help developers become more comfortable with RDF/SPARQL over time. For example:

  • ex:_MarkSupervisesEmma becomes rel:_MarkSupervisesEmma
  • ex:supervisionBy becomes role:supervisionBy
  • ex:_Mark becomes pers:_Mark

Where:

@prefix rel: <www.example.com/relation/>.
@prefix role: <www.example.com/role/>.
@prefix pers: <www.example.com/person/>.

Links

Alan Morrison: Zero-Copy Integration and Radical Simplification

Dave McComb’s book Software Wasteland underscored a fundamental problem: Enterprise software sometimes costs 1,000 times more than it ought to. The poster child for cost overruns was highlighted in the book was Healthcare.gov, a public registration system for the US Affordable Care Act, enacted in 2010. By 2018, the US Federal government had spent $2.1 billion to build and implement the system. Most of that money was wasted. The government ended up adopting many of the design principles embodied in an equivalent system called HealthSherpa, which cost $1 million to build and implement.

In an era where the data-centric architecture Semantic Arts advocates should be the norm, application-centric architecture still predominates. But data-centric architecture doesn’t just reduce the cost of applications. It also attacks the data duplication problem attributable to poor software design. This article explores how expensive data duplication has become, and how data-centric, zero-copy integration can put enterprises on a course to simplification.

Data sprawl and storage volumes

In 2021, Seagate became the first company to ship three zettabytes worth of hard disks. It took them 36 years to ship the first zettabyte. six years to ship the second zettabyte, and only one additional year to ship the third zettabyte. 

The company’s first product, the ST-506, was released in 1980. The ST-506 hard disk, when formatted, stored five megabytes (10002). By comparison, an IBM RAMAC 305, introduced in 1956, stored five to ten megabytes. The RAMAC 305 weighed 10 US tons (the equivalent of nine metric tonnes). By contrast, the Seagate ST-506, 24 years later, weighed five US pounds (or 2.27 kilograms).

A zettabyte is the equivalent of 7.3 trillion MP3 files or 30 billion 4K movies, according to Seagate. When considering zettabytes:

  • 1 zettabyte equals 1,000 exabytes.
  • 1 exabyte equals 1,000 petabytes.
  • 1 petabyte equals 1,000 terabytes.

IDC predicts that the world will generate 178 zettabytes of data by 2025. At that pace, “The Yottabyte Era” would succeed The Zettabyte Era by 2030, if not earlier.

The cost of copying

The question becomes, how much of the data generated will be “disposable” or unnecessary data? In other words, how much data do we actually need to generate, and how much do we really need to store? Aren’t we wasting energy and other resources by storing more than we need to?

Let’s put it this way: If we didn’t have to duplicate any data whatsoever, the world would only have to generate 11 percent of the data it currently does. In 2021 terms, we’d only need to generate 8.7 zettabytes of data, compared with the 78 zettabytes we actually generated worldwide over the course of that year.

Moreover, Statista estimates that the ratio of unique to replicated data stored worldwide will decline to 1:10 from 1:9 by 2024. In other words, the trend is toward more duplication, rather than less.

The cost of storing oodles of data is substantial. Computer hardware guru Nick Evanson, quoted by Gerry McGovern in CMSwire, estimated in 2020 that storing two yottabytes would cost $58 trillion. If the cost per byte stored stayed constant, 40 percent of the world’s economic output would be consumed in 2035 by just storing data.

Clearly, we should be incentivizing what graph platform Cinchy calls “zero-copy integration”--a way of radically reducing unnecessary data duplication. The one thing we don’t have is “zero-cost” storage. But first, let’s finish the cost story. More on the solution side and zero-copy integration later.

The cost of training and inferencing large language models

Model development and usage expenses are just as concerning. The cost of training machines to learn with the help of curated datasets is one thing, but the cost of inferencing–the use of the resulting model to make predictions using live data–is another. 

“Machine learning is on track to consume all the energy being supplied, a model that is costly, inefficient, and unsustainable,” Brian Bailey in Semiconductor Engineering pointed out in 2022. AI model training expense has increased with the size of the datasets used, but more importantly, as the amount of parameters increases by four, the  amount of energy consumed in the process increases by 18,000 times. Some AI models included as many as 150 billion parameters in 2022. The more recent ChatGPT LLM Training includes 180 billion parameters. Training can often be a continuous activity to keep models up to date.

But the applied model aspect of inferencing can be enormously costly. Consider the AI functions in self-driving cars, for example. Major car makers sell millions of cars a year, and each one they sell is utilizing the same carmaker’s model in a unique way. 70 percent of the energy consumed in self-driving car applications could be due to inference, says Godwin Maben, a scientist at electronic design automation (EDA) provider Synopsys.

Data Quality by Design

Transfer learning is a machine learning term that refers to how machines can be taught to generalize better. It’s a form of knowledge transfer. Semantic knowledge graphs can be a valuable means of knowledge transfer because they describe contexts and causality well with the help of relationships. 

Well-described knowledge graphs provide the context in contextual computing. Contextual computing, according to the US Defense Advanced Research Projects Agency (DARPA), is essential to artificial general intelligence.

A substantial percentage of training set data used in large language models is more or less duplicate data, precisely because of poorly described context that leads to a lack of generalization ability. Thus the reason why the only AI we have is narrow AI. And thus the reason large language models are so inefficient.

But what about the storage cost problem associated with data duplication? Knowledge graphs can help with that problem also, by serving as a means for logic sharing. As Dave has pointed out, knowledge graphs facilitate model-driven development when applications are written to use the description or relationship logic the graph describes. Ontologies provide the logical connections that allow reuse and thereby reduce the need for duplication.

FAIR data and Zero-Copy Integration

How do you get others who are concerned about data duplication on board with semantics and knowledge graphs? By encouraging data and coding discipline that’s guided by FAIR principles. As Dave pointed out in a December 2022 blogpost, semantic graphs and FAIR principles go hand in hand. https://www.semanticarts.com/the-data-centric-revolution-detour-shortcut-to-fair/ 

Adhering to the FAIR principles, formulated by a group of scientists in 2016, promotes reusability by “enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.”  When it comes to data, FAIR stands for Findable, Accessible, Interoperable, and Reusable. FAIR data is easily found, easily shared, easily reused quality data, in other words. 

FAIR data implies the data quality needed to do zero-copy integration.

Bottom line: When companies move to contextual computing by using knowledge graphs to create FAIR data and do model-driven development, it’s a win-win. More reusable data and logic means less duplication, less energy, less labor waste, and lower cost. The term “zero-copy integration” underscores those benefits.

 Alan Morrison is an independent consultant and freelance writer on data tech and enterprise transformation. He is a contributor to Data Science Central and TechTarget sites with over 35 years of experience as an analyst, researcher, writer, editor and technology trends forecaster, including 20 years in emerging tech R&D at PwC.