From Labels to Verbs –Child’s Play!

From Labels to Verbs –Child’s Play!

Watching a child acquire their first language skills is nothing like acquiring a new language. The 2-year-old is learning vocalisation, and also ensuring that they get the correct labels for the things around them. “Mummy”, “Daddy”, “cat”, “car”, “sea” and so on. Often this is done with gestures and pointing. The whole person is involved in enunciating and conveying meaning as part of a dialogue. Others in the child’s environment will be encouraging them, but the child, too, will be observing and understanding at a level way beyond their ability to join in – at least for a month or two. 

Over time, labels get replaced by phrases. Verbs and adjectives come into the mix. The degree of sophistication improves both the communication, and the ability to show that something communicated to the child has been understood as intended. 

Are there any lessons from the child acquiring language from which enterprises can learn when it comes to their development in acquiring semantic skills, assets and technologies? I think there are. I see that enterprises follow a somewhat similar path to the child in acquiring  semantic skills – tending to start with simple collections of labels (controlled vocabularies and  simple taxonomies) before venturing into using more complex information structures with verbs  and adjectives. (These come from ontologies that provide more scope for knowledge representation than controlled vocabularies and taxonomies do). 

Historically, this has been the case. We went nearly 2000 years between Plato’s Socrates  “carving nature at its joints” to help us understand what ‘things’ there are in our world to William  A. Woods writing “What’s in a link?”.

Semantic Arts has developed a core, upper ontology that carves enterprise information at its seams. This has been described in many ways, most recently in a form like the Mendeleev periodic table of elements iii. But where are the links? Are we still approaching the persuasion to enterprises to use semantics in the same way as a child acquiring language rather than as someone already with language skills learning a new language? Do people in business and industry see their information assets as telling a story? Have they the understanding that information ‘bricks’ can be organised to build a variety of information stories in the same way that a set of Lego/Duplo bricks can be organised into the shape of a house, or a boat, or a space rocket? It is the links, the predicates in the RDF model, that help bring together instances of classes into a phrasal structure that is as simple as a 3-year-old’s language constructs. 

Let’s use the remainder of this post just to examine the vocabulary of Semantic Arts’ ‘gist’  ontology from the perspective of the links – those predicates on which predicate logic is based. 

‘gist’ version 13 has 63 object properties. These are the property type that relates one ‘thing’ to another ‘thing’. There are also 50 data properties, the type that relates a ‘thing’ to a ‘string’ of some sort. We know that the ‘string’ could be assigned to a ‘thing’ type by using xsd:anyURI, but let’s leave that for now. 

There are 3 object properties where only the domain (the origin of the relationship ‘arrow’) is  specified in the ontology: 

gist: owns
gist:providesOrderFor
gist:isAbout

and 14 where the range alone (the class at the pointy end of the relationship ‘arrow) is specified: 

gist:hasAccuracy 
gist:hasParty 
gist:hasPhysicalLocation 
gist:comesFromAgent 
gist:hasAspect 
gist:isIdentifiedBy 
gist:isAllocatedBy
gist:isMadeUpOf 
gist:comesFromPlace 
gist:goesToPlace 
gist:hasMagnitude 
gist:isRecognizedBy 
gist:hasAddress 
gist:goesToAgent

There are only 6 object properties where the ontology ‘constrains’ us to a specific set of both  domain and range classes: 

gist:isGeoContainedIn 
gist:hasUnitOfMeasure 
gist:prevents
gist:hasUnitGroup 
gist:hasBiologicalParent 
gist:isFirstMemberOf

This leaves 40 object properties that are a little more flexible in their intended use. 

gist:isExpressedIn 
gist:isCategorizedBy 
gist:hasDirectBroader
gist:hasBroader 
gist:isRenderedOn 
gist:isGovernedBy 
gist:hasParticipant 
gist:hasGiver 
gist:precedesDirectly 
gist:requires 
gist:isPartOf 
gist:precedes 
gist:allows 
gist:isDirectPartOf
gist:contributesTo 
gist:hasMultiplier 
gist:isMemberOf 
gist:hasGoal 
gist:accepts 
gist:hasUniqueNavigationalParent gist:hasNavigationalParent 
gist:isTriggeredBy 
gist:links 
gist:isBasedOn 
gist:isConnectedTo 
gist:hasRecipient 
gist:occursIn 
gist:isAgectedBy
gist:refersTo 
gist:hasSubtrahend 
gist:hasDivisor 
gist:conformsTo 
gist:hasAddend 
gist:hasUniqueBroader
gist:produces 
gist:isUnderJurisdictionOf gist:linksFrom 
gist:ogers 
gist:hasIncumbent 
gist:linksTo

One of our observations is that the more one focuses on classes, the tighter one sticks to domains within the enterprise. These then follow the verticals of the periodic table. We are re-enforcing the siloed view, a little. We are keeping interoperability, but looking at the enterprise from the perspective of business areas. But if we move to modelling with relationships, then we can think more openly about how similar patterns occur between the ‘things’ of the enterprise across different areas. It leads us to a much more abstract way of thinking about the enterprise because these relationships will crop up all over the place. (This is one reason the ‘gist’ ontology does not specify domains or ranges for the majority of its properties. It increases flexibility and  opens up more possibilities for patterns.) 

For a bit of semantic fun, look in the real world for opportunities to use the gist properties as verbs. Go into work and think about how many types of ‘thing’ you can see in the enterprise where ‘thing  A’ gist:produces ‘thing B’. See where you can find ‘thing X’ gist:hasGoal ‘thing Y’. Unlike the child we started this article with, you as the reader already know language. So you can use more than  just labels and start making statements of a phrasal nature that use these ‘gist’ relationships. 

gist:produces
a owl:ObjectProperty ; 
rdfs:isDefinedBy <https://w3id.org/semanticarts/ontology/gistCore> ; 
skos:definition “The subject creates the object.”^^xsd:string ; 
skos:example “A task produces a deliverable.”^^xsd:string ; 
skos:prefLabel “produces”^^xsd:string ; 
gist:hasGoal
 a owl:ObjectProperty ; rdfs:isDefinedBy <https://w3id.org/semanticarts/ontology/gistCore> ; 
skos:definition “The reason for doing something”^^xsd:string ; 
skos:prefLabel “has goal”^^xsd:string ; 

The Case for Enterprise Ontology

The Case for Enterprise Ontology  

I was asked by one of our senior staff why someone might want an enterprise ontology.  From my perspective, there are three main categories of value for integrating all your  enterprise’s data into a single core: 

  • Economy 
  • Cross Domain Use Cases 
  • Serendipity 

Economy 

For many of our clients there is an opportunity that stems from simple rationalization and elimination of duplication. Every replicated data set incurs costs. It incurs costs in the creation and maintenance of the processes that generate it. But the far bigger  costs are associated with data reconciliation. Inevitably each extract and population create variation. These variations add up, triggering additional research to find out why there are slight differences between these datasets.  

Even with ontological based systems, these difference creep in. We know that many of our clients ontological based domains contain an inventory (or a sub inventory).  Employees are a good example. These sub-directories show up all over the place.  There is a very good chance each domain has their own feed from HR. They may be fed from the same system, but as is often the case, each was directed to a warehouse or a different system for their source. Even if they came from the same source – the pipeline, IRI assignment and transformation are all likely different.  

Here’s an illustration from a large bank associated with records retention within their  legal department. One part of this project involved getting a full directory of all the  employees into the graph. Later on we were working with another group on the technical infrastructure, and they wanted to get their own feed from HR to convert into triples. Fortunately we were able to divert them by pointing out that there was already a feed that provided curated employee triples.  

They accepted our justification but asked … “can we have a copy of those triples to  conform to our needs.” This gave us the opportunity to explain there is no conforming. Each triple is an individual asserted fact with its own provenance. You either accept it or ignore it. There really isn’t anything to conform. There is no need to restructure. 

At first glance all their sub domains seemed to stand alone, but the truth is there is a surprising amount of overlap between them. There were many similar but not identical  definitions of “business units.” There were several incompatible ways to describe geographic aggregation. Many different divisions dealt with the same counterparties or with the same products. And it is only when the domains are unified that most of these differences come to light.  

Just unifying and integrating duplicate data sets provided economic justification for the project. We know of another company that justified their whole graph undertaking  simply from the rationalization and reduction of subscriptions to the same or similar  datasets from different parts of the business.  

The good news is that harmonizing ontologically based systems is an order of magnitude cheaper than traditional systems.  

Cross Domain Use Cases 

Reuse of concepts is one of the most compelling reasons for an enterprise ontology.  Some of the obvious cross-domain use cases from some of our pharmaceutical clients  include:  

  • Translation of manufacturing process from bench to trial to full scale • Integration of Real-World Evidence and Adverse events 
  • Collapsing submission time for regulatory reporting 
  • Clinical trial recruiting  
  • Cross channel customer integration 

Some of the best opportunities come from combining previously separate sub-domains. Sometimes you can know this going into a project. But sometimes you don’t discover the opportunity until you are well into the project. Those are the ones that fall into the serendipity category.  

Serendipity 

I’ve recently come to the realization that the most important use cases for unification  might in fact be serendipity. That is, the power might be in unanticipated use cases.  I’ll give some examples and then we’ll point you to a video from one of Amazon’s lead  ontologists who came to the same conclusion.  

Schneider-Electric 

We did a project for Schneider-Electric (see case study). We constructed the scaffolding of their enterprise ontology and then drilled in on their product catalog and  offering. Our initial goal was to get their 1 million parts into a knowledge graph and  demonstrate that it was as complete and as detailed as their incumbent system. At the end of the project we had all their products in a knowledge graph, with all their physical, electrical, thermal and many other characteristics defined and classified.  

Serendipity 1: Inherent Product Compatibility 

We interviewed product designers to find out the nature of product compatibility. It was easy to write a different type of rule (using SPARQL) with our greatly simplified ontology that persisted the “inherent” compatibility of parts into the catalog. By doing this it reversed the sequence of events. Previously, because the compatibility process  was difficult and time-consuming, they would wait until they were ready to sell a line of  products in a new market before beginning the compatibility studies. Not knowing the compatibility added months into their time-to-market. In the new approach, the graph  knew which products were compatible before the decision to offer them to new  markets.  

Serendipity 2: Standards Alignment 

Schneider were interested in aligning their product offerings with the standard called  eCl@ss which has over 15,000 classes and thousands of attributes. It is a complex mapping process, which had been attempted before but abandoned. By starting with the extreme simplification of the ontology (46 classes and 36 properties out of the several hundred in the enterprise ontology), working toward the standard was far easier and we had an initial map completed in about two months.  

Serendipity 3: Integrating Acquisitions 

Schneider had acquired another electrical part manufacturer, Clipsal. They asked if we could integrate the Clipsal catalogue with the new graph catalogue. Clipsal also had a complex product catalogue. It was not as complex as Schneider’s, but it was complex and structured quite differently.  

Rather than reverse engineering the Clipsal catalogue we just asked for their data engineers to point us to where the 46 classes and 36 properties were in the catalogue.  Once we’d extracted all that we asked if we were missing anything. Turns out there  were a few items, which we added to the model.  

The whole exercise took about six weeks. At the end of the project we were reviewing the Schneider-Electric page in Wikipedia and found that they had acquired Clipsal over ten years prior. When we asked why they hadn’t integrated their catalogue in all the  time they responded that it was “too hard.”

All three of these use cases are of interest, because they weren’t the use cases we were hired to solve but only manifested when the data was integrated into a simple model.  

—————————– 

Amazon Story of Serendipity 

This video of Ora Lassila is excellent and inspiring. 

https://videolectures.net/videos/iswc2024_lassila_web_and_ai

If you don’t have time to watch to the whole thing, skip into minute 14:40 where he describes the “inventory graph” for tracking packages in the Amazon ecosystem. They have 1 Trillion triples in the graph and the query response is far better than it was in their previous systems. At minute 23:20 he makes the case for serendipity.

How a “User” Knowledge Graph Can Help Change Data Culture

How a “User” Knowledge Graph Can Help Change Data Culture

Identity and Access Management (IAM) has had the same problem since  Fernando Corbató of MIT first dreamed up the idea of digital passwords in  1960: opacity. Identity in the physical world is rich and well-articulated, with a wealth of different ways to verify information on individual humans and devices. By contrast, the digital realm has been identity data impoverished, cryptic and inflexible for over 60  years now. 

Jans Aasman, CEO of Franz, provider of the entity-event knowledge graph  solution Allegrograph, envisions a “user” knowledge graph as a flexible and more  manageable data-centric solution to the IAM challenge. He presented on the topic at this past summer’s Data-Centric Architecture Forum, which Semantic Arts hosted near its headquarters in Fort Collins, Colorado. 

Consider the specificity of a semantic graph and how it could facilitate secure access control. Knowledge graphs constructed of subject-predicate-object triples make it possible to set rules and filters in an articulated and yet straightforward manner. Information about individuals that’s been collected for other HR purposes  could enable this more precise filtering. 

For example, Jans could disallow others’ access to a triple that connects “Jans”  and “salary”. Or he could disallow access to certain predicates. 

Identity and access management vendors call this method Attribute-Based  Access Control (ABAC). Attributes include many different characteristics of users and  what they interact with, which is inherently more flexible than role-based access control  (RBAC). 

Cell-level control is also possible, but as Forrest Hare of Summit Knowledge  Solutions points out, such security doesn’t make a lot of sense, given how much meaning is absent in cells controlled in isolation. “What’s the classification of the  number 7?” He asked. Without more context, it seems silly to control cells that are just storing numbers or individual letters, for example. 

Simplifying identity management with a knowledge graph approach  

Graph databases can simplify various aspects of the process of identity  management. Let’s take Lightweight Directory Access Protocol, or LDAP, for example. 

This vendor-agnostic protocol has been around for 30 years, but it’s still popular  with enterprises. It’s a pre-web, post-internet hierarchical directory service and authentication protocol. 

“Think of LDAP as a gigantic, virtual telephone book,” suggests access control management vendor Foxpass. Foxpass offers a dashboard-based LDAP management product which it claims is much easier to manage than OpenLDAP. 

If companies don’t use LDAP, they might as well use Microsoft’s Active Directory,  which is a broader, database-oriented identity and access management product that covers more of the same bases. Microsoft bundles AD with its Server and Exchange products, a means of lock-in that has been quite effective. Lock-in, obviously, inhibits innovation in general. 

Consider the whole of identity management as it exists today and how limiting it has been. How could enterprises embark on the journey of using a graph database-oriented approach as an alternative to application-centric IAM software? The first step  involves the creation of a “user” knowledge graph. 

Access control data duplication and fragmentation  

Semantic Arts CEO Dave McComb in his book Software Wasteland estimated  that 90 percent of data is duplicated. Application-centric architectures in use since the  days of mainframes have led to user data sprawl. Part of the reason there is such a duplication of user data is that authentication, authorization, and access control (AAA)  methods require more bits of personally identifiable information (PII) be shared with central repositories for AAA purposes. 

B2C companies are particularly prone to hoovering up these additional bits of  PII lately and storing that sensitive info in centralized repositories. Those repositories become one-stop shops for identity thieves. Customers who want to pay online have to  enter bank routing numbers and personal account numbers. As a result, there’s even more duplicate PII sprawl.

One of the reasons a “user” knowledge graph (and a knowledge graph enterprise foundation) could be innovative is that enterprises who adopt such an approach can move closer to zero-copy integration architectures. Model-driven development of the type that knowledge graphs enable assumes and encourages shared data and logic. 

A “user” graph coupled with project management data could reuse the same  enabling entities and relationships repeatedly for different purposes. The model-driven development approach thus incentivizes organic data management. 

The challenge of harnessing relationship-rich data  

Jans points out that enterprises, for example, run massive email systems that could be tapped to analyze project data for optimization purposes. And  disambiguation by unique email address across the enterprise can be a starting point  for all sorts of useful applications. 

Most enterprises don’t apply unique email address disambiguation, but Franz has a pharma company client that does, an exception that proves the rule. Email continues to be an untapped resource in many organizations precisely because it’s a treasure trove of relationship data. 

Problematic data farming realities: A social media example  

Relationship data involving humans is sensitive by definition, but the reuse potential of sensitive data is too important to ignore. Organizations do need to interact with individuals online, and vice versa. 

Former US Federal Bureau of Investigation (FBI) counterintelligence agent Peter  Strzok quoted from Deadline: White House, an MSNBC program in the US aired on  August 16: 

“I’ve served I don’t know how many search warrants on Twitter (now known as X)  over the years in investigations. We need to put our investigator’s hat on and talk about  tradecraft a little bit. Twitter gathers a lot of information. They just don’t have your tweets. They have your draft tweets. In some cases, they have deleted tweets. They have DMs that people have sent you, which are not encrypted. They have your draft  DMs, the IP address from which you logged on to the account at the time, sometimes  the location at which you accessed the account and other applications that are  associated with your Twitter account, amongst other data.”

X and most other social media platforms, not to mention law enforcement  agencies such as the FBI, obviously care a whole lot about data. Collecting, saving, and  allowing access to data from hundreds of millions of users in such a broad,  comprehensive fashion is essential for X. At least from a data utilization perspective,  what they’ve done makes sense. 

Contrast these social media platforms with the way enterprises collect and  handle their own data. That collection and management effort is function- rather than human-centric. With social media, the human is the product. 

So why is a social media platform’s culture different? Because with public social media, broad, relationship-rich data sharing had to come first. Users learned first-hand  what the privacy tradeoffs were, and that kind of sharing capability was designed into  the architecture. The ability to share and reuse social media data for many purposes  implies the need to manage the data and its accessibility in an elaborate way. Email, by contrast, is a much older technology that was not originally intended for multi-purpose reuse. 

Why can organizations like the FBI successfully serve search warrants on data from data farming companies? Because social media started with a broad data sharing assumption and forced a change in the data sharing culture. Then came adoption.  Then law enforcement stepped in and argued effectively for its own access. 

Broadly reused and shared, web data about users is clearly more useful than siloed data. Shared data is why X can have the advertising-driven business model it does. One-way social media contracts with users require agreement with provider terms. The users have one choice: Use the platform, or don’t. 

The key enterprise opportunity: A zero-copy user PII graph that respects users  

It’s clear that enterprises should do more to tap the value of the kinds of user data that email, for example, generates. One way to sidestep the sensitivity issues associated with reusing that sort of data would be to treat the most sensitive user data separately. 

Self-sovereign identity (SSI) advocate Phil Windley has pointed out that agent-managed, hashed messaging and decentralized identifiers could make it unnecessary to duplicate identifiers that correlate. If a bartender just needs to confirm that a patron  at the bar is old enough to drink, the bartender could just ping the DMV to confirm the  fact. The DMV could then ping the user’s phone to verify the patron’s claimed adult status.

Given such a scheme, each user could manage and control their access to their  own most sensitive PII. In this scenario, the PII could stay in place, stored, and encrypted on a user’s phone. 

Knowledge graphs lend themselves to such a less centralized, and yet more fine-grained and transparent approach to data management. By supporting self-sovereign identity and a data-centric architecture, a Chief Data Officer could help the  Chief Risk Officer mitigate the enterprise risk associated with the duplication of personally identifiable information—a true, win-win.

Zero Copy Integration and Radical Simplification

Zero Copy Integration and Radical Simplification

Dave McComb’s book Software Wasteland underscored a fundamental problem:  Enterprise software sometimes costs 1,000 times more than it ought to. The poster child for cost overruns was highlighted in the book was Healthcare.gov, a public registration system for the US Affordable Care Act, enacted in 2010. By 2018, the US Federal government had spent  $2.1 billion to build and implement the system. Most of that money was wasted. The  government ended up adopting many of the design principles embodied in an equivalent  system called HealthSherpa, which cost $1 million to build and implement. 

In an era where the data-centric architecture Semantic Arts advocates should be the  norm, application-centric architecture still predominates. But data-centric architecture doesn’t just reduce the cost of applications. It also attacks the data duplication problem attributable to  poor software design. This article explores how expensive data duplication has become, and  how data-centric, zero-copy integration can put enterprises on a course to simplification. 

Data sprawl and storage volumes  

In 2021, Seagate became the first company to ship three zettabytes worth of hard disks.  It took them 36 years to ship the first zettabyte. Six years to ship the second zettabyte, and only one additional year to ship the third zettabyte. 

The company’s first product, the ST-506, was released in 1980. The ST-506 hard disk,  when formatted, stored five megabytes (10002). By comparison, an IBM RAMAC 305,  introduced in 1956, stored five to ten megabytes. The RAMAC 305 weighed 10 US tons (the  equivalent of nine metric tonnes). By contrast, the Seagate ST-506, 24 years later, weighed five  US pounds (or 2.27 kilograms). 

A zettabyte is the equivalent of 7.3 trillion MP3 files or 30 billion 4K movies, according to  Seagate. When considering zettabytes: 

  • 1 zettabyte equals 1,000 exabytes. 
  • 1 exabyte equals 1,000 petabytes. 
  • 1 petabyte equals 1,000 terabytes. 

IDC predicts that the world will generate 178 zettabytes of data by 2025. At that pace, “The  Yottabyte Era” would succeed The Zettabyte Era by 2030, if not earlier. 

The cost of copying  

The question becomes, how much of the data generated will be “disposable” or  unnecessary data? In other words, how much data do we actually need to generate, and how 

much do we really need to store? Aren’t we wasting energy and other resources by storing  more than we need to? 

Let’s put it this way: If we didn’t have to duplicate any data whatsoever, the world would only have to generate 11 percent of the data it currently does. In 2021 terms, we’d only need  to generate 8.7 zettabytes of data, compared with the 78 zettabytes we actually generated worldwide over the course of that year. 

Moreover, Statista estimates that the ratio of unique to replicated data stored worldwide will decline to 1:10 from 1:9 by 2024. In other words, the trend is  toward more duplication, rather than less. 

The cost of storing oodles of data is substantial. Computer hardware guru Nick  Evanson, quoted by Gerry McGovern in CMSwire, estimated in 2020 that storing two  yottabytes would cost $58 trillion. If the cost per byte stored stayed constant, 40 percent of the world’s economic output would be consumed in 2035 by just storing data. 

Clearly, we should be incentivizing what graph platform Cinchy calls “zero-copy  integration”–a way of radically reducing unnecessary data duplication. The one thing we don’t  have is “zero-cost” storage. But first, let’s finish the cost story. More on the solution side and zero-copy integration later. 

The cost of training and inferencing large language models  

Model development and usage expenses are just as concerning. The cost of training  machines to learn with the help of curated datasets is one thing, but the cost of inferencing–the  use of the resulting model to make predictions using live data–is another. 

“Machine learning is on track to consume all the energy being supplied, a model that is costly, inefficient, and unsustainable,” Brian Bailey in Semiconductor Engineering pointed out  in 2022. AI model training expense has increased with the size of the datasets used, but more importantly, as the amount of parameters increases by four, the amount of energy consumed in the process increases by 18,000 times. Some AI models included as many as 150 billion parameters in 2022. The more recent ChatGPT LLM Training includes 180 billion parameters.  Training can often be a continuous activity to keep models up to date. 

But the applied model aspect of inferencing can be enormously costly. Consider the AI  functions in self-driving cars, for example. Major car makers sell millions of cars a year, and each  one they sell is utilizing the same carmaker’s model in a unique way. 70 percent of the energy  consumed in self-driving car applications could be due to inference, says Godwin Maben, a  scientist at electronic design automation (EDA) provider Synopsys. 

Data Quality by Design  

Transfer learning is a machine learning term that refers to how machines can be taught  to generalize better. It’s a form of knowledge transfer. Semantic knowledge graphs can be a  valuable means of knowledge transfer because they describe contexts and causality well with  the help of relationships.

Well-described knowledge graphs provide the context in contextual computing.  Contextual computing, according to the US Defense Advanced Research Projects Agency  (DARPA), is essential to artificial general intelligence. 

A substantial percentage of training set data used in large language models is more or less duplicate data, precisely because of poorly described context that leads to a lack of generalization ability. Thus the reason why the only AI we have is narrow AI. And thus the reason large language models are so inefficient. 

But what about the storage cost problem associated with data duplication? Knowledge graphs can help with that problem also, by serving as a means for logic sharing. As Dave has  pointed out, knowledge graphs facilitate model-driven development when applications are  written to use the description or relationship logic the graph describes. Ontologies provide the logical connections that allow reuse and thereby reduce the need for duplication. 

FAIR data and Zero-Copy Integration  

How do you get others who are concerned about data duplication on board with semantics and knowledge graphs? By encouraging data and coding discipline that’s guided by  FAIR principles. As Dave pointed out in a December 2022 blog post, semantic graphs and FAIR  

principles go hand in hand. https://www.semanticarts.com/the-data-centric-revolution-detour shortcut-to-fair/ 

Adhering to the FAIR principles, formulated by a group of scientists in 2016, promotes  reusability by “enhancing the ability of machines to automatically find and use the data, in  addition to supporting its reuse by individuals.” When it comes to data, FAIR stands for Findable, Accessible, Interoperable, and Reusable. FAIR data is easily found, easily shared,  easily reused quality data, in other words. 

FAIR data implies the data quality needed to do zero-copy integration. 

Bottom line: When companies move to contextual computing by using knowledge  graphs to create FAIR data and do model-driven development, it’s a win-win. More reusable  data and logic means less duplication, less energy, less labor waste, and lower cost. The term  “zero-copy integration” underscores those benefits.

A Knowledge Model for Explainable Military AI

A Knowledge Model for Explainable Military AI 

Forrest Hare, Founder of Summit Knowledge Solutions, is a retired US Air Force targeting and information operations officer who now works with the Defense Intelligence Agency (DIA). His  experience includes integrating intelligence from different types of communications, signals,  imagery, open source, telemetry, and other sources into a cohesive and actionable whole. 

Hare became aware of semantics technology while at SAIC and is currently focused on building a space + time ontology called the DIA Knowledge Model so that Defense Department intelligence could use it to contextualize these multi-source inputs. 

The question becomes, how do you bring objects that don’t move and objects that do move into the same information frame with a unified context? The information is currently organized by collectors and producers. 

The object-based intelligence that does exist involves things that don’t move at all. Facilities,  for example, or humans using phones that are present on a communications network are more or less static. But what about the things in between such as trucks that are only intermittently present? 

Only sparse information is available about these. How do you know the truck that was there  yesterday in an image is the same truck that is there today? Not to mention the potential hostile forces who own the truck that have a strong incentive to hide it. 

Objects in object-based intelligence not only include these kinds of assets, but also events and locations that you want to collect information about. In an entity-relationship sense, objects are entities. 

Hare’s DIA Knowledge Model uses the ISO standard Basic Formal Ontology (BFO) to unify  domains so that the information from different sources is logically connected and therefore  makes sense as part of a larger whole. BFO’s maintainers (Director Barry Smith and team at the  National Center for Ontological Research (NCOR) at the University of Buffalo) keep the ontology strictly limited to 30 or so classes. 

The spatial-temporal regions of the Knowledge Model are what’s essential to do the kinds of  dynamic, unfolding object tracking that’s been missing from object-based intelligence. Hare gave the example of a “site” (an immaterial entity) from a BFO perspective. A strict geolocational definition of “site” makes it possible for both humans and machines to make sense of the data about sites. Otherwise, Hare says, “The computer has no idea how to  understand what’s in our databases, and that’s why it’s a dumpster fire.”

This kind of mutual human and machine understanding is a major rationale behind explainable  AI. A commander briefed by an intelligence team must know why the team came to the  conclusions it did. The stakes are obviously high. “From a national security perspective, it’s extremely important for AI to be explainable,” Hare reminded the audience. Black boxes such as ChatGPT as currently designed can’t effectively answer the commander’s question on how the intel team arrived at the conclusions it did. 

Finally, the level of explain-ability knowledge models like the DIA’s becomes even more critical as information flows into the Joint Intelligence Operations Center (JIOC). Furthermore, the various branches of the US Armed Forces must supply and continually update a Common  Intelligence Picture that’s actionable by the US President, who’s the Commander in Chief for the military as a whole. 

Without this conceptual and spatial-temporal alignment across all service branches, joint operations can’t proceed as efficiently and effectively as they should. Certainly, the risk of failure looms much larger as a result.

How US Homeland Security Plans to Use Knowledge Graph

How US Homeland Security Plans to Use Knowledge Graph

During this summer’s Data Centric Architecture Forum, Ryan Riccucci, Division Chief for  U.S. Border Patrol – Tucson (AZ) Sector, and his colleague Eugene Yockey gave a glimpse of what the data environment is like within the US Department of Homeland Security (DHS), as well as how transforming that data environment has been evolving. 

The DHS celebrated its 20-year anniversary recently. The Federal department’s data challenges are substantial, considering the need to collect, store, retrieve and manage information associated with 500,000 daily border crossings, 160,000 vehicles, and $8 billion in imported goods processed daily by 65,000 personnel. 

Riccucci is leading an ontology development effort within the Customs and Border  Patrol (CBP) agency and the Department of Homeland Security more generally to support  scalable, enterprise-wide data integration and knowledge sharing. It’s significant to note that a  Division Chief has tackled the organization’s data integration challenge. Riccucci doesn’t let leading-edge, transformational technology and fundamental data architecture change intimidate him. 

Riccucci described a typical use case for the transformed, integrated data sharing  environment that DHS and its predecessor organizations have envisioned for decades. 

The CBP has various sensor nets that monitor air traffic close to or crossing the borders  between Mexico and the US, and Canada and the US. One such challenge on the Mexican border is Fentanyl smuggling into the US via drones. Fentanyl can be 50 times as powerful as morphine. Fentanyl overdoses caused 110,000 deaths in the US in 2022. 

On the border with Canada, a major concern is gun smuggling via drone from the US to Canada. Though legal in the US, Glock pistols, for instance, are illegal and in high demand in Canada. 

The challenge in either case is to intercept the smugglers retrieving the drug or weapon drops while they are in the act. Drones may only be active for seven to 15 minutes at a time, so  the opportunity window to detect and respond effectively is a narrow one. 

Field agents ideally need to see enough visual real-time, mapped airspace information on the sensor activated, allowing them to move quickly and directly to the location. Specifics are important; verbally relayed information, by contrast, can often be less specific, causing confusion or misunderstanding.

The CBP’s successful proof of concept involved a basic Resource Description Framework  (RDF) triple, semantic capabilities with just this kind of information: 

Sensor → Act of sensing → drone (SUAS, SUAV, vehicle, etc.) 

In a recent test scenario, CBP collected 17,000 records that met specified time/space requirements for a qualified drone interdiction over a 30-day period. 

The overall impression that Riccucci and Yockey conveyed was that DHS has both the budget and the commitment to tackle this and many other use cases using a transformed data-centric architecture. By capturing information within an interoperability format, the DHS has  been apprehending the bad guys with greater frequency and precision.Copyright @ semanticarts.com

SIX AXES OF DECOUPLING

SIX AXES OF DECOUPLING  

Loose coupling has been a Holy Grail for systems developers for generations.

The virtues of loose coupling have been widely lauded, yet there has been little description about what is needed to achieve loose coupling.  In this paper we describe our observations from projects we’ve been involved with. 

Coupling  

Two systems or two parts of a single system are considered coupled if a change to one of the systems unnecessarily affects the other system. So for instance, if we upgrade the version of our database and it requires  that we upgrade the operating system for every client attached to that database, then we would say those two systems or those two parts of  the system are tightly coupled. 

Coupling is widely understood to be undesirable because of the spread  of the side effects. As systems get larger and more complex, anything that causes a change in one part to affect a larger and larger footprint in the entire system is going to be expensive and destabilizing. 

Loose Coupling/Decoupling  

So, the converse of this is to design systems that are either “loosely  coupled” or “decoupled.” Loosely coupled systems do not arise by accident. They are intentionally designed such that change can be introduced around predefined flex points. 

For instance, one common strategy is to define an application programming interface (API) which external users of a module or class can use. This simple technique allows the interior of the class or module or method to change without necessarily exporting a change in behavior to the users. 

The Role of the Intermediate  

In virtually every system that we’ve investigated that has achieved any degree of decoupling, we’ve found an “intermediate form.” It is this intermediate form that allows the two systems or subsystems not to be directly connected to each other. 

As shown in Figure (1), they are connected through an intermediary. In the example described above with an API, the signature of the interface is the intermediate. 

What Makes a Good Intermediary?  

An intermediary needs several characteristics to be useful: 

It doesn’t change as rapidly as its clients. Introducing an intermediate that changes more frequently than either the producer or consumer of the service will not reduce change traffic in the system. Imagine a system built on an API which changes on a weekly basis. Every producer and consumer of the services that use the API would have to change along with the API and chaos would ensue. 

It is nonproprietary. A proprietary intermediary is one that is effectively owned and controlled by a single group or small number of vendors. The reason proprietary intermediaries are undesirable is because the rate of change of the intermediary itself has been placed outside the control of the consumer. In many cases to use the service you must adopt the intermediary of the provider. It should also be noted that in many cases the controller of the proprietary standard has  incentive to continue to change the standard if that can result in additional revenue for upgrades and the like. 

It is evolvable. It’s highly unlikely that anyone will design an intermediate form that is correct for all time from the initial design.  Because of this, it’s highly desirable to have intermediate forms that are evolvable. The best trait of an evolvable intermediate is that it can be added on to, without invalidating previous uses of it. We sometimes more accurately call this an accretive capability, meaning that things can be added on incrementally. The great advantage of an evolvable or  accretive intermediary is that if there are many clients and many suppliers using the intermediary they do not have to all be changed in  lockstep, which allows many more options for upgrade and change. 

It is simple to use. An intermediate form that is complex or overly difficult to use will not be used and either other forms will be adopted which may be more various and different or the intermediate form will be skipped altogether and the benefit lost. 

Shared Intermediates  

In addition to the simple reduction in change traffic from having the intermediate be more stable than the components at either end, we also have an advantage in most cases where the intermediate allows re use of connections. This has been popularized in the Systems Integration business where people have pointed out time and time again that creating a hub will drastically reduce the number of interfaces needed to supply a system.

In Figure (2), we have an example of what we call the traditional interface math, where the introduction of a hub or intermediate form can drastically reduce the number of interconnections in a system.  

People selling hubs very often refer to this as: (n * n – 1) / 2 or  sometimes simply the n2 problem. While this makes for very  compelling economics, our observation is that the true math for this  style of system is much less generous but still positive. Just because  two systems might be interconnected does not mean that they will be.  Systems are not completely arbitrarily divided and therefore not every interconnection need be accounted for. 

Figure (3) shows a more traditional scenario where, in the case on the left without a hub, there are many but not an exponential number of interfaces between systems. As the coloring shows, if you change one  

of those systems, any of the systems it touches may be affected and should at least be reviewed with an impact analysis. In the figure on the right, when the one system is changed, the evaluation is whether the effect spreads beyond the intermediary hub in the center. If it does not, if the system continues to obey the dictates of the intermediary form, than the change effect is, in fact, drastically reduced. 

The Axes of Decoupling  

We found in our work that, in many cases, people desire to decouple their systems and even go through the effort of creating intermediate forms or hubs and then build their systems to connect to those intermediate forms. However, as the systems evolve, very often they realize that a change in one of the systems does, in fact, “leak through”  the abstraction in the intermediate and affects other systems. 

In examining cases such as this, we have determined that there are six major considerations that cause systems that otherwise appear to be decoupled to have a secret or hidden coupling. We call these the axes of  decoupling. If a system is successfully decoupled on each of these axes,  then the impact of a change in any one of the systems should be greatly minimized. 

Technology Dependency  

The first axis that needs to be decoupled, and in some ways the hardest,  is what we call technology dependency. In the current state of the practice, people attempt to achieve integration, as well as economy of system operation, by standardizing on a small number of underlying technologies, such as operating systems and databases. The hidden trap in this is that it is very easy to rely on the fact that two systems or subsystems are operating on the same platform. As a result, developers find it easy to join a table from another database to one in their own database if they find that to be a convenient solution. They find it easy  to make use of a system function on a remote system if they know that  the remote system supports the same programming languages, the same  API, etc. 

However, this is one of the most pernicious traps because as a complex system is constructed with more and more of these subtle technology dependencies, it becomes very hard to separate out any portion and re – implement it.

The solution to this, as shown in Figure (4), is to introduce an intermediate form that ensures that a system does not talk directly to another platform. The end result is that each application or subsystem or service can run on its own hardware, in its own operating system,  using its own database management system, and not be affected by changes in other systems. Of course, each system or subsystem does have a technological dependency on the technology of the intermediary in the middle. This is the trade-off; you introduce the dependence on  one platform in exchange for being independent of n other platforms.  In the current state-of-the-art, most people use what’s called an integration broker to achieve this. An integration broker is a product such as IBM’s WebSphere or TIBCO or BEA, which allows one application to communicate with another without being aware of, or care, what platform the second application runs on. 

Destination Dependency  

Even when you’ve successfully decoupled the platforms the two applications rely on, we’ve sometimes observed problems where one  application “knows” of the existence and location of another application or service. By the way, this will become a very “normal problem” as Web services become more popular because the default method of implementing Web services has the requester knowing of the nature and destination of the service. 


In Figure (5), we show a little more clearly through an example where two systems have an intermediary. In this case, the distribution and  shipping application would like to send messages to a freight application, for instance to get a freight rating or to determine how  long it would take to get a package somewhere. Imagine if you were to introduce a new service in the freight area that in some cases handled international shipping, but we continue to do domestic the old way. If  we had not decoupled these services, it is highly likely that the calling  program would now need to be aware of the difference and make a  determination in terms of what message to send, what API to call,  where to send its request, etc. The only other defense would be to have yet another service that accepted all requests and then dispatched them; but this is really an unnecessary artifact that would have to be added into a system where the destination intermediary had not been  designed in. 

Syntax Intermediary  

Classically in an API, the application programming interface defines very specifically the syntax of any message sent between two systems.  For instance, the API specifies the number of arguments, their order,  and their type; and any change to any of those will affect any of the  calling programs. Also EDI (electronic data interchange) relies very much on a strict syntactical definition of the message being passed between partners. 



In Figure (6), we show a small snippet of XML, which has recently become the de facto syntactic intermediate form. Virtually all new initiatives now use XML as the syntactic lingua franca. As such, any two systems that communicate through XML at least do not have to mediate differences at that syntactic level. Also, fortunately, XML is a  nonproprietary standard and, at least to date, has been evolving very slowly. 

Semantic Intermediary  

Where systems integration projects generally run into the greatest amount of trouble is from semantic differences or ambiguities in the meaning of the information being passed back and forth. Traditionally, we find that developers build interfaces and run them and test them against live data, and then find that the ways in which the systems have been used does not conform particularly well to the spec.  Additionally, in each case the names and therefore the implied semantics of all the elements used in the interface are typically different from system to system and must be reconciled. The n2 way of resolving this is to reconcile every system to every other system, a very tedious process.

There have been a few products and some approaches, as we show very simply and schematically in Figure (7), that have attempted to provide a semantic intermediary. Two that we’re most familiar with are  

Condivo and Unicorn. Over the long-term, the intent of the Semantic  Web is to build shared ontologies in OWL, which is the Web Ontology  Language and a derivative of RDF and DAML+OIL. In the long-term,  

it’s expected that systems will be able to communicate shared meaning through mutually committed ontologies. 

Identity Intermediary

  
A much subtler coupling that we’ve found in several systems is in the use of identifiers. Most systems have identifiers for all the key real world and invented entities that they deal with. For instance, most systems have identifiers for customers, patients, employees, sales orders, purchase orders, production lines, etc. All of these things must be given unique unambiguous names. That is not the problem; the  problem is that each system has a tendency to create its own identifiers for items that are very often shared. In the real world, there is only one instance of many of these items. There is only one of each of us as  individuals, one each for each building, one each for each corporation,  etc. And yet each system tends to create its own numbering system and when it discovers a new customer it will give it the next available customer number. In order to communicate unambiguously with the system that’s done this, to date the two main approaches have been  either to force universal identifiers onto a large number of systems or to store other people’s identifiers in your own system. Both of these approaches are flawed and do not scale well. In the case of the universal  identifier, besides having all the problems of attempting to get  coverage on the multiple domains, there is the converse problem of  privacy. Once people, for instance, are given universal identifiers it’s very hard to keep information about individuals anonymous. The other  approach of storing others’ identifiers in your systems does not scale  well because as the number of systems you must communicate with  grows, the number of other identifiers that you must store also grows.  In addition, there is the problem of being notified when any changes to  these identifiers occur.

 

In Figure (8), we outline a new intermediary, which is just beginning to be discussed as a general-purpose service, variously called the identity intermediary or the handle intermediary. The reason we’ve begun shifting from calling it an identity intermediary is because the security industry has been referring to identity systems and it does not mean exactly the same thing as what we mean here. Essentially, this is a service where each subscribing system recognizes that it may be dealing with an entity that any of the other systems may have previously dealt with. So this has a discovery piece that systems can discover if they’re dealing with, communicating with, or aware of any entity that has already been identified in the larger federation. It also acts as a cross reference so that each system need not keep track of all the synonyms of identifiers or handles to all  the other systems. Figure (8) shows a very simple representation of this with two very similar individuals that need to be identified separately. To date, the only system that we know of that covers some  of this territory is called ChoiceMaker, but it is not configured to be  used in exactly the manner that we show here. 

Nomenclature Intermediary  

Very similar to the identity or handle intermediary is the nomenclature intermediary. We separate it because typically, with the identity intermediary, we’re dealing with discovered real world entities and the reason we have synonyms is because multiple different systems are “discovering” the same physical real-world item. 

In the case of the nomenclature intermediary system, we’re dealing with an invented categorization system. Sometimes categorization systems are quite complex. In the medical industry we have SNOMED,  HCPCS, and the CPT nomenclature. But also we have incredibly simple,  and very often internally made up, classification systems, so in every case where we create a code file where we might have seven types of customer or orders or accidents or whatever that we tend to codify in order to get more uniformity, these are nomenclatures. What is helpful about having intermediary forms is that it enables multiple systems to either share or map to a common set of nomenclatures or codes. 

Figure (9) shows a simple case of how the mapping could be centralized. Again, this is another example where over the long term, developments in Semantic Web may be a great help and may provide clearinghouses for the communication between disparate systems. In the meantime, the only example that we’re aware of where a company has internally devoted a lot of attention to this is the Allstate Insurance Co., which has built what they call a domain management system where they have found, catalogued, and cross-referenced over 6,000 different nomenclatures that are in use within Allstate.

Summary  

Loose coupling has been a Holy Grail for systems developers for generations. There is no silver bullet that will slay these problems; however, as we have discussed in this paper, there are a number of specific disciplined things that we can look at as developers, and as we continue to pay attention to these, we will make our systems more and more decoupled, and therefore easier and easier to evolve and change.

Documents, Events and Actions

Documents, Events and Actions  

We have recently been reexamining the weird relationship of “documents” to “events” in enterprise information systems and have surfaced some new insights that are worth  sharing.  

Documents and Events 

Just to make sure we are all seeing things clearly, the documents we’re referring to are  those that give rise to financial change in an enterprise. This includes invoices,  purchase orders, receiving reports and sales contracts. We’re not including other documents like memos, reports, news articles and emails – nor are we focusing on document structures such as JSON or XML.  

In this context, the “events” represent the recording of something happening that has  a high probability of affecting the finances of the firm. Many people call these  “transactions” or “financial transactions.” The deeper we investigated, the more we  found a need to distinguish the “event” (which is occurring in the real world) from the  “transaction” (which is its reflection in the database). But I’m getting ahead of myself  and will just stick with documents and events for this article. 

Documents and Events, Historically 

For most of recorded history, the document was the event, or at least it was the only tangibly recorded interpretation of the event. That piece of actual paper was both the document and the representation of the event. When you wrote up a purchase order  (and had it signed by the other party) you had an event.  

In the 1950’s we began computerizing these documents, turning them into a skeuomorph (a design that imitates a real-world object to make it more familiar). The  user interfaces looked like paper forms. There were boxes on the top for “ship to” and  “bill to” and small boxes in the middle for things like “payment terms,” and “free on  board.” This was accompanied by a line item of the components that made up the bill,  invoice, purchase order, timecard, etc. 

For the longest time, the paper was also the “source document” which would be entered into the computer at the home office. Somewhere along the way some clever person realized you could start by entering the data into the computer for things you originated and then print out the paper. That paper was then sent to the other party for them to key it into their system.  

Now, most of these “events” are not produced by humans, but by some other computer program. These ‘bills of materials’ processors can generate purchase orders much faster than a room full of procurement specialists. Many industries now consider these “events” to be primary. The documents (if they exist at all) are part of the audit trail. Industries like healthcare have long ago replaced the “superbill” (a document on  a clipboard with 3 dozen check boxes to represent what the physician did to you on  that visit) with 80 specific types of HL7 messages that ricochet back and forth from  provider to payer.  

And yet, even in the 21st century, we still find ourselves often excerpting facts from  unstructured documents and entering them into our computer systems. Here at  Semantic Arts, we take the contracts we’ve signed with our clients and scan them for the tidbits that we need to put into our systems (such as the budgets, time frame,  staffing and billing rates) and conveniently leave the other 95% of the document in a file somewhere.  

Documents and Events, what is the difference?

So for hundreds of years, documents and events were more or less the same thing.  Now they have drifted apart. In today’s environment, the real questions are not “what’s  the difference” but rather “which one is the truth.” In other words, if there is a difference which one do we use? There is not a one-size-fits-all answer to that dilemma. It varies from industry to industry.  

But I think it’s fairly safe to say the current difference is that an “event” is a structured  data representation of the business activity, while a “document” is the unstructured  data representation. Either one could have come first. Each is meant to be the reflection of the other.  

The Event and the Transaction 

The event has a very active sense to it because it occurs at a specific point in time.  And therefore, we record it in our computer system and create a transaction, which  updates our database at the posting date and as the effective accounting date.  

The transaction and the event often appear to be the same thing, partly because so  many events terminate in the accounting department. But, in reality, the transaction is adding information to the event that allows it to be posted. The main information that is being added is the valuation, the classification and the effective dates. Most people enter these at the same time they capture the event, but they are distinct. The distinction is more obvious when you consider events such as “issuing material” to a production order. The issuer doesn’t know what account number should be charged,  nor do they know the valuation (this is buried in an accounting policy that determines  whether to cost this widget based on the most recent cost, the oldest cost or the average cost of widgets on hand.) So the “transaction” is different from the “event” even if they occur at the same time.  

Until fairly recently, administrators wouldn’t sit at their computer and enter invoices until they were prepared for them to be issued. Most people wait until they ship the  widget or complete the milestone before they key in the invoice data and email it to  their customer. In this circumstance, the event and the transaction are cotemporaneous – they happen at the same time. And the document being sent to the  customer follows shortly thereafter.  

One More Disconnect  

We are implementing data-centric accounting at Semantic Arts and have disconnected  the “event” that is the structured data representation of the event, from its  classification as an event. We realized that as soon as we had signed a contract, we knew at least one of the two aspects of our future invoices, and in many cases, we knew both. For fixed price projects, we knew the amount of the future invoices. The  only thing we didn’t know was when we could invoice them – because that was based  on the date of some given milestone. For time and material contracts we know the dates of our future invoices (end of the month often) but don’t know the amount. And for our best efforts contracts we know the dates and the amounts and adjust the scope to fit.  

But knowing these things and capturing them in our accounting system creates a problem. They weren’t actually real yet (or at least they weren’t real enough to be  invoices). The sad thing was they looked just like invoices. They had all the data, and it was all valid. They could be rendered to pdfs, and even printed, but we knew we couldn’t send all the invoices to our client all at once. So we now had some invoices in  our system that weren’t really invoices, and didn’t have a good way to make the  distinction.  

As we puzzled over this, we came across a university that was dealing with the same  challenge. In their case they were implementing “commitment accounting,” which is trying to keep track of the commitments (purchase orders mostly) that are outstanding as a way to prevent overrunning budgets. As people entered their purchase orders  (structured records as we’ve been describing them) the system captured them as events. These events were captured and tallied by the system. In order to get the  system to work, people entered purchase orders long before they were approved. In fact, you have to enter them to get an event (or a document) that can be approved and  agreed to by your vendor.  

The problem was many of these purchase order events never were approved. The  apparent commitments vastly exceeded the budgets, and the whole system was shut  down.  

Actions 

We discovered that it isn’t the document, and it isn’t even the event (if we think of the event as the structured data record of the business event) that makes the financial effect real. It is something we are now calling the “action,” or really a special type of  “action.” 

There is a magic moment when an event, or perhaps more accurately a proto-event becomes real. On a website, it is the “buy” button. In the enterprise ,it is often the  “approval” button.  

As we worked on this, we discovered it is just one of the steps in a workflow. The  workflow for a purchase order might start with sourcing, getting quotes, negotiating, etc. The special step that makes the purchase order “real” isn’t even the last step.  After the purchase order is accepted by the vendor, we still need to exchange more  documents to get shipping notifications, deal with warranties, etc. It is one of those steps that makes the commitment. We are now calling this the “green button.” There is one step, one button in the workflow progression that makes the event real. In our internal systems we’re going to make that one green, so that employees know when they are committing the firm.  

Once you have this idea in your head, you’ll be surprised how often it is missed. I go on my bank’s website and work through the process of transferring money. I get a  number of red buttons, and with each one, I wonder, “is this the green one.” Nope,  one more step before we’re committed. Same with booking a flight. There are lots of  purple buttons, but you have to pay a lot of attention before you notice which one of  those purple buttons is really the green one.  

Promotion 

And what does the green button in our internal systems do? Well, it varies a bit,  workflow to workflow, but in many cases it just “promotes” a draft item to a committed one. 

In a traditional system you would likely have draft items in one table and then copy them over to the approved table. Or you might have a status and just be careful to  exclude the unapproved ones from most queries.  

But we’ve discovered that many of these events can be thought of as subtypes of their  draft versions. When the green button gets pressed in an invoicing workflow, the draft invoice gains another triple, which makes it also an approved or a submitted invoice – in addition to its being a draft invoice.  

Summary 

We in the enterprise software industry have had a long history of conflating documents  and events. Usually we get away with it, but occasionally it bites us.  

What we’re discovering now with the looming advent of data-centric accounting is the need not only to distinguish the document from the event but also distinguish the event (as a structure) from the action that enlivens it. We see this as an important step  in the further automation of direct financial reporting.

gist: Buckets, Buckets Everywhere:  Who Knows What to Think

gist: Buckets, Buckets Everywhere:  Who Knows What to Think

We humans are categorizing machines, which is to say, we like to create metaphorical buckets and put things inside. But there are different kinds of buckets, and different ways to model them in  OWL and gist. The most common bucket represents a kind of thing, such as Person or Building.  Things that go into those buckets are individuals of those kinds, e.g. Albert Einstein, or the particular office building you work in. We represent this kind of bucket as an owl:Class and we use rdf:type to put something into the bucket. 

Another kind of bucket is when you have a group of things, like a jury or a deck of cards that are functionally connected in some way. Those related things go into the bucket (12 members of a jury, or 52 cards). We have a special class in gist called Collection, for this kind of bucket. A specific bucket of this sort will be an instance of a subclass of gist:Collection. E.g. OJs_Jury is an instance of the class Jury, a subclass of gist: Collection. We use gist:memberOf to put things into the bucket.  Convince yourself that these buckets do not represent a kind of thing. A jury is a kind of thing, a particular jury is not. We would use rdf:type to connect OJ’s jury to the owl: ClassJury, and use gist:memberOf to connect the specific jurors to OJ’s jury.

A third kind of bucket is a tag which represents a topic and is used to categorize individual items for the purpose of indexing a body of content. For example, the tag “Winter” might be used to index photographs, books and/or YouTube videos. Any content item that depicts or relates to winter in some way should be categorized using this tag. In gist, we represent this in a way that is  structurally the same as how we represent buckets that are collections of functionally connected  items. The differences are 1) the bucket is an instance of a subclass of gist:Category, rather than of gist: Collection and 2) we put things into the bucket using gist:categorizedBy rather than gist:memberOf. The Winter tag is essentially a bucket containing all the things that have been indexed or categorized using that tag.

Below is a summary table showing these different kinds of buckets, and how we represent them in  OWL and gist.

Kind of Bucket Example Representing the Bucket Putting something in the Bucket
Individual of a Kind John Doe is a Person Instance of owl:Class rdf:type
A bucket with  functionally connected  things insideSheila Woods is a  member of OJ’s JuryInstance of a subclass of  gist:Collection gist:memberOf
An index term for  categorizing contentThe book “Winter of  our Discontent” has  Winter as one of its  tagsInstance of a subclass of  gist:Category gist:categorizedBy


Morgan Stanley: Data-Centric Journey 

Morgan Stanley: Data-Centric Journey 

Morgan Stanley has been on the semantic/ data-centric journey with us for about 6 years.  Their approach is the adoption of an RDF graph and the development of a semantic knowledge base to help answer domain-specific questions, formulate classification recommendations and deliver quality search to their internal users. Their primary objective is to enable the firm to retrieve, retain and protect information (i.e., where the information resides, how long it must be maintained and what controls apply to it). 

The knowledge graph is being developed by the Information Management team under the direction of Nic Seyot (Managing Director and Head of Data & Analytics for Non-Financial  Risk). Nic is responsible for the development of the firm-wide ontology for trading surveillance, compliance, global financial crime and operational risk. Nic’s team is also helping other departments across the firm discover and embrace semantic data modeling for their own use cases.  

Morgan Stanley has tens of thousands of discrete repositories of information. There are many different groups with specialized knowledge about the primary objectives as well as many technical environments to deal with. Their motivating principle is to understand the  conceptual meaning of the information across these various departments and  environments so that they can answer compliance and risk questions.  

A good example is a query from a user about the location of sensitive information (with many conflicting classifications) and whether they are allowed to share it outside of the firm. The answer to this type of question involves knowledge of business continuity,  disaster recovery, emergency planning and many other areas of control. Their approach is to leverage semantic modeling, ontologies and knowledge graph to be able to comprehensively answer that question.  

To build the knowledge graph around these information repositories, they hired Semantic  Arts to create a core ontology around issues that are relevant to the entire firm – including personnel, geography, legal entities, records management, organization and a number of firm-wide taxonomies. Morgan Stanley is committed to open standards and W3C principles which they have combined with their internal standards around quality governance. They created a Semantic Modeling and Ontology Consortium to help govern and maintain that core ontology. Many divisions within the firm have joined the advisory board for the consortium and it is viewed as an excellent way of facilitating cooperation between divisions.

The adoption-based principle has been a success. They have standardized ETL and  virtualization to get information structured and into their knowledge graph. The key use  case is enterprise search to give departments the ability to search for their content by leveraging the tags, lists, categories and taxonomies they use as facets for content search.  One of the key benefits is an understanding of the network of concepts and terms as well as how they relate to one another within their organization. 

Semantic Arts ontologists helped engineer the network of concepts that are included into their semantic thesaurus as well as how they interconnect within the firm. They started out with over 6,500 policies and procedures as a curated corpus of knowledge of the firm.  They used natural language to extract the complexity of relationships out of their combined taxonomies (over half a million concepts). We worked with them to demonstrate the power of conceptual simplification. We helped them transform these complex relationships into broader, narrower and related properties which enable the users to ask business questions in their own context (and acronyms) to enhance the quality of search without manual curation. Our efforts helped reduce the noise, merge concepts with similar meaning and identify critical topics to support complex queries from users.

Contact Us: 

Overcome integration debt with proven semantic solutions. 

Contact Semantic Arts, the experts in data-centric transformation, today! 

CONTACT US HERE 

Address: Semantic Arts, Inc. 

123 N College Avenue Suite 218 

Fort Collins, CO 80524 

Email: [email protected] 

Phone: (970) 490-2224