How a “User” Knowledge Graph Can Help Change Data Culture

How a “User” Knowledge Graph Can Help Change Data Culture

Identity and Access Management (IAM) has had the same problem since  Fernando Corbató of MIT first dreamed up the idea of digital passwords in  1960: opacity. Identity in the physical world is rich and well-articulated, with a wealth of different ways to verify information on individual humans and devices. By contrast, the digital realm has been identity data impoverished, cryptic and inflexible for over 60  years now. 

Jans Aasman, CEO of Franz, provider of the entity-event knowledge graph  solution Allegrograph, envisions a “user” knowledge graph as a flexible and more  manageable data-centric solution to the IAM challenge. He presented on the topic at this past summer’s Data-Centric Architecture Forum, which Semantic Arts hosted near its headquarters in Fort Collins, Colorado. 

Consider the specificity of a semantic graph and how it could facilitate secure access control. Knowledge graphs constructed of subject-predicate-object triples make it possible to set rules and filters in an articulated and yet straightforward manner. Information about individuals that’s been collected for other HR purposes  could enable this more precise filtering. 

For example, Jans could disallow others’ access to a triple that connects “Jans”  and “salary”. Or he could disallow access to certain predicates. 

Identity and access management vendors call this method Attribute-Based  Access Control (ABAC). Attributes include many different characteristics of users and  what they interact with, which is inherently more flexible than role-based access control  (RBAC). 

Cell-level control is also possible, but as Forrest Hare of Summit Knowledge  Solutions points out, such security doesn’t make a lot of sense, given how much meaning is absent in cells controlled in isolation. “What’s the classification of the  number 7?” He asked. Without more context, it seems silly to control cells that are just storing numbers or individual letters, for example. 

Simplifying identity management with a knowledge graph approach  

Graph databases can simplify various aspects of the process of identity  management. Let’s take Lightweight Directory Access Protocol, or LDAP, for example. 

This vendor-agnostic protocol has been around for 30 years, but it’s still popular  with enterprises. It’s a pre-web, post-internet hierarchical directory service and authentication protocol. 

“Think of LDAP as a gigantic, virtual telephone book,” suggests access control management vendor Foxpass. Foxpass offers a dashboard-based LDAP management product which it claims is much easier to manage than OpenLDAP. 

If companies don’t use LDAP, they might as well use Microsoft’s Active Directory,  which is a broader, database-oriented identity and access management product that covers more of the same bases. Microsoft bundles AD with its Server and Exchange products, a means of lock-in that has been quite effective. Lock-in, obviously, inhibits innovation in general. 

Consider the whole of identity management as it exists today and how limiting it has been. How could enterprises embark on the journey of using a graph database-oriented approach as an alternative to application-centric IAM software? The first step  involves the creation of a “user” knowledge graph. 

Access control data duplication and fragmentation  

Semantic Arts CEO Dave McComb in his book Software Wasteland estimated  that 90 percent of data is duplicated. Application-centric architectures in use since the  days of mainframes have led to user data sprawl. Part of the reason there is such a duplication of user data is that authentication, authorization, and access control (AAA)  methods require more bits of personally identifiable information (PII) be shared with central repositories for AAA purposes. 

B2C companies are particularly prone to hoovering up these additional bits of  PII lately and storing that sensitive info in centralized repositories. Those repositories become one-stop shops for identity thieves. Customers who want to pay online have to  enter bank routing numbers and personal account numbers. As a result, there’s even more duplicate PII sprawl.

One of the reasons a “user” knowledge graph (and a knowledge graph enterprise foundation) could be innovative is that enterprises who adopt such an approach can move closer to zero-copy integration architectures. Model-driven development of the type that knowledge graphs enable assumes and encourages shared data and logic. 

A “user” graph coupled with project management data could reuse the same  enabling entities and relationships repeatedly for different purposes. The model-driven development approach thus incentivizes organic data management. 

The challenge of harnessing relationship-rich data  

Jans points out that enterprises, for example, run massive email systems that could be tapped to analyze project data for optimization purposes. And  disambiguation by unique email address across the enterprise can be a starting point  for all sorts of useful applications. 

Most enterprises don’t apply unique email address disambiguation, but Franz has a pharma company client that does, an exception that proves the rule. Email continues to be an untapped resource in many organizations precisely because it’s a treasure trove of relationship data. 

Problematic data farming realities: A social media example  

Relationship data involving humans is sensitive by definition, but the reuse potential of sensitive data is too important to ignore. Organizations do need to interact with individuals online, and vice versa. 

Former US Federal Bureau of Investigation (FBI) counterintelligence agent Peter  Strzok quoted from Deadline: White House, an MSNBC program in the US aired on  August 16: 

“I’ve served I don’t know how many search warrants on Twitter (now known as X)  over the years in investigations. We need to put our investigator’s hat on and talk about  tradecraft a little bit. Twitter gathers a lot of information. They just don’t have your tweets. They have your draft tweets. In some cases, they have deleted tweets. They have DMs that people have sent you, which are not encrypted. They have your draft  DMs, the IP address from which you logged on to the account at the time, sometimes  the location at which you accessed the account and other applications that are  associated with your Twitter account, amongst other data.”

X and most other social media platforms, not to mention law enforcement  agencies such as the FBI, obviously care a whole lot about data. Collecting, saving, and  allowing access to data from hundreds of millions of users in such a broad,  comprehensive fashion is essential for X. At least from a data utilization perspective,  what they’ve done makes sense. 

Contrast these social media platforms with the way enterprises collect and  handle their own data. That collection and management effort is function- rather than human-centric. With social media, the human is the product. 

So why is a social media platform’s culture different? Because with public social media, broad, relationship-rich data sharing had to come first. Users learned first-hand  what the privacy tradeoffs were, and that kind of sharing capability was designed into  the architecture. The ability to share and reuse social media data for many purposes  implies the need to manage the data and its accessibility in an elaborate way. Email, by contrast, is a much older technology that was not originally intended for multi-purpose reuse. 

Why can organizations like the FBI successfully serve search warrants on data from data farming companies? Because social media started with a broad data sharing assumption and forced a change in the data sharing culture. Then came adoption.  Then law enforcement stepped in and argued effectively for its own access. 

Broadly reused and shared, web data about users is clearly more useful than siloed data. Shared data is why X can have the advertising-driven business model it does. One-way social media contracts with users require agreement with provider terms. The users have one choice: Use the platform, or don’t. 

The key enterprise opportunity: A zero-copy user PII graph that respects users  

It’s clear that enterprises should do more to tap the value of the kinds of user data that email, for example, generates. One way to sidestep the sensitivity issues associated with reusing that sort of data would be to treat the most sensitive user data separately. 

Self-sovereign identity (SSI) advocate Phil Windley has pointed out that agent-managed, hashed messaging and decentralized identifiers could make it unnecessary to duplicate identifiers that correlate. If a bartender just needs to confirm that a patron  at the bar is old enough to drink, the bartender could just ping the DMV to confirm the  fact. The DMV could then ping the user’s phone to verify the patron’s claimed adult status.

Given such a scheme, each user could manage and control their access to their  own most sensitive PII. In this scenario, the PII could stay in place, stored, and encrypted on a user’s phone. 

Knowledge graphs lend themselves to such a less centralized, and yet more fine-grained and transparent approach to data management. By supporting self-sovereign identity and a data-centric architecture, a Chief Data Officer could help the  Chief Risk Officer mitigate the enterprise risk associated with the duplication of personally identifiable information—a true, win-win.

A Knowledge Model for Explainable Military AI

A Knowledge Model for Explainable Military AI 

Forrest Hare, Founder of Summit Knowledge Solutions, is a retired US Air Force targeting and information operations officer who now works with the Defense Intelligence Agency (DIA). His  experience includes integrating intelligence from different types of communications, signals,  imagery, open source, telemetry, and other sources into a cohesive and actionable whole. 

Hare became aware of semantics technology while at SAIC and is currently focused on building a space + time ontology called the DIA Knowledge Model so that Defense Department intelligence could use it to contextualize these multi-source inputs. 

The question becomes, how do you bring objects that don’t move and objects that do move into the same information frame with a unified context? The information is currently organized by collectors and producers. 

The object-based intelligence that does exist involves things that don’t move at all. Facilities,  for example, or humans using phones that are present on a communications network are more or less static. But what about the things in between such as trucks that are only intermittently present? 

Only sparse information is available about these. How do you know the truck that was there  yesterday in an image is the same truck that is there today? Not to mention the potential hostile forces who own the truck that have a strong incentive to hide it. 

Objects in object-based intelligence not only include these kinds of assets, but also events and locations that you want to collect information about. In an entity-relationship sense, objects are entities. 

Hare’s DIA Knowledge Model uses the ISO standard Basic Formal Ontology (BFO) to unify  domains so that the information from different sources is logically connected and therefore  makes sense as part of a larger whole. BFO’s maintainers (Director Barry Smith and team at the  National Center for Ontological Research (NCOR) at the University of Buffalo) keep the ontology strictly limited to 30 or so classes. 

The spatial-temporal regions of the Knowledge Model are what’s essential to do the kinds of  dynamic, unfolding object tracking that’s been missing from object-based intelligence. Hare gave the example of a “site” (an immaterial entity) from a BFO perspective. A strict geolocational definition of “site” makes it possible for both humans and machines to make sense of the data about sites. Otherwise, Hare says, “The computer has no idea how to  understand what’s in our databases, and that’s why it’s a dumpster fire.”

This kind of mutual human and machine understanding is a major rationale behind explainable  AI. A commander briefed by an intelligence team must know why the team came to the  conclusions it did. The stakes are obviously high. “From a national security perspective, it’s extremely important for AI to be explainable,” Hare reminded the audience. Black boxes such as ChatGPT as currently designed can’t effectively answer the commander’s question on how the intel team arrived at the conclusions it did. 

Finally, the level of explain-ability knowledge models like the DIA’s becomes even more critical as information flows into the Joint Intelligence Operations Center (JIOC). Furthermore, the various branches of the US Armed Forces must supply and continually update a Common  Intelligence Picture that’s actionable by the US President, who’s the Commander in Chief for the military as a whole. 

Without this conceptual and spatial-temporal alignment across all service branches, joint operations can’t proceed as efficiently and effectively as they should. Certainly, the risk of failure looms much larger as a result.

SIX AXES OF DECOUPLING

SIX AXES OF DECOUPLING  

Loose coupling has been a Holy Grail for systems developers for generations.

The virtues of loose coupling have been widely lauded, yet there has been little description about what is needed to achieve loose coupling.  In this paper we describe our observations from projects we’ve been involved with. 

Coupling  

Two systems or two parts of a single system are considered coupled if a change to one of the systems unnecessarily affects the other system. So for instance, if we upgrade the version of our database and it requires  that we upgrade the operating system for every client attached to that database, then we would say those two systems or those two parts of  the system are tightly coupled. 

Coupling is widely understood to be undesirable because of the spread  of the side effects. As systems get larger and more complex, anything that causes a change in one part to affect a larger and larger footprint in the entire system is going to be expensive and destabilizing. 

Loose Coupling/Decoupling  

So, the converse of this is to design systems that are either “loosely  coupled” or “decoupled.” Loosely coupled systems do not arise by accident. They are intentionally designed such that change can be introduced around predefined flex points. 

For instance, one common strategy is to define an application programming interface (API) which external users of a module or class can use. This simple technique allows the interior of the class or module or method to change without necessarily exporting a change in behavior to the users. 

The Role of the Intermediate  

In virtually every system that we’ve investigated that has achieved any degree of decoupling, we’ve found an “intermediate form.” It is this intermediate form that allows the two systems or subsystems not to be directly connected to each other. 

As shown in Figure (1), they are connected through an intermediary. In the example described above with an API, the signature of the interface is the intermediate. 

What Makes a Good Intermediary?  

An intermediary needs several characteristics to be useful: 

It doesn’t change as rapidly as its clients. Introducing an intermediate that changes more frequently than either the producer or consumer of the service will not reduce change traffic in the system. Imagine a system built on an API which changes on a weekly basis. Every producer and consumer of the services that use the API would have to change along with the API and chaos would ensue. 

It is nonproprietary. A proprietary intermediary is one that is effectively owned and controlled by a single group or small number of vendors. The reason proprietary intermediaries are undesirable is because the rate of change of the intermediary itself has been placed outside the control of the consumer. In many cases to use the service you must adopt the intermediary of the provider. It should also be noted that in many cases the controller of the proprietary standard has  incentive to continue to change the standard if that can result in additional revenue for upgrades and the like. 

It is evolvable. It’s highly unlikely that anyone will design an intermediate form that is correct for all time from the initial design.  Because of this, it’s highly desirable to have intermediate forms that are evolvable. The best trait of an evolvable intermediate is that it can be added on to, without invalidating previous uses of it. We sometimes more accurately call this an accretive capability, meaning that things can be added on incrementally. The great advantage of an evolvable or  accretive intermediary is that if there are many clients and many suppliers using the intermediary they do not have to all be changed in  lockstep, which allows many more options for upgrade and change. 

It is simple to use. An intermediate form that is complex or overly difficult to use will not be used and either other forms will be adopted which may be more various and different or the intermediate form will be skipped altogether and the benefit lost. 

Shared Intermediates  

In addition to the simple reduction in change traffic from having the intermediate be more stable than the components at either end, we also have an advantage in most cases where the intermediate allows re use of connections. This has been popularized in the Systems Integration business where people have pointed out time and time again that creating a hub will drastically reduce the number of interfaces needed to supply a system.

In Figure (2), we have an example of what we call the traditional interface math, where the introduction of a hub or intermediate form can drastically reduce the number of interconnections in a system.  

People selling hubs very often refer to this as: (n * n – 1) / 2 or  sometimes simply the n2 problem. While this makes for very  compelling economics, our observation is that the true math for this  style of system is much less generous but still positive. Just because  two systems might be interconnected does not mean that they will be.  Systems are not completely arbitrarily divided and therefore not every interconnection need be accounted for. 

Figure (3) shows a more traditional scenario where, in the case on the left without a hub, there are many but not an exponential number of interfaces between systems. As the coloring shows, if you change one  

of those systems, any of the systems it touches may be affected and should at least be reviewed with an impact analysis. In the figure on the right, when the one system is changed, the evaluation is whether the effect spreads beyond the intermediary hub in the center. If it does not, if the system continues to obey the dictates of the intermediary form, than the change effect is, in fact, drastically reduced. 

The Axes of Decoupling  

We found in our work that, in many cases, people desire to decouple their systems and even go through the effort of creating intermediate forms or hubs and then build their systems to connect to those intermediate forms. However, as the systems evolve, very often they realize that a change in one of the systems does, in fact, “leak through”  the abstraction in the intermediate and affects other systems. 

In examining cases such as this, we have determined that there are six major considerations that cause systems that otherwise appear to be decoupled to have a secret or hidden coupling. We call these the axes of  decoupling. If a system is successfully decoupled on each of these axes,  then the impact of a change in any one of the systems should be greatly minimized. 

Technology Dependency  

The first axis that needs to be decoupled, and in some ways the hardest,  is what we call technology dependency. In the current state of the practice, people attempt to achieve integration, as well as economy of system operation, by standardizing on a small number of underlying technologies, such as operating systems and databases. The hidden trap in this is that it is very easy to rely on the fact that two systems or subsystems are operating on the same platform. As a result, developers find it easy to join a table from another database to one in their own database if they find that to be a convenient solution. They find it easy  to make use of a system function on a remote system if they know that  the remote system supports the same programming languages, the same  API, etc. 

However, this is one of the most pernicious traps because as a complex system is constructed with more and more of these subtle technology dependencies, it becomes very hard to separate out any portion and re – implement it.

The solution to this, as shown in Figure (4), is to introduce an intermediate form that ensures that a system does not talk directly to another platform. The end result is that each application or subsystem or service can run on its own hardware, in its own operating system,  using its own database management system, and not be affected by changes in other systems. Of course, each system or subsystem does have a technological dependency on the technology of the intermediary in the middle. This is the trade-off; you introduce the dependence on  one platform in exchange for being independent of n other platforms.  In the current state-of-the-art, most people use what’s called an integration broker to achieve this. An integration broker is a product such as IBM’s WebSphere or TIBCO or BEA, which allows one application to communicate with another without being aware of, or care, what platform the second application runs on. 

Destination Dependency  

Even when you’ve successfully decoupled the platforms the two applications rely on, we’ve sometimes observed problems where one  application “knows” of the existence and location of another application or service. By the way, this will become a very “normal problem” as Web services become more popular because the default method of implementing Web services has the requester knowing of the nature and destination of the service. 


In Figure (5), we show a little more clearly through an example where two systems have an intermediary. In this case, the distribution and  shipping application would like to send messages to a freight application, for instance to get a freight rating or to determine how  long it would take to get a package somewhere. Imagine if you were to introduce a new service in the freight area that in some cases handled international shipping, but we continue to do domestic the old way. If  we had not decoupled these services, it is highly likely that the calling  program would now need to be aware of the difference and make a  determination in terms of what message to send, what API to call,  where to send its request, etc. The only other defense would be to have yet another service that accepted all requests and then dispatched them; but this is really an unnecessary artifact that would have to be added into a system where the destination intermediary had not been  designed in. 

Syntax Intermediary  

Classically in an API, the application programming interface defines very specifically the syntax of any message sent between two systems.  For instance, the API specifies the number of arguments, their order,  and their type; and any change to any of those will affect any of the  calling programs. Also EDI (electronic data interchange) relies very much on a strict syntactical definition of the message being passed between partners. 



In Figure (6), we show a small snippet of XML, which has recently become the de facto syntactic intermediate form. Virtually all new initiatives now use XML as the syntactic lingua franca. As such, any two systems that communicate through XML at least do not have to mediate differences at that syntactic level. Also, fortunately, XML is a  nonproprietary standard and, at least to date, has been evolving very slowly. 

Semantic Intermediary  

Where systems integration projects generally run into the greatest amount of trouble is from semantic differences or ambiguities in the meaning of the information being passed back and forth. Traditionally, we find that developers build interfaces and run them and test them against live data, and then find that the ways in which the systems have been used does not conform particularly well to the spec.  Additionally, in each case the names and therefore the implied semantics of all the elements used in the interface are typically different from system to system and must be reconciled. The n2 way of resolving this is to reconcile every system to every other system, a very tedious process.

There have been a few products and some approaches, as we show very simply and schematically in Figure (7), that have attempted to provide a semantic intermediary. Two that we’re most familiar with are  

Condivo and Unicorn. Over the long-term, the intent of the Semantic  Web is to build shared ontologies in OWL, which is the Web Ontology  Language and a derivative of RDF and DAML+OIL. In the long-term,  

it’s expected that systems will be able to communicate shared meaning through mutually committed ontologies. 

Identity Intermediary

  
A much subtler coupling that we’ve found in several systems is in the use of identifiers. Most systems have identifiers for all the key real world and invented entities that they deal with. For instance, most systems have identifiers for customers, patients, employees, sales orders, purchase orders, production lines, etc. All of these things must be given unique unambiguous names. That is not the problem; the  problem is that each system has a tendency to create its own identifiers for items that are very often shared. In the real world, there is only one instance of many of these items. There is only one of each of us as  individuals, one each for each building, one each for each corporation,  etc. And yet each system tends to create its own numbering system and when it discovers a new customer it will give it the next available customer number. In order to communicate unambiguously with the system that’s done this, to date the two main approaches have been  either to force universal identifiers onto a large number of systems or to store other people’s identifiers in your own system. Both of these approaches are flawed and do not scale well. In the case of the universal  identifier, besides having all the problems of attempting to get  coverage on the multiple domains, there is the converse problem of  privacy. Once people, for instance, are given universal identifiers it’s very hard to keep information about individuals anonymous. The other  approach of storing others’ identifiers in your systems does not scale  well because as the number of systems you must communicate with  grows, the number of other identifiers that you must store also grows.  In addition, there is the problem of being notified when any changes to  these identifiers occur.

 

In Figure (8), we outline a new intermediary, which is just beginning to be discussed as a general-purpose service, variously called the identity intermediary or the handle intermediary. The reason we’ve begun shifting from calling it an identity intermediary is because the security industry has been referring to identity systems and it does not mean exactly the same thing as what we mean here. Essentially, this is a service where each subscribing system recognizes that it may be dealing with an entity that any of the other systems may have previously dealt with. So this has a discovery piece that systems can discover if they’re dealing with, communicating with, or aware of any entity that has already been identified in the larger federation. It also acts as a cross reference so that each system need not keep track of all the synonyms of identifiers or handles to all  the other systems. Figure (8) shows a very simple representation of this with two very similar individuals that need to be identified separately. To date, the only system that we know of that covers some  of this territory is called ChoiceMaker, but it is not configured to be  used in exactly the manner that we show here. 

Nomenclature Intermediary  

Very similar to the identity or handle intermediary is the nomenclature intermediary. We separate it because typically, with the identity intermediary, we’re dealing with discovered real world entities and the reason we have synonyms is because multiple different systems are “discovering” the same physical real-world item. 

In the case of the nomenclature intermediary system, we’re dealing with an invented categorization system. Sometimes categorization systems are quite complex. In the medical industry we have SNOMED,  HCPCS, and the CPT nomenclature. But also we have incredibly simple,  and very often internally made up, classification systems, so in every case where we create a code file where we might have seven types of customer or orders or accidents or whatever that we tend to codify in order to get more uniformity, these are nomenclatures. What is helpful about having intermediary forms is that it enables multiple systems to either share or map to a common set of nomenclatures or codes. 

Figure (9) shows a simple case of how the mapping could be centralized. Again, this is another example where over the long term, developments in Semantic Web may be a great help and may provide clearinghouses for the communication between disparate systems. In the meantime, the only example that we’re aware of where a company has internally devoted a lot of attention to this is the Allstate Insurance Co., which has built what they call a domain management system where they have found, catalogued, and cross-referenced over 6,000 different nomenclatures that are in use within Allstate.

Summary  

Loose coupling has been a Holy Grail for systems developers for generations. There is no silver bullet that will slay these problems; however, as we have discussed in this paper, there are a number of specific disciplined things that we can look at as developers, and as we continue to pay attention to these, we will make our systems more and more decoupled, and therefore easier and easier to evolve and change.

Documents, Events and Actions

Documents, Events and Actions  

We have recently been reexamining the weird relationship of “documents” to “events” in enterprise information systems and have surfaced some new insights that are worth  sharing.  

Documents and Events 

Just to make sure we are all seeing things clearly, the documents we’re referring to are  those that give rise to financial change in an enterprise. This includes invoices,  purchase orders, receiving reports and sales contracts. We’re not including other documents like memos, reports, news articles and emails – nor are we focusing on document structures such as JSON or XML.  

In this context, the “events” represent the recording of something happening that has  a high probability of affecting the finances of the firm. Many people call these  “transactions” or “financial transactions.” The deeper we investigated, the more we  found a need to distinguish the “event” (which is occurring in the real world) from the  “transaction” (which is its reflection in the database). But I’m getting ahead of myself  and will just stick with documents and events for this article. 

Documents and Events, Historically 

For most of recorded history, the document was the event, or at least it was the only tangibly recorded interpretation of the event. That piece of actual paper was both the document and the representation of the event. When you wrote up a purchase order  (and had it signed by the other party) you had an event.  

In the 1950’s we began computerizing these documents, turning them into a skeuomorph (a design that imitates a real-world object to make it more familiar). The  user interfaces looked like paper forms. There were boxes on the top for “ship to” and  “bill to” and small boxes in the middle for things like “payment terms,” and “free on  board.” This was accompanied by a line item of the components that made up the bill,  invoice, purchase order, timecard, etc. 

For the longest time, the paper was also the “source document” which would be entered into the computer at the home office. Somewhere along the way some clever person realized you could start by entering the data into the computer for things you originated and then print out the paper. That paper was then sent to the other party for them to key it into their system.  

Now, most of these “events” are not produced by humans, but by some other computer program. These ‘bills of materials’ processors can generate purchase orders much faster than a room full of procurement specialists. Many industries now consider these “events” to be primary. The documents (if they exist at all) are part of the audit trail. Industries like healthcare have long ago replaced the “superbill” (a document on  a clipboard with 3 dozen check boxes to represent what the physician did to you on  that visit) with 80 specific types of HL7 messages that ricochet back and forth from  provider to payer.  

And yet, even in the 21st century, we still find ourselves often excerpting facts from  unstructured documents and entering them into our computer systems. Here at  Semantic Arts, we take the contracts we’ve signed with our clients and scan them for the tidbits that we need to put into our systems (such as the budgets, time frame,  staffing and billing rates) and conveniently leave the other 95% of the document in a file somewhere.  

Documents and Events, what is the difference?

So for hundreds of years, documents and events were more or less the same thing.  Now they have drifted apart. In today’s environment, the real questions are not “what’s  the difference” but rather “which one is the truth.” In other words, if there is a difference which one do we use? There is not a one-size-fits-all answer to that dilemma. It varies from industry to industry.  

But I think it’s fairly safe to say the current difference is that an “event” is a structured  data representation of the business activity, while a “document” is the unstructured  data representation. Either one could have come first. Each is meant to be the reflection of the other.  

The Event and the Transaction 

The event has a very active sense to it because it occurs at a specific point in time.  And therefore, we record it in our computer system and create a transaction, which  updates our database at the posting date and as the effective accounting date.  

The transaction and the event often appear to be the same thing, partly because so  many events terminate in the accounting department. But, in reality, the transaction is adding information to the event that allows it to be posted. The main information that is being added is the valuation, the classification and the effective dates. Most people enter these at the same time they capture the event, but they are distinct. The distinction is more obvious when you consider events such as “issuing material” to a production order. The issuer doesn’t know what account number should be charged,  nor do they know the valuation (this is buried in an accounting policy that determines  whether to cost this widget based on the most recent cost, the oldest cost or the average cost of widgets on hand.) So the “transaction” is different from the “event” even if they occur at the same time.  

Until fairly recently, administrators wouldn’t sit at their computer and enter invoices until they were prepared for them to be issued. Most people wait until they ship the  widget or complete the milestone before they key in the invoice data and email it to  their customer. In this circumstance, the event and the transaction are cotemporaneous – they happen at the same time. And the document being sent to the  customer follows shortly thereafter.  

One More Disconnect  

We are implementing data-centric accounting at Semantic Arts and have disconnected  the “event” that is the structured data representation of the event, from its  classification as an event. We realized that as soon as we had signed a contract, we knew at least one of the two aspects of our future invoices, and in many cases, we knew both. For fixed price projects, we knew the amount of the future invoices. The  only thing we didn’t know was when we could invoice them – because that was based  on the date of some given milestone. For time and material contracts we know the dates of our future invoices (end of the month often) but don’t know the amount. And for our best efforts contracts we know the dates and the amounts and adjust the scope to fit.  

But knowing these things and capturing them in our accounting system creates a problem. They weren’t actually real yet (or at least they weren’t real enough to be  invoices). The sad thing was they looked just like invoices. They had all the data, and it was all valid. They could be rendered to pdfs, and even printed, but we knew we couldn’t send all the invoices to our client all at once. So we now had some invoices in  our system that weren’t really invoices, and didn’t have a good way to make the  distinction.  

As we puzzled over this, we came across a university that was dealing with the same  challenge. In their case they were implementing “commitment accounting,” which is trying to keep track of the commitments (purchase orders mostly) that are outstanding as a way to prevent overrunning budgets. As people entered their purchase orders  (structured records as we’ve been describing them) the system captured them as events. These events were captured and tallied by the system. In order to get the  system to work, people entered purchase orders long before they were approved. In fact, you have to enter them to get an event (or a document) that can be approved and  agreed to by your vendor.  

The problem was many of these purchase order events never were approved. The  apparent commitments vastly exceeded the budgets, and the whole system was shut  down.  

Actions 

We discovered that it isn’t the document, and it isn’t even the event (if we think of the event as the structured data record of the business event) that makes the financial effect real. It is something we are now calling the “action,” or really a special type of  “action.” 

There is a magic moment when an event, or perhaps more accurately a proto-event becomes real. On a website, it is the “buy” button. In the enterprise ,it is often the  “approval” button.  

As we worked on this, we discovered it is just one of the steps in a workflow. The  workflow for a purchase order might start with sourcing, getting quotes, negotiating, etc. The special step that makes the purchase order “real” isn’t even the last step.  After the purchase order is accepted by the vendor, we still need to exchange more  documents to get shipping notifications, deal with warranties, etc. It is one of those steps that makes the commitment. We are now calling this the “green button.” There is one step, one button in the workflow progression that makes the event real. In our internal systems we’re going to make that one green, so that employees know when they are committing the firm.  

Once you have this idea in your head, you’ll be surprised how often it is missed. I go on my bank’s website and work through the process of transferring money. I get a  number of red buttons, and with each one, I wonder, “is this the green one.” Nope,  one more step before we’re committed. Same with booking a flight. There are lots of  purple buttons, but you have to pay a lot of attention before you notice which one of  those purple buttons is really the green one.  

Promotion 

And what does the green button in our internal systems do? Well, it varies a bit,  workflow to workflow, but in many cases it just “promotes” a draft item to a committed one. 

In a traditional system you would likely have draft items in one table and then copy them over to the approved table. Or you might have a status and just be careful to  exclude the unapproved ones from most queries.  

But we’ve discovered that many of these events can be thought of as subtypes of their  draft versions. When the green button gets pressed in an invoicing workflow, the draft invoice gains another triple, which makes it also an approved or a submitted invoice – in addition to its being a draft invoice.  

Summary 

We in the enterprise software industry have had a long history of conflating documents  and events. Usually we get away with it, but occasionally it bites us.  

What we’re discovering now with the looming advent of data-centric accounting is the need not only to distinguish the document from the event but also distinguish the event (as a structure) from the action that enlivens it. We see this as an important step  in the further automation of direct financial reporting.

gist: Buckets, Buckets Everywhere:  Who Knows What to Think

gist: Buckets, Buckets Everywhere:  Who Knows What to Think

We humans are categorizing machines, which is to say, we like to create metaphorical buckets and put things inside. But there are different kinds of buckets, and different ways to model them in  OWL and gist. The most common bucket represents a kind of thing, such as Person or Building.  Things that go into those buckets are individuals of those kinds, e.g. Albert Einstein, or the particular office building you work in. We represent this kind of bucket as an owl:Class and we use rdf:type to put something into the bucket. 

Another kind of bucket is when you have a group of things, like a jury or a deck of cards that are functionally connected in some way. Those related things go into the bucket (12 members of a jury, or 52 cards). We have a special class in gist called Collection, for this kind of bucket. A specific bucket of this sort will be an instance of a subclass of gist:Collection. E.g. OJs_Jury is an instance of the class Jury, a subclass of gist: Collection. We use gist:memberOf to put things into the bucket.  Convince yourself that these buckets do not represent a kind of thing. A jury is a kind of thing, a particular jury is not. We would use rdf:type to connect OJ’s jury to the owl: ClassJury, and use gist:memberOf to connect the specific jurors to OJ’s jury.

A third kind of bucket is a tag which represents a topic and is used to categorize individual items for the purpose of indexing a body of content. For example, the tag “Winter” might be used to index photographs, books and/or YouTube videos. Any content item that depicts or relates to winter in some way should be categorized using this tag. In gist, we represent this in a way that is  structurally the same as how we represent buckets that are collections of functionally connected  items. The differences are 1) the bucket is an instance of a subclass of gist:Category, rather than of gist: Collection and 2) we put things into the bucket using gist:categorizedBy rather than gist:memberOf. The Winter tag is essentially a bucket containing all the things that have been indexed or categorized using that tag.

Below is a summary table showing these different kinds of buckets, and how we represent them in  OWL and gist.

Kind of Bucket Example Representing the Bucket Putting something in the Bucket
Individual of a Kind John Doe is a Person Instance of owl:Class rdf:type
A bucket with  functionally connected  things insideSheila Woods is a  member of OJ’s JuryInstance of a subclass of  gist:Collection gist:memberOf
An index term for  categorizing contentThe book “Winter of  our Discontent” has  Winter as one of its  tagsInstance of a subclass of  gist:Category gist:categorizedBy


Morgan Stanley: Data-Centric Journey 

Morgan Stanley: Data-Centric Journey 

Morgan Stanley has been on the semantic/ data-centric journey with us for about 6 years.  Their approach is the adoption of an RDF graph and the development of a semantic knowledge base to help answer domain-specific questions, formulate classification recommendations and deliver quality search to their internal users. Their primary objective is to enable the firm to retrieve, retain and protect information (i.e., where the information resides, how long it must be maintained and what controls apply to it). 

The knowledge graph is being developed by the Information Management team under the direction of Nic Seyot (Managing Director and Head of Data & Analytics for Non-Financial  Risk). Nic is responsible for the development of the firm-wide ontology for trading surveillance, compliance, global financial crime and operational risk. Nic’s team is also helping other departments across the firm discover and embrace semantic data modeling for their own use cases.  

Morgan Stanley has tens of thousands of discrete repositories of information. There are many different groups with specialized knowledge about the primary objectives as well as many technical environments to deal with. Their motivating principle is to understand the  conceptual meaning of the information across these various departments and  environments so that they can answer compliance and risk questions.  

A good example is a query from a user about the location of sensitive information (with many conflicting classifications) and whether they are allowed to share it outside of the firm. The answer to this type of question involves knowledge of business continuity,  disaster recovery, emergency planning and many other areas of control. Their approach is to leverage semantic modeling, ontologies and knowledge graph to be able to comprehensively answer that question.  

To build the knowledge graph around these information repositories, they hired Semantic  Arts to create a core ontology around issues that are relevant to the entire firm – including personnel, geography, legal entities, records management, organization and a number of firm-wide taxonomies. Morgan Stanley is committed to open standards and W3C principles which they have combined with their internal standards around quality governance. They created a Semantic Modeling and Ontology Consortium to help govern and maintain that core ontology. Many divisions within the firm have joined the advisory board for the consortium and it is viewed as an excellent way of facilitating cooperation between divisions.

The adoption-based principle has been a success. They have standardized ETL and  virtualization to get information structured and into their knowledge graph. The key use  case is enterprise search to give departments the ability to search for their content by leveraging the tags, lists, categories and taxonomies they use as facets for content search.  One of the key benefits is an understanding of the network of concepts and terms as well as how they relate to one another within their organization. 

Semantic Arts ontologists helped engineer the network of concepts that are included into their semantic thesaurus as well as how they interconnect within the firm. They started out with over 6,500 policies and procedures as a curated corpus of knowledge of the firm.  They used natural language to extract the complexity of relationships out of their combined taxonomies (over half a million concepts). We worked with them to demonstrate the power of conceptual simplification. We helped them transform these complex relationships into broader, narrower and related properties which enable the users to ask business questions in their own context (and acronyms) to enhance the quality of search without manual curation. Our efforts helped reduce the noise, merge concepts with similar meaning and identify critical topics to support complex queries from users.

Contact Us: 

Overcome integration debt with proven semantic solutions. 

Contact Semantic Arts, the experts in data-centric transformation, today! 

CONTACT US HERE 

Address: Semantic Arts, Inc. 

123 N College Avenue Suite 218 

Fort Collins, CO 80524 

Email: [email protected] 

Phone: (970) 490-2224

Chemical Manufacturer: Faceted Taxonomies 

Chemical Manufacturer: Faceted Taxonomies 

Capturing interrelations of information for relevance can be difficult, even with NLP. More often companies will seek to work in taxonomy space in their journey toward richer implementations of knowledge graphs for automation adoption. Our consulting services leveraged this approach to provide a foundation stepping stone as the company sought to bring inherent knowledge graph capabilities into their business. 

This global manufacturer had a sluggish system in place to comb through internet publications and look for key terms that might mark articles of interest to its divisions for competitive intelligence as a spawning point for innovative ideas. However, processes remained heavily manual and cumbersome. They realized that strong text matching and analysis was a missing component and decided to turn to taxonomies to mitigate and improve the process. 

Semantic Arts quickly discovered that the key to success was faceted taxonomies. We  worked with SMEs to determine what areas contained specific controlled vocabularies and  specialized terminology. As a starting point, Semantic Arts created a series of taxonomies  for each area for improved automation. Areas included: Products, Industries, Customers,  Capabilities, Manufacturers, Materials and Processes 

The tight focus of each facet allowed for SMEs and division experts to create very specific lists of terms. By using preferred labels and alternate labels (synonyms) for each, SA  enabled what could be recognized and matched in a desired internet corpus. Initial  implementation of the facets showed a higher level of matching to recognized terms of  interest than an NLP algorithm achieved, created a higher confidence in the significance of  the match, and left out many common or “stop” terms that the original method still picked  up. A start of efficiency was realized. 

Semantic Arts developed a more extended road map with the manufacturer to first refine  and bulk up the taxonomy lists based on continued implementation and analysis. By implementing, the client’s intent will be to apply a simple semantic layer to relate and interconnect the taxonomy facets. This ontology model will allow even richer inferencing  and matching of results based on relationships between terms (i.e., an article about a  specific product will imply the involvement of certain manufacturers even if they are not  explicitly mentioned). 

In this case, a small step in a focused step into taxonomy re-classification is helping to open more understanding about the broader benefit while allowing for faster delivery of more pinpointed research answers. In addition, building pipelines of connected unstructured information is consistent with organizational goals of harmonizing data for greater  strategic value. Divisions in other parts of the enterprise have taken notice and there is 

expressed interest in leveraging the unique re-usability and interoperability semantic capabilities enable after this initial pilot.

Contact Us: 

Overcome integration debt with proven semantic solutions. 

Contact Semantic Arts, the experts in data-centric transformation, today! 

CONTACT US HERE 

Address: Semantic Arts, Inc. 

123 N College Avenue Suite 218 

Fort Collins, CO 80524 

Email: [email protected] 

Phone: (970) 490-2224

Investment Bank Case Study: Operational Risk 

Investment Bank Case Study: Operational Risk 

In this major investment bank managing all the flavors of operational risk has become very balkanized. There are separate systems for process management, risk identification,  controls, vendor risks, cyber risks, outsourced risks, fraud, internal incidents, external incidents, business continuity, disaster recovery inter-affiliate risk and many more. 

To address, we were able to create an elegant ontology that captured all these aspects of  risk. We then (one-by-one) were able to extract and conform their existing information into this shared model. 

We managed to catch the re-write of a control library in mid-stream and get them to persist the key information directly to a triple store. The mappings have been ported into production, and we built (in TARQL) the capability to create a unified view of information systems that feed risk evaluation metrics. Additionally, an interactive graphics capability has been built directly on the triplestore for visualization across the risk portfolio.

Contact Us: 

Overcome integration debt with proven semantic solutions. 

Contact Semantic Arts, the experts in data-centric transformation, today! 

CONTACT US HERE 

Address: Semantic Arts, Inc. 

123 N College Avenue Suite 218 

Fort Collins, CO 80524 

Email: [email protected] 

Phone: (970) 490-2224

Investment Bank: Resolution Planning 

Investment Bank: Resolution Planning 

This is one of the “too big to fail” banks, who are required by regulators to implement  “resolution planning” or as it’s known on the street a “living will.” The first few generations of the resolution plan were long on long textual descriptions of the nature of the interactions between various legal entities within the bank. 

Our sponsor recognized that the key to making a resolution plan workable is to make it data driven rather than document driven. Document-driven resolution plans are out of date as soon as they are written and require humans to read and interpret. While the firm,  as with most large financial services firms, consists of thousands of legal entities, there are  “only” a few dozen that are significant from a resolution standpoint. However, this is made more complex because hundreds of departments (may and do) have service relationships with their peers in other countries and time zones. Often these arrangements are tacit rather than spelled out, and even those that are written fall far short of the regulators desire to see specific mechanisms for controlling the work and assuring it gets completed. 

We based this project on the concept of Inter-affiliate Service Level Agreements. We designed an ontology of Service Level Agreements and in the course of four months iterated it through eight versions as we learned more and more about the specifics of getting a new system designed and built. 

In addition to (and in parallel with) the ontology development we built an operating system,  using our model driven development environment. We populated a triple store with data  sourced from many of their existing systems (HR for personnel and departments, finance for legal entities and jurisdictions, IT for applications, hosting and data centers and the activity taxonomy from the project we had performed the previous year). On top of this we built user interfaces that allowed managers to document the agreements that were in place between themselves and other departments in other legal entities. 

We completed the project in time to demo to the regulators and it is now being used as the basis for their go forward Resolution Plan.

Contact Us: 

Overcome integration debt with proven semantic solutions. 

Contact Semantic Arts, the experts in data-centric transformation, today! 

CONTACT US HERE 

Address: Semantic Arts, Inc. 

123 N College Avenue Suite 218 

Fort Collins, CO 80524 

Email: [email protected] 

Phone: (970) 490-2224

LexisNexis: Enterprise Ontology

LexisNexis: Enterprise Ontology 

We worked with this leading provider of legal and medical knowledge to build an enterprise  ontology for their wide-ranging content. In addition to building an ontology for their case law and statutory product lines, we worked with their Master Data Management Initiatives.  

They have over 30 MDMs in various stages of development with logical data models. These models (and therefore the MDMs themselves) were integrated manually, in a somewhat ad-hoc fashion. We built tooling to convert their existing logical models into a single integrated ontology, where the integration points were far more obvious. From there, we built tooling to convert the ontology back to a set of similar, but now conformed, logical data models.

Contact Us: 

Overcome integration debt with proven semantic solutions. 

Contact Semantic Arts, the experts in data-centric transformation, today! 

CONTACT US HERE 

Address: Semantic Arts, Inc. 

123 N College Avenue Suite 218 

Fort Collins, CO 80524 

Email: [email protected] 

Phone: (970) 490-2224