Six Enterprise Knowledge Graph Anti-Patterns

Anti-pattern #1 — Agreeing with the Status Quo

Anti-pattern #2 — Fad Surfing

Anti-pattern #3 — Too Small

Anti-pattern #4 — Too Big

Anti-pattern #5 — Data Governance

Anti-pattern #6 — Data Hoarding

Need a sherpa to get up the mountain?

CONTACT US

Originally posted at Medium.com

Semantic Messages

Interfaces and Interactions

Too Much Specificity And Not Enough Play

I recently saw this tweet and it reminded me about something I’ve wanted to think and talk about.

book

Satnam continues

configuration management has not had the attention enjoyed by academic research for languages and networking, as well as language and networking innovations in industry.

I don’t think a “configuration language” is the solution, nor is a domain specific language / library (DSL).

I tend to agree. I think perhaps we should explore more loosey-goosey, declarative approaches. That is, I’d like to explore systems with more play (as in “scope or freedom to act or operate”).

I’d like to see more semantic messages that convey the spirit rather than the letter. When you can’t foresee all the consequences of the letter then that’s when the spirit can help.

That’s what I’d like to think about in this post.

Let’s see an example of such a loosey-goosey semantic message.

Semantic Messages

I’m writing another blog post on what “semantic” means in the semantic web. I’ll put a link here once I am done but in the mean time think of “semantic” as getting different things (different people, different machines, people and machines, etc.) to see eye to eye. Yes, a tall order, but I’m optimistic about it.

The hypothetical situation is that I have an instance of Apache Jena Fuseki (a database for RDF) running on my local machine. There is a software agent (semantic web style) running on my local machine that knows how to interact with Apache Jena Fuseki. I am running a software agent (semantic web style) to whom I make requests.

I have a file on my machine that I want to load into a dataset on the Apache Jena Fuseki instance. I type this request to my agent “load /mnt/toys/gifts.ttl into Apache Jena Fuseki listening on port 3030 at dataset ‘gifts’ on 25 Dec early in the morning.”

My agent produces the following RDF (or I do by some other means) in TriG serialization:

@prefix : <https://example.com/> .
@prefix gist: <https://ontologies.semanticarts.com/gist/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix schema: <http://schema.org/> .

:message0 a gist:Message ;
  gist:comesFromAgent [ gist:name "Justin Dowdy" ;
                        gist:hasAddress [ gist:containedText "[email protected]" ] ] ;
  gist:isAbout :message0content .

:message0content a gist:Content, :NamedGraph ;
  rdfs:comment "this named graph is the content of the message" .

:message0content {
    :message0content gist:hasGoal :goal0 .
  :goal0 a gist:Goal ;
    rdfs:comment "this is the goal specified in the content of the message" ;
    gist:isAbout :goal0content .
}

:goal0content a gist:Content , :NamedGraph ;
  rdfs:comment "this named graph is the content of the goal" .

:goal0content {
  [ a gist:Event ;
    gist:produces [ a gist:Content ;
                    gist:isBasedOn [ a gist:FormattedContent ;
                                     gist:hasAddress [ gist:containedText "file:///mnt/toys/gifts.ttl" ] ;
                                     gist:isExpressedIn [ a gist:MediaType ;
                                                          schema:encodingFormat "application/turtle" ] ] ;
                    gist:isPartOf [ a gist:Content ;
                                    gist:name "gifts" ;
                                    rdfs:comment 'the dataset called "gifts"' ;
                                    gist:isPartOf [ a gist:System ;
                                                    gist:hasAddress [ gist:containedText "http://127.0.0.1:3030" ] ;
                                                    gist:name "Apache Jena Fuseki" ] ] ] ;
   gist:plannedStartDateTime "2022-12-25T01:00:00Z"^^xsd:dateTime ]
}

Side Note

You might notice that I’ve used the URI of an RDF named graph in the place where a resource would typically be expected. With this blog post I am also thinking about using named graphs to represent the content of goals (gist:Goal). Really a named graph could represent the content of many different types of things.

Back to the semantic message example

My agent then puts that RDF onto the semantic message bus (the bus where agents listen for and send RDF) on my local machine. The agent that governs Apache Jena Fuseki sees the RDF and recognizes that it knows how to handle the request.

The Fuseki agent that interprets that RDF needs to know some things.

The Fuseki agent needs to know things like:

  • that it is capable of and allowed to handle requests to load data into the Apache Jena Fuseki running on localhost at port 3030
  • how to use GSP or some other programmatic method to load data into Fuseki
    • how to reference a dataset or optionally create one if the desired on does not exist
  • how to delay the execution of this (since the gist:plannedStartDateTime is in the future)

My agent needs to know things like:

  • it is allowed to make assumptions
    • e.g. if I leave off the year, when I am talking about a goal, when I reference a date then I probably mean whatever year that date occurs in next
  • it can look in my existing graphs (perhaps my “personal knowledge graphs”) to gather information

Fuseki’s agent can’t be too finicky about interpreting the RDF. The RDF isn’t really a request conforming to a contract; it is more of a spirit of a request.

If you are familiar with RDF and gist, the spirit of the RDF is pretty clear “early in the morning on December 25th find the file /mnt/toys/gifts.ttl and load it into the dataset ‘gifts’ on the Apache Jena Fuseki server running on localhost at port 3030.”

If the agent saw this message or a similar message but it knew the message content wasn’t sufficient for it to do anything then it would reply, by putting RDF onto the semantic message bus, with the content of another goal as if to say “did you mean this?” There could be a back and forth between my agent and the agent governing Apache Jena Fuseki as my agent figures out how to schedule the ingestion of that data.

But this time Fuseki’s agent knew what to do. It runs the following command:

at 25 dec 1am <<~
curl -X POST 'http://127.0.0.1:3030/gifts/data' -H 'Content-type: text/turtle' --data-binary @/mnt/toys/gifts.ttl
~

My agent receives some confirmation RDF and the gifts should be available, via SPARQL, before the kids wake up on Christmas morning.

The Article

In this post I’m mostly sketching out some of the consequences of the ideas presented in this 2001 Scientific American article.

Standardization can only go so far, because we can’t anticipate all possible future needs.

Right on.

The Semantic Web, in contrast, is more flexible. The consumer and producer agents can reach a shared understanding by exchanging ontologies, which provide the vocabulary needed for discussion.

I’m less optimistic that we’ll sort out useful ontology exchange anytime soon. In the mean time I think picking a single upper ontology that is squishy in the right ways is a path forward.

Semantics also makes it easier to take advantage of a service that only partially matches a request.

I think for semantics to work in this way we have to accept that our systems will get more adaptive at the cost of becoming less brittle.

Brittle:

  • by design, shouldn’t ever be wrong
  • when it sees something unexpected it stops or breaks

Adaptive:

  • by design, could be wrong
  • when it sees something unexpected it tries to figure it out

That might be hard for people to accept. Perhaps it is why we haven’t progressed much on these kind of agents since the 2001 article.

Closing

I haven’t sketched everything out. For example, what if the command fails on 25 Dec because the file is missing? I’d expect the Fuseki agent to tell my agent. Also maybe my agent could periodically check that the file is accessible and report back to me if it isn’t.

Anyway, I imagine you get the idea.

I do think a requirement of semantic message buses is that all agents must have the same world view and speak the same language. Ontologies set the world view and language. I used the gist upper ontology for my example.

Maybe make an agent! Or let me know what you think about this stuff.

The Data-Centric Revolution: OWL as a Discipline

Many developers pooh-pooh OWL (the dyslexic acronym for the Web Ontology Language). Many decry it as “too hard,” which seems bizarre, given that most developers I know pride themselves on their cleverness (and, as anyone who takes the time to learn OWL knows, it isn’t very hard at all). It does require you to think slightly differently about the problem domain and your design. And I thin that’s what developers don’t like. If they continue to think in the way they’ve always thought, and try to express themselves in OWL, yes, they will and do get frustrated.

That frustration might manifest itself in a breakthrough, but more often, it manifests itself in a retreat. A retreat perhaps to SHACL, but more often the retreat is more complete than that, to not doing data modeling at all. By the way, this isn’t a “OWL versus SHACL” discussion, we use SHACL almost every day. This is an “OWL plus SHACL” conversation.

The point I want to make in this article is that it might be more productive to think of OWL not as a programming language, not even as a modeling language, but as a discipline. A discipline akin to normalization.

Keep reading: The Data-Centric Revolution: OWL as a Discipline – TDAN.com

Read more of Dave’s articles: mccomb – TDAN.com

The Greatest Sin of Tabular Data

We recently came across this great article titled “The greatest sin of tabular data”. It is an excellent summary of the kind of work we do for our clients and how they benefit.

You can read it at The greatest sin of tabular data · A blog @ nonodename.com

The journey of capturing the meaning to data is an elusive process.  If 80% of data science is simply data wrangling, how can we do better actually providing value by making sense of that data?

With a disciplined approach and levering RDF capabilities, Semantic Arts can help to create clear, defined data, saving time and money and driving true value instead of getting bogged down in simply trying to understand data.

As stated by the author, “We can do better!”

Reach out to Semantic Arts today to see how we can help.

Original article at nonodename.com/Dan Bennett via LinkedIn Post.

Software Development process expressed in a Knowledge Graph

More times than not we receive pushback on the use of RDF (the standard model for data exchange on the web) being difficult. However, the simplicity of triplestores (Subject – Predicate – Object) and logically composing queries with the written language make this form ideal for technical business users, capability owners, and data scientists alike. By possessing the deep business knowledge, a business user’s inquisitive nature result in an endless list of questions to answer. Why wait for an engineer to get the request weeks later when greater accessibility can be achieved with semantic web capabilities?

We bring this idea to enriching the meaningfulness of capturing these answers by making some thoughtful RDF for enhanced interrogation in a git repository.

Motivation
A while back cURL’s creator, Daniel Stenberg, tweeted some stats on cURL’s git repository.

I have a pretty good idea about how he answered those questions. I bet he used some tools like sed, awk, and grep. If I had to answer those questions I too might use those CLI utilities with a throw away shell pipeline. But I wondered what it would be like to answer those questions semantic web style.

What
In order to answer questions semantic web style you first have to find or make some thoughtful RDF. I say “thoughtful” because it is possible, though mostly not desirable, to use the semantic web stack (RDF/SPARQL/OWL/SHACL, etc.) without doing much domain modeling.

In our case we can easily get some structured data to start with. Here is a git commit I just made:

Notice how compact that representation is. The meat of that text is the unified output format of the diff tool. If you work with git much you probably recognize what most of that is. But the semantic web isn’t about just allowing you to work with data you already know how to decipher. To participate in the semantic web we need to unpack this compact application-centric representation into a data-centric representation so that others don’t need to do the deciphering. In the semantic web we want data to wear its meaning on its sleeve.

That compact representation is fine for the diff and patch tools but doesn’t really check any of these boxes:

Ok, I ran my conversion tool on that commit and it transformed that representation into a thoughtful RDF graph.

Let’s take a look (using RDFox’s graph viz):

You’ll notice that commits have parts: hunks.

Those hunks, when applied, produce contiguous lines.

Those contiguous lines:

occur in a text file with a name
are identified by a line number
have a magnitude with a unit of measure (line count)
and have the literal contained text
Note that I’ve used Wikidata entities because Wikidata is a nice hub in the semantic web. Here is a Wikidata subgraph with labels that are relevant for the RDF I’ve produced:

By the way, don’t let those Q numbers scare you. I don’t memorize them (well I do have wd:Q2 memorized since it is pretty special). I use auto completion in my text editor and Wikidata has it here too. You just type wd: then press control-enter then type what you want. Also, Wikidata has some good reasons for using opaque IRIs.

Here is the RDF graph (the same one as in the image above) in turtle serialization:

Here is another commit:

And the RDF:

You’ll notice that this commit does a little more. The hunk produces contiguous lines as before. The hunk also affects contiguous lines. That is because this commit does not add a new file; it changes an existing file by replacing some contiguous lines with some other contiguous lines.

Why
At this point maybe you’re wondering why the data isn’t more “direct.” The RDF seems to spread things out and use generic predicates (produces, occurs in, etc.). That is intentional.

My conversion utility does use some intermediate “direct” data:


But that data does not snap together with other data like RDF does. It does not have formal semantics. It has not unpacked the meaning of the data. It is more like an ad hoc projection of data. It is not something I would want to pass around between applications.

There are some nice things about using RDF to express the content of a git repository. This is not a comprehensive list but rather just stuff that I thought of while doing this project:

(1)

You can start anywhere with queries.

If you want to find all things with names you just:

select * where {
?s gist:name ?name .
}

If you want to find all files with names:

select * where {
?s a wd:Q86920 .
?s gist:name ?name .
}

You don’t need to know structurally where these “fields” live.

(2)

You define things in terms of more primitive things.

For example, if you look on Wikidata you’ll see that commit is defined in terms of changeset, and version control. Hunk is defined in terms of diff unified format and line.

Eventually definitions bottom out in really primitive things that aren’t defined in terms of anything else.

One of the reasons this is helpful is that you can query against the more primitive things and get back results containing more composite things (built up from the more primitive things).

(3)

You are encouraged (if you use a thoughtful upper ontology such as Gist) to unpack meaning.

I think of the semantic web as something like the exploded part diagram for the web’s data.

Yes, it takes up more space than a render of fully assembled thing but all the components you might want to talk about are addressable and their relationship to other components is evident.

One example of how not unpacking makes question answering harder is how Wikidata packs up postal code ranges with an en dash (–).

If you query Wikidata to see what region has postal code “10498” allocated to it you won’t find any results. You’ll instead have to write a query to find a postal code (some of them are really a range of postal codes designated with an en dash) by making a procedure that gets the start and stop symbols (numbers in this case) and enumerates the range and does a where in or something similar.

If you require users to unpack all your representations before they use them then maybe they’ll lose interest and move on to something else.

A thoughtful ontology will help you carve the world at its joints, putting points of articulation between things, by having a thoughtful set of generic predicates. You might not be using a thoughtful ontology if you can connect any two arbitrary things with a single edge.

The unified output format for diff works well for the git and patch programs but not for humans asking questions.

Sure, unpacked representations mean more data (triples) but the alternatives (application-centric data, LPGs/RDF-Star, etc.) are like bodge wires:

They are acceptable for your final act, maybe , but not something you’d want to build upon.

(4)

RDF allows for incremental enrichment.

As a follow up to this project I think it would be interesting to transform CWEs (Common Weakness Enumeration) and CVEs (Common Vulnerabilities and Exposures) into RDF and connect them to the git repositories where the vulnerability producing code is.

(5)

More people can ask questions of the data.

SPARQL is a declarative query language. The ease of using SPARQL has a bit to do with the thoughtfulness of the domain modeling.

Below I pose several questions to the data and I obtain answers with SPARQL.

Answering Questions About cURL
The cURL git repo has about 29k commits and commits going back to 1999.

My conversion tool turned it into just under 8 million triples in 70 minutes. I haven’t focused on execution efficiency yet. I wanted to run queries against the data to get a feel the utility of this approach before I refine the tool.

Let’s answer some questions about the development of cURL.

How many deleted lines per person?

Result:

How many deleted files per person?

What files did a particular person delete and when?

Which commits affected lib/http.c in 2019 only?

Which persons have authored commits with the same email but different names?

Result:

Which persons have authored commits with the same name and different email?

Result:

Which persons have authored commits in libcurl’s lib/ directory (this includes deleting something in there)?

And depending on how you count the people the query finds between 678 and 727 people that authored commits in libcurl’s lib/ directory. That was Daniel’s first question. He got 629 with his method but that was a few months ago and I don’t know exactly what his method of counting was. He may not have included the act of deleting a file in that directory like I did.

To answer his next three questions I’d need to record each commit’s parent commit (I don’t yet — one of my many TODOs) and simulate the application of hunks in SPARQL or add the output of git blame to the RDF. Daniel likely used the output of git blame. I’ll think about adding it to the RDF.

How
In another blog post I might describe how the conversion utility works. It is written in Clojure and it uses SPARQL Anything (which is built upon Apache Jena). I expect to push it to Github soon.

Closing Thoughts
It is fun to imagine having all the git repos in Github as RDF graphs in a massive triplestore and asking questions with SPARQL.

In my example queries I didn’t make use of the fact that each source code line is in the RDF. Most triplestores have full text search capabilities so I’ll write some queries that make use of that too. In general I haven’t been overly impressed with search built-into Gitlab and Bitbucket (I haven’t used Github’s search much) so I wonder if keeping an RDF representation with full text search would be a useful approach. I’d love to see a SPARQL endpoints for searching hosted git platforms!

I think this technique could be applied to other application-centric file formats. SPARQL Anything gets you part of the way there for several file formats but I’d like to hear if you have other ideas.

Join the discussion on twitter!

Resisting the Temptation of Fused Edges

Fused Edges

If you are doing domain modeling and using a graph database you might be tempted to use fused edges. You see them around the semantic web. But you should resist the temptation.

What

In a graph database a fused edge occurs when a domain modeler uses a single edge where a node and two edges would be more thoughtful. To me a fused edge feels like running an interstate through an area of interest and not putting an exit nearby. It also feels like putting a cast on a joint that normally articulates.

Here is an example of a fused edge:

fused edges

And here is what that fused edge looks like in turtle (a popular RDF graph serialization):

:event01 :venueName "Olive Garden" .

You can see the fusion of edges in the name of the edge usually: there is a “venue” and there is a “name.”

Here is a more thoughtful representation:

 articulating edges

with an additional point of articulation: the venue.

:event01 :occursIn :venue01 .
:venue01 :name "Olive Garden" .

Here is another common fused edge:

:person02 :mothersMaidenName "Smith" .

vs.

:person02 :hasMother :person01 .
:person01 :maidenName "Smith" .

Why

I can think of three (two I heard other people say) reasons why fused edges might be used. Let’s use the event and venue example.

  1. Your source data may not have details about the venue other than its name.

  2. “you get better #findability with dedicated properties”

  3. Fewer nodes in a graph likely means fewer hardware resources are required.

Let me attempt to persuade you that you should mostly ignore those reasons to use fused edges.

(1)

One of the ideas of the semantic web is AAA: Anyone can say Anything about Any topic.

It is hard for someone to say something about the venue (perhaps its address, current owner, hours of operation, other events that occur there, etc.) if no node exists in the graph for it. With the fused edge, if someone does come along later and they want to express the venue’s address it is not a straight forward update. You’d have to make a new venue node, find the event node in the graph, find all the edges expressing facts about the venue and move them to the new venue node, then connect the event to the new venue node. Finding all the edges hanging off of the event that express facts about the venue will likely be a manual effort — there probably won’t be clever data for the machine to use that says :venueName is not a direct attribute of the event but rather it is a direct attribute of the venue not yet represented in the graph.

Also, fused edges encourage the use of additional fused edges. If you don’t have a node to reference then a modeler might make more fused edges in order to express additional information.

(2)

Giving a shortcut a name can be valuable, yes.

But I think if you use a shortcut the details that the shortcut hides should also be available. If you use fused edges those details are not available; there is only the shortcut.

There are ways to have dedicated properties without sacrificing the details.

In SPARQL you can use shortcuts: property paths. In OWL you can define those shortcuts: property chains.

In a SPARQL query you could just do

?event :occursIn/:name ?venue_name .

Or you could define that in OWL

:venueName  owl:propertyChainAxiom  ( :occursIn  :name ) .

And if you have an OWL 2 reasoner active you can just query using the shortcut you just defined

?event :venueName ?venue_name .

(3)

Ok, using fused edges does reduce the number of triples in your graph. I can put a billion triples in a triplestore on my laptop and query durations will probably be acceptable. If I put 100 billion triples on my laptop query durations might not be acceptable. Still I think I would rather consider partitioning the data and using SPARQL query federation rather than fusing edges together to reduce resource requirements. I say that because I reach for semantic web technologies when I think radical data interoperability and serendipity would be valuable.

Fused edges and radical data interoperability don’t go together. Fused edges are about the use cases you currently know about and the data you currently have. Graphs with thoughtful points of articulation are about the use cases you know about, those you discover tomorrow, and about potential data. Points of articulation in a graph suggest enrichment opportunities and new questions.

Schema.org

Schema.org is a well known ontology that unfortunately has lots of fused edges.

If you run this SPARQL query against schema.ttl you’ll see some examples.

PREFIX  schema: <https://schema.org/>
PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?s ?com
WHERE
  { graph ?g {
      ?s rdfs:comment ?com .
      {
          GRAPH ?g
          { ?s  schema:rangeIncludes  schema:URL
            MINUS
              { ?s  schema:rangeIncludes  ?o
                FILTER ( ?o != schema:URL )
              }
          }
      }
  }
}

That query finds properties that are intended to have only instances of schema:URL in the object position.

You get these bindings:

s com
https://schema\.org/sameAs URL of a reference Web page that unambiguously indicates the item’s identity. E.g. the URL of the item’s Wikipedia page, Wikidata entry, or official website.
https://schema\.org/additionalType An additional type for the item, typically used for adding more specific types from external vocabularies in microdata syntax. This is a relationship between something and a class that the thing is in. In RDFa syntax, it is better to use the native RDFa syntax – the ‘typeof’ attribute – for multiple types. Schema.org tools may have only weaker understanding of extra types, in particular those defined externally.
https://schema\.org/codeRepository Link to the repository where the un-compiled, human readable code and related code is located (SVN, github, CodePlex).
https://schema\.org/contentUrl Actual bytes of the media object, for example the image file or video file.
https://schema\.org/discussionUrl A link to the page containing the comments of the CreativeWork.
https://schema\.org/downloadUrl If the file can be downloaded, URL to download the binary.
https://schema\.org/embedUrl A URL pointing to a player for a specific video. In general, this is the information in the “`src“` element of an “`embed“` tag and should not be the same as the content of the “`loc“` tag.
https://schema\.org/installUrl URL at which the app may be installed, if different from the URL of the item.
https://schema\.org/map A URL to a map of the place.
https://schema\.org/maps A URL to a map of the place.
https://schema\.org/paymentUrl The URL for sending a payment.
https://schema\.org/relatedLink A link related to this web page, for example to other related web pages.
https://schema\.org/replyToUrl The URL at which a reply may be posted to the specified UserComment.
https://schema\.org/serviceUrl The website to access the service.
https://schema\.org/significantLinks The most significant URLs on the page. Typically, these are the non-navigation links that are clicked on the most.
https://schema\.org/significantLink One of the more significant URLs on the page. Typically, these are the non-navigation links that are clicked on the most.
https://schema\.org/targetUrl The URL of a node in an established educational framework.
https://schema\.org/thumbnailUrl A thumbnail image relevant to the Thing.
https://schema\.org/trackingUrl Tracking url for the parcel delivery.
https://schema\.org/url URL of the item.

You can see that most of those object properties are fused edges.

e.g.

schema:paymentUrl fuses together hasPayment and url

schema:trackingUrl fuses together hasTracking and url

schema:codeRepository fuses together hasCodeRepository and url

etc.

I think each of those named shortcuts would be fine if they were built up from primitives like

:codeRepositoryURL  owl:propertyChainAxiom  ( :hasCodeRepository  :url ) .

but I might not put them in core Schema.org because then what stops people from thinking all their favorite named shortcuts belong in core Schema.org?

Also if you run that same query with schema:Place (instead of schema:URL) you can see many more fused properties. Maybe I’ll do another post where I catalog all the fused properties in Schema.org.

Wrap it up

If you find yourself in the position of building an ontology (the T-box) then remember that the object properties you create will shape the way domain modelers think about decomposing their data. An ontology with composable object/data properties, such as Gist, encourages domain modelers to use points of articulation in their graphs. You can always later define object properties that build upon the more primitive and composable object properties but once you start fusing edges it could be hard to reel it in.

Please consider not using fused edges and instead use an ontology that encourages the thoughtful use of points (nodes) of articulation. I don’t see how the semantic web can turn down any stereo’s volume when you get a phone call without thoughtful points of articulation.

Final Appeal

If you believe you must use an edge like :venueName then please put something like this in your Tbox: :venueName owl:propertyChainAxiom ( :occursIn :name ) .

Appendix

schema.org way (fused edges)

[ a schema:CreativeWork ;
  a wd:Q1886349 ; # Logo 
  schema:url  "https://i.imgur.com/46JjPLl.jpg" ;
  rdfs:label "Shipwreck Cafe Logo" ;
  schema:discussionUrl  "https://gist.github.com/justin2004/183add3d617105cc9cc7cee013d44198" ]

points of articulation way

[ a schema:UserComments ;
  schema:url "https://gist.github.com/justin2004/183add3d617105cc9cc7cee013d44198" ; 
  schema:discusses [ a schema:CreativeWork ;
                     a wd:Q1886349 ; # Logo 
                     rdfs:label "Shipwreck Cafe Logo" ;
                     schema:url  "https://i.imgur.com/46JjPLl.jpg"
                   ]
]
wd:Q113149564 schema:logo "https://i.imgur.com/46JjPLl.jpg" .

schema:discussionUrl is really a shorthand for the property path: (^schema:discusses)/schema:url. So it is 2 edges fused together in such a way that you can’t reference the node in the middle: the discussion itself. If you can’t reference the node in the middle (the discussion itself) you can’t say when it started, when it ended, who the participants were, etc.

Oh, I think the reason Schema.org has so many fused edges is that it is designed as a way to add semantics to webpages. A webpage is a document… which is often a bag of information. So a fused edge leaving a bag of information doesn’t seem like such a sin. But, personally, that makes me want to do more than attempt to hang semantics off of a bag of information.

Get the gist: start building simplicity now

While organizing data has always been important, a noticeably profound interest in optimizing information models with Semantic Knowledge graphs has arisen.  LinkedIn, AirBnB, in addition to giants Google and Amazon use graphs, but without a model for connecting concepts with rules for membership buyer recommendations and enhanced searchability (follow your nose) capabilities would lack accuracy.
Drum roll please … Introduce the ontology.
It is a model that supports semantic knowledge graph reasoning, inference, and provenance enablement.  Think of an ontology as the brain giving messages to the nervous systems (the knowledge graph).  An ontology organizes data into well-defined categories with clearly defined relationships.  This model represents a foundational starting point that allows humans and machines to read, understand, and infer knowledge based on its classification.  In short, this automatically figures out what is similar and what is different.
We’re asked often, where do I start?
Enter ‘gist’ a minimalist business ontology (model) to springboard transitioning information into knowledge.  With more than a decade of refinement grounded in simplicity, ‘gist’ is designed to have the maximum coverage of typical business ontology concepts with the fewest number of primitives and least amount of ambiguity.  ‘gist’ is available for free under a Creative Commons license and is being applied and extended within a number of business use cases and utilized by countless industries.
Recently, senior Ontologist Michael Uschold has been sharing an introductory overview of ‘gist’, maintained by Semantic Arts.
One compelling difference from most publicly available ontologies, ‘gist’ has an active governance and best practices community, called the gist Council. The council meets virtually on the first Thursday of every month to discuss how to use ‘gist’ and make suggestions on its evolution.
See Part I of Michael’s introduction here:

See Part II of Michael’s introduction here:

Stay tuned for the final installment!

Interested in gist? Visit Semantic Arts – gist

See more informative videos on Semantic Arts – YouTube

The Data-Centric Revolution: Headless BI and the Metrics Layer

Read more from Dave McComb in his recent article on The Data Administration Newsletter.

“The data-centric approach to metrics puts the definition of the metrics in the shared data. Not in the BI tool, not in code in an API. It’s in the data, right along with the measurement itself.”

Link: The Data-Centric Revolution: Headless BI and the Metrics Layer – TDAN.com

Read more of Dave’s articles: mccomb – TDAN.com

How to SPARQL with tarql

To load existing data into a knowledge graph without writing code, try using the tarql program. Tarql takes comma-separated values (csv) as input, so if you have a way to put your existing data in csv format, you can then use tarql to convert the data to semantic triples ready to load into a knowledge graph. Often, the data starts off as a tab in an Excel spreadsheet, which can be saved as a file of comma-separated values.

This blog post is for anyone familiar with SPARQL who wants to get started using tarql by learning a simple three-step process and seeing enough examples to feel confident about applying it.

Why SPARQL? Because tarql gets its instructions for how to convert csv data to triples via SPARQL statements you write. Tarql reads one row of data at a time and converts it to triples; by default the first row of the comma-separated values is interpreted to be variables, and subsequent rows are interpreted to be data.

Here are three steps to writing the SPARQL:

1. Understand your csv data and write down what one row should be converted to.
2. Use a SPARQL CONSTRUCT clause to define the triples you want as output.
3. Use a SPARQL WHERE clause to convert csv values to output values.

That’s how to SPARQL with tarql.

Example:

1. Review the data from your source; identify what each row represents and how the values in a row are related to the subject of the row.

In the example, each row includes information about one employee, identified by the employee ID in the first column. Find the properties in your ontology that will let you relate values in the other columns to the subject.

Then pick one row and write down what you want the tarql output to look like for the row. For example:

exd:_Employee_802776 rdf:type ex:Employee
ex:name “George L. Taylor” ;
ex:hasSupervisor exd:_Employee_960274 ;
ex:hasOffice “4B17” ;
ex:hasWorkPhone “906-555-5344” ;
ex:hasWorkEmail “[email protected]” .

The “ex:” in the example is an abbreviation for the namespace of the ontology, also known as a prefix for the ontology. The “exd:” is a prefix for data that is represented by the ontology.

2. Now we can start writing the SPARQL that will produce the output we want. Start by listing the prefixes needed and then write a CONSTRUCT statement that will create the triples. For example:

prefix ex: <https://ontologies.company.com/examples/>
prefix exd: <https://data.company.com/examples/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

construct {
?employee_uri rdf:type ex:Employee ;
ex:name ?name_string ;
ex:hasSupervisor ?supervisor_uri ;
ex:hasOffice ?office_string ;
ex:hasWorkPhone ?phone_string ;
ex:hasWorkEmail ?email_string .
}

Note that the variables in the CONSTRUCT statement do not have to match variable names in the spreadsheet. We included the type (uri or string) in the variable names to help make sure the next step is complete and accurate.

3. Finish the SPARQL by adding a WHERE clause that defines how each variable in the CONSTRUCT statement is assigned its value when a row of the csv is read. Values get assigned to these variables with SPARQL BIND statements.

If you read tarql documentation, you will notice that tarql has some conventions for converting the column headers to variable names. We will override those to simplify the SPARQL by inserting our own variable names into a new row 1, and then skipping the original values in row 2 as the data is processed.

Here’s the complete SPARQL script:

prefix ex: <https://ontologies.company.com/examples/>
prefix exd: <https://data.company.com/examples/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

construct {
?employee_uri rdf:type ex:Employee ;
ex:name ?name_string ;
ex:hasSupervisor ?supervisor_uri ;
ex:hasOffice ?office_string ;
ex:hasWorkPhone ?phone_string ;
ex:hasWorkEmail ?email_string .
}

where {
bind (xsd:string(?name) as ?name_string) .
bind (xsd:string(?office) as ?office_string) .
bind (xsd:string(?phone) as ?phone_string) .
bind (xsd:string(?email) as ?email_string) .

bind(str(tarql:expandPrefix(“ex”)) as ?exNamespace) .
bind(str(tarql:expandPrefix(“exd”)) as ?exdNamespace) .

bind(concat(“_Employee_”, str(?employee)) as ?employee_string) .
bind(concat(“_Employee_”, str(?supervisor)) as ?supervisor_string) .

bind(uri(concat(?exdNamespace, ?employee_string)) as ?employee_uri) .
bind(uri(concat(?exdnamespace, ?supervisor_string))as ?supervisor_uri) .

# skip the row you are not using (original variable names)
filter (?ROWNUM != 1) # ROWNUM must be in capital letters
}

And here are the triples created by tarql:

exd:_Employee_802776 rdf:type ex:Employee ;
ex:name “George L. Taylor” ;
ex:hasOffice “4B17” ;
ex:hasWorkPhone “906-555-5344” ;
ex:hasWorkEmail “[email protected]” .

exd:_Employee_914053 rdf:type ex:Employee ;
ex:name “Amy Green” ;
ex:hasOffice “3B42” ;
ex:hasWorkPhone “906-555-8253” ;
ex:hasWorkEmail “[email protected]” .

exd:_Employee_426679 rdf:type ex:Employee ;
ex:name “Constance Hogan” ;
ex:hasOffice “9C12” ;
ex:hasWorkPhone “906-555-8423” .

If you want a diagram of the output, try this tool for viewing triples.

Now that we have one example worked out, let’s review some common situations and SPARQL statements to deal with them.

To remove special characters from csv values:

replace(?variable, ‘[^a-zA-Z0-9]’, ‘_’)

To cast a date as a dateTime value:

bind(xsd:dateTime(concat(?date, ‘T00:00:00’)) as ?dateTime)

To convert yes/no values to meaningful categories (or similar conversions):

bind(if … )

To split multi-value fields:

apf:strSplit(?variable ‘,’)

Another really important point is that data extracts in csv format typically do not contain URIs (the unique permanent IDs that allow triples to “snap together” in the graph). When working with multiple csv files, make sure to keep track of how you are creating the URI for each type of instance and always use exactly the same method across all of the sources.

Practical tip: name files to make them easy to find, for example:

employee.csv
employee.tq SPARQL script containing instructions for tarql
employee.sh shell script with the line “tarql employee.tq employee.csv”

Excel tip: to save an Excel sheet as csv use Save As / Comma Separated Values (csv).

So there it is, a simple three-step method for writing the SPARQL needed to convert comma-separated values to semantic triples. The beauty of it is that you don’t need to write code, and since you need to use SPARQL for querying triple stores anyway, there’s only a small additional learning curve to use it for tarql.

Special thanks to Michael Uschold and Dalia Dahleh for their excellent input.

For more examples and more options, see the nice writeup by Bob DuCharme or refer to the tarql site.

Incremental Stealth Legacy Modernization

I’m reading the book Kill it with Fire by Marianne Bellotti. It is a delightful book. Plenty of pragmatic advice, both on the architectural side (how to think through whether and when to break up that monolith) and the organizational side (how to get and maintain momentum for what are often long, drawn-out projects). So far in my reading she seems to advocate incremental improvement over rip and replace, which is sensible, given the terrible track record with rip and replace. Recommended reading for anyone who deals with legacy systems (which is to say anyone who deals with enterprise systems, because a majority are or will be legacy systems).

But there is a better way to modernize legacy systems. Let me spoil the suspense: it is Data-Centric. We are calling it Incremental Stealth Legacy Modernization because no one is going to get the green light to take this on directly. This article is for those playing the long game.

Legacy Systems

Legacy is the covering concept for a wide range of activities involving aging enterprise systems. I had the misfortune of working in Enterprise IT just as the term “Legacy” became pejorative. It was the early 1990’s, we were just completing a long-term strategic plan for John’s Manville. We decided to call it the “Legacy Plan” as we thought those involved with it would leave a legacy to those who came after. The ink had barely dried when “legacy” acquired a negative connotation. (While writing this I just looked it up, and Wikipedia thinks the term had already acquired its negative connotation in the 1980’s. Seems to me if it were in widespread use someone would have mentioned it before we published that report).

There are multiple definitions of what makes something a legacy system. Generally, it refers to older technology that is still in place and operating. What tends to keep legacy systems in place are networks of complex dependencies. A simple stand-alone program does not become a legacy system, because when the time comes, it can easily be rewritten and replaced. Legacy systems have hundreds or thousands of external dependencies, that often are not documented. Removing, replacing, or even updating legacy systems runs the risk of violating some of those dependencies. It is the fear of this disruption that keeps most legacy systems in place. And the longer it stays in place the more dependencies it accretes.

If these were the only forces affecting legacy systems, they would stay in place forever. The countervailing forces are obsolescence, dis-economy, and risk. While many parts of the enterprise depend on the legacy systems, the legacy system itself has dependencies. The system is dependent on operating systems, programming languages, middleware, and computer hardware. Any of these dependencies can and do become obsolescent and eventually obsolete. Obsolete components are no longer supported and therefore represent a high degree of risk of total failure of the system. The two main dimensions of dis-economy are operations and change. A modern system can typically run at a small fraction of the operating costs of a legacy system, especially when you tally up all the licenses for application systems, operating systems and middleware and add in salary costs for operators and administrators to support. The dis-economy of change is well known coming in the form of integration debt. Legacy systems are complex and brittle which makes change hard. The cost to make even the smallest changes to a legacy system are orders of magnitude more than the cost to make a similar change to a modern well-designed system. They are often written in obscure languages. One of my first legacy modernization projects involved replacing a payroll system written in assembler language with one that was to be written in “ADPAC.” You can be forgiven for thinking it is insane to have written a payroll system in assembler language, and even more so for replacing with a system written in a language that no one in the 21st century has heard of, but this was a long time ago, and is indicative of where legacy systems come from.

Legacy Modernization

Eventually the pressure to change overwhelms the inertia to leave things as they are. This usually does not end well for several reasons. Legacy modernization is usually long delayed. There is not a compelling need to change, and as a result for most of the life of a legacy systems resources have been assigned to other projects that get short term net positive returns. Upgrading the legacy system represents low upside. The new legacy system will do the same thing the old legacy system did, perhaps a bit cheaper or a bit better, but not fundamentally differently. Your old payroll system is paying everyone, and so will a new one.

As a result, the legacy modernization project is delayed as long as possible. When the inevitable precipitating event occurs, the replacement becomes urgent. People are frustrated with the old system. Replacing the legacy system with some more modern system seems like a desirable thing to do. Usually this involves replacing an application system with a package, as this is the easiest project to get approved. These projects were called “Rip and Replace” until the success rate of this approach plummeted. It is remarkable how expensive these projects are and how frequently they fail. Each failure further entrenches the legacy system and raises the stakes for the next project.

Ms. Bellotti points out in Kill it with Fire, many times the way to go is incremental improvement. By skillfully understanding the dependencies, and engineering decoupling techniques, such as APIs and intermediary data sets, it is possible to stave off some of the highest risk aspects of the legacy system. This is preferably to massive modernization projects that fail but, interestingly, has its own downsides: major portions of the legacy system continue to persist, and as she points out, few developers want to sign on to this type of work.

We want to outline a third way.

The Lost Opportunity

After a presentation on Data-Centricity someone in the audience pointed out that data-warehousing represented a form of Data-Centricity. Yes, in a way it does. With Data Warehousing and more recently Data Lakes and Data Lake houses, you have taken a subset of the data from numerous data silos and put it in one place for easier reporting. Yes, this captures a few of the data-centric tenets.

But what a lost opportunity. Think about it, we have spent the last 30 years setting up ETL pipelines and gone through several generations of data warehouses (from Kimball / Inmon roll your own to Teradata, Netezza to Snowflake and dozens more along the way) but have not gotten one inch closer to replacing any legacy systems. Indeed, the data warehouse is entrenching the legacy systems deeper by being dependent on them for their source of data. The industry has easily spent hundreds of billions of dollars, maybe even trillions of dollars over the last several decades, on warehouses and their ecosystems, but rather than getting us closer to legacy modernization we have gotten further from it.

Why no one will take you seriously

If you propose replacing a legacy system with a Knowledge Graph you will get laughed out of the room. Believe me, I’ve tried. They will point out that the legacy systems are vastly complex (which they are), have unknowable numbers of dependent systems (they do), the enterprise depends on their continued operation for its very existence (it does) and there are few if any reference sites of firms that have done this (also true). Yet, this is exactly what needs to be done, and at this point, it is the only real viable approach to legacy modernization.

So, if no one will take you seriously, and therefore no one will fund you for this route to legacy modernization, what are you to do? Go into stealth mode.

Think about it: if you did manage to get funded for a $100 million legacy replacement project, and it failed, what do you have? The company is out $100 million, and your reputation sinks with the $100 million. If instead you get approval for a $1 Million Knowledge Graph based project that delivers $2 million in value, they will encourage you to keep going. Nobody cares what the end game is, but you.

The answer then, is incremental stealth.

Tacking

At its core, it is much like sailing into the wind. You cannot sail directly into the wind. You must tack, and sail as close into the wind as you can, even though you are not headed directly towards your target. At some point, you will have gone far to the left of the direct line to your target, and you need to tack to starboard (boat speak for “right”). After a long starboard tack, it is time to tack to port.

In our analogy, taking on legacy modernization directly is sailing directly into the wind. It does not work. Incremental stealth is tacking. Keep in mind though, just incremental improvement without a strategy is like sailing with the wind (downwind): it’s fun and easy, but it takes you further from your goal, not closer.

The rest of this article are what we think the important tacking strategy should be for a firm that wants to take the Data-Centric route to legacy modernization. We have several clients that are on the second and third tack in this series.

I’m going to use a hypothetical HR / Payroll legacy domain for my examples here, but they apply to any domain.

Leg 1 – ETL to a Graph

The first tack is the simplest. Just extract some data from legacy systems and load it into a Graph Database. You will not get a lot of resistance to this, as it looks familiar. It looks like yet another data warehouse project. The only trick is getting sponsors to go this route instead of the tried-and-true data warehouse route. The key enablers here are to find problems well suited to graph structures, such as those that rely on graph analytics or shortest path problems. Find data that is hard to integrate in a data warehouse, a classic example is integrating structured data with unstructured data, which is nearly impossible in traditional warehouses, and merely tricky in graph environments.

The only difficulty is deciding how long to stay on this tack. As long as each project is adding benefit, it is tempting to stay on this tack for a long, long time. We recommend staying this course at least until you have a large subset of the data in at least one domain in the graph while refreshing frequently.

Let’s say after being on this tack for a long while you have all the key data on all your employees in the graph and being updated frequently

Leg 2 – Architecture MVP

On the first leg of the journey there are no updates being made directly to the graph. Just as in a data warehouse: no one makes updates in place in the data warehouse. It is not designed to handle that, and it would mess with everyone’s audit trails.

But a graph database does not have the limitations of a warehouse. It is possible to have ACID transactions directly in the graph. But you need a bit of architecture to do so. The challenge here is crating just enough architecture to get through your next tack. It depends a lot on what you think your next tack will be as to where you start. You’ll need constraint management to make sure your early projects are not loading invalid data back into your graph. Depending on the next tack you may need to implement fine grained security.

Whatever you choose, you will need to build or buy enough architecture to get your first update in place functionality going.

Leg 3 — Simple new Functionality in the Graph

In this leg we begin building update in place business use cases. We recommend not trying to replace anything yet. Concentrate on net new functionality. Some of the current best places to start are maintaining reference data (common shared data such as country codes, currencies, and taxonomies) and/ or some meta data management. Everyone seems to be doing data cataloging projects these days, they could just as well be done in the graph and give you experience and working through learning this new paradigm.

The objective here is to spend enough time on this tack that developers become comfortable with the new development paradigm. Coding directly to graph involves new libraries and new patterns.

Optionally, you may want to stay on this tack long enough to build “model driven development” (low code / no code in Gartner speak) capability into the architecture. The objective of this effort is to drastically reduce the cost of implementing new functionality in future tacks. Comparing before and after metrics on reduced code development, code testing, and code defects to make the case for the innovative approach will be alarming. Or you could leave model driven to a future tack.

Using the payroll / HR example, it will add new functionality dependent on HR data, but other things are not dependent on it. Maybe you built a skills database, or a learning management system. It depends on what is not yet in place that can be purely additive. These are the good places to start demonstrating business value.

Leg 4 – Understand the Legacy System and its Environment

Eventually you will get good at this and want to replace some legacy functionality. Before you do it will behoove you to do a bunch of deep research. Many legacy modernization attempts have run aground from not knowing what they did not know.

There are three things that you don’t fully know at this point:

• What data is the legacy system managing
• What business logic is the legacy system delivering
• What systems are dependent on the legacy system, and what is the nature of those dependencies.
If you have done the first three tacks well, you will have all the important data from the domain in the graph. But you will not have all the data. In fact, at the meta data level, it will appear that you have the tiniest fraction of the data. In your Knowledge Graph you may have populated a few hundred classes and used a few hundred properties, but your legacy system has tens of thousands of columns. By appearances you are missing a lot. What we have discovered anecdotally but have not proven yet, is that legacy systems are full of redundancy and emptiness. You will find that you do have most of the data you need, but before you proceed you need to prove this.

We recommend data profiling using software from a company such as GlobalIDs, IoTahoe or BigID. This software reads all the data in the legacy system and profiles it. It discovers patterns and creates histograms, which reveal where the redundancy is. More importantly, you can find data that is not in the graph and have a conversation about whether it is needed. A lot of data in legacy systems are accumulators (YTD, MTD etc.) that can easily be replaced by aggregation functions, processing flags that are no longer needed, and vast number of fields that are no longer used but both business and IT are afraid to let go. This will provide that certainty.

Another source of fear is “business logic” hidden in the legacy system. People fear that we do not know all of what the legacy system is doing and turning it off will break something. There are millions of lines of code in that legacy system, surely it is doing something useful. Actually, it is not. There is remarkably little essential business logic in most legacy systems. I know as I’ve built complex ERP systems and implemented many packages. Most of this code is just moving data from the database to an API, or to a transaction to another API, or into a conversational control record, or to the DOM if it is a more modern legacy system, onto the screen and back again. There is a bit of validation sprinkled throughout which some people call “business logic” but that is a stretch, it’s just validation. There is some mapping (when the user selects “Male” in the drop down put “1” in the gender field). And occasionally there is a bit of bona fide business logic. Calculating economic order quantities, critical paths or gross to net payroll calculations are genuine business logic. But they represent far less than 1% of the code base. The value is to be sure you have found them and insert into the graph.

This is where reverse engineering or legacy understanding software plays a vital role. Ms. Bellotti is 100% correct on this point as well. If you think these reverse engineer systems are going to automate your legacy conversion, you are in for a world of hurt. But what they can do is help you find the genuine business logic and provide some comfort to the sponsors that there isn’t something important that the legacy system is doing that no one knows about.

The final bit of understanding is the dependencies. This is the hardest one to get complete. The profiling software can help. Some can detect when the histogram of social security numbers in system A changes and the next day the same change is seen in system B, therefore it must be an interface. But beyond this the best you can do is catalog all the known data feeds and APIs. These are the major mechanisms that other systems use to become dependent on the legacy system. You will need to have strategies to mimic these dependencies to begin the migration.

This tack is purely research, and therefore does not develop any perceived immediate gain. You may need to bundle it with some other project that is providing incremental value to get it funded or you may fund it via contingency budget.

Leg 5 – Become the System of Record for some subset

Up to this point, data has been flowing into the graph from the legacy system or originating directly in the graph.

Now it is time to begin the reverse flow. We need to find an area where we can begin the flow going in the other direction. We now have enough architecture to build and answer use cases in the graph, it is time to start publishing rather than subscribing.

It is tempting to want to feed all the data back to the legacy system, but the legacy system has lots of data we do not want to source. Furthermore, this entrenches deeper into the legacy system. We need to pick off small areas that could decommission part of the legacy system.

Let’s say there was a certificate management system in the legacy system. We replace this with a better one in the graph and quit using the legacy one. But from our investigation above, we realize that the legacy certificate management system was feeding some points to the compensation management system. We just make sure the new system can feed the compensation system those points.

Leg 6 – Replace the dependencies incrementally

Now the flywheel is starting to turn. Encouraged by the early success of the reverse flow, the next step is to work out the data dependencies in the legacy system and work out a sequence to replace them.

The legacy payroll system is dependent on the benefit elections system. You now have two choices. You could replace the benefits system in the Graph. Now you will need to feed the results of the benefit elections (how much to deduct for the health care options etc.) to the legacy system. This might be the easier of the two options.

But the one that has the most impact is the other. Replace the payroll system. You have the benefits data feeding into the legacy system. If you replace the payroll system, there is nothing else (in HR) you need to feed. A feed the financial system and the government reporting system will be necessary, but you will have taken a much bigger leap in the legacy modernization effort.

Leg 7 – Lather, Rinse, Repeat

Once you have worked through a few of those, you can safely decommission the legacy system a bit at a time. Each time, pick off an area that can be isolated. Replace the functionality and feed the remaining bits of the legacy infrastructure if necessary. Just stop using that portion of the legacy system. The system will gradually atrophy. No need for any big bang replacement. The risk is incremental and can be rolled back and retried at any point.

Conclusion

We do not go into our clients claiming to be doing legacy modernization, but it is our intent to put them in a position where they could realize over time by applying knowledge graph capabilities.

We all know that at some point all legacy systems will have to be retired. At the moment the state of the art seems to be either “rip and replace” usually putting a packaged application in to replace the incumbent legacy system, or incrementally improve the legacy system in place.

We think there is a risk adverse, predictable, and self-funding route to legacy modernization, and it is done through Data-Centric implementation.