The Data-Centric Revolution: The Sky is Falling (Let’s Make Lemonade)

Recently IDC predicted that IT spending will drop by 5% due to the COVID-19 pandemic.[1] Last week, Gartner went further by predicting that IT spending would drop by 8% or $300 Billion.[2] (Expect a prediction bidding war.) Both were consistent: highest hit areas would be devices, followed by IT service and enterprise software.

The predicted $100 billion drop, in those last two categories, should send chills through those of us who make our living in those two categories. And keep inIT Spending mind, this drop will occur in the latter half of this year. To date, here have been very few cuts.

But I’m seeing the glass half full here. Half full of lemonade.[3]

Here is my thought process:

  • For at least five years, we have been advocating to abandon the senseless implementation of application after application. (You know: the silo making industry.) We have made a strong case for avoiding the application centric quagmire in Software Wasteland.[4]
  • And yet spending on implementing application systems had continued unabated since 2015.
  • With the need to slash budgets in the latter half of 2020, the large application implementation projects will be the easiest section to target.
  • Indeed, the IDC article says that “IT services spending will also decline, mostly due to delays in large projects.”
  • Furthermore, “some firms will cut capital spending and others will either delay new projects or seek to cut costs in other ways.”
  • Gartner reported that “some companies are cutting big IT projects altogether; others are ploughing ahead but delaying some elements of their plans to save money.”
  • Hershey has halted sections of a new ERP system and will drop IT capital spending from the budgeted $500 million to between $400-450 million.
  • Gartner also stated that “health care systems [are] pushing out projects to create digital health records by six months or more.”

This would be a terrible time to be an application software vendor or a systems integrator. The yearly 7% reductions in both categories are still in front of us. Any contract not yet signed will be put on hold. Even contracts in progress may get cancelled.

Click here to read more on TDAN.com

Structure-First Data Modeling: The Losing Battle of Perfect Descriptions

Structure-First Data Modeling: The Losing Battle of Perfect Descriptions In my last article I described Meaning-First data modeling. It’s time to dig into its predecessor and antithesis, which I call Structure-First data modeling, specifically looking at how two assumptions drive our actions. Assumptions are quite useful since they leverage experience without having to re-learn what is already known. It is a real time-saver.

Until it isn’t.

For nearly the last half century, the eventual implementation for data management systems has consisted of various incarnations of tables-with-columns and the supporting infrastructure which weaves them into a solution. The brilliant works of Steve Hoberman, Len Silverston, David Hay, and many others, in developing data modeling strategies and patterns are notable and admirable. They pushed data modeling art and science forward. As strong as those contributions are, they are still description-focused and assume a Structure-First implementation.

Structure-First data modeling is based on two assumptions. The first assumption is that the solution will always be physically articulated in a tables-with-columns structure. The second is that proceeding requires developing complete descriptions of subject matter. This second assumption is also on the path of either/or thinking; either the description is complete, or it is not. If it is not, then tables-with-columns (and a great deal of complexity) are added until it is complete. Our analysis, building on these assumptions, is focused on the table structures and how they are joined to create a complete attribute inventory.

The focus on structure is required because no data can be captured until the descriptive attribute structure exists. This inflexibility makes the system both brittle and complex.
All the descriptive attribution being stuffed into tables-with-columns are a parts list for the concept, but there is no succinct definition of the whole. These first steps taken on a data management journey are on the path to complexity, and since they are based on rarely articulated assumptions, the path is never questioned. The complete Structure-First model must accommodate every possible descriptive attribute that could be useful. We have studied E. F. Codd’s 5 data normalization levels and drive towards structural normalization. Therefore, our analysis is focused on avoiding repeating columns, multiple values in a single column, etc., rather than on what the data means.

Yet with all the attention paid to capturing all the descriptive attributes, new ones constantly appear. We know this is inevitable for any system having even a modest lifespan. For example, thanks to COVID-19, educational institutions that have never offered online courses are suddenly faced with moving exclusively to online offerings, at least temporarily. Buildings and rooms are not relevant for those offerings, but web addresses and enabling software are. Experience demonstrates how costly it is in both time and resources to add a new descriptive attribute after the system has been put into production. Inevitably something needs to be added. This happens either because something was missed or a new requirement was added. It also happens because buried in the long parts list of descriptive attributes, the same thing has been described several times in different ways. The brittle nature of tables-with-columns results in every change requiring very expensive modeling, refactoring, and regression testing to get the change into production.

Neither the tables-with-columns nor descriptive assumption parts lists assumptions apply when developing semantic knowledge graph solutions using Structure-First Data Modeling: The Losing Battle of Perfect Descriptionsa Meaning-First data modeling approach. Why am I convinced Meaning-First will advance the data management discipline? Because Meaning-First is definitional, the path of both/and thinking, and it rests on a single structure, the triple, for virtually everything. The World-Wide Web Consortium (W3C) defined the standard RDF (Resource Description Framework) triple to enable linking data on the open web and in private organizations. The definition, articulated in RDF triples, captures the essence to which new facts are linked. Semantic technologies provide a solid, machine-interpretable definition and the standard RDF triple as the structure. Since there is no need to build new structures, new information can be added instantly. By simply dropping new information into the database, it automatically links to existing data right away.

While meaning and structure are separate concepts, we have been conflating them for decades, resulting in unnecessary complexity. Humankind has been formalizing the study of meaning since Aristotle and has been making significant progress along the way. Philosophy’s formal logics are semantics’ Meaning-First cornerstone. Formal logics define the nature of whatever is being studied such that when something matches the formal definition, it can be proved that it is necessarily in the defined set. Semantic technology has enabled machine-readable assembly using formal logics. An example might make it easier to understand.

Consider a requirement to know which teams have won the Super Bowl. How would each approach solve this requirement? The required data is:
• Super Bowls played
• Teams that played in each Super Bowl
• Final scores
Data will need to be acquired in both cases and is virtually the same, so this example skips over those mechanics to focus on differences.

A Structure-First approach might look something like this. First, create a conceptual model with the table structures and their columns to contain  all the relevant team, Super Bowl, and score data. Second, create a logical model from the conceptual model that Structure-First Data Modeling: The Losing Battle of Perfect Descriptionsidentifies the logical designs that will allow the data to be connected and used. This requires primary and foreign key designs, logical data types and sizes, as well as join structures for assembling data from multiple tables. Third, create a physical model from the logical to model the storage strategy and incorporate vendor-specific implementation details.

Only at this point can the data be entered into the Structure-First system. This is because until the structure has been built, there is no place for the data to land. Then, unless you (the human user) know the structure, there is no way to get data back out. However, this isn’t true when using Meaning-First semantic technology.

A Meaning-First approach can start either by acquiring well-formed triples or building the model as the first step. The model can then define the meaning of “Super Bowl winner” as the team with the highest score for each Super Bowl occurrence. Semantic technology captures the meaning using formal logics, and the data that match that meaning self-assemble into the result set. Formal logics can also be used to infer which teams might have won the Super Bowl using the logic “in order to win, the team must have played in the Super Bowl,” and not all NFL teams have.

The key is that in the Meaning-First example, members of the set called Super Bowl winners can be returned without identifying the structure in the request. The Structure-First example required understanding and navigating the structure before even starting to formulate the question. It’s not so hard in this simple example, but in enterprise data systems with hundreds, or more likely thousands, of tables, understanding the structure is extremely challenging.

Structure-First Data Modeling: The Losing Battle of Perfect DescriptionsSemantic Meaning-First databases, known as triplestores, are not a collection of tables-with-columns. They are comprised of RDF triples that are used for both the definitions (schema in the form of an ontology) and the content (data). As a result, you can write queries against an RDF data set that you have never seen and get meaningful answers. Queries can return what sets have been defined. Queries can then find when the set is used as the subject or the object of a statement. Semantic queries simply walk across the formal logic that defines the graph letting the graph itself inform you about possible next steps. This isn’t an option in Structure-First environments because they are not based in formal logic and the schema is encapsulated in a different language from the data.

Traditional Structure-First databases are made up of tens to hundreds, often thousands of tables. Each table is arbitrarily made up and named by the modeler with the goal to contain all attributes of a specific concept. Within each table are columns that are also made up, again hopefully with lots of rigor, but made up. You can prove this to yourself by looking at the lack of standard definitions around simple concepts like address. Some will leverage modeling patterns, some will leverage standards like USPS, but the variability between systems is great and arbitrary.

Semantic technology has enabled the Meaning-First Structure-First Data Modeling: The Losing Battle of Perfect Descriptionsapproach with machine-readable definitions to which new attribution can be added in production. At the same time this clarity is added to the data management toolkit, semantic technology sweeps away the nearly infinite collection of complex table-with-column structures with the one single, standards-based RDF triple structure. Changing from descriptive to definitional is orders of magnitude clearer. Replacing tables and columns with triples is orders of magnitude simpler. Combining them into a single Meaning-First semantic solution is truly a game changer.

The Data-Centric Revolution: The Role of SemOps (Part 2)

In our previous installment of this two-part series we introduced a couple of ideas.

First, data governance may be more similar to DevOps than first meets the eye.

Second, the rise of Knowledge Graphs, Semantics and Data-Centric development will bring with it the need for something similar, which we are calling, “SemOps” (Semantic Operations).

Third, when you peel back what people are doing in DevOps and Data Governance, we get down to five key activities that will be very instructive in our SemOps journey:

  1. Quality
  2. Allowing/ “Permission-ing”
  3. Predicting Side Effects
  4. Constructive
  5. Traceability

We’ll take up each in turn and compare and contrast how each activity is performed in DevOps and Data Governance to inform our choices in SemOps.

But before we do, I want to cover one more difference: how the artifacts scale under management.

Code

There isn’t any obvious hierarchy to code, from abstract to concrete or general to specific, as there is in data and semantics.  It’s pretty much just a bunch of code, partitioned by silos. Some of it you bought, some you built, and some you rent through SaaS (Software as a Service).

Each of these silos represents, often, a lot of code.  Something as simple as Quick Books is 10 million lines of code.  SAP is hundreds of millions.  Most in-house software is not as bloated as most packages or software services; still, it isn’t unusual to have millions of lines of code in an in-house developed project (much of it is in libraries that were copied in, but it still represents complexity to be managed).  The typical large enterprise is managing billions of lines of code.

The only thing that makes this remotely manageable is, paradoxically, the thing that makes it so problematic: isolating each codebase in its own silo.  Within a silo, the developer’s job is to not introduce something that will break the silo and to not introduce something that will break the often fragile “integration” with the other silos.

Data and Metadata

There is a hierarchy to data that we can leverage for its governance.  The main distinction is between data and metadata.

The Data-Centric Revolution: The Role of SemOps (Part 2)

There is almost always more data than metadata.  More rows than columns.  But in many large enterprises there is far, far more metadata than anyone could possibly guess.  We were privy to a project to inventory the metadata for a large company, who shall go nameless.  At the end of the profiling, it was discovered that there were 200 million columns under management in the sum total of the firm. This is columns not rows.  No doubt there were billions of rows in all their data.

There are also other levels that people often introduce to help with the management of this pyramid.  People often separate Reference data (e.g., codes and geographies) and Master data (slower changing data about customers, vendors, employees and products).

These distinctions help, but even as the data governance people are trying to get their arms around this, the data scientists show up with “Big Data.”  Think of big data being below the bottom of this pyramid.  Typically, it is even more voluminous, and usually has only the most ad hoc metadata (the “keys” in the “key/value pairs” in the deeply nested json data structures are metadata, sort of, but you are left guessing what these short cryptic labels actually mean).

Click here to read more on TDAN.com

DCAF 2020: Second Annual Data-Centric Architecture Forum Re-Cap

Last year, we decided to call the first annual Data Centric Architecture Conference a Forum which resulted in DCAF. It didn’t take long for attendees to start calling the event “decaf” but they were equally quick to point out that the forum was anything but decaf. We had a great blend of presentations ranging from discussions about emerging best practices in the applied semantics profession to mind-blowing vendor demos. Our stretch goals from last year included growing the number of attendees, seeing more data-centric vendors, and exploring security and privacy. These were met and exceeded, and we’re on track to set even loftier stretch-goals for next year.

Throughout the Data-Centric Architecture Forum presentations, we were particularly impressed by the blockchain data security presentation by Brian Platz at https://flur.ee/. Semantic tech is an obvious choice for organizations wishing to become data centric, but we often have to rely on security frameworks that work for legacy platforms. It was exciting to see a platform that addresses security in a way that is highly compatible with semantics. They also provide a solid architecture that is consistent with the goals of the DCA, regardless of whether their clients choose to go with more traditional relational configurations, or semantic configurations.

We welcomed returning attendees from Lymba, showcasing some of the project work they’ve done while partnering with Semantic Arts. Mark Van Berkel from Schema App built an architecture based on outcomes from last year’s Data Centric Architecture Conference. It’s amazing what a small team can do in a short amount of time when they’re operating free from corporate constraints.

One of our concerns with growing the number of participants was that we would lose the energy of the room, the level of comfort in sharing ideas and networking across unspoken professional barriers (devs vs product? Not here!). Everyone was set up to learn from these presentations. The group was intimate enough that presenters could engage directly with the audience, which included developers, other vendors, and practitioners in field of semantics. We made every effort to keep presentations on target and to keep audience participation smoothly moderated, so coffee breaks were fertile ground for discussions and networking. So much of this conversation grew organically that we at Semantic Arts decided to open virtual forums to continue the discussions.

You can join us on these channels at:
LinkedIn group
Estes Park Group

While we’re on the topic of goals, here’s what we envision for next year’s Data-Centric Architecture Forum:
• Continuing with our mindset of growth – we want to see vendors bring the clients who showcase the best the tools and products have to offer. Success stories and challenges welcome.
• Academic interests – not that this is going to be a job fair, but Fort Collins IS a college town, just sayin’. Also, to that point, how do we recruit? What does it take to be a DCAF professional? What are you (vendors and clients) looking for when you want to build teams that can work on transformative tech?
• Continuing with our mindset of transparency, learning, and vulnerability. We still have to really solve the issue of security and privacy; how do we do that when we’re all about sharing data? What are our blind-spots as a profession?

Decades, Planets and Marriage

Google ontologist, Denny Vrandečić started a vigorous thread on the question of what constitutes a decade. See for example, the article: “People Can’t Even Agree On When The Decade Ends”. This, is a re-emergence of the question from 20 years ago on whether the new millennium will/did start on January 1 of 2000 or 2001.This is often posed as a mathematical conundrum, and math certainly plays a role here, but I think its more about terminology than it is about math. It reminds me of the question of whether Pluto is a planet. It is also relevant to ontologists.

The decade question is whether the 2020s did start on January 1, 2020 or will start on January 1, 2021. Denny noted that: “The job of an ontologist is to define concepts”. This is true, but ontologists often have to perform careful analysis to identify what the concepts are that really matter. Denny continued: “There are two ways to count calendar decades…”. I would put it differently and say: “The term ‘calendar decade’ is used to refer to at least two different concepts.”

At last count, there were 72 comments arguing exactly why one way or the other is correct. The useful and interesting part of the that discussion centers on identifying the nuanced differences between those two different concepts. The much less interesting part is arguing over which of these concepts deserves to be blessed with the term ‘calendar decade’. The latter is a social question, not an ontological question.

This brings us to Pluto. The interesting thing from an ontology perspective is to identify the various characteristics of bodies revolving around the sun, and then to identify which sets of characteristics correspond to important concepts that are worthy of having names. Finally, names have to be assigned: e.g. asteroid, moon, planet. The problem is that the term, ‘planet’, was initially used fairly informally to refer to one set of characteristics and it was later determined that it should be assigned to a different set of more precisely defined characteristics that scientists deemed to be more useful than the first. And so the term ‘planet’ now refers to a slightly different concept than it did before. The social uproar happened because the new concept no longer included Pluto.

A more massive social as well as political uproar arose in the past couple of decades around the term, ‘marriage’. The underlying ontological issues are similar. What are the key characteristics that constitute a useful concept that deserves a name? It used to be generally understood that a marriage was between a man and a woman, just like it used to be generally understood what a planet was. But our understanding and recognition of what is, should or could be, changes over time and so do the sets of characteristics that we think are deserving of a name.

The term planet was given a more restricted meaning, which excluded Pluto. The opposite was being argued in the case of marriage. People wanted a gender-neutral concept for a committed relationship; it was less restrictive. The term ‘marriage’ began to be used to include same-gender relationships.

I am aware that there are important differences between the decades, planets and marriages – but in all three cases, there are arguments about what the term should mean. Ironically and misnomeristically (if that’s a word), we refer to the worrying about what to call things as “just semantics”. Use of this phrase implies a terms-first perspective, i.e. you have a term, and you want to decide what it should mean. As an ontologist, I find it much more useful to identify the concepts first, and think of good terms afterwards. I wrote a series of blogs about this a few years ago.

What is my position on the decade question? If I was King, I would use the term ‘decade’ to refer to the set of years that start with the same 3 digits. Why? Maybe for the same reason that watching my car odometer change from 199999 to 200000 is more fun than watching it change from 200000 to 200001. The other proposed meaning for ‘calendar decade’ is not very interesting to me. So I would not bother to give it any name. But your mileage may vary.

Meaning-First Data Modeling, A Radical Return to Simplicity

Person uses language. Person speaks language.Meaning-First Data Modeling, A Radical Return to Simplicity Person learns language. We spend the early years of life learning vocabulary and grammar in order to generate and consume meaning. As a result of constantly engaging in semantic generation and consumption, most of us are semantic savants. This Meaning-First approach is our default until we are faced with capturing meaning in databases. We then revert to the Structure-First approach that has been beaten into our heads since Codd invented the relational model in 1970. This blog post presents Meaning-First data modeling for semantic knowledge graphs as a replacement to Structure-First modeling. The relational model was a great start for data management, but it is time to embrace a radical return to simplicity: Meaning-First data modeling.

This is a semantic exchange, me as a writer and you as a reader. The semantic mechanism by which it all works is comprised of a subject-predicate-object construct. The subject is a noun to which the statement’s meaning is applied. The predicate is the verb, the action part of the statement. The object is also generally a noun, the focus of the action. These three parts are the semantic building blocks of language and the focus of this post, semantic knowledge graphs.

In Meaning-First semantic data models the subject-predicate-object construct  is called a triple, the foundational structure upon which semantic technology is built. Simple facts are stated with these three elements, each of which is commonly surrounded by angle brackets. The first sentence in this post is an example triple. <Person> <uses> <language>. People will generally get the same meaning from it. Through life experience, people have assembled a working knowledge that allows us to both understand the subject-predicate-object pattern as well as what people and language are. Since computers don’t have life experience, we must fill in some details to allow this same understanding to be reached. Fortunately, a great deal of this work has been done by the World Wide Web Consortium (W3C) and we can simply leverage those standards.

Modeling the triple “Person uses language” in Figure 1, Triple diagram using arrows and ovals is a good start. Tightening the model by adding formal definitions makes it more robust and less ambiguous. These definitions come from gist, Semantic Arts’ minimalist upper level ontology. The subject, <Person>, is defined as “A Living Thing that is the

Meaning-First Data Modeling, A Radical Return to Simplicity
Figure 1, Triple diagram

offspring of some Person and that has a name.” The object, <Language>, is defined as “A recognized, organized set of symbols and grammar”. The predicate, <uses>, isn’t defined in gist, but could be defined as something like “Engages with purpose”. It is the action linking <Person> to <Language> to create the assertion about Person. Formal definitions for subjects and objects are useful because they are mathematically precise. They can be used by semantic technologies to reach the same conclusions as can a person with working knowledge of these terms.

 

Surprise! This single triple is (almost) an ontology. This is almost an ontology because it contains formal definitions and is in the form of a triple. Almost certainly, it is the world’s smallest ontology, and it is missing a few technical components, but it is a good start on an ontology all the same. The missing components come from standards published by the W3C which won’t be covered in detail here. To make certain the progression is clear, a quick checkpoint is in order. These are the assertions so far:

  • A triple is made up of a <Subject>, a <Predicate>, and an <Object>.
  • <Subjects> are always Things, e.g. something with independent existence including ideas.
  • <Predicates> create assertions that
    • Connect things when both the Subject and Object are things, or
    • Make assertions about things when the Object is a literal
  • <Objects> can be either
    • Things or
    • Literals, e.g. a number or a string

These assertions summarize the Resource Description Framework (RDF) model. RDF is a language for representing information about resources in the World Wide Web. Resource refers to anything that can be returned in a browser. More generally, RDF enables Linked Data (LD) that can operate on the public internet or privately within an organization. It is the simple elegance embodied in RDF that enables Meaning-First Data Modeling’s radically powerful capabilities. It is also virtually identical to the linguistic building blocks that enabled cultural evolution: subject, predicate, object.

Where RDF defines the framework that defines the triple, Resource Description Framework Schema (RDFS) provides a data-modeling vocabulary for building RDF triples. RDFS is an extension of the basic RDF vocabulary and is leveraged by higher-level languages such as Web Ontology Language (OWL), and Dublin Core Metadata Initiative (Dcterms). RDFS supports constructs for declaring that resources, such as Living Thing and Person, are classes. It also enables establishing subclass relationships between classes so the computer can make sense of the formal Person definition “A Living Thing that is the offspring of some Person and that has a name.”

Here is a portion of the schema supporting the opening statement in this post,

Figure 2, RDFS subclass property

“Person uses Language”. For simplicity, the ‘has name’ portion of the definition has been omitted from this diagram, but it will show up later.Figure 2 shows the RDFS subClassOf property as a named arrow connecting two ovals. This model is correct as it shows the subClassOf property, yet it isn’t quite satisfying. Perhaps it is even a bit ambiguous because through the lens of traditional, Structure-First data modeling, it appears to show two tables with a connecting relationship.

 

Nothing could be further from the truth.

There are two meanings here and they are not connected structures. The Venn diagram in Figure 3, RDFS subClassOf Venn diagram more clearly shows the Person set is wholly contained within the set of all Living

Figure 3, RDFS subClassOf Venn diagram

Things so it is also a Living Thing. There is no structure separating them. They are in fact both in one single structure; a triple store. They are differentiated only by the meaning found in their formal definitions which create membership criteria of two different sets. The first set is all Living Things. The second set, wholly embedded within the set of all Living Things, is the set of all Living Things that are also the offspring of some Person and that have a name. Person is a more specific set with criteria that causes a Living Thing to be a member of the Person set but is also still a member of the Living Things set.

Rather than Structure-First modeling, this is Meaning-First modeling built upon the triple defined by RDF with the schema articulated in RDFS. There is virtually no structure beyond the triple. All the triples, content and schema, commingle in one space called a triple store.

Figure 4, Complete schema

Here is some informal data along with the simple ontology’s model:

Schema:

  • <Person> <uses> <Language>

Content:

  • <Mark> <uses> <English>
  • <Boris > <uses> <Russian>
  • <Rebecca> <uses> <Java>
  • <Andrea> <uses> <OWL>

Contained within this sample data lies a demonstration of the radical simplicity of Meaning-First data modeling. There are two subclasses in the data content not   currently

Figure 5, Updated Language Venn diagram

modeled in the schema, yet they don’t violate the schema. The Figure 5 shows subclasses added to the schema after they have been discovered in the data. This can be done in a live, production setting without breaking anything! In a Structure-First system, new tables and joins would need to be added to accommodate this type of change at great expense and over a long period of time. This example just scratches the radical simplicity surface of Meaning-First data modeling.

 

 

Stay tuned for the next installment and a deeper dive into Meaning-First vs Structure-First data modeling!

Facet Math: Trim Ontology Fat with Occam’s Razor

Facet Math: Trim Ontology Fat with Occam's RazorAt Semantic Arts we often come across ontologies whose developers seem to take pride in the number of classes they have created, giving the impression that more classes equate to a better ontology. We disagree with this perspective and as evidence, point to Occam’s Razor, a problem-solving principle that states, “Entities should not be multiplied without necessity.” More is not always better. This post introduces Facet Math and demonstrates how to contain runaway class creation during ontology design.

Semantic technology is suited to making complex information intellectually manageable and huge class counts are counterproductive. Enterprise data management is complex enough without making the problem worse. Adding unnecessary classes can render enterprise data management intellectually unmanageable. Fortunately, the solution comes in the form of a simple modeling change.

Facet Math leverages core concepts and pushes fine-grained distinction to the edges of the data model. This reduces class counts and complexity without losing any informational fidelity. Here is a scenario that demonstrates spurious class creation in the literature domain. Since literature can be sliced many ways, it is easy to justify building in complexity as data structures are designed. This example demonstrates a typical approach and then pivots to a more elegant Facet Math solution.Facet Math: Trim Ontology Fat with Occam's Razor

A taxonomy is a natural choice for the literature domain. To get to each leaf, the whole path must be modeled adding a multiplier with each additional level in the taxonomy. This case shows the multiplicative effect and would result in a tree with 1000 leaves (10*10*10) assuming it had:
10 languages
10 genres
10 time periods

Taxonomies typically are not that regular though they do chart a path from the topmost concept down to each leaf. Modelers tend to model the whole path which multiplies the result set. Having to navigate taxonomy paths makes working with the information more difficult. The path must be disassembled to work with the components it has aggregated.

This temptation to model taxonomy paths into classes and/or class hierarchies creates a great deal of complexity. The languages, genres, and time periods in the example are really literature categories. This is where Facet Math kicks in taking an additive approach by designing them as distinct categories. Using those categories for faceted search and dataset assembly returns all the required data. Here is how it works.

Facet Math: Trim Ontology Fat with Occam's Razor

To apply Facet Math, remove the category duplication from the original taxonomy by refactoring them as category facets. The facets enable exactly the same data representation:
10 languages
10 genres
10 time periods

By applying Facet Math principles, the concept count is reduced by two orders of magnitude. Where the paths multiplied to produce 1000 concepts, facets are only added and there are now only 30. This results in two orders of magnitude reduction!

Sure, this is a simple example. Looking at a published ontology might be more enlightening.

SNOMED (Systematized Nomenclature of Medicine—Clinical Terms) ontology is a real-world example.

Since the thesis here is looking at fat reduction, here is the class hierarchy in SNOMED to get from the top most class to Gastric Bypass.Facet Math: Trim Ontology Fat with Occam's Razor

Notice that Procedure appears in four levels, Anastomosis and Stomach each appear in two levels. This hierarchy is a path containing paths.

SNOMED’s maximum class hierarchy depth is twenty-seven. Given the multiplicative effect shown above in the first example, SNOMED having 357,533 classes, while disappointing, is not surprising. The medical domain is highly complex but applying Facet Math to SNOMED would surely generate some serious weight reduction. We know this is possible because we have done it with clients. In one case Semantic Arts produced a reduction from over one hundred fifty thousand concepts to several hundred without any loss in data fidelity.

Bloated ontologies contain far more complexity than is necessary. Humans cannot possibly memorize a hundred thousand concepts, but several hundred are intellectually manageable. Computers also benefit from reduced class counts. Machine Learning and Artificial Intelligence applications have fewer, more focused concepts to work with so they can move through large datasets more quickly and effectively.

It is time to apply Occam’s Razor and avoid creating unnecessary classes. It is time to design ontologies using Facet Math.

Property Graphs: Training Wheels on the way to Knowledge Graphs

I’m at a graph conference. The general sense is that property graphs are much easier to get started with than Knowledge Graphs. I wanted to explore why that is, and whether it is a good thing.

It’s a bit of a puzzle to us, we’ve been using RDF and the Semantic Web stack for almost two decades, and it seems intuitive, but talking to people new to graph databases there is a strong preference to property graphs (at this point primarily Neo4J and TigerGraph, but there are others). – Dave McComb

Property Graphs

A knowledge graph is a database that stores information as digraphs (directed graphs, which are just a link between two nodes).

Property Graphs: Training Wheels on the way to Knowledge Graphs

The nodes self-assemble (if they have the same value) into a completer and more interesting graph.

Property Graphs: Training Wheels on the way to Knowledge Graphs

What makes a graph a “property graph” (also called a “labeled property graph”) is the ability to have values on the edges

Either type of graph can have values on the nodes, in a Knowledge Graph they are done with a special kind of edge called a “datatype Property.”

Property Graphs: Training Wheels on the way to Knowledge Graphs

Property Graphs: Training Wheels on the way to Knowledge Graphs

Here is an example of one of the typical uses for values on the edges (the date the edge was established).  As it turns out this canonical example isn’t a very good example, in most databases, graph or otherwise, a purchase would be a node with many other complex relationships.

The better use of dates on the edges in property graphs are where there is what we call a “durable temporal relation.” There are some relationships that exist for a long time, but not forever, and depending on the domain are often modeled as edges with effective start and end dates (ownership, residence, membership are examples of durable temporal relations that map well to dates on the edges)

The other big use case for values on the edges which we’ll cover below.

The Appeal of Property Graphs

Talking to people and reading white papers, it seems the appeal of Property Graph data bases are in these areas:

  • Closer to what programmers are used to
  • Easy to get started
  • Cool Graphics out of the box
  • Attributes on the edges
  • Network Analytics

Property Graphs are Closer to What Programmers are Used to

The primary interfaces to Property Graphs are json style APis, which developers are comfortable with and find easy to adapt to.

Property Graphs: Training Wheels on the way to Knowledge Graphs

Easy to Get Started

Neo4J in particular have done a very good job of getting people set up and running and productive in short order.  There are free versions to get started with, and well exercised data sets to get up and going rapidly. This is very satisfying for people getting started.

Property Graphs: Training Wheels on the way to Knowledge Graphs

Cool Graphics Out of the Box

One of the striking things about Neo4J is their beautiful graphics

Property Graphs: Training Wheels on the way to Knowledge Graphs

You can rapidly get graphics that often have never been seen in traditional systems, and this draws in the attention of sponsors.

Property Graphs have Attributes on the Edges

Perhaps the main distinction between Property Graphs and RDF Graphs is the ability to add attributes to the edges in the network.  In this case the attribute is a rating (this isn’t a great example, but it was the best one I could find easily).

Property Graphs: Training Wheels on the way to Knowledge Graphs

One of the primary use cases for attributes on the edges would be weights that are used in the evaluation of network analytics.  For instance, a network representation of how to get from one town to another, might include a number of alternate sub routes through different towns or intersections.  Each edge would represent a segment of a possible journey.  By putting weights on each edge that represented distance, a network algorithm could calculate the shortest path between two towns.  By putting weights on the edges that represent average travel time, a network algorithm could calculate the route that would take the least time.

Other use cases for attributes on the edges include temporal information (when did this edge become true, and when was is no longer true), certainty (you can rate the degree of confidence you have in a given link and in some cases only consider links that are > some certainly value), and popularity (you could implement the page rank algorithm with weights on the edges, but I think it might be more appropriate to put the weights on the nodes)

Network Analytics

There are a wide range of network analytics that come out of the box and are enabled in the property graph.  Many do not require attributes on the edges, for instance the “clustering” and “strength of weak ties” suggested in this graphic can be done without attributes on the edges.

Property Graphs: Training Wheels on the way to Knowledge Graphs

However, many of the network analytics algorithms can take advantage of and gain from weights on the edges.

Property Graphs: What’s Not to Like

That is a lot of pluses on the Property Graph side, and it explains their meteoric rise in popularity.

Our contention is that when you get beyond the initial analytic use case you will find yourself in a position of needing to reinvent a great body of work that already exists and have been long standardized.  At that point if you have over committed to Property Graphs you will find yourself in a quandary, whereas if you positioned Property Graphs as a stepping stone on the way to Knowledge Graphs you will save yourself a lot unnecessary work.

Property Graphs, What’s the Alternative?

The primary alternative is an RDF Knowledge Graph.  This is a graph database using the W3C’s standards stack including RDF (resource description framework) as well as many other standards that will be described below as they are introduced.

The singular difference is the RDF Knowledge Graph standards were designed for interoperability at web scale.  As such all identifiers are globally unique, and potentially discoverable and resolvable.  This is a gigantic advantage when using knowledge graphs as an integration platform as we will cover below.

Where You’ll Hit the Wall with Property Graphs

There are a number of capabilities, we assume you’ll eventually want to add on to your Property Graph stack, such as:

  • Schema
  • Globally Unique Identifiers
  • Resolvable identifiers
  • Federation
  • Constraint Management
  • Inference
  • Provenance

Our contention is you could in principle add all this to a property graph, and over time you will indeed be tempted to do so.  However, doing so is a tremendous amount of work, high risk, and even if you succeed you will have a proprietary home-grown version of all these things that already exist, are standardized and have been in large scale production systems.

As we introduce each of these capabilities that you will likely want to add to your Property Graph stack, we will describe the open standards approach that already covers it.

Schema

Property Graphs do not have a schema.  While big data lauded the idea of “schema-less” computing, the truth is, completely removing schema means that a number of functions previously performed by schema have now moved somewhere else, usually code. In the case of Property Graphs, the nearest equivalent to a schema is the “label” in “Labeled Property Graph.” But as the name suggests, this is just a label, essentially like putting a tag on something.  So you can label a node as “Person” but that tells you nothing more about the node.  It’s easier to see how limited this is when you label a node a “Vanilla Swap” or “Miniature Circuit Breaker.”

Knowledge Graphs have very rich and standardized schema.  One of the ways they allow you to have the best of both worlds, is unlike relational databases they do not require all schema to be present before any data can be persisted. At the same time when you are ready to add schema to your graph, you can do so with a high degree of rigor and go to as much or as little detail as necessary.

Globally Unique Identifiers

The identifiers in Property Graphs are strictly local.  They don’t mean anything outside the context of the immediate database.  This is a huge limitation when looking to integrate information across many systems and especially when looking to combine third party data.

Knowledge Graphs are based on URIs (really IRIs).  Uniform Resource Identifiers (and their Unicode equivalent, which is a super set, International Resource Identifiers) are a lot like URLs, but instead of identifying a web location or page, they identify a “thing.” In best practices (which is to say 99% of all the extant URIs and IRIs out there) the URI/IRI is based on a domain name.  This delegation of id assignment to organizations that own the domain names allows relatively simple identifiers that are not in danger of being mistakenly duplicated.

Every node in a knowledge graph is assigned a URI/IRI, including the schema or metadata.  This makes discovering what something means as simple as “following your nose” (see next section)

Resolvable Identifiers

Because URI/IRIs are so similar to URLs, and indeed in many situations are URLs it makes it easy to resolve any item.  Clicking on a URI/IRI can redirect to a server in the domain name of the URI/IRI, which can then render a page that represents the Resource.  In the case of a schema/ metadata URI/IRI the page might describe what the metadata means.  This typically includes both the “informal” definition (comments and other annotations) as well as the “formal” definition (described below).

For a data URI/IRI the resolution might display what is known about the item (typically the outgoing links), subject to security restrictions implemented by the owner of the domain.  This style of exploring a body of data, by clicking on links and exploring is called “following your nose” and is a very effective way of learning a complex body of knowledge, because unlike traditional systems you do not need to know the whole schema in order to get started.

Property Graphs have no standard way of doing this.  Anything that is implemented is custom for the application at hand.

Federation

Federation refers to the ability to query across multiple databased to get a single comprehensive result set.  This is almost impossible to do with relational databases.  No major relational database vendor will execute queries across multiple databases and combine the result (the result generally wouldn’t make any sense anyway as the schemas are never the same).  The closest thing in traditional systems, is the Virtual Data P***, which allows some limited aggregation of harmonized databases.

The Property Graphs also have no mechanism for federation over more than a single in memory graph.

Federation is built into SPARQL (the W3C standard for querying “triple stores” or RDF based Graph Databases).  You can point a SPARQL query at a number of databases (including relational databases that have been mapped to RDF through another W3C standard, R2RML).

Constraint Management

One of the things needed in a system that is hosting transactional updates, is the ability to enforce constraints on incoming transactions.  Suffice it to say Property Graphs have no transaction mechanism and no constraint management capability.

Knowledge Graphs have the W3C standard, SHACL (SHApes Constraint Language) to specify constraints in a model driven fashion.

Inference

Inference is the creation of new information from existing information.  A Property Graph creates a number of “insights” which are a form of inference, but it is really only in the heads of the persons running the analytics and interpreting what the insight is.

Knowledge Graphs have several inference capabilities.  What they all share is that the result of the inference is rendered as another triple (the inferred information is another fact which can be expressed as a triple).  In principle almost any fact that can be asserted in a Knowledge Graph can also be inferred, provided the right contextual information.  For instance, we can infer that a class is a subclass of another class.  We can infer that a node has a given property, we can infer that two nodes represent the same real-world items, and each of these inferences can be “materialized” (written) back to the database.  This makes any inferred fact available to any human reviewing the graph, and process that acts on the graph, including queries.

Two of the prime creators of inferred knowledge are RDFS and  OWL, the W3C standards for schema.  RDFS provides the simple sort of inference that people familiar with Object Oriented programming might be familiar with, primarily the ability infer that a node that is a member of a class is also a member of any of its superclasses.  A bit new to many people is the idea that properties can have superproperties, and that leads to inference at the instance level.  If you make the assertion that you have a mother  (property :hasMother) Beth, and then declare :hasParent to be a superproperty of :hasMother, the system will infer that you :hasParent Beth, and this process can be repeated by making :has Ancestor a superproperty of :hasParent. The system can infer and persist this information.

OWL (the Web Ontology Language for dyslexics) allows for much more complex schema definitions.  OWL allows you to create class definitions from Booleans, and allows the formal definition of classes by creating membership definitions based on what properties are attached to nodes.

If RDFS and OWL don’t provide sufficient rigor and/or flexibility there are two other options, both rule languages and both will render their inferences as triples that can be returned to the triple store.  RIF (the Rule Interchange Format) allow inference rules defined in terms of “if / then“ logic.  SPARQL the above-mentioned query language can also be used to create new triples that can be rendered back to the triple store.

Provenance

Provenance is the ability to know where any atom of data came from.  There are two provenance mechanisms in Knowledge Graphs.  For inferences generated from RDFS or OWL definitions, there is an “explain” mechanism, which is decribed in the standards as “proof.” In the same spirit as a mathematical proof, the system can reel out the assertions including schema-based definitions as data level assertions that led to the provable conclusion of the inference.

For data that did not come from inference (that was input by a user, or purchased, or created through some batch process, there is a W3C standard, call PROV-O (the provenance ontology) that outlines a standard way to describe where a data set or even an individual atom of data came from.

Property Graphs have nothing similar.

Convergence

The W3C held a conference to bring together the labeled property graph camp with the RDF knowledge graph camp in Berlin in March of 2019.

One of our consultants attended and has been tracking the aftermath.  One promising path is RDF* which is being mooted as a potential candidate to unify the two camps.  There are already several commercial implementations supporting RDF*, even though the standard hasn’t even begun its journey through the approval process. We will cover RDF* in a subsequent white paper.

Summary

Property Graphs are easy to get started with.  People think RDF based Knowledge Graphs are hard to understand, complex and hard to get started with. There is some truth to that characterization.

The reason we made the analogy to “training wheels” (or “stepping stones” in the middle of the article) is to acknowledge that riding a bike is difficult.  You may want to start with training wheels.  However, as you become proficient with the training wheels, you may consider discarding them rather than enhancing them.

Most of our clients start directly with Knowledge Graphs, but we recognize that that isn’t the only path.  Our contention is that a bit of strategic planning up front,  outlining where this is likely to lead gives you a lot more runway.  You may choose to do your first graph project using a property graph, but we suspect that sooner or later you will want to get beyond the first few projects and will want to adopt an RDF / Semantic Knowledge Graph based system.

Toss Out Metadata That Does Not Bring Joy

As I write this, I can almost hear you wail “No, no, we don’t have too much metadata, we don’t have nearly enough!  We have several projects in flight to expand our use of metadata.”

Sorry, I’m going to have to disagree with you there.  You are on a fool’s errand that will just provide busy work and will have no real impact on your firm’s abilityThe Data-Centric Revolution: Implementing a Data-Centric Architecture to make use of the data they have.

Let me tell you what I have seen in the last half dozen or so very large firms I’ve been involved with, and you can tell me if this rings true for you.  If you are in a mid-sized or even small firm you may want to divide these numbers by an appropriate denominator, but I think the end result will remain the same.

Most large firms have thousands of application systems.  Each of these systems have data models that consist of hundreds of tables and many thousands of columns.  Complex applications, such as SAP, explode these numbers (a typical SAP install has populated 90,000 tables and a half million columns).

Even as we speak, every vice president with a credit card is unknowingly expanding their firm’s data footprint by implementing suites of SaaS (Software as a Service) applications.  And let’s not even get started on your Data Scientists.  They are rabidly vacuuming up every dataset they can get their hands on, in the pursuit of “insights.”

Naturally you are running out of space, and especially system admin bandwidth in your data centers, so you turn to the cloud.  “Storage is cheap.”

This is where the Marie Kondo analogy kicks in.  As you start your migration to the cloud (or to your Data Lake, which may or may not be in the cloud), you decide “this would be a good time to catalog all this stuff.”  You launch into a project with the zeal of a Property and Evidence Technician at a crime scene. “Let’s careful identify and tag every piece of evidence.”  The advantage that they have, and you don’t is that their world is finite.   You are faced with cataloging billions of pieces of metadata.  You know you can’t do it alone, so you implore the people who are putting the data in the Data Swamp (er, Lake).  You mandate that anything that goes into the lake must have a complete catalog.  Pretty soon you notice, that the people putting the data in don’t know what it is either.  And they know most of it is crap, but there are a few good nuggets in there.  If you require them to have descriptions of each data element, they will copy the column heading and call it a description.

Let’s just say, hypothetically, you succeeded in getting a complete and decent catalog for all the datasets in use in your enterprise.  Now what?

Click here to read more on TDAN.com

The Flagging Art of Saying Nothing

Who doesn’t like a nice flag? Waving in the breeze, reminding us of who we are and what we stand for. Flags are a nice way of providingUnderstanding Meaning in Data a rallying point around which to gather and show our colors to the world. They are a way of showing membership in a group, or providing a warning. Which is why it is so unfortunate when we find flags in a data management system, because they are reduced to saying nothing. Let me explain.

When we see Old Glory, we instantly know it is emblematic of the United States. We also instantly recognize the United Kingdom’s emblematic Union Jack and Canada’s Maple Leaf Flag. Another type of flag is a Warning flag alerting us to danger. In either case, we have a clear reference to what the flag represents. How about when you look at a data set and see ‘Yes’, or ‘7’? Sure, ‘Yes’ is a positive assertion and 7 is a number, but those are classifications, not meaning. Yes what? 7 what? There is no intrinsic meaning in these flags. Another step is required to understand the context of what is being asserted as ‘Yes’. Numeric values have even more ambiguity. Is it a count of something, perhaps 7 toasters? Is it a ranking, 7th place? Or perhaps it is just a label, Group 7?

In data systems the number of steps required to understand a value’s meaning is critical both for reducing ambiguity, and, more importantly, for increasing efficiency. An additional step to understand that ‘Yes’ means ‘needs review’, so the processing steps have doubled to extract its meaning. In traditional systems, the two-step flag dance is required because two steps are required to capture the value. First a structure has to be created to hold the value, the ‘Needs Review’ column. Then a value must be placed into that structure. More often than not, an obfuscated name like ‘NdsRvw’ is used which requires a third step to understand what that means. Only when the structure is understood can the value and meaning the system designer was hoping to capture be deciphered.

In cases where what value should be contained in the structure isn’t known, a NULL value is inserted as a placeholder. That’s right, a value literally saying nothing. Traditional systems are built as structure first, content second. First the schema, the structure definition, gets built. Then it is populated with content. The meaning of the content may or may not survive the contortions required to stuff it into the structure, but it gets stuffed in anyway in the hope it can deciphered later when extracted for a given purpose. Given situations where there is a paucity of data, there is a special name for a structure that largely says nothing – sparse tables. These are tables known to likely contain only a very few of the possible values, but the structure still has to be defined before the rare case values actually show up. Sparse tables are like requiring you to have a shoe box for every type of shoe you could possibly ever own even though you actually only own a few pair.

Structure-first thinking is so embedded in our DNA that we find it inconceivable that we can manage data without first building the structure. As a result, flag structures are often put in to drive system functionality. Logic then gets built to execute the flag dance and get executed every time interaction with the data occurs. The logic says something like this:
IF this flag DOESN’T say nothing
THEN do this next thing
OTHERWISE skip that next step
OR do something else completely.
Sadly, structure-first thinking requires this type of logic to be in place. The NULL placeholders are a default value to keep the empty space accounted for, and there has to be logic to deal with them.

Semantics, on the other hand, is meaning-first thinking. Since there is no meaning in NULL, there is no concept of storing NULL. Semantics captures meaning by making assertions. In semantics we write code that says “DO this with this data set.” No IF-THEN logic, just DO this and get on with it. Here is an example of how semantics maintains the fidelity of our information without having vacuous assertions.

The system can contain an assertion that the Jefferson contract is categorized as ‘Needs Review’ which puts it into the set of all contracts needing review. It is a subset of all the contracts. The rest of the contracts are in the set of all contracts NOT needing review. These are separate and distinct sets which are collectively the set of all contracts, a third set. System functionality can be driven by simply selecting the set requiring action, the “Needs Review” set, the set that excludes those that need review, or the set of all contracts. Because the contracts requiring review are in a different set, a sub-set, and it was done with a single step, the processing logic is cut in half. Where else can you get a 50% discount and do less work to get it?

I love a good flag, but I don’t think they would have caught on if we needed to ask the flag-bearer what the label on the flagpole said to understand what it stood for.

Blog post by Mark Ouska 

For more reading on the topic, check out this post by Dave McComb.