Time to Rethink Master and Reference Data

Time to Rethink Master and Reference Data

Every company contends with data quality, and in its pursuit they often commit substantial resources to manage their master and reference data. Remarkably, quite a bit of confusion exists around exactly what these are and how they differ. And since they provide context to business activity, this confusion can undermine any data quality initiative.

Here are amalgams of the prevailing definitions, which seem meaningful at first glance:

Time to Rethink Master and Reference Data

Sound familiar? In this article, I will discuss some tools and techniques for naming and defining terms that explain how these definitions actually create confusion. Although there is no perfect solution, I will share the terms and definitions that have helped me guide data initiatives, processes, technologies, and governance over the course of my career.

What’s in a Name?

Unique and self-explanatory names save time and promote common understanding. Naming, however, is nuanced in that words are often overloaded with multiple meanings. The word “customer,” for instance, often means very different things to people in the finance, sales, or product departments. There are also conventions that, while not exactly precise, have accumulated common understanding over time. The term “men’s room,” for example, is understood to mean something more specific than a room (it has toilets); yet something less specific than men’s (it’s also available to boys).

They’re both “master”

The term “master” data derives from the notion that each individually identifiable thing has a corresponding, comprehensive and authoritative record in the system. The verb to master means to gain control of something. The word causes confusion, however, when used to distinguish master data from reference data. If anything, reference data is the master of master data, as it categorizes and supplies context to master data. The dependency graph below demonstrates that master data may refer to and thus depend on reference data (red arrow), but not the other way around:

They’re both “reference”

The name “reference data” also makes sense in isolation. It evokes reference works like dictionaries, which are highly curated by experts and typically used to look up individual terms rather than being read from beginning to end. But reference can also mean the act of referring, and in practice, master data has just as many references to it as reference data.  

So without some additional context, these terms are problematic in relation to each other.

It is what it is

Although we could probably conjure better terms, “Master Data” and “Reference Data” have become universal standards with innumerable citations. Any clarification provided by new names would be offset by their incompatibility with the consensus

Pluralizations R Us

Whenever possible, it’s best to express terms in the singular rather than the plural since the singular form refers to the thing itself, while the plural form denotes a set. That’s why dictionaries always define the singular form and provide the plural forms as an annotation.  Consider the following singular and plural terms and definitions:

* Note that entity is used in the entity-relationship sense, where it denotes a type of thing rather than an identifiable instance of a thing.

The singular term “entity” works better for our purposes since the job at hand is to classify each entity as reference or master, rather than some amorphous concept of data. In our case, classifying each individual entity informs its materialized design in a database, its quality controls, and its integration process. The singular also makes it more natural to articulate relationships between things, as demonstrated by these awkward counterexamples:

“One bushels contains many apples.”

“Each data contains one or more entities.”

Good Things Come in Threes

Trying to describe the subject area with just two terms, master and reference, falls short because the relationship between the two cannot be fully understood without also defining the class that includes them both.  For example, some existing definitions specify a “disjoint” relationship in which an entity can belong to either reference or master data, but not both. This can be represented as a diagram or tree:

The conception is incomplete because the class that contains both reference and master data is missing.  Are master data and reference data equal siblings among other data categories, as demonstrated below?

That’s not pragmatic, since it falsely implies that master and reference data have no more potential for common governance and technology than, say, weblogs and image metadata. We can remedy that by subsuming master and reference data within an intermediate class, which must still be named, defined, and assigned the common characteristics shared by master and reference data.

Some definitions posit an inclusion or containment relationship in which reference data is a subset of master data, rather than a disjoint peer. This approach, however, omits the complement–the master data which is not reference data.

Any vocabulary that doesn’t specify the combination of master and reference data will be incomplete and potentially confusing.

It’s Just Semantics

Generally speaking, there are two broad categories of definitions: extensional and intensional.  

Extensional Definitions

An extensional definition simply defines an entity by listing all of its instances, as in the following example:

This is out of the question for defining reference or master data, as each has too many entities and regularly occurring additions. Imagine how unhelpful and immediately obsolete the following definition would be:

A variation of this approach, ostensive definition, uses partial lists as examples.  These are often used for “type” entities that nominally classify other things:

Ostensive definitions, unlike extensional definitions, can withstand the addition of new instances. They do not, however, explain why their examples satisfy the term. In fact, ostensive definitions are used primarily for situations in which it’s hard to formulate a definition that can stand on its own.  Therefore both extensive and ostensive definitions are inadequate since they fail to provide a rationale to distinguish reference from master data.

Intensional Definitions 

Intensional definitions, on the other hand, define things by their intrinsic properties and do not require lists of instances.  The following definition of mineral, for example, does not list any actual minerals:

With that definition, we can examine the properties of quartz, for example, and determine that it meets the necessary and sufficient conditions to be deemed a mineral.  Now we’re getting somewhere, and existing definitions have naturally used this approach.  

Unfortunately, the conditions put forth in the existing definitions of master and reference data can describe either, rather than one or the other. The following table shows that every condition in the intensional definitions of master and reference data applies to both terms:

How can you categorize the product entity, for example, when it adheres to both definitions? It definitely conforms to the definition of master–a core thing shared across an enterprise. But it also conforms to reference, as it’s often reasonably stable and simply structured, used to categorize other things (sales), provides a list of permissible values (order forms), and corresponds to external databases (vendor part lists).  I could make the same case for almost any entity categorized as master or reference, and this is where the definitions fail.

Master data in reference data: use intentional definitions

Celebrate Diversity

Although they share the same intrinsic qualities, master and reference data truly are different and require separate terms and definitions. Their flow through a system and their respective quality control processes, for instance, are quite distinct.  

Reference data is centrally administered and stored. It is curated by an authoritative party before becoming available in its system of record, and only then is it copied to application databases or the edge. An organization, for instance, would never let a user casually add a new unit of measure or a new country.

Master data, on the other hand, is often regularly added and modified in various distributed systems. New users register online, sales systems acquire new customers, organizations hire and fire employees, etc. The data comes in from the edge during the normal course of business, and quality is enforced as it is merged into the systems of record.

Master data and reference data change and merge

Companies must distinguish between master and reference data to ensure their quality and proper integration.

Master data and reference data are distinct concepts that require...

Turn The Beat Around

It’s entirely reasonable and common to define things by their intrinsic qualities and then use those definitions to inform their use and handling. Intuition tells us that once we understand the characteristics of a class of data, we can assess how best to manage it. But since the characteristics of master and reference data overlap, we need to approach their definitions differently.

 

In software architecture and design, there’s a technique called Inversion of Control that reverses the relationship between a master module and the process it controls. It essentially makes the module subservient to the process. We can apply a similar concept here by basing our definitions on the processes required by the data, rather than trying to base the processes on insufficiently differentiated definitions. This allows us to pragmatically define terms that abide by the conclusions described above:

  1. Continue to use the industry-standard terms “master data” and “reference data.”
  2. Define terms in the singular form.
  3. Define a third concept that encompasses both categories.
  4. Eschew extensive and ostensive definitions, and use intensional definitions that truly distinguish the concepts

With all that out of the way, here are the definitions that have brought clarity and utility to my work with master and reference data. I’ve promoted the term “core” from an adjective of master data to a first-class concept that expresses the superclass encompassing both master and reference entities.

With core defined, we can use a form of intensional definition called genus differentia for reference and master data. Genus differentia definitions have two parts. The first, genus, refers to a previously defined class to which the concept belongs–core entity, in our case. The rest of the definition, the differentia, describes what sets it apart from others in its class. We can now leverage our definition of core entity as the genus, allowing the data flow to provide the differentia. This truly distinguishes reference and master.

We can base the plural terms on the singular ones:

Conclusion

This article has revealed several factors that have handicapped our understanding of master and reference data:

  • The names and prevailing definitions insufficiently distinguish the concepts because they apply to both.
  • The plural form of a given concept obscures its definition.
  • Master data and reference data are incompletely described without a third class that contains both. 

Although convention dictates retention of the terms “master” and “reference,” we achieve clarity by using genus differentia to demonstrate that while they are both classified as core entities, they are truly distinguished by their flow and quality requirements rather than any intrinsic qualities or purpose.

By Alan Freedman

Connect with the Author

Want to learn more about what we do at Semantic Arts? Contact us!

Facet Math: Trim Ontology Fat with Occam’s Razor

Facet Math: Trim Ontology Fat with Occam's RazorAt Semantic Arts we often come across ontologies whose developers seem to take pride in the number of classes they have created, giving the impression that more classes equate to a better ontology. We disagree with this perspective and as evidence, point to Occam’s Razor, a problem-solving principle that states, “Entities should not be multiplied without necessity.” More is not always better. This post introduces Facet Math and demonstrates how to contain runaway class creation during ontology design.

Semantic technology is suited to making complex information intellectually manageable and huge class counts are counterproductive. Enterprise data management is complex enough without making the problem worse. Adding unnecessary classes can render enterprise data management intellectually unmanageable. Fortunately, the solution comes in the form of a simple modeling change.

Facet Math leverages core concepts and pushes fine-grained distinction to the edges of the data model. This reduces class counts and complexity without losing any informational fidelity. Here is a scenario that demonstrates spurious class creation in the literature domain. Since literature can be sliced many ways, it is easy to justify building in complexity as data structures are designed. This example demonstrates a typical approach and then pivots to a more elegant Facet Math solution.Facet Math: Trim Ontology Fat with Occam's Razor

A taxonomy is a natural choice for the literature domain. To get to each leaf, the whole path must be modeled adding a multiplier with each additional level in the taxonomy. This case shows the multiplicative effect and would result in a tree with 1000 leaves (10*10*10) assuming it had:
10 languages
10 genres
10 time periods

Taxonomies typically are not that regular though they do chart a path from the topmost concept down to each leaf. Modelers tend to model the whole path which multiplies the result set. Having to navigate taxonomy paths makes working with the information more difficult. The path must be disassembled to work with the components it has aggregated.

This temptation to model taxonomy paths into classes and/or class hierarchies creates a great deal of complexity. The languages, genres, and time periods in the example are really literature categories. This is where Facet Math kicks in taking an additive approach by designing them as distinct categories. Using those categories for faceted search and dataset assembly returns all the required data. Here is how it works.

Facet Math: Trim Ontology Fat with Occam's Razor

To apply Facet Math, remove the category duplication from the original taxonomy by refactoring them as category facets. The facets enable exactly the same data representation:
10 languages
10 genres
10 time periods

By applying Facet Math principles, the concept count is reduced by two orders of magnitude. Where the paths multiplied to produce 1000 concepts, facets are only added and there are now only 30. This results in two orders of magnitude reduction!

Sure, this is a simple example. Looking at a published ontology might be more enlightening.

SNOMED (Systematized Nomenclature of Medicine—Clinical Terms) ontology is a real-world example.

Since the thesis here is looking at fat reduction, here is the class hierarchy in SNOMED to get from the top most class to Gastric Bypass.Facet Math: Trim Ontology Fat with Occam's Razor

Notice that Procedure appears in four levels, Anastomosis and Stomach each appear in two levels. This hierarchy is a path containing paths.

SNOMED’s maximum class hierarchy depth is twenty-seven. Given the multiplicative effect shown above in the first example, SNOMED having 357,533 classes, while disappointing, is not surprising. The medical domain is highly complex but applying Facet Math to SNOMED would surely generate some serious weight reduction. We know this is possible because we have done it with clients. In one case Semantic Arts produced a reduction from over one hundred fifty thousand concepts to several hundred without any loss in data fidelity.

Bloated ontologies contain far more complexity than is necessary. Humans cannot possibly memorize a hundred thousand concepts, but several hundred are intellectually manageable. Computers also benefit from reduced class counts. Machine Learning and Artificial Intelligence applications have fewer, more focused concepts to work with so they can move through large datasets more quickly and effectively.

It is time to apply Occam’s Razor and avoid creating unnecessary classes. It is time to design ontologies using Facet Math.

Property Graphs: Training Wheels on the way to Knowledge Graphs

I’m at a graph conference. The general sense is that property graphs are much easier to get started with than Knowledge Graphs. I wanted to explore why that is, and whether it is a good thing.

It’s a bit of a puzzle to us, we’ve been using RDF and the Semantic Web stack for almost two decades, and it seems intuitive, but talking to people new to graph databases there is a strong preference to property graphs (at this point primarily Neo4J and TigerGraph, but there are others). – Dave McComb

Property Graphs

A knowledge graph is a database that stores information as digraphs (directed graphs, which are just a link between two nodes).

Property Graphs: Training Wheels on the way to Knowledge Graphs

The nodes self-assemble (if they have the same value) into a completer and more interesting graph.

Property Graphs: Training Wheels on the way to Knowledge Graphs

What makes a graph a “property graph” (also called a “labeled property graph”) is the ability to have values on the edges

Either type of graph can have values on the nodes, in a Knowledge Graph they are done with a special kind of edge called a “datatype Property.”

Property Graphs: Training Wheels on the way to Knowledge Graphs

Property Graphs: Training Wheels on the way to Knowledge Graphs

Here is an example of one of the typical uses for values on the edges (the date the edge was established).  As it turns out this canonical example isn’t a very good example, in most databases, graph or otherwise, a purchase would be a node with many other complex relationships.

The better use of dates on the edges in property graphs are where there is what we call a “durable temporal relation.” There are some relationships that exist for a long time, but not forever, and depending on the domain are often modeled as edges with effective start and end dates (ownership, residence, membership are examples of durable temporal relations that map well to dates on the edges)

The other big use case for values on the edges which we’ll cover below.

The Appeal of Property Graphs

Talking to people and reading white papers, it seems the appeal of Property Graph data bases are in these areas:

  • Closer to what programmers are used to
  • Easy to get started
  • Cool Graphics out of the box
  • Attributes on the edges
  • Network Analytics

Property Graphs are Closer to What Programmers are Used to

The primary interfaces to Property Graphs are json style APis, which developers are comfortable with and find easy to adapt to.

Property Graphs: Training Wheels on the way to Knowledge Graphs

Easy to Get Started

Neo4J in particular have done a very good job of getting people set up and running and productive in short order.  There are free versions to get started with, and well exercised data sets to get up and going rapidly. This is very satisfying for people getting started.

Property Graphs: Training Wheels on the way to Knowledge Graphs

Cool Graphics Out of the Box

One of the striking things about Neo4J is their beautiful graphics

Property Graphs: Training Wheels on the way to Knowledge Graphs

You can rapidly get graphics that often have never been seen in traditional systems, and this draws in the attention of sponsors.

Property Graphs have Attributes on the Edges

Perhaps the main distinction between Property Graphs and RDF Graphs is the ability to add attributes to the edges in the network.  In this case the attribute is a rating (this isn’t a great example, but it was the best one I could find easily).

Property Graphs: Training Wheels on the way to Knowledge Graphs

One of the primary use cases for attributes on the edges would be weights that are used in the evaluation of network analytics.  For instance, a network representation of how to get from one town to another, might include a number of alternate sub routes through different towns or intersections.  Each edge would represent a segment of a possible journey.  By putting weights on each edge that represented distance, a network algorithm could calculate the shortest path between two towns.  By putting weights on the edges that represent average travel time, a network algorithm could calculate the route that would take the least time.

Other use cases for attributes on the edges include temporal information (when did this edge become true, and when was is no longer true), certainty (you can rate the degree of confidence you have in a given link and in some cases only consider links that are > some certainly value), and popularity (you could implement the page rank algorithm with weights on the edges, but I think it might be more appropriate to put the weights on the nodes)

Network Analytics

There are a wide range of network analytics that come out of the box and are enabled in the property graph.  Many do not require attributes on the edges, for instance the “clustering” and “strength of weak ties” suggested in this graphic can be done without attributes on the edges.

Property Graphs: Training Wheels on the way to Knowledge Graphs

However, many of the network analytics algorithms can take advantage of and gain from weights on the edges.

Property Graphs: What’s Not to Like

That is a lot of pluses on the Property Graph side, and it explains their meteoric rise in popularity.

Our contention is that when you get beyond the initial analytic use case you will find yourself in a position of needing to reinvent a great body of work that already exists and have been long standardized.  At that point if you have over committed to Property Graphs you will find yourself in a quandary, whereas if you positioned Property Graphs as a stepping stone on the way to Knowledge Graphs you will save yourself a lot unnecessary work.

Property Graphs, What’s the Alternative?

The primary alternative is an RDF Knowledge Graph.  This is a graph database using the W3C’s standards stack including RDF (resource description framework) as well as many other standards that will be described below as they are introduced.

The singular difference is the RDF Knowledge Graph standards were designed for interoperability at web scale.  As such all identifiers are globally unique, and potentially discoverable and resolvable.  This is a gigantic advantage when using knowledge graphs as an integration platform as we will cover below.

Where You’ll Hit the Wall with Property Graphs

There are a number of capabilities, we assume you’ll eventually want to add on to your Property Graph stack, such as:

  • Schema
  • Globally Unique Identifiers
  • Resolvable identifiers
  • Federation
  • Constraint Management
  • Inference
  • Provenance

Our contention is you could in principle add all this to a property graph, and over time you will indeed be tempted to do so.  However, doing so is a tremendous amount of work, high risk, and even if you succeed you will have a proprietary home-grown version of all these things that already exist, are standardized and have been in large scale production systems.

As we introduce each of these capabilities that you will likely want to add to your Property Graph stack, we will describe the open standards approach that already covers it.

Schema

Property Graphs do not have a schema.  While big data lauded the idea of “schema-less” computing, the truth is, completely removing schema means that a number of functions previously performed by schema have now moved somewhere else, usually code. In the case of Property Graphs, the nearest equivalent to a schema is the “label” in “Labeled Property Graph.” But as the name suggests, this is just a label, essentially like putting a tag on something.  So you can label a node as “Person” but that tells you nothing more about the node.  It’s easier to see how limited this is when you label a node a “Vanilla Swap” or “Miniature Circuit Breaker.”

Knowledge Graphs have very rich and standardized schema.  One of the ways they allow you to have the best of both worlds, is unlike relational databases they do not require all schema to be present before any data can be persisted. At the same time when you are ready to add schema to your graph, you can do so with a high degree of rigor and go to as much or as little detail as necessary.

Globally Unique Identifiers

The identifiers in Property Graphs are strictly local.  They don’t mean anything outside the context of the immediate database.  This is a huge limitation when looking to integrate information across many systems and especially when looking to combine third party data.

Knowledge Graphs are based on URIs (really IRIs).  Uniform Resource Identifiers (and their Unicode equivalent, which is a super set, International Resource Identifiers) are a lot like URLs, but instead of identifying a web location or page, they identify a “thing.” In best practices (which is to say 99% of all the extant URIs and IRIs out there) the URI/IRI is based on a domain name.  This delegation of id assignment to organizations that own the domain names allows relatively simple identifiers that are not in danger of being mistakenly duplicated.

Every node in a knowledge graph is assigned a URI/IRI, including the schema or metadata.  This makes discovering what something means as simple as “following your nose” (see next section)

Resolvable Identifiers

Because URI/IRIs are so similar to URLs, and indeed in many situations are URLs it makes it easy to resolve any item.  Clicking on a URI/IRI can redirect to a server in the domain name of the URI/IRI, which can then render a page that represents the Resource.  In the case of a schema/ metadata URI/IRI the page might describe what the metadata means.  This typically includes both the “informal” definition (comments and other annotations) as well as the “formal” definition (described below).

For a data URI/IRI the resolution might display what is known about the item (typically the outgoing links), subject to security restrictions implemented by the owner of the domain.  This style of exploring a body of data, by clicking on links and exploring is called “following your nose” and is a very effective way of learning a complex body of knowledge, because unlike traditional systems you do not need to know the whole schema in order to get started.

Property Graphs have no standard way of doing this.  Anything that is implemented is custom for the application at hand.

Federation

Federation refers to the ability to query across multiple databased to get a single comprehensive result set.  This is almost impossible to do with relational databases.  No major relational database vendor will execute queries across multiple databases and combine the result (the result generally wouldn’t make any sense anyway as the schemas are never the same).  The closest thing in traditional systems, is the Virtual Data P***, which allows some limited aggregation of harmonized databases.

The Property Graphs also have no mechanism for federation over more than a single in memory graph.

Federation is built into SPARQL (the W3C standard for querying “triple stores” or RDF based Graph Databases).  You can point a SPARQL query at a number of databases (including relational databases that have been mapped to RDF through another W3C standard, R2RML).

Constraint Management

One of the things needed in a system that is hosting transactional updates, is the ability to enforce constraints on incoming transactions.  Suffice it to say Property Graphs have no transaction mechanism and no constraint management capability.

Knowledge Graphs have the W3C standard, SHACL (SHApes Constraint Language) to specify constraints in a model driven fashion.

Inference

Inference is the creation of new information from existing information.  A Property Graph creates a number of “insights” which are a form of inference, but it is really only in the heads of the persons running the analytics and interpreting what the insight is.

Knowledge Graphs have several inference capabilities.  What they all share is that the result of the inference is rendered as another triple (the inferred information is another fact which can be expressed as a triple).  In principle almost any fact that can be asserted in a Knowledge Graph can also be inferred, provided the right contextual information.  For instance, we can infer that a class is a subclass of another class.  We can infer that a node has a given property, we can infer that two nodes represent the same real-world items, and each of these inferences can be “materialized” (written) back to the database.  This makes any inferred fact available to any human reviewing the graph, and process that acts on the graph, including queries.

Two of the prime creators of inferred knowledge are RDFS and  OWL, the W3C standards for schema.  RDFS provides the simple sort of inference that people familiar with Object Oriented programming might be familiar with, primarily the ability infer that a node that is a member of a class is also a member of any of its superclasses.  A bit new to many people is the idea that properties can have superproperties, and that leads to inference at the instance level.  If you make the assertion that you have a mother  (property :hasMother) Beth, and then declare :hasParent to be a superproperty of :hasMother, the system will infer that you :hasParent Beth, and this process can be repeated by making :has Ancestor a superproperty of :hasParent. The system can infer and persist this information.

OWL (the Web Ontology Language for dyslexics) allows for much more complex schema definitions.  OWL allows you to create class definitions from Booleans, and allows the formal definition of classes by creating membership definitions based on what properties are attached to nodes.

If RDFS and OWL don’t provide sufficient rigor and/or flexibility there are two other options, both rule languages and both will render their inferences as triples that can be returned to the triple store.  RIF (the Rule Interchange Format) allow inference rules defined in terms of “if / then“ logic.  SPARQL the above-mentioned query language can also be used to create new triples that can be rendered back to the triple store.

Provenance

Provenance is the ability to know where any atom of data came from.  There are two provenance mechanisms in Knowledge Graphs.  For inferences generated from RDFS or OWL definitions, there is an “explain” mechanism, which is decribed in the standards as “proof.” In the same spirit as a mathematical proof, the system can reel out the assertions including schema-based definitions as data level assertions that led to the provable conclusion of the inference.

For data that did not come from inference (that was input by a user, or purchased, or created through some batch process, there is a W3C standard, call PROV-O (the provenance ontology) that outlines a standard way to describe where a data set or even an individual atom of data came from.

Property Graphs have nothing similar.

Convergence

The W3C held a conference to bring together the labeled property graph camp with the RDF knowledge graph camp in Berlin in March of 2019.

One of our consultants attended and has been tracking the aftermath.  One promising path is RDF* which is being mooted as a potential candidate to unify the two camps.  There are already several commercial implementations supporting RDF*, even though the standard hasn’t even begun its journey through the approval process. We will cover RDF* in a subsequent white paper.

Summary

Property Graphs are easy to get started with.  People think RDF based Knowledge Graphs are hard to understand, complex and hard to get started with. There is some truth to that characterization.

The reason we made the analogy to “training wheels” (or “stepping stones” in the middle of the article) is to acknowledge that riding a bike is difficult.  You may want to start with training wheels.  However, as you become proficient with the training wheels, you may consider discarding them rather than enhancing them.

Most of our clients start directly with Knowledge Graphs, but we recognize that that isn’t the only path.  Our contention is that a bit of strategic planning up front,  outlining where this is likely to lead gives you a lot more runway.  You may choose to do your first graph project using a property graph, but we suspect that sooner or later you will want to get beyond the first few projects and will want to adopt an RDF / Semantic Knowledge Graph based system.

When is a Brick not a Brick?

They say good things come in threes and my journey to data-centricity started with three revelations.

The first was connected to a project I was working on for a university college with a problem that might sound familiar to some of you. The department I worked in was taking four months to clean, consolidate and reconcile our quarterly reports to the college executive. We simply did not have the resources to integrate incoming data from multiple applications into a coherent set of reports in a timely way.

The second came in the form of a lateral thinking challenge worthy of Edward de Bono: ‘How many different uses for a brick can you think of?’

The third revelation happened when I was on a consulting assignment at a multinational software company in Houston, Texas. As part of a content management initiative we were hired to work with their technical documentation team to install a large ECM application. What intrigued me the most, though, were the challenges the company experienced at the interface between the technology and the ‘multiple of multiples’ with respect to business language.

Revelation #1: Application Data Without the Application is Easy to Work With

The college where I had my first taste of data-centricity had the usual array of applications supporting its day-to-day operations. There were Student systems, HR systems, Finance systems, Facility systems, Faculty systems and even a separate Continuing Education System that replicated all those disciplines (with their own twists, of course) under one umbrella.

The department I worked in was responsible for generating executive quarterly reports for all activities on the academic side plus semi-annual faculty workload and annual graduation and financial performance reports. In the beginning we did this piece-meal and as IT resources became available. One day, we decided to write a set of specifications about what kind of data we needed; to what level of granularity; in what sequence; and, how frequently it should be extracted from various sources.

We called the process ‘data liquefication’ because once the data landed on our shared drive the only way we could tell what application it came from was by the file name. Of course, the contents and structure of the individual extracts were different, but they were completely pliable. Detached from the source application, we had complete freedom to do almost anything we wanted with it. And we did. The only data model we had to build (actually, we only ever thought about it once) was which “unit of production’ to use as the ‘center’ of our new reporting universe. To those of you working with education systems today, the answer will come as no surprise. We used ‘seat’. 

A journey to data-centricity
Figure 1: A Global Candidate for Academic Analytics

Once that decision was taken, and we put feedback loops in to correct data quality at source, several interesting patterns emerged:

  • The collections named Student, Faculty, Administrator and Support Staff were not as mutually exclusive as we originally thought. Several individuals occupied multiple roles in one semester.
  • The Finance categories were set up to reflect the fact that some expenses applied to all Departments; some were unique to individual Departments; and, some were unique to Programs.
  • Each application seemed to use a different code or name or structure to identify the same Person, Program or Facility.

From these patterns we were able to produce quarterly reports in half the time. We also introduced ‘what-if’ reporting for the first time, and since we used the granular concept of ‘seat’ as our unit of production we added Cost per Seat; Revenue per Seat; Overhead per Seat; Cross-Faculty Registration per Seat; and, Longitudinal Program Costs, Revenues, Graduation Rates and Employment Patterns to our mix of offerings as well.

Revelation #2: A Brick is Always a Brick. How it is Used in A Separate Question

When we separate what a thing “is” from how it is used, some interesting data patterns show up. I won’t take up much space in this article to enumerate them, but the same principle that can take ‘one thing’ like an individual brick and use it in multiple ways (paper weight, door stop, wheel chock, pendulum weight, etc.) puts the whole data classification thing in a new light.

The string “John Smith” can appear, for example, as the name of a doctor, a patient, a student, an administrator and/or an instructor. This is a similar pattern to the one that popped up at the university college. As it turns out that same string can be used as an entity name, an attribute, as metadata, reference data and a few other popular ‘sub-classes’ of data. They are not separate collections of ‘things’ as much as they are separate functions of the same thing.

Figure 2: What some ‘thing’ is and how it is used are two separate things

The implication for me was to classify ‘things’ first and foremost as what they refer to or in fact what they are. So, “John Smith” refers to an individual, and in my model surrounding data-centricity “is-a”(member of the set named) Person. On the other side of the equation, words like ‘Student’, ‘Patient’, and ‘Administrator’ for example are Roles. In my declarations, Student “is-a”(member of the set named) Role.

One of the things this allowed me to do was to create a very small (n = 19) number of mutually exclusive and exhaustive sets in any collection. This development also supported the creation of semantically interoperable interfaces and views into broadly related data stores.

Revelation #3: Shape and Semantics Must be Managed Separately and on Purpose

The theme of separation came up again while working on a technical publications project in Houston, Texas. Briefly, the objective was to render application user support topics into their smallest, reusable chunks and make it possible for technical writers to create document maps ranging from individual Help files in four different formats to full-blown, multi-chapter user guides and technical references. What really made the project challenging was what we came to call the ‘’multiple of multiples” problem. This turned out to be the exact opposite challenge of reuse in Revelation #1:

  • Multiple customer platforms
  • Multiple versions of customer platforms
  • Multiple product families (Mainframe, Distributed and Hybrid)
  • Multiple product platforms
  • Multiple versions of product platforms
  • Multiple versions of products (three prior, one current, and one work-in-progress)
  • Multiple versions of content topics
  • Multiple versions of content assemblies (guides, references, specification sheets, for example)
  • Multiple customer locales (United States, Japan, France, Germany, China, etc.)
  • Multiple customer language (English (two ‘flavours’), Japanese, German, Chinese, etc.)

The solution to this ‘factorial mess’ was not found in an existing technology (including the ECM software we were installing) but in fact came about by not only removing all architectural or technical considerations (as we did in Revelation #1), but asking what it means to say: “The content is the same” or “The content is different.”

In the process of comparing two components found in the ‘multiple of multiples’ list, we discovered three factors for consideration:

  1. The visual ‘shape’ of the components. ‘Stop’ and ‘stop’ look the same.
  2. The digital signatures of the components. We used MD5 Hash to do this.
  3. The semantics of the components. We used translators and/or a dictionary.

Figure 3 shows the matrix we used to demonstrate the tendency of each topic to be reused (or not) in one of the multiples.

Figure 3: Shape, Signal and Semantics for Content Component Comparison

It turns out that content can vary as a result of time (a version), place (a locale with different requirements for the same feature, for example) people (different languages) and/or format (saving a .docx file as a pdf). In addition to changes in individual components, assemblies of components can have their own identities.

This last point is especially important. Some content was common to all products the company sold. Other content was variable along product lines, client platform, target market and audience. Finally, the last group of content elements were unique to a unique combination of parameters.

Take-Aways

Separating data from its controlling applications presents an opportunity to look at it in a new way. Removed from its physical and logical constraints, data-centricity begins to look a lot like the language of business. While the prospect of liberating data this way might horrify many application developers and data modelers out there, those of us trying to get the business closer to the information they need to accomplish their goals see the beginning of more naturally integrated way of doing that.

The Way Forward with Data-Centricity

Data-centricity in architecture is going to take a while to get used to. I hope this post has given readers a sense of what the levers to making it work might look like and how they could be put to good use.

Click here to read a free chapter of Dave McComb’s book, “The Data-Centric Revolution”

Article by John O’Gorman

Connect with the Author

 

 

 

 

Toss Out Metadata That Does Not Bring Joy

As I write this, I can almost hear you wail “No, no, we don’t have too much metadata, we don’t have nearly enough!  We have several projects in flight to expand our use of metadata.”

Sorry, I’m going to have to disagree with you there.  You are on a fool’s errand that will just provide busy work and will have no real impact on your firm’s abilityThe Data-Centric Revolution: Implementing a Data-Centric Architecture to make use of the data they have.

Let me tell you what I have seen in the last half dozen or so very large firms I’ve been involved with, and you can tell me if this rings true for you.  If you are in a mid-sized or even small firm you may want to divide these numbers by an appropriate denominator, but I think the end result will remain the same.

Most large firms have thousands of application systems.  Each of these systems have data models that consist of hundreds of tables and many thousands of columns.  Complex applications, such as SAP, explode these numbers (a typical SAP install has populated 90,000 tables and a half million columns).

Even as we speak, every vice president with a credit card is unknowingly expanding their firm’s data footprint by implementing suites of SaaS (Software as a Service) applications.  And let’s not even get started on your Data Scientists.  They are rabidly vacuuming up every dataset they can get their hands on, in the pursuit of “insights.”

Naturally you are running out of space, and especially system admin bandwidth in your data centers, so you turn to the cloud.  “Storage is cheap.”

This is where the Marie Kondo analogy kicks in.  As you start your migration to the cloud (or to your Data Lake, which may or may not be in the cloud), you decide “this would be a good time to catalog all this stuff.”  You launch into a project with the zeal of a Property and Evidence Technician at a crime scene. “Let’s careful identify and tag every piece of evidence.”  The advantage that they have, and you don’t is that their world is finite.   You are faced with cataloging billions of pieces of metadata.  You know you can’t do it alone, so you implore the people who are putting the data in the Data Swamp (er, Lake).  You mandate that anything that goes into the lake must have a complete catalog.  Pretty soon you notice, that the people putting the data in don’t know what it is either.  And they know most of it is crap, but there are a few good nuggets in there.  If you require them to have descriptions of each data element, they will copy the column heading and call it a description.

Let’s just say, hypothetically, you succeeded in getting a complete and decent catalog for all the datasets in use in your enterprise.  Now what?

Click here to read more on TDAN.com

My Path Towards Becoming A Data-Centric Revolution Practitioner

In 1986 I started down a path that, in 2019, has made me a fledgling Data-Centric revolution practitioner. My path towards the Data-Centric revolution started in 1986 with my wife and I founding two micro-businesses in the music and micro-manufacturing industries. In 1998 I put the music business, EARTHTUNES, on hold and sold the other; then I started my Information Technology career. For the last 21 years I’ve covered hardware, software, network, administration, data architecture and development. I’ve mastered relational and dimensional design, working in small and large environments. But my EARTHTUNES work in 1994 powerfully steered me toward the Data-Centric revolution.

In early 1994 I was working on my eighth, ninth and tenth nature sound albums for my record label EARTHTUNES. (See album cover photos below.) The year before, I had done 7 months’ camping and recording in the Great Smoky Mountains National Park to capture the raw materials for my three albums. (To hear six minutes of my recording from October 24, 1993 at 11:34am, right-click here and select open link in new tab, to download the MP3 and PDF files—my gift to you for your personal use. You may listen while you finish reading below, or anytime you like.)

In my 1993 field work I generated 268 hours of field recordings with 134 field logs. (See below for my hand-written notes from the field log.)

Now, in 1994, I was trying to organize the audio recordings’ metadata so that I could select the best recordings and sequence them according to a story-line across the three albums. So, I made album part subtake forms for each take, each few-minutes’ recording, that I thought worthy of going on one of the albums. (See the image of my Album Part Subtake Form, below.)

I organized all the album part subtake forms—all my database metadata entries—and, after months of work, had my mix-down plan for the three albums. In early summer I completed the mix and Macaulay Library of Nature Sound prepared to publish the “Great Smoky Mountains National Park” series: “Winter & Spring;” “Summer & Fall;” and “Storms in the Smokies.”

The act of creating those album part subtake forms was a tipping point towards my becoming a Data-Centric revolution practitioner. In 1994 I started to understand many of the principles defined here and in chapter 2 of Dave McComb’s “The Data-Centric Revolution: Restoring Sanity to Enterprise Information Systems” . Since then I have internalized and started walking them out. The words below are my understandings of the principles, adapted from the Manifesto and McComb’s book.

  • All the many different types of data needed to be included: structured, semi-structured, network-structured and unstructured. Audio recordings and their artifacts; business and reference data; and other associated data, altogether, was my invaluable, curated inter-generational asset. These were the only foundation for future work.
  • I knew that I needed to organize my data in an industry-standard, archival, human-readable and machine-readable format so that I could use it across all my future projects, integrate it with external data, and export it into many different formats. Each new project and whatever applications I made or used would depend completely upon this first class-citizen, this curated data store. In contrast, apps, computing devices and networks would be, relative to the curated data, ephemeral second-class citizens.
  • Any information system I built or acquired must be evolve-able and specialize-able: they had to have a reasonable cost of change as my business evolved; and the integration of my data needed to be nearly free.
  • My data was an open resource that must be shareable, that needed to far outlive the initial database application I made. (I knew that a hundred or so years in the future, climate change would alter the flora and fauna of the habitats I had recorded in; this would change the way those habitats sounded. I was convicted that my field observation data, with recordings, needed to be perpetually accessible as a benchmark of how the world had changed.) Whatever systems I used, the data must have its integrity and quality preserved.
  • This meant that my data needed to have its meaning precisely defined in the context of long-living semantic disciplines and technologies. This would enable successive generations (using different applications and systems) to understand and use my lifework, enshrined in the data legacy I left behind.
  • I needed to use low-code/no-code as much as possible; to enable this I wanted the semantic model to be the genesis of the data structures, constraints and presentation layer, being used to generate all or most data structures and app components/apps (model-driven everything). I needed to use established, well-fitting-with-my-domain ontologies, adding only what wasn’t available and allowing local variety in the context of standardization (specialize-able and single but federated). (Same with the apps.)

From 1994 to the present I’ve been seeking the discipline and technology stacks that a handful of architects and developers could use to create this legacy. I think that I have finally found them in the Data-Centric revolution. My remaining path is to develop full competence in the appropriate semantic disciplines and technology stacks, build my business and community and complete my information system artifacts: passing my work to my heirs over the next few decades.

Article By Jonathon R. Storm

Jonathon works as a data architect helping to maintain and improve a Data-Centric information system that is used to build enterprise databases and application code in a Data-Centric company. Jonathon continues to, on weekends, record the music of the wilderness; in the next year he plans to get his first EARTHTUNES website online to sell his nature sound recordings: you can email him at [email protected] to order now.

The Flagging Art of Saying Nothing

Who doesn’t like a nice flag? Waving in the breeze, reminding us of who we are and what we stand for. Flags are a nice way of providingUnderstanding Meaning in Data a rallying point around which to gather and show our colors to the world. They are a way of showing membership in a group, or providing a warning. Which is why it is so unfortunate when we find flags in a data management system, because they are reduced to saying nothing. Let me explain.

When we see Old Glory, we instantly know it is emblematic of the United States. We also instantly recognize the United Kingdom’s emblematic Union Jack and Canada’s Maple Leaf Flag. Another type of flag is a Warning flag alerting us to danger. In either case, we have a clear reference to what the flag represents. How about when you look at a data set and see ‘Yes’, or ‘7’? Sure, ‘Yes’ is a positive assertion and 7 is a number, but those are classifications, not meaning. Yes what? 7 what? There is no intrinsic meaning in these flags. Another step is required to understand the context of what is being asserted as ‘Yes’. Numeric values have even more ambiguity. Is it a count of something, perhaps 7 toasters? Is it a ranking, 7th place? Or perhaps it is just a label, Group 7?

In data systems the number of steps required to understand a value’s meaning is critical both for reducing ambiguity, and, more importantly, for increasing efficiency. An additional step to understand that ‘Yes’ means ‘needs review’, so the processing steps have doubled to extract its meaning. In traditional systems, the two-step flag dance is required because two steps are required to capture the value. First a structure has to be created to hold the value, the ‘Needs Review’ column. Then a value must be placed into that structure. More often than not, an obfuscated name like ‘NdsRvw’ is used which requires a third step to understand what that means. Only when the structure is understood can the value and meaning the system designer was hoping to capture be deciphered.

In cases where what value should be contained in the structure isn’t known, a NULL value is inserted as a placeholder. That’s right, a value literally saying nothing. Traditional systems are built as structure first, content second. First the schema, the structure definition, gets built. Then it is populated with content. The meaning of the content may or may not survive the contortions required to stuff it into the structure, but it gets stuffed in anyway in the hope it can deciphered later when extracted for a given purpose. Given situations where there is a paucity of data, there is a special name for a structure that largely says nothing – sparse tables. These are tables known to likely contain only a very few of the possible values, but the structure still has to be defined before the rare case values actually show up. Sparse tables are like requiring you to have a shoe box for every type of shoe you could possibly ever own even though you actually only own a few pair.

Structure-first thinking is so embedded in our DNA that we find it inconceivable that we can manage data without first building the structure. As a result, flag structures are often put in to drive system functionality. Logic then gets built to execute the flag dance and get executed every time interaction with the data occurs. The logic says something like this:
IF this flag DOESN’T say nothing
THEN do this next thing
OTHERWISE skip that next step
OR do something else completely.
Sadly, structure-first thinking requires this type of logic to be in place. The NULL placeholders are a default value to keep the empty space accounted for, and there has to be logic to deal with them.

Semantics, on the other hand, is meaning-first thinking. Since there is no meaning in NULL, there is no concept of storing NULL. Semantics captures meaning by making assertions. In semantics we write code that says “DO this with this data set.” No IF-THEN logic, just DO this and get on with it. Here is an example of how semantics maintains the fidelity of our information without having vacuous assertions.

The system can contain an assertion that the Jefferson contract is categorized as ‘Needs Review’ which puts it into the set of all contracts needing review. It is a subset of all the contracts. The rest of the contracts are in the set of all contracts NOT needing review. These are separate and distinct sets which are collectively the set of all contracts, a third set. System functionality can be driven by simply selecting the set requiring action, the “Needs Review” set, the set that excludes those that need review, or the set of all contracts. Because the contracts requiring review are in a different set, a sub-set, and it was done with a single step, the processing logic is cut in half. Where else can you get a 50% discount and do less work to get it?

I love a good flag, but I don’t think they would have caught on if we needed to ask the flag-bearer what the label on the flagpole said to understand what it stood for.

Blog post by Mark Ouska 

For more reading on the topic, check out this post by Dave McComb.

The Data-Centric Revolution: Lawyers, Guns and Money

My book “The Data-Centric Revolution” will be out this summer.  I will also be presenting at Dataversity’s Data Architecture Summit coming up in a fewThe Data-Centric Revolution months.  Both exercises reminded me that Data-Centric is not a simple technology upgrade.  It’s going to take a great deal more to shift the status quo.

Let’s start with Lawyers, Guns and Money, and then see what else we need.

A quick recap for those who just dropped in: The Data-Centric Revolution is the recognition that maintaining the status quo on enterprise information system implementation is a tragic downward spiral.  Almost every ERP, Legacy Modernization, MDM, or you name it project is coming in at ever higher costs and making the overall situation worse.

We call the status quo the “application-centric quagmire.”  The application-centric aspect stems from the observation that many business problems turn into IT projects, most of which end up with building, buying, or renting (Software as a Service) a new application system.  Each new application system comes with its own, arbitrarily different data model, which adds to the pile of existing application data models, further compounding the complexity, upping the integration tax, and inadvertently entrenching the legacy systems.

The alternative we call “data-centric.”  It is not a technology fix.  It is not something you can buy.  We hope for this reason that it will avoid the fate of the Gartner hype cycle.  It is a discipline and culture issue.  We call it a revolution because it is not something you add to your existing environment; it is something you do with the intention of gradually replacing your existing environment (recognizing that this will take time.)

Seems like most good revolutions would benefit from the Warren Zevon refrain: “Send lawyers, guns, and money.”  Let’s look at how this will play out in the data-centric revolution.

Click here to read more on TDAN.com

The 1st Annual Data-Centric Architecture Forum: Re-Cap

In the past few weeks, Semantic Arts, hosted a new Data-Centric Architecture Forum.  One of the conclusions made by the participants was that it wasn’t like a traditional conference.  This wasn’t marching from room to room to sit through another talking head and PowerPoint lead presentation. There were a few PowerPoint slides that served to anchor, but it was much more a continual co-creation of a shared artifact.

The agreed consensus was:

  • — Yes, let’s do it again next year.
  • — Let’s call it a forum, rather than a conference.
  • — Let’s focus on implementation next year.
  • — Let’s make it a bit more vendor-friendly next year.

So retrospectively, last week was the first annual Data-Centric Architecture Forum.

What follows are my notes and conclusions from the forum.

Shared DCA Vision

I think we came away with a great deal of commonality and more specifics on what a DCA needs to look like and what it needs to consist of. The straw-man (see appendix A) came through with just a few revisions (coming soon).  More importantly, it grounded everyone on what was needed and gave a common vocabulary about the pieces.

Uniqueness

I think with all the brain power in the room and the fact that people have been looking for this for a while, after we had described what such a solution entailed, if anyone knew of a platform or set of tools that provided all of this, out of the box, they would have said so.

I think we have outlined a platform that does not yet exist and needs to.  With a bit of perseverance, next year we may have a few partial (maybe even more than partial) implementations.

Completeness

After working through this for 2 ½ days, I think if there were anything major missing, we would have caught it.  Therefore, this seems to be a pretty complete stack. All the components and at least a first cut as to how they are related seems to be in place.

Doable-ness

While there are a lot of parts in the architecture, most of the people in the room thought that most of the parts were well-known and doable.

This isn’t a DARPA challenge to design some state-of-the-art thing, this is more a matter of putting pieces together that we already understand.

Vision v. Reference Architecture

As noted right at the end, this is a vision for an architecture— not a specific architecture or a reference architecture.

Notes From Specific Sessions

DCA Strawman

Most of this is covered was already covered above.  I think we eventually suggested that “Analytics” might deserve its own layer.  You could say that analytics is a “behavior” but it seems to be burying the lead.

I also thought it might be helpful to have some of the specific key APIs that are suggested by the architecture, and it looks like we need to split the MDM style of identity management from user identity management for clarity, and also for positioning in the stack.

State of the Industry

There is a strong case to be made that knowledge graph driven enterprises are eating the economy.  Part of this may be because network effect companies are sympathetic with network data structures.  But we think the case can be made so that the flexibility inherent in KGs applies to companies in any industry.

According to research that Alan provided, the average enterprise now executes 1100 different SaaS services.  This is fragmenting the data landscape even faster than legacy did.

Business Case

A lot of the resistance isn’t technical, but instead tribal.

Even within the AI community there are tribes with little cross-fertilization:

  • Symbolists
  • Bayesians
  • Statisticians
  • Connectionists
  • Evolutionaries
  • Analogizers

On the integration front, the tribes are:

  • Relational DB Linkers
  • Application-Centric ESB Advocates
  • Application-Centric RESTful developers
  • Data-centric Knowledge Graphers

Click here to read more on TDAN.com

The Data-Centric Revolution: Chapter 2

The Data-Centric Revolution

Below is an excerpt and downloadable copy of the “Chapter 2: What is Data-Centric?”

CHAPTER 2

What is Data-Centric?

Our position is:

A data-centric enterprise is one where all application functionality is based on a single, simple, extensible data model.

First, let’s make sure we distinguish this from the status quo, which we can describe as an application-centric mindset. Very few large enterprises have a single data model. They have one data model per application, and they have thousands of applications (including those they bought and those they built). These models are not simple. In every case we examined, application data models are at least 10 times more complex than they need to be, and the sum total of all application data models is at least 100-1000 times more complex than necessary.

Our measure of complexity is the sum total of all the items in the schema that developers and users must learn in order to master a system.  In relational technology this would be the number of classes plus the number of all attributes (columns).  In object-oriented systems, it is the number of classes plus the number of attributes.  In an XML or json based system it is the number of unique elements and/or keys.

The number of items in the schema directly drives the number of lines of application code that must be written and tested.  It also drives the complexity for the end user, as each item, eventually surfaces in forms or reports and the user must master what these mean and how the relate to each other to use the system.

Very few organizations have applications based on an extensible model. Most data models are very rigid.  This is why we call them “structured data.”  We define the structure, typically in a conceptual model, and then convert that structure to a logical model and finally a physical (database specific) model.  All code is written to the model.  As a result, extending the model is a big deal.  You go back to the conceptual model, make the change, then do a bunch of impact analysis to figure out how much code must change.

An extensible model, by contrast is one that is designed and implemented such that changes can be added to the model even while the application is in use. Later in this book and especially in the two companion books we get into a lot more detail on the techniques that need to be in place to make this possible.

In the data-centric world we are talking about a data model that is primarily about what the data means (that is, the semantics). It is only secondarily, and sometimes locally, about the structure, constraints, and validation to be performed on the data.

Many people think that a model of meaning is “merely” a conceptual model that must be translated into a “logical” model, and finally into a “physical” model, before it can be implemented. Many people think a conceptual model lacks the requisite detail and/or fidelity to support implementation. What we have found over the last decade of implementing these systems is that done well, the semantic (conceptual) data model can be put directly into production. And that it contains all the requisite detail to support the business requirements.

And let’s be clear, being data-centric is a matter of degree. It is not binary. A firm is data-centric to the extent (or to the percentage) its application landscape adheres to this goal.

Data-Centric vs. Data-Driven

Many firms claim to be, and many firms are, “data-driven.” This is not quite the same thing as data-centric. “Data-driven” refers more to the place of data in decision processes. A non-data-driven company relies on human judgement as the justification for decisions. A data-driven company relies on evidence from data.

Data-driven is not the opposite of data-centric. In fact, they are quite compatible, but merely being data-driven does not ensure that you are data-centric. You could drive all your decisions from data sets and still have thousands of non-integrated data sets.

Our position is that data-driven is a valid aspiration, though data-driven does not imply data-centric. Data-driven would benefit greatly from being data-centric as the simplicity and ease of integration make being data-driven easier and more effective.

We Need our Applications to be Ephemeral

The first corollary to the data-centric position is that applications are ephemeral, and data is the important and enduring asset. Again, this is the opposite of the current status quo. In traditional development, every time you implement a new application, you convert the data to the new applications representation. These application systems are very large capital projects. This causes people to think of them like more traditional capital projects (factories, office buildings, and the like). When you invest $100 Million in a new ERP or CRM system, you are not inclined to think of it as throwaway. But you should. Well, really you shouldn’t be spending that kind of money on application systems, but given that you already have, it is time to reframe this as sunk cost.

One of the ways application systems have become entrenched is through the application’s relation to the data it manages. The application becomes the gatekeeper to the data. The data is a second-class citizen, and the application is the main thing. In data-centric, the data is permanent and enduring, and applications can come and go.

Data-Centric is Designed with Data Sharing in Mind

The second corollary to the data-centric position is default sharing. The default position for application-centric systems is to assume local self-sufficiency. Most relational database systems base their integrity management on having required foreign key constraints. That is, an ordering system requires that all orders be from valid customers. The way they manage this is to have a local table of valid customers. This is not sharing information. This is local hoarding, made possible by copying customer data from somewhere else. And this copying process is an ongoing systems integration tax. If they were really sharing information, they would just refer to the customers as they existed in another system. Some API-based systems get part of the way there, but there is still tight coupling between the ordering system and the customer system that is hosting the API. This is an improvement but hardly the end game.

As we will see later in this book, it is now possible to have a single instantiation of each of your key data types—not a “golden source” that is copied and restructured to the various application consumers, but a single copy that can be used in place.

Is Data-Centric Even Possible?

Most experienced developers, after reading the above, will explain to you why this is impossible. Based on their experience, it is impossible. Most of them have grown up with traditional development approaches. They have learned how to build traditional standalone applications. They know how applications based on relational systems work. They will use this experience to explain to you why this is impossible. They will tell you they tried this before, and it didn’t work.

Further, they have no idea how a much simpler model could recreate all the distinctions needed in a complex business application. There is no such thing as an extensible data model in traditional practice.

You need to be sympathetic and recognize that based on their experience, extensive though it might be, they are right. As far as they are concerned, it is impossible.

But someone’s opinion that something is impossible is not the same as it not being possible. In the late 1400s, most Europeans thought that the world was flat and sailing west to get to the far east was futile. In a similar vein, in 1900 most people were convinced that heavier than air flight was impossible.

The advantage we have relative to the pre-Columbians, and the pre-Wrights is that we are already post-Columbus and post-Wrights. These ideas are both theoretically correct and have already been proved.

The Data-Centric Vision

To fix your wagon to something like this, we need to make a few aspects of the end game much clearer. We earlier said the core of this was the idea of a single, simple, extensible data model. Let’s drill in on this a bit deeper.

Click here to download the entire chapter.

Use the code: SemanticArts for a a 20% discount off of Technicspub.com