Structure-First Data Modeling The Losing Battle of Perfect Descriptions

In my last article I described Meaning-First data modeling. It’s time to dig into its predecessor and antithesis, which I call Structure-First data modeling, specifically looking at how two assumptions drive our actions. Assumptions are quite useful since they leverage experience without having to re-learn what is already known. It is a real time-saver.

Until it isn’t.

For nearly the last half century, the eventual implementation for data management systems has consisted of various incarnations of tables-with-columns and the supporting infrastructure which weaves them into a solution. The brilliant works of Steve Hoberman, Len Silverston, David Hay, and many others, in developing data modeling strategies and patterns are notable and admirable. They pushed data modeling art and science forward. As strong as those contributions are, they are still description-focused and assume a Structure-First implementation.

Structure-First data modeling is based on two assumptions. The first assumption is that the solution will always be physically articulated in a tables-with-columns structure. The second is that proceeding requires developing complete descriptions of subject matter. This second assumption is also on the path of either/or thinking; either the description is complete, or it is not. If it is not, then tables-with-columns (and a great deal of complexity) are added until it is complete. Our analysis, building on these assumptions, is focused on the table structures and how they are joined to create a complete attribute inventory.

The focus on structure is required because no data can be captured until the descriptive attribute structure exists. This inflexibility makes the system both brittle and complex.
All the descriptive attribution being stuffed into tables-with-columns are a parts list for the concept, but there is no succinct definition of the whole. These first steps taken on a data management journey are on the path to complexity, and since they are based on rarely articulated assumptions, the path is never questioned. The complete Structure-First model must accommodate every possible descriptive attribute that could be useful. We have studied E. F. Codd’s 5 data normalization levels and drive towards structural normalization. Therefore, our analysis is focused on avoiding repeating columns, multiple values in a single column, etc., rather than on what the data means.

Yet with all the attention paid to capturing all the descriptive attributes, new ones constantly appear. We know this is inevitable for any system having even a modest lifespan. For example, thanks to COVID-19, educational institutions that have never offered online courses are suddenly faced with moving exclusively to online offerings, at least temporarily. Buildings and rooms are not relevant for those offerings, but web addresses and enabling software are. Experience demonstrates how costly it is in both time and resources to add a new descriptive attribute after the system has been put into production. Inevitably something needs to be added. This happens either because something was missed or a new requirement was added. It also happens because buried in the long parts list of descriptive attributes, the same thing has been described several times in different ways. The brittle nature of tables-with-columns results in every change requiring very expensive modeling, refactoring, and regression testing to get the change into production.

Neither the tables-with-columns nor descriptive assumption parts lists assumptions apply when developing semantic knowledge graph solutions using a Meaning-First data modeling approach. Why am I convinced Meaning-First will advance the data management discipline? Because Meaning-First is definitional, the path of both/and thinking, and it rests on a single structure, the triple, for virtually everything. The World-Wide Web Consortium (W3C) defined the standard RDF (Resource Description Framework) triple to enable linking data on the open web and in private organizations. The definition, articulated in RDF triples, captures the essence to which new facts are linked. Semantic technologies provide a solid, machine-interpretable definition and the standard RDF triple as the structure. Since there is no need to build new structures, new information can be added instantly. By simply dropping new information into the database, it automatically links to existing data right away.

While meaning and structure are separate concepts, we have been conflating them for decades, resulting in unnecessary complexity. Humankind has been formalizing the study of meaning since Aristotle and has been making significant progress along the way. Philosophy’s formal logics are semantics’ Meaning-First cornerstone. Formal logics define the nature of whatever is being studied such that when something matches the formal definition, it can be proved that it is necessarily in the defined set. Semantic technology has enabled machine-readable assembly using formal logics. An example might make it easier to understand.

Consider a requirement to know which teams have won the Super Bowl. How would each approach solve this requirement? The required data is:
• Super Bowls played
• Teams that played in each Super Bowl
• Final scores
Data will need to be acquired in both cases and is virtually the same, so this example skips over those mechanics to focus on differences.

A Structure-First approach might look something like this. First, create a conceptual model with the table structures and their columns to contain  all the relevant team, Super Bowl, and score data. Second, create a logical model from the conceptual model that identifies the logical designs that will allow the data to be connected and used. This requires primary and foreign key designs, logical data types and sizes, as well as join structures for assembling data from multiple tables. Third, create a physical model from the logical to model the storage strategy and incorporate vendor-specific implementation details.

Only at this point can the data be entered into the Structure-First system. This is because until the structure has been built, there is no place for the data to land. Then, unless you (the human user) know the structure, there is no way to get data back out. However, this isn’t true when using Meaning-First semantic technology.

A Meaning-First approach can start either by acquiring well-formed triples or building the model as the first step. The model can then define the meaning of “Super Bowl winner” as the team with the highest score for each Super Bowl occurrence. Semantic technology captures the meaning using formal logics, and the data that match that meaning self-assemble into the result set. Formal logics can also be used to infer which teams might have won the Super Bowl using the logic “in order to win, the team must have played in the Super Bowl,” and not all NFL teams have.

The key is that in the Meaning-First example, members of the set called Super Bowl winners can be returned without identifying the structure in the request. The Structure-First example required understanding and navigating the structure before even starting to formulate the question. It’s not so hard in this simple example, but in enterprise data systems with hundreds, or more likely thousands, of tables, understanding the structure is extremely challenging.

Semantic Meaning-First databases, known as triplestores, are not a collection of tables-with-columns. They are comprised of RDF triples that are used for both the definitions (schema in the form of an ontology) and the content (data). As a result, you can write queries against an RDF data set that you have never seen and get meaningful answers. Queries can return what sets have been defined. Queries can then find when the set is used as the subject or the object of a statement. Semantic queries simply walk across the formal logic that defines the graph letting the graph itself inform you about possible next steps. This isn’t an option in Structure-First environments because they are not based in formal logic and the schema is encapsulated in a different language from the data.

Traditional Structure-First databases are made up of tens to hundreds, often thousands of tables. Each table is arbitrarily made up and named by the modeler with the goal to contain all attributes of a specific concept. Within each table are columns that are also made up, again hopefully with lots of rigor, but made up. You can prove this to yourself by looking at the lack of standard definitions around simple concepts like address. Some will leverage modeling patterns, some will leverage standards like USPS, but the variability between systems is great and arbitrary.

Semantic technology has enabled the Meaning-First approach with machine-readable definitions to which new attribution can be added in production. At the same time this clarity is added to the data management toolkit, semantic technology sweeps away the nearly infinite collection of complex table-with-column structures with the one single, standards-based RDF triple structure. Changing from descriptive to definitional is orders of magnitude clearer. Replacing tables and columns with triples is orders of magnitude simpler. Combining them into a single Meaning-First semantic solution is truly a game changer.

Scroll to top