In our previous installment of this two-part series we introduced a couple of ideas.
First, data governance may be more similar to DevOps than first meets the eye.
Second, the rise of Knowledge Graphs, Semantics and Data-Centric development will bring with it the need for something similar, which we are calling, “SemOps” (Semantic Operations).
Third, when you peel back what people are doing in DevOps and Data Governance, we get down to five key activities that will be very instructive in our SemOps journey:
- Allowing/ “Permission-ing”
- Predicting Side Effects
We’ll take up each in turn and compare and contrast how each activity is performed in DevOps and Data Governance to inform our choices in SemOps.
But before we do, I want to cover one more difference: how the artifacts scale under management.
There isn’t any obvious hierarchy to code, from abstract to concrete or general to specific, as there is in data and semantics. It’s pretty much just a bunch of code, partitioned by silos. Some of it you bought, some you built, and some you rent through SaaS (Software as a Service).
Each of these silos represents, often, a lot of code. Something as simple as Quick Books is 10 million lines of code. SAP is hundreds of millions. Most in-house software is not as bloated as most packages or software services; still, it isn’t unusual to have millions of lines of code in an in-house developed project (much of it is in libraries that were copied in, but it still represents complexity to be managed). The typical large enterprise is managing billions of lines of code.
The only thing that makes this remotely manageable is, paradoxically, the thing that makes it so problematic: isolating each codebase in its own silo. Within a silo, the developer’s job is to not introduce something that will break the silo and to not introduce something that will break the often fragile “integration” with the other silos.
Data and Metadata
There is a hierarchy to data that we can leverage for its governance. The main distinction is between data and metadata.
There is almost always more data than metadata. More rows than columns. But in many large enterprises there is far, far more metadata than anyone could possibly guess. We were privy to a project to inventory the metadata for a large company, who shall go nameless. At the end of the profiling, it was discovered that there were 200 million columns under management in the sum total of the firm. This is columns not rows. No doubt there were billions of rows in all their data.
There are also other levels that people often introduce to help with the management of this pyramid. People often separate Reference data (e.g., codes and geographies) and Master data (slower changing data about customers, vendors, employees and products).
These distinctions help, but even as the data governance people are trying to get their arms around this, the data scientists show up with “Big Data.” Think of big data being below the bottom of this pyramid. Typically, it is even more voluminous, and usually has only the most ad hoc metadata (the “keys” in the “key/value pairs” in the deeply nested json data structures are metadata, sort of, but you are left guessing what these short cryptic labels actually mean).