When is a Brick not a Brick?

They say good things come in threes and my journey to data-centricity started with three revelations.

The first was connected to a project I was working on for a university college with a problem that might sound familiar to some of you. The department I worked in was taking four months to clean, consolidate and reconcile our quarterly reports to the college executive. We simply did not have the resources to integrate incoming data from multiple applications into a coherent set of reports in a timely way.

The second came in the form of a lateral thinking challenge worthy of Edward de Bono: ‘How many different uses for a brick can you think of?’

The third revelation happened when I was on a consulting assignment at a multinational software company in Houston, Texas. As part of a content management initiative we were hired to work with their technical documentation team to install a large ECM application. What intrigued me the most, though, were the challenges the company experienced at the interface between the technology and the ‘multiple of multiples’ with respect to business language.

Revelation #1: Application Data Without the Application is Easy to Work With

The college where I had my first taste of data-centricity had the usual array of applications supporting its day-to-day operations. There were Student systems, HR systems, Finance systems, Facility systems, Faculty systems and even a separate Continuing Education System that replicated all those disciplines (with their own twists, of course) under one umbrella.

The department I worked in was responsible for generating executive quarterly reports for all activities on the academic side plus semi-annual faculty workload and annual graduation and financial performance reports. In the beginning we did this piece-meal and as IT resources became available. One day, we decided to write a set of specifications about what kind of data we needed; to what level of granularity; in what sequence; and, how frequently it should be extracted from various sources.

We called the process ‘data liquefication’ because once the data landed on our shared drive the only way we could tell what application it came from was by the file name. Of course, the contents and structure of the individual extracts were different, but they were completely pliable. Detached from the source application, we had complete freedom to do almost anything we wanted with it. And we did. The only data model we had to build (actually, we only ever thought about it once) was which “unit of production’ to use as the ‘center’ of our new reporting universe. To those of you working with education systems today, the answer will come as no surprise. We used ‘seat’. 

A journey to data-centricity
Figure 1: A Global Candidate for Academic Analytics

Once that decision was taken, and we put feedback loops in to correct data quality at source, several interesting patterns emerged:

  • The collections named Student, Faculty, Administrator and Support Staff were not as mutually exclusive as we originally thought. Several individuals occupied multiple roles in one semester.
  • The Finance categories were set up to reflect the fact that some expenses applied to all Departments; some were unique to individual Departments; and, some were unique to Programs.
  • Each application seemed to use a different code or name or structure to identify the same Person, Program or Facility.

From these patterns we were able to produce quarterly reports in half the time. We also introduced ‘what-if’ reporting for the first time, and since we used the granular concept of ‘seat’ as our unit of production we added Cost per Seat; Revenue per Seat; Overhead per Seat; Cross-Faculty Registration per Seat; and, Longitudinal Program Costs, Revenues, Graduation Rates and Employment Patterns to our mix of offerings as well.

Revelation #2: A Brick is Always a Brick. How it is Used in A Separate Question

When we separate what a thing “is” from how it is used, some interesting data patterns show up. I won’t take up much space in this article to enumerate them, but the same principle that can take ‘one thing’ like an individual brick and use it in multiple ways (paper weight, door stop, wheel chock, pendulum weight, etc.) puts the whole data classification thing in a new light.

The string “John Smith” can appear, for example, as the name of a doctor, a patient, a student, an administrator and/or an instructor. This is a similar pattern to the one that popped up at the university college. As it turns out that same string can be used as an entity name, an attribute, as metadata, reference data and a few other popular ‘sub-classes’ of data. They are not separate collections of ‘things’ as much as they are separate functions of the same thing.

Figure 2: What some ‘thing’ is and how it is used are two separate things

The implication for me was to classify ‘things’ first and foremost as what they refer to or in fact what they are. So, “John Smith” refers to an individual, and in my model surrounding data-centricity “is-a”(member of the set named) Person. On the other side of the equation, words like ‘Student’, ‘Patient’, and ‘Administrator’ for example are Roles. In my declarations, Student “is-a”(member of the set named) Role.

One of the things this allowed me to do was to create a very small (n = 19) number of mutually exclusive and exhaustive sets in any collection. This development also supported the creation of semantically interoperable interfaces and views into broadly related data stores.

Revelation #3: Shape and Semantics Must be Managed Separately and on Purpose

The theme of separation came up again while working on a technical publications project in Houston, Texas. Briefly, the objective was to render application user support topics into their smallest, reusable chunks and make it possible for technical writers to create document maps ranging from individual Help files in four different formats to full-blown, multi-chapter user guides and technical references. What really made the project challenging was what we came to call the ‘’multiple of multiples” problem. This turned out to be the exact opposite challenge of reuse in Revelation #1:

  • Multiple customer platforms
  • Multiple versions of customer platforms
  • Multiple product families (Mainframe, Distributed and Hybrid)
  • Multiple product platforms
  • Multiple versions of product platforms
  • Multiple versions of products (three prior, one current, and one work-in-progress)
  • Multiple versions of content topics
  • Multiple versions of content assemblies (guides, references, specification sheets, for example)
  • Multiple customer locales (United States, Japan, France, Germany, China, etc.)
  • Multiple customer language (English (two ‘flavours’), Japanese, German, Chinese, etc.)

The solution to this ‘factorial mess’ was not found in an existing technology (including the ECM software we were installing) but in fact came about by not only removing all architectural or technical considerations (as we did in Revelation #1), but asking what it means to say: “The content is the same” or “The content is different.”

In the process of comparing two components found in the ‘multiple of multiples’ list, we discovered three factors for consideration:

  1. The visual ‘shape’ of the components. ‘Stop’ and ‘stop’ look the same.
  2. The digital signatures of the components. We used MD5 Hash to do this.
  3. The semantics of the components. We used translators and/or a dictionary.

Figure 3 shows the matrix we used to demonstrate the tendency of each topic to be reused (or not) in one of the multiples.

Figure 3: Shape, Signal and Semantics for Content Component Comparison

It turns out that content can vary as a result of time (a version), place (a locale with different requirements for the same feature, for example) people (different languages) and/or format (saving a .docx file as a pdf). In addition to changes in individual components, assemblies of components can have their own identities.

This last point is especially important. Some content was common to all products the company sold. Other content was variable along product lines, client platform, target market and audience. Finally, the last group of content elements were unique to a unique combination of parameters.

Take-Aways

Separating data from its controlling applications presents an opportunity to look at it in a new way. Removed from its physical and logical constraints, data-centricity begins to look a lot like the language of business. While the prospect of liberating data this way might horrify many application developers and data modelers out there, those of us trying to get the business closer to the information they need to accomplish their goals see the beginning of more naturally integrated way of doing that.

The Way Forward with Data-Centricity

Data-centricity in architecture is going to take a while to get used to. I hope this post has given readers a sense of what the levers to making it work might look like and how they could be put to good use.

Click here to read a free chapter of Dave McComb’s book, “The Data-Centric Revolution”

Article by John O’Gorman

Connect with the Author

 

 

 

 

Toss Out Metadata That Does Not Bring Joy

As I write this, I can almost hear you wail “No, no, we don’t have too much metadata, we don’t have nearly enough!  We have several projects in flight to expand our use of metadata.”

Sorry, I’m going to have to disagree with you there.  You are on a fool’s errand that will just provide busy work and will have no real impact on your firm’s abilityThe Data-Centric Revolution: Implementing a Data-Centric Architecture to make use of the data they have.

Let me tell you what I have seen in the last half dozen or so very large firms I’ve been involved with, and you can tell me if this rings true for you.  If you are in a mid-sized or even small firm you may want to divide these numbers by an appropriate denominator, but I think the end result will remain the same.

Most large firms have thousands of application systems.  Each of these systems have data models that consist of hundreds of tables and many thousands of columns.  Complex applications, such as SAP, explode these numbers (a typical SAP install has populated 90,000 tables and a half million columns).

Even as we speak, every vice president with a credit card is unknowingly expanding their firm’s data footprint by implementing suites of SaaS (Software as a Service) applications.  And let’s not even get started on your Data Scientists.  They are rabidly vacuuming up every dataset they can get their hands on, in the pursuit of “insights.”

Naturally you are running out of space, and especially system admin bandwidth in your data centers, so you turn to the cloud.  “Storage is cheap.”

This is where the Marie Kondo analogy kicks in.  As you start your migration to the cloud (or to your Data Lake, which may or may not be in the cloud), you decide “this would be a good time to catalog all this stuff.”  You launch into a project with the zeal of a Property and Evidence Technician at a crime scene. “Let’s careful identify and tag every piece of evidence.”  The advantage that they have, and you don’t is that their world is finite.   You are faced with cataloging billions of pieces of metadata.  You know you can’t do it alone, so you implore the people who are putting the data in the Data Swamp (er, Lake).  You mandate that anything that goes into the lake must have a complete catalog.  Pretty soon you notice, that the people putting the data in don’t know what it is either.  And they know most of it is crap, but there are a few good nuggets in there.  If you require them to have descriptions of each data element, they will copy the column heading and call it a description.

Let’s just say, hypothetically, you succeeded in getting a complete and decent catalog for all the datasets in use in your enterprise.  Now what?

Click here to read more on TDAN.com

My Path Towards Becoming A Data-Centric Revolution Practitioner

In 1986 I started down a path that, in 2019, has made me a fledgling Data-Centric revolution practitioner. My path towards the Data-Centric revolution started in 1986 with my wife and I founding two micro-businesses in the music and micro-manufacturing industries. In 1998 I put the music business, EARTHTUNES, on hold and sold the other; then I started my Information Technology career. For the last 21 years I’ve covered hardware, software, network, administration, data architecture and development. I’ve mastered relational and dimensional design, working in small and large environments. But my EARTHTUNES work in 1994 powerfully steered me toward the Data-Centric revolution.

In early 1994 I was working on my eighth, ninth and tenth nature sound albums for my record label EARTHTUNES. (See album cover photos below.) The year before, I had done 7 months’ camping and recording in the Great Smoky Mountains National Park to capture the raw materials for my three albums. (To hear six minutes of my recording from October 24, 1993 at 11:34am, right-click here and select open link in new tab, to download the MP3 and PDF files—my gift to you for your personal use. You may listen while you finish reading below, or anytime you like.)

In my 1993 field work I generated 268 hours of field recordings with 134 field logs. (See below for my hand-written notes from the field log.)

Now, in 1994, I was trying to organize the audio recordings’ metadata so that I could select the best recordings and sequence them according to a story-line across the three albums. So, I made album part subtake forms for each take, each few-minutes’ recording, that I thought worthy of going on one of the albums. (See the image of my Album Part Subtake Form, below.)

I organized all the album part subtake forms—all my database metadata entries—and, after months of work, had my mix-down plan for the three albums. In early summer I completed the mix and Macaulay Library of Nature Sound prepared to publish the “Great Smoky Mountains National Park” series: “Winter & Spring;” “Summer & Fall;” and “Storms in the Smokies.”

The act of creating those album part subtake forms was a tipping point towards my becoming a Data-Centric revolution practitioner. In 1994 I started to understand many of the principles defined here and in chapter 2 of Dave McComb’s “The Data-Centric Revolution: Restoring Sanity to Enterprise Information Systems” . Since then I have internalized and started walking them out. The words below are my understandings of the principles, adapted from the Manifesto and McComb’s book.

  • All the many different types of data needed to be included: structured, semi-structured, network-structured and unstructured. Audio recordings and their artifacts; business and reference data; and other associated data, altogether, was my invaluable, curated inter-generational asset. These were the only foundation for future work.
  • I knew that I needed to organize my data in an industry-standard, archival, human-readable and machine-readable format so that I could use it across all my future projects, integrate it with external data, and export it into many different formats. Each new project and whatever applications I made or used would depend completely upon this first class-citizen, this curated data store. In contrast, apps, computing devices and networks would be, relative to the curated data, ephemeral second-class citizens.
  • Any information system I built or acquired must be evolve-able and specialize-able: they had to have a reasonable cost of change as my business evolved; and the integration of my data needed to be nearly free.
  • My data was an open resource that must be shareable, that needed to far outlive the initial database application I made. (I knew that a hundred or so years in the future, climate change would alter the flora and fauna of the habitats I had recorded in; this would change the way those habitats sounded. I was convicted that my field observation data, with recordings, needed to be perpetually accessible as a benchmark of how the world had changed.) Whatever systems I used, the data must have its integrity and quality preserved.
  • This meant that my data needed to have its meaning precisely defined in the context of long-living semantic disciplines and technologies. This would enable successive generations (using different applications and systems) to understand and use my lifework, enshrined in the data legacy I left behind.
  • I needed to use low-code/no-code as much as possible; to enable this I wanted the semantic model to be the genesis of the data structures, constraints and presentation layer, being used to generate all or most data structures and app components/apps (model-driven everything). I needed to use established, well-fitting-with-my-domain ontologies, adding only what wasn’t available and allowing local variety in the context of standardization (specialize-able and single but federated). (Same with the apps.)

From 1994 to the present I’ve been seeking the discipline and technology stacks that a handful of architects and developers could use to create this legacy. I think that I have finally found them in the Data-Centric revolution. My remaining path is to develop full competence in the appropriate semantic disciplines and technology stacks, build my business and community and complete my information system artifacts: passing my work to my heirs over the next few decades.

Article By Jonathon R. Storm

Jonathon works as a data architect helping to maintain and improve a Data-Centric information system that is used to build enterprise databases and application code in a Data-Centric company. Jonathon continues to, on weekends, record the music of the wilderness; in the next year he plans to get his first EARTHTUNES website online to sell his nature sound recordings: you can email him at [email protected] to order now.

The Flagging Art of Saying Nothing

Who doesn’t like a nice flag? Waving in the breeze, reminding us of who we are and what we stand for. Flags are a nice way of providingUnderstanding Meaning in Data a rallying point around which to gather and show our colors to the world. They are a way of showing membership in a group, or providing a warning. Which is why it is so unfortunate when we find flags in a data management system, because they are reduced to saying nothing. Let me explain.

When we see Old Glory, we instantly know it is emblematic of the United States. We also instantly recognize the United Kingdom’s emblematic Union Jack and Canada’s Maple Leaf Flag. Another type of flag is a Warning flag alerting us to danger. In either case, we have a clear reference to what the flag represents. How about when you look at a data set and see ‘Yes’, or ‘7’? Sure, ‘Yes’ is a positive assertion and 7 is a number, but those are classifications, not meaning. Yes what? 7 what? There is no intrinsic meaning in these flags. Another step is required to understand the context of what is being asserted as ‘Yes’. Numeric values have even more ambiguity. Is it a count of something, perhaps 7 toasters? Is it a ranking, 7th place? Or perhaps it is just a label, Group 7?

In data systems the number of steps required to understand a value’s meaning is critical both for reducing ambiguity, and, more importantly, for increasing efficiency. An additional step to understand that ‘Yes’ means ‘needs review’, so the processing steps have doubled to extract its meaning. In traditional systems, the two-step flag dance is required because two steps are required to capture the value. First a structure has to be created to hold the value, the ‘Needs Review’ column. Then a value must be placed into that structure. More often than not, an obfuscated name like ‘NdsRvw’ is used which requires a third step to understand what that means. Only when the structure is understood can the value and meaning the system designer was hoping to capture be deciphered.

In cases where what value should be contained in the structure isn’t known, a NULL value is inserted as a placeholder. That’s right, a value literally saying nothing. Traditional systems are built as structure first, content second. First the schema, the structure definition, gets built. Then it is populated with content. The meaning of the content may or may not survive the contortions required to stuff it into the structure, but it gets stuffed in anyway in the hope it can deciphered later when extracted for a given purpose. Given situations where there is a paucity of data, there is a special name for a structure that largely says nothing – sparse tables. These are tables known to likely contain only a very few of the possible values, but the structure still has to be defined before the rare case values actually show up. Sparse tables are like requiring you to have a shoe box for every type of shoe you could possibly ever own even though you actually only own a few pair.

Structure-first thinking is so embedded in our DNA that we find it inconceivable that we can manage data without first building the structure. As a result, flag structures are often put in to drive system functionality. Logic then gets built to execute the flag dance and get executed every time interaction with the data occurs. The logic says something like this:
IF this flag DOESN’T say nothing
THEN do this next thing
OTHERWISE skip that next step
OR do something else completely.
Sadly, structure-first thinking requires this type of logic to be in place. The NULL placeholders are a default value to keep the empty space accounted for, and there has to be logic to deal with them.

Semantics, on the other hand, is meaning-first thinking. Since there is no meaning in NULL, there is no concept of storing NULL. Semantics captures meaning by making assertions. In semantics we write code that says “DO this with this data set.” No IF-THEN logic, just DO this and get on with it. Here is an example of how semantics maintains the fidelity of our information without having vacuous assertions.

The system can contain an assertion that the Jefferson contract is categorized as ‘Needs Review’ which puts it into the set of all contracts needing review. It is a subset of all the contracts. The rest of the contracts are in the set of all contracts NOT needing review. These are separate and distinct sets which are collectively the set of all contracts, a third set. System functionality can be driven by simply selecting the set requiring action, the “Needs Review” set, the set that excludes those that need review, or the set of all contracts. Because the contracts requiring review are in a different set, a sub-set, and it was done with a single step, the processing logic is cut in half. Where else can you get a 50% discount and do less work to get it?

I love a good flag, but I don’t think they would have caught on if we needed to ask the flag-bearer what the label on the flagpole said to understand what it stood for.

Blog post by Mark Ouska 

For more reading on the topic, check out this post by Dave McComb.

The Data-Centric Revolution: Lawyers, Guns and Money

My book “The Data-Centric Revolution” will be out this summer.  I will also be presenting at Dataversity’s Data Architecture Summit coming up in a fewThe Data-Centric Revolution months.  Both exercises reminded me that Data-Centric is not a simple technology upgrade.  It’s going to take a great deal more to shift the status quo.

Let’s start with Lawyers, Guns and Money, and then see what else we need.

A quick recap for those who just dropped in: The Data-Centric Revolution is the recognition that maintaining the status quo on enterprise information system implementation is a tragic downward spiral.  Almost every ERP, Legacy Modernization, MDM, or you name it project is coming in at ever higher costs and making the overall situation worse.

We call the status quo the “application-centric quagmire.”  The application-centric aspect stems from the observation that many business problems turn into IT projects, most of which end up with building, buying, or renting (Software as a Service) a new application system.  Each new application system comes with its own, arbitrarily different data model, which adds to the pile of existing application data models, further compounding the complexity, upping the integration tax, and inadvertently entrenching the legacy systems.

The alternative we call “data-centric.”  It is not a technology fix.  It is not something you can buy.  We hope for this reason that it will avoid the fate of the Gartner hype cycle.  It is a discipline and culture issue.  We call it a revolution because it is not something you add to your existing environment; it is something you do with the intention of gradually replacing your existing environment (recognizing that this will take time.)

Seems like most good revolutions would benefit from the Warren Zevon refrain: “Send lawyers, guns, and money.”  Let’s look at how this will play out in the data-centric revolution.

Click here to read more on TDAN.com

The 1st Annual Data-Centric Architecture Forum: Re-Cap

In the past few weeks, Semantic Arts, hosted a new Data-Centric Architecture Forum.  One of the conclusions made by the participants was that it wasn’t like a traditional conference.  This wasn’t marching from room to room to sit through another talking head and PowerPoint lead presentation. There were a few PowerPoint slides that served to anchor, but it was much more a continual co-creation of a shared artifact.

The agreed consensus was:

  • — Yes, let’s do it again next year.
  • — Let’s call it a forum, rather than a conference.
  • — Let’s focus on implementation next year.
  • — Let’s make it a bit more vendor-friendly next year.

So retrospectively, last week was the first annual Data-Centric Architecture Forum.

What follows are my notes and conclusions from the forum.

Shared DCA Vision

I think we came away with a great deal of commonality and more specifics on what a DCA needs to look like and what it needs to consist of. The straw-man (see appendix A) came through with just a few revisions (coming soon).  More importantly, it grounded everyone on what was needed and gave a common vocabulary about the pieces.

Uniqueness

I think with all the brain power in the room and the fact that people have been looking for this for a while, after we had described what such a solution entailed, if anyone knew of a platform or set of tools that provided all of this, out of the box, they would have said so.

I think we have outlined a platform that does not yet exist and needs to.  With a bit of perseverance, next year we may have a few partial (maybe even more than partial) implementations.

Completeness

After working through this for 2 ½ days, I think if there were anything major missing, we would have caught it.  Therefore, this seems to be a pretty complete stack. All the components and at least a first cut as to how they are related seems to be in place.

Doable-ness

While there are a lot of parts in the architecture, most of the people in the room thought that most of the parts were well-known and doable.

This isn’t a DARPA challenge to design some state-of-the-art thing, this is more a matter of putting pieces together that we already understand.

Vision v. Reference Architecture

As noted right at the end, this is a vision for an architecture— not a specific architecture or a reference architecture.

Notes From Specific Sessions

DCA Strawman

Most of this is covered was already covered above.  I think we eventually suggested that “Analytics” might deserve its own layer.  You could say that analytics is a “behavior” but it seems to be burying the lead.

I also thought it might be helpful to have some of the specific key APIs that are suggested by the architecture, and it looks like we need to split the MDM style of identity management from user identity management for clarity, and also for positioning in the stack.

State of the Industry

There is a strong case to be made that knowledge graph driven enterprises are eating the economy.  Part of this may be because network effect companies are sympathetic with network data structures.  But we think the case can be made so that the flexibility inherent in KGs applies to companies in any industry.

According to research that Alan provided, the average enterprise now executes 1100 different SaaS services.  This is fragmenting the data landscape even faster than legacy did.

Business Case

A lot of the resistance isn’t technical, but instead tribal.

Even within the AI community there are tribes with little cross-fertilization:

  • Symbolists
  • Bayesians
  • Statisticians
  • Connectionists
  • Evolutionaries
  • Analogizers

On the integration front, the tribes are:

  • Relational DB Linkers
  • Application-Centric ESB Advocates
  • Application-Centric RESTful developers
  • Data-centric Knowledge Graphers

Click here to read more on TDAN.com

The Data-Centric Revolution: Chapter 2

The Data-Centric Revolution

Below is an excerpt and downloadable copy of the “Chapter 2: What is Data-Centric?”

CHAPTER 2

What is Data-Centric?

Our position is:

A data-centric enterprise is one where all application functionality is based on a single, simple, extensible data model.

First, let’s make sure we distinguish this from the status quo, which we can describe as an application-centric mindset. Very few large enterprises have a single data model. They have one data model per application, and they have thousands of applications (including those they bought and those they built). These models are not simple. In every case we examined, application data models are at least 10 times more complex than they need to be, and the sum total of all application data models is at least 100-1000 times more complex than necessary.

Our measure of complexity is the sum total of all the items in the schema that developers and users must learn in order to master a system.  In relational technology this would be the number of classes plus the number of all attributes (columns).  In object-oriented systems, it is the number of classes plus the number of attributes.  In an XML or json based system it is the number of unique elements and/or keys.

The number of items in the schema directly drives the number of lines of application code that must be written and tested.  It also drives the complexity for the end user, as each item, eventually surfaces in forms or reports and the user must master what these mean and how the relate to each other to use the system.

Very few organizations have applications based on an extensible model. Most data models are very rigid.  This is why we call them “structured data.”  We define the structure, typically in a conceptual model, and then convert that structure to a logical model and finally a physical (database specific) model.  All code is written to the model.  As a result, extending the model is a big deal.  You go back to the conceptual model, make the change, then do a bunch of impact analysis to figure out how much code must change.

An extensible model, by contrast is one that is designed and implemented such that changes can be added to the model even while the application is in use. Later in this book and especially in the two companion books we get into a lot more detail on the techniques that need to be in place to make this possible.

In the data-centric world we are talking about a data model that is primarily about what the data means (that is, the semantics). It is only secondarily, and sometimes locally, about the structure, constraints, and validation to be performed on the data.

Many people think that a model of meaning is “merely” a conceptual model that must be translated into a “logical” model, and finally into a “physical” model, before it can be implemented. Many people think a conceptual model lacks the requisite detail and/or fidelity to support implementation. What we have found over the last decade of implementing these systems is that done well, the semantic (conceptual) data model can be put directly into production. And that it contains all the requisite detail to support the business requirements.

And let’s be clear, being data-centric is a matter of degree. It is not binary. A firm is data-centric to the extent (or to the percentage) its application landscape adheres to this goal.

Data-Centric vs. Data-Driven

Many firms claim to be, and many firms are, “data-driven.” This is not quite the same thing as data-centric. “Data-driven” refers more to the place of data in decision processes. A non-data-driven company relies on human judgement as the justification for decisions. A data-driven company relies on evidence from data.

Data-driven is not the opposite of data-centric. In fact, they are quite compatible, but merely being data-driven does not ensure that you are data-centric. You could drive all your decisions from data sets and still have thousands of non-integrated data sets.

Our position is that data-driven is a valid aspiration, though data-driven does not imply data-centric. Data-driven would benefit greatly from being data-centric as the simplicity and ease of integration make being data-driven easier and more effective.

We Need our Applications to be Ephemeral

The first corollary to the data-centric position is that applications are ephemeral, and data is the important and enduring asset. Again, this is the opposite of the current status quo. In traditional development, every time you implement a new application, you convert the data to the new applications representation. These application systems are very large capital projects. This causes people to think of them like more traditional capital projects (factories, office buildings, and the like). When you invest $100 Million in a new ERP or CRM system, you are not inclined to think of it as throwaway. But you should. Well, really you shouldn’t be spending that kind of money on application systems, but given that you already have, it is time to reframe this as sunk cost.

One of the ways application systems have become entrenched is through the application’s relation to the data it manages. The application becomes the gatekeeper to the data. The data is a second-class citizen, and the application is the main thing. In data-centric, the data is permanent and enduring, and applications can come and go.

Data-Centric is Designed with Data Sharing in Mind

The second corollary to the data-centric position is default sharing. The default position for application-centric systems is to assume local self-sufficiency. Most relational database systems base their integrity management on having required foreign key constraints. That is, an ordering system requires that all orders be from valid customers. The way they manage this is to have a local table of valid customers. This is not sharing information. This is local hoarding, made possible by copying customer data from somewhere else. And this copying process is an ongoing systems integration tax. If they were really sharing information, they would just refer to the customers as they existed in another system. Some API-based systems get part of the way there, but there is still tight coupling between the ordering system and the customer system that is hosting the API. This is an improvement but hardly the end game.

As we will see later in this book, it is now possible to have a single instantiation of each of your key data types—not a “golden source” that is copied and restructured to the various application consumers, but a single copy that can be used in place.

Is Data-Centric Even Possible?

Most experienced developers, after reading the above, will explain to you why this is impossible. Based on their experience, it is impossible. Most of them have grown up with traditional development approaches. They have learned how to build traditional standalone applications. They know how applications based on relational systems work. They will use this experience to explain to you why this is impossible. They will tell you they tried this before, and it didn’t work.

Further, they have no idea how a much simpler model could recreate all the distinctions needed in a complex business application. There is no such thing as an extensible data model in traditional practice.

You need to be sympathetic and recognize that based on their experience, extensive though it might be, they are right. As far as they are concerned, it is impossible.

But someone’s opinion that something is impossible is not the same as it not being possible. In the late 1400s, most Europeans thought that the world was flat and sailing west to get to the far east was futile. In a similar vein, in 1900 most people were convinced that heavier than air flight was impossible.

The advantage we have relative to the pre-Columbians, and the pre-Wrights is that we are already post-Columbus and post-Wrights. These ideas are both theoretically correct and have already been proved.

The Data-Centric Vision

To fix your wagon to something like this, we need to make a few aspects of the end game much clearer. We earlier said the core of this was the idea of a single, simple, extensible data model. Let’s drill in on this a bit deeper.

Click here to download the entire chapter.

Use the code: SemanticArts for a a 20% discount off of Technicspub.com

Semantic Ontology: The Basics

What is Semantics?

Semantics is the study of meaning. By creating a common understanding of the meaning of things, semantics helps us better understandsemantic arts each other. Common meaning helps people understand each other despite different experiences or points of view. Common meaning in semantic technology helps computer systems more accurately interpret what people mean. Common meaning enables disparate IT systems – data sources and applications – to interface more efficiently and productively.

What is an Ontology?

An ontology defines all of the elements involved in a business ecosystem and organizes them by their relationship to each other. The benefits of building an ontology are:

  • Everyone agrees on a common set of terms used to describe things
  • Different systems – databases and applications – can communicate with each other without having to directly connect to each other.

Enterprise Ontology

An Ontology is a set of formal concept definitions.

An Enterprise Ontology is an Ontology of the key concepts that organize and structure an Organization’s information systems. Having an Enterprise Ontology provides a unifying whole that makes system integration bearable.

An Enterprise Ontology is like a data dictionary or a controlled vocabulary, however it is different in a couple of key regards. A data dictionary, or a controlled vocabulary, or even a taxonomy, relies on humans to read the definitions and place items into the right categories. An ontology is a series of rules about class (concept) membership that uses relationships to set up the inclusion criteria. This has several benefits, one of the main ones being that a system (an inference engine) can assign individuals to classes consistently and automatically.

By building the ontology in application neutral terminology it can fill the role of “common denominator” between the many existing and potential data sources you have within your enterprise. Best practice in ontology building favors building an Enterprise Ontology with the fewest concepts needed to promote interoperability, and this in turns allows it to fill the role of “least common denominator”

Building an Enterprise Ontology is the jumping off point for a number of Semantic Technology initiatives. We’ll only mention in passing here the variety of those initiatives (we invite you to poke around our web site to find out more) . We believe that Semantic Technology will change the way we implement systems in three major areas:

  • Harvest – Most of the information used to run most large organizations comes from their “applications” (their ERP or EHR or Case Management or whatever internal application). Getting new information is a matter of building screens in these applications and (usually) paying your employees to enter data, such that you can later extract it for other purposes. Semantic Technology introduces approaches to harvest data not only from internal apps, but from Social Media, unstructured data and the vast and growing sets of publicly available data waiting to be integrated.
  • Organize – Relational, and even Object Oriented, technology, impose a rigid, pre-defined structure and set of constraints on what data can be stored and how it is organized. Semantic Technology replaces this with a flexible data structure that can be changed without converting the underlying data. It is so flexible that not all the users of a data set need to share the same schema (they need to share some part of the schema, otherwise there is no basis for sharing, but they don’t need to be in lockstep, each can extend the model independently). Further the semantic approach promotes the idea that the information is at least partially “self-organizing.” Using URIs (Web based Uniform Resource Identifiers) and graph-based databases allows these systems to infer new information from existing information and then use that new information in the dynamic assembly of data structures.
  • Consume — Finally we think semantic technology is going to change the way we consume information. It is already changing the nature of work flow-oriented systems (ask us about BeInformed). It is changing data analytics. It is the third “V” in Big Data (“Variety”). Semantic Based mashups are changing the nature of presentation. Semantic based Search Engine Optimization (SEO) is changing internal and external search.

Given all that, how does one get started?

Well you can do it yourself. We’ve been working in this space for more than twenty years and have been observing clients take on a DIY approach, and while there have been some successes, in general we see people recapitulating many of the twists and turns that we have worked through over the last decade.

You can engage some of our competitors (contact us and we’d be happy to give you a list). But, let us warn you ahead of time: most of our competitors are selling products, and as such their “solutions” are going to favor the scope of the problem that their tools address. Nothing wrong with that, but you should know going in, that this is a likely bias. And, in our opinion, our competitors are just not as good at this as we are. Now it may come to pass that you need to go with one of our competitors (we are a relatively small shop and we can’t always handle all the requests we get) and if so, we wish you all the best…

If you do decide that you’d like to engage us, we’d suggest a good place to get started would be with an Enterprise Ontology. If you’d like to get an idea, for your budgeting purposes, what this might entail, click here to get in touch, and you’ll go through a process where we help you clarify a scope such that we can estimate from it. Don’t worry about being descended on by some over eager sales types, we recognize that these things have their own timetables and we will be answering questions and helping you decide what to do next. We recognize that these days “selling” is far less effective than helping clients do their own research and supporting your buying process.

That said, there are three pretty predictable next steps:

  • Ask us to outline what it would cost to build an Enterprise Ontology for your organization (you’d be surprised it is far less than the effort to build and Enterprise Data Model or equivalent)
  • gist – as a byproduct of our work with many Enterprise Ontologies over the last decade we have built and made publicly available “gist” which is an upper ontology for business systems. We use it in all our work and we have made it publicly available via a Creative Commons Share Alike license (you can use it for any purpose provided you acknowledge where you got it)
  • Training – if you’d like to learn more about the language and technology behind this (either through public courses or in house) check out of offerings in training.

How is Semantic Technology different from Artificial Intelligence?

Artificial Intelligence (AI) is a 50+ year old academic discipline that provided many technologies that are now in commercial use. Two things comprise the core of semantic technology. The first stems from AI research in knowledge representation and reasoning done in the 70s and 80s and includes ontology representation languages such as OWL and inference engines like Fact++. The second relates to data representation and querying using triple stores, RDF and SPARQL, which are largely unrelated to AI. A broad definition of semantic technology includes a variety of other technologies that emerged from AI. These include machine learning, natural language processing, intelligent agents and to a lesser extent speech recognition and planning. Areas of AI not usually associated with semantic technology include creativity, vision and robotics.

How Does Semantics Use Inference to Build Knowledge?

Semantics organizes data into well-defined categories with clearly defined relationships. Classifying information in this way enables humans and machines to read, understand and infer knowledge based on its classification. For example, if we see a red breasted bird outside our window in April, our general knowledge leads us to identify it as a robin. Once it is properly categorized, we can infer a lot more information about the robin then just its name.

We know for example that it is a bird; it flies; it sings a song; it spends its winter somewhere else and the fact that it has showed up means that good weather is on its way.

We know this other information because the robin has been correctly identified within the schematic of our general knowledge about birds, a higher classification; seasons, a related classification, etc.

This is a simple example of by correctly classifying information into a predefined structure we can infer new knowledge. In a semantic model, once the relationships are set up, a computer can classify data appropriately, analyze it based on the predetermined relationships and then infer new knowledge based on this analysis.

What is Semantic Agreement?

The primary challenge in building an ontology is getting people to agree about what they really mean when they describe the concepts that define their business. Gaining semantic agreement is the process of helping people understand exactly what they mean when they express themselves.

Semantic technologists accomplish this because they define terms and relationships independent from the context of how they are applied or the IT systems that store the information, so they can build pure and consistent definitions across disciplines.

Why is Semantic Agreement Important?

Semantic agreement is important because it is enables disparate computer systems to communicate directly with each other. If one application defines a customer as someone who has placed an order and another application defines the customer as someone who might place an order, then the two applications cannot pass information back and forth because they are talking about two different people. In a traditional IT approach, the only way the two applications will be able to pass information back and forth is through a systems integration patch. Building these patches costs time and money because it requires the owners of the two systems need to negotiate a common meaning and write incremental code to ensure that the information is passed back and forth correctly. In a semantic enabled IT environment, all the concepts that mean the same thing are defined by a common meaning, so the different applications are able to communicate with each other without having to write systems integration code.

What is the Difference Between a Taxonomy and Ontology?

A taxonomy is a set of definitions that are organized by a hierarchy that starts at the most general description of something and gets more defined and specific as you go down the hierarchy of terms. For example, a red-tailed hawk could be represented in a common language taxonomy as follows:

  • Bird
    • Raptors
    • Hawks
      • Red Tailed Hawk

An ontology describes a concept both by its position in a hierarchy of common factors like the above description of the red-tailed hawk but also by its relationships to other concepts. For example, the red-tailed hawk would also be associated with the concept of predators or animals that live in trees.

The richness of the relationships described in an ontology is what makes it a powerful tool for modeling complex business ecosystems.

What is the Difference Between a Logical Data Model and Ontology?

The purpose of an ontology is to model the business. It is independent from the computer systems, e.g. legacy or future applications and databases. Its purpose is to use formal logic and common terms to describe the business, in a way that both humans and machines can understand. Ontologies use OWL axioms to describe classes and properties that are shared across multiple lines of business so concepts can be defined by their relationships, making them extensible to increasing levels of detail as required. Good ontologies are ‘fractal’ in nature, meaning that the common abstractions create an organizing structure that easily expands to accommodate the complex information management requirements of the business. The purpose of a logical model is to describe the structure of the data required for a particular application or service. Typically, a logical model shows all the entities, relationships and attributes required for a proposed application. It only includes data relevant to the particular application in question. Ideally logical models are derived from the ontology which ensures consistent meaning and naming across future information systems.

How can an Ontology Link Computer Systems Together?

Since an ontology is separate from any IT structure, it is not limited by the constraints required by specific software or hardware. The ontology exists as a common reference point for any IT system to access. Thanks to this independence, it can serve as a common ground for different:

  • database structures, such as relational and hierarchical,
  • applications, such as an SAP ERP system and a cloud-hosted e-market,
  • devices, such as an iPad or cell phone.

The benefit of the semantic approach is that you can link the legacy IT systems that are the backbone of most business to exciting new IT solutions, like cloud computing and mobile delivery.

What are 5 Business Benefits of Semantic Technology Solutions?

Semantic technology helps us:

  1. Find more relevant and useful information
    • Because it enables us to search information from disparate sources (federated search) and automatically refine our searches (faceted search).
  2. Better understand what is happening
    • Because it enables us to use the relationships between concepts to predict and interpret change.
  3. Build more transparent systems and communications
    • Because it is based on common meanings and mutual understanding of the key concepts and relationships that govern our business ecosystems.
  4. Increase our effectiveness, efficiency and strategic advantage
    • Because it enables us to make changes to our information systems more quickly and easily.
  5. Become more perceptive, intelligent and collaborative
    • Because it enables us to ask questions we couldn’t ask before.

How Can Semantic Technology Enable Dynamic Workflow?

Semantic-driven dynamic workflow systems are a new way to organize, document and support knowledge management. They include two key things:

  1. A consistent, comprehensive and rigorous definition of an ecosystem that defines all its elements and the relationships between elements. It is like a map.
  2. A set of tools that use this model to:
    • Gather and deliver ad hoc, relevant data.
    • Generate a list of actions – tasks, decisions, communications, etc. – based on the current situation.
    • Facilitate and document interactions in the ecosystem.

These tools work like a GPS system that uses the map to adjust its recommendations based on human interactions This new approach to workflow management enables organizations to respond faster, make better decisions and increase productivity.

Why Do Organizations Need Semantic-Driven, Dynamic Workflow Systems?

A business ecosystem is a series of interconnected systems that is constantly changing. People need flexible, accurate and timely information and tools to positively impact their ecosystems. Then they need to see how their actions impact the systems’ energy and flow. Semantic-driven, dynamic workflow systems enable users to access information from non-integrated sources, set up rules to monitor this information and initiate workflow procedures when the dynamics of the relationship between two concepts change. It also supports the definition or roles and responsibilities to ensure that this automated process is managed appropriately and securely. Organizational benefits to implementing semantic-driven, dynamic workflow systems include:

  • Improved management of complexity
  • Better access to accurate and timely information
  • Improved insight and decision making
  • Proactive management of risk and opportunity
  • Increased organizational responsiveness to change
  • Better understanding of the interlocking systems that influence the health of the business ecosystem

Blog post by Dave McComb

Click here to read a free chapter of Dave McComb’s book, “A Data-Centric Revolution”

 

White Paper: The Value of Using Knowledge Graphs in Some Common Use Cases

We’ve been asked to comment on the applicability of Knowledge Graphs and Semantic Technology in service of a couple of common use cases.  We will draw on our own experience with client projects as well as some examples we have come to from networking with our peers.

The two use cases are:

  • Customer 360 View
  • Compliance

We’ll organize this with a brief review of why these two use cases are difficult for traditional technologies, then a very brief summary of some of the capabilities that these new technologies bring to bear, and finally a discussion of some case studies that have successfully used graph and semantic technology to address these areas.

Why is This Hard?

In general, traditional technologies encourage complexity, and they encourage it through ad-hoc introduction of new data structures.  When you are solving an immediate problem at hand, introducing a new data structure (a new set of tables, a new json data structure, a new message, a new API, whatever) seems to be an expedient.  What is rarely noticed is the accumulated effect of many, many small decisions taken this way.  We were at a healthcare client who admitted (they were almost bragging about it) that they had patient data in 4,000 tables in their various systems.  This pretty much guarantees you have no hope of getting a complete picture of a patient’s health and circumstances. There is no human that could write a 4,000 table join and no systems that could process it even if it were able to be written.

This shows up everywhere we look.  Every enterprise application we have looked at in detail is 10-100 times more complex than it needs to be to solve the problem at hand.  Systems of systems (that is the sum total of the thousands of application systems managed by a firm) are 100- 10,000 times more complex than they need to be.  This complexity shows up for users who have to consume information (so many systems to interrogate, each arbitrarily different) and developers and integrators who fight a read guard action to keep the whole at least partially integrated.

Two other factors contribute to the problem:

  • Acquisition – acquiring new companies inevitably brings another ecosystem of applications that must be dealt with.
  • Unstructured information – a vast amount of important information is still represented in unstructured (text) or semi-structured forms (XML, Json, HTML). Up until now it has been virtually impossible to meaningfully combine this knowledge with the structured information businesses run on.

Let’s look at how these play out in the customer 360 view and compliance.

Customer 360

Eventually, most firms decide that it would be of great strategic value to provide a view of everything that is known about their customers. There are several reasons this is harder than it looks.  We summarize a few here:

  • Customer data is all over the place. Every system that places an order, or provides service, has its own, often locally persisted set of data about “customers.”
  • Customer data is multi-formatted. Email and customer support calls represent some of the richest interactions most companies have with their clients; however, these companies find data from such calls difficult to combine with the transactional data about customers.
  • Customers are identified differently in different systems. Every system that deals with customers assigns them some sort of customer ID. Some of the systems share these identifiers.  Many do not.  Eventually someone proposes a “universal identifier” so that each customer has exactly one ID.  This almost never works.  In 40 years of consulting I’ve never seen one of these projects succeed.  It is too easy to underestimate how hard it will be to change all the legacy systems that are maintaining customer data.  And as the next bullet suggests, it may not be logically possible.
  • The very concept of “customer” varies widely from system to system. In some systems the customer is an individual contact; in other, a firm; in another a role; in yet another, a household. For some it is a bank account (I know how weird that sounds but we’ve seen it).
  • Each system needs to keep different data about customers in order to achieve their specific function. Centralizing this puts a burden of gathering a great deal of data at customer on-boarding time that may not be used by anyone.

Compliance

The primary reason that compliance related systems are complex is that what you are complying with is a vast network of laws and regulations written exclusively in text and spanning a vast array of overlapping jurisdictions.  These laws and regulations are changing constantly and are always being re-interpreted through findings, audits, and court cases.

The general approach is to carve off some small scope, read up as much as you can, and build bespoke systems to support them. The first difficulty is that there are humans in the loop all throughout the process.  All documents need to be interpreted, and for that interpretation to be operationalized it generally has to be through a hand-crafted system.

A Brief Word on Knowledge Graphs and Semantic Technology

Knowledge Graphs and Graph Databases have gained a lot of mind share recently as it has become known that most of the very valuable digital native firms have a knowledge graph at their core:

  • Google – the google knowledge graph is what has made their answering capability so much better than the key word search that launched their first offering. It also powers their targeted ad placement.
  • LinkedIn, Facebook, Twitter – all are able to scale and flex because they are built on graph databases.
  • Most Large Financial Institutions – almost all major financial institutions have some form of Knowledge Graph or Graph Database initiative in the works.

Graph Databases

A graph database expresses all its information in a single, simple relationship structure: two “nodes” are connected by an “edge.”

A node is some identifiable thing.  It could be a person or a place or an email or a transaction.  An “edge” is the relationship between two nodes.  It could represent where someone lives, that they sent or received an email, or that they were a party to a transaction.

A graph database does not need to have the equivalent of a relational table structure set up before any data can be stored, and you don’t need to know the whole structure of the database and all its metadata to use a graph database.  You can just add new edges and nodes to existing nodes as soon as you discover them.  The network (the graph) grows organically.

The most common use case for graph databases are analytic.  There are a whole class of analytics that make use of network properties (i.e., how closely x is connected to y, what the shortest route is from a to b).

Knowledge Graphs

Most graph databases focus on low level data: transactions, communications, and the like. If you add a knowledge layer onto this, most people refer to this as a knowledge graph.  The domain of medical knowledge (diseases, symptoms, drug/drug interaction, and even the entire human genome) has been converted to knowledge graphs to better understand and explore the interconnected nature of health and disease.

Often the knowledge in a knowledge graph has been harvested from documents and converted to the graph structure.  When you combine a knowledge graph with specific data in a graph database the combination is very powerful.

Semantic Technology

Semantic Technology is the open standards approach to knowledge graphs and graph databases.  (Google, Facebook, LinkedIn and Twitter all started with open source approaches, but have built their own proprietary versions of these technologies.)  For most firms we recommend going with open standards.  There are many open source and vendor supported products at every level of the stack, and a great deal of accumulated knowledge as to how to solve problems with these technologies.

Semantic technologies implement an alphabet soup of standards, including: RDF, RDFS, OWL, SPARQL, SHACL, R2RML, JSON-LD, and PROV-O.  If you’re unfamiliar with these it sounds like a bunch of techno-babble. The rap against semantic technology has been that it is complicated.  It is, especially if you have to embrace and understand it all at once.  But we have been using this technology for almost 20 years and have figured out how to help people adapt by using carefully curated subsets of each of the standards and leading through example to drastically reduce the learning curve.

While there is still some residual complexity, we think it is well worth the investment in time.  The semantic technologies stack has solved a large number of problems that graph databases and knowledge graphs have to solve on their own, on a piecemeal basis.  Some of these capabilities are:

  • Schema – graph databases and even knowledge graphs have no standard schema, and if you wish to introduce one you have to implement the capability yourself. The semantic technologies have a very rich schema language that allows you to define classes based on what they mean in the real world.  We have found that disciplined use of this formal schema language creates enterprise models that are understandable, simple, and yet cover all the requisite detail.
  • Global Identifiers – semantic technology uses URIs (the Unicode version of which is called an IRI) to identify all nodes and arcs. A URI looks a lot like a URL, and best practice is to build them based on a domain name you own.  It is these global identifiers that allow the graphs to “self-assemble” (there is no writing of joins in semantic technology, the data is already joined by the system).
  • Identity Management – semantic technology has several approaches that make living with the fact that you have assigned multiple identifiers to the same person or product or place. One of the main ones is called “sameAs” and allows the system to know that ‘n’ different URIs (which were produced from data in ‘n’ different systems, with ‘n’ different local IDs) all represent the same real-world item, and all information attached to any of those URIs is available to all consumers of the data (subject to security, of course).
  • Resource Resolution – some systems have globally unique identifiers (you’ve seen those 48-character strings of numbers and letters that come with software licenses, and the like), but these are not very useful, unless you have a special means for finding out what any of them are or mean. Because semantic technology best practice says to base your URIs on a domain name that you own, you have the option for providing a means for people to find out what the URI “means” and what it is connected to.
  • Inference – with semantic technology you do not have to express everything explicitly as you do in traditional systems. There is a great deal of information that can be inferred based on the formal definitions in the knowledge graph as part of the semantic schema and combined with the detailed data assertions.
  • Constraint Management – most graph databases and knowledge graphs were not built for online interactive end user update access. Because of their flexibility it is hard to enforce integrity management. Semantic technology has a model driven constraint manager that can ensure the integrity of a database is maintained.
  • Provenance – one key use case in semantic technology is combining data from many different sources. This creates a new requirement when looking at data that has come from many sources you often need to know: Where did this particular bit of data come from?  Semantic Technologies have solved this in a general way that can go down to individual data assertions.
  • Relational and Big Data Integration – you won’t be storing all of your data in a graph database (semantic, or otherwise). Often you will want to combine data in your graph with data in your existing systems.  Semantic technology has provided standards, and there are vendors that have implemented these standards, such that you can write a query that combines information in the graph with that in a relational database or a big data store.

It is hard to cover a topic as broad as this in a page, but hopefully this establishes some of what the approach provides.

Applying Graph Technology

So how do these technologies deliver capability to some more common business problems?

Customer 360

We worked with a bank that was migrating to the cloud.  As part of the migration they wanted to unify their view of their customers.  They brought together a task force from all the divisions to create a single definition of a customer.  This was essentially an impossible task.  For some divisions (Investment Banking) a customer was a company, for others (Credit Card processing) it was usually a person.  Not only were there differences in type, all the data that they wanted and were required to have in these different contexts was different.  Further one group (corporate) espoused a very broad definition of customer that included anyone that they could potentially contact.  Needless to say, the “Know Your Customer” group couldn’t abide this definition as every new customer obligates them to perform a prescribed set of activities.

What we have discovered time and again is that if you start with a term (say, “Customer”) and try to define it, you will be deeply disappointed.  On the other hand, if you start with formal definitions (one of which for “Customer” might be, “a Person who is an owner or beneficiary on a financial account” (and of course financial account has to be formally defined)), it is not hard to get agreement on what the concept means and what the set of people in this case would be.  From there it is not hard to get to an agreed name for each concept.

In this case we ended up creating a set of formal, semantic definitions for all the customer related concepts.  At first blush it might sound like we had just capitulated to letting everyone have their own definition of what a “Customer” was.  While there are multiple definitions of “Customer” in the model, they are completely integrated in a way that any individual could be automatically categorized and simultaneously in multiple definitions of “Customer” (which is usually the case).

The picture shown below, which mercifully omits a lot of the implementation detail, captures the essence of the idea. Each oval represents a definition of “Customer.”

Knowledge graphs

In the lower right is the set of people who have signed up for a free credit rating service.  These are people who have an “Account’ (the credit reporting account), but it is an account without financial obligation (there is no balance, you cannot draw against it, etc.).  The Know Your Customer (KYC) requirements only kick in for people with Financial Accounts.  The overlap suggests some people have financial accounts and non-financial accounts.  The blue star represents a financial customer that also falls under the guidelines of KYC.  Finally, the tall oval at the top represents the set of people and organizations that are not to be customers, the so-called “Sanctions lists.”  You might think that these two ovals should not overlap, but with the sanctions continually changing and our knowledge of customer relations constantly changing, it is quite possible that we discover after the fact that a current customer is on the sanctions list.  We’ve represented this as a brown star that is simultaneously a financial customer and someone who should not be a customer.

We think this approach uniquely deals with the complexity inherent in large companies’ relationships with their customers.

In another engagement we used a similar approach to find customers who were also vendors, which is often of interest, and typically hard to detect consistently.

Compliance

Compliance also is a natural for solving with Knowledge Graphs.

Next Angles

Mphasis’ project “Next Angles” converts regulatory text into triples conforming to an ontology, which they can then use to evaluate particular situations (we’ve worked with them in the past on a semantic project).  In this white paper they outline how it has been used to streamline the process of detecting money laundering: http://ceur-ws.org/Vol-1963/paper498.pdf.

Legal and Regulatory Information Provider

Another similar project that we worked on was with a major provider of legal and regulatory information.  The firm ingests several million documents a day, mostly court proceedings but also all changes to laws and regulation.  For many years these documents were tagged by a combination of scripts and off shore human taggers.  Gradually the relevance and accuracy of their tagging began to fall behind that of their rivals.

They employed us to help them develop an ontology and knowledge graph; they employed the firm netOWL to perform the computational linguistics to extract data from documents and conform it to the ontology.  We have heard from third parties that the relevance of their concept-based search is now considerably ahead of their competitors.

They recently contacted us as they are beginning work on a next generation system, one that takes this base question to the next level: Is it possible to infer new information in search by leveraging the knowledge graph they have plus a deeper modeling of meaning?

Investment Bank

We are working in the Legal and Compliance Division for a major investment bank.  Our initial remit was to help with compliance to records retention laws. There is complexity at both ends of this domain.  On one end there are hundreds of jurisdictions promulgating and changing laws and regulations continually.  On the other end are the billions of documents and databases that must be classified consistently before they can be managed properly.

We built a knowledge graphs that captured all the contextual information surrounding a document or repository.  This included who authored it, who put it there, what department were they in, what cost code they charged, etc., etc.  Each bit of this contextual data had textual data available.  We were able to add some simple natural language processing that allowed them to accurately classify about 25% of the data under management.  While 25% is hardly a complete solution, this compares to ½ of 1% that had been classified correctly up to that point.  Starting from this they have launched a project with more sophisticated NLP and Machine Learning to create an end user “classification wizard” that can be used by all repository managers.

We have moved on to other related compliance issues, which includes managing legal holds, operation risk, and a more comprehensive approach to all compliance.

Summary: Knowledge Graphs & Semantic Technology

Knowledge Graphs and Semantic Technology are the preferred approach to complex business problems, especially those that require the deep integration of information that was previously hard to align, such as customer-related and compliance-related data.

Click here to download the white paper.

Field Report from the First Annual Data-Centric Architecture Conference

Our Data-Centric Architecture conference a couple weeks ago was pretty incredible. I don’t think I’ve ever participated in a single intense, productive conversation with 20 people that lasted 2 1/2 days, with hardly a let up. Great energy, very balanced participation.

And I echo Mark Wallace’s succinct summary on LinkedIn.

I think one thing all the participants agreed on was that it wasn’t a conference, or at least not a conference in the usual sense. I think going forward we will call it the Data-centric Architecture Forum. Seems more fitting.

My summary take away was:

  1. This is an essential pursuit.
  2. There is nothing that anyone in the group (and this is a group with a lot of coverage) knows of that does what a Data-Centric Architecture has to do, out of the box.
  3. We think we have identified the key components. Some of them are difficult and have many design options that are still open, but no aspect of this is beyond the reach of competent developers, and none of the components are even that big or difficult.
  4. The straw-man held up pretty well. It seemed to work pretty well as a communication device. We have a few proposed changes.
  5. We all learned a great deal in the process.

A couple of immediate next steps:

  1. Hold the date, and save some money: We’re doing this again next year Feb 3-5, $225 if you register by April 15th: http://dcc.semanticarts.com.
  2. The theme of next year’s forum will be experience reports on attempting to implement portions of the architecture.
  3. We are going to pull together a summary of points made and changes to the straw-man.
  4. I am going to begin in earnest on a book covering the material covered.

Field Report by Dave McComb

Join us next year!