Dispose, Delete, and Discard: Keep your Enterprise Data Tidy Part 3

Those who are familiar with Marie Kondo know that she is a ruthless disposer. If you’ve read parts one and two of this series, you know that the process is more nuanced than just “throw it all away,” but we’ve come to the point in the process where it’s important to focus on discarding. If you haven’t read parts one and two of this series, please do so; they provide context for the content of this post.  Armed with categories that work for your organization and a solid set of values that the data you keep must uphold to be useful to your business, this part of the process is primarily dedicating time to pruning your files and records, and documentation.

Data Lifecycle Policies

“The fact that you possess a surplus of things that you can’t bring yourself to discard doesn’t mean you are taking good care of them.  In fact, it is quite the opposite.” It’s interesting to note that, while there are many book collectors who lament Kondo’s popularity and cry, “You can pry my books out of my cold, dead hands,” there aren’t many librarians who hold this sentiment.  Professionals know that collections must be pruned and managed. In fact, your organization may have one or more policies about managing data and documents.  At a minimum, data lifecycle policies cover three points of a document’s existence within an organization: creation or acquisition, use and storage, and disposition.  These policies may be driven by the systems used to manage your documents (Microsoft SharePoint comes to mind) or they may be driven by government mandates. These should be your guide on what and when you discard.  If your organization has these policies outlined clearly, the hard work is already done, and you can begin using parts one and two as your guide to systematically deleting unneeded data and documentation. It may also be that some of this lifecycle management functionality is encoded in your systems, but it’s important to understand the policies if you’re making the decisions about data disposition. If your organization does not have a data lifecycle policy, you can explore creating one while you work on becoming data centric.

Data Configuration Management

Outside of an overarching strategy or policy for managing your organization’s data and information, your organization may have various configuration management tools in place (e.g., Git or Subversion) to manage drafts and backups. Many large organizations use file sharing systems to govern who has privileges to directories and files.  If you’re attempting to KonMari your files when such systems are in place, it will be necessary to work collaboratively to get access to the files in your control.

When do you actually discard???

One of the key ideas in Marie Kondo’s method is that when you discard, you only discard your own belongings.  If you are the owner and CTO of a company, then you have the freedom to discard what no longer sparks joy.  In a large company, that question of ownership is far more complex and possibly beyond the reader’s paygrade. It might be beyond the CEO’s paygrade. It is certainly beyond the paygrade of the writer, except with a select few files on a laptop and in a removable storage device used for backups.  But the question of ownership can often be established by completing the work recommended in this series of blog posts.  And once you’ve established ownership, even complex ownership, you can use metadata to describe ownership and provenance, making it easier to manage that data’s future state, discarded or otherwise.

Futureproofing your Data

Now that we’ve considered the end of the data lifecycle management picture, take a look at the start—data acquisition and creation.  If you’ve done the work so far of identifying your business processes and assessed how well your data supports your goals and aligned to your data lifecycle management policy (formal or otherwise), you know how important it is to also consider the introduction of new data.  We touched on this in the first two parts, but there’s a subtle difference between considering how data came to be in your collection and considering data that you will include in your collection from this point forward.

This is something you can specify with policy, and it’s something you can anticipate with a robust ontology. However, it’s not as simple as building robust metadata.  An ontology that is carefully anchored to your organization’s processes, has sufficient input from the right subject matter experts, and is developed within a hospitable IT infrastructure, is far more likely to be a sound gatekeeper for your incoming data.

In the IT industry, this is referred to as Futureproofing, and is designed to minimize the need for down-stream development to make corrections to work you’re doing now. It’s often a judgment call as to whether the application or system is introducing too much technical debt, but there is no argument that being able to understand each piece of data that goes into your system is critical to avoiding such debt. The way to ensure your data will be understandable downstream is to have adequate metadata.  If you want your data to be sophisticated and able to support complex information needs, you need to use semantics.

“The secret to maintaining an uncluttered room is to pursue ultimate simplicity in storage so that you can tell at a glance how much you have.” -Marie Kondo

Read Part 1: Does your Data Spark Joy?

Read Part 2: Setting the Stage for Success

Written by Meika Ungricht

The Data-Centric Revolution: The Role of SemOps (Part 1)

We’ve been working on something we call “SemOps” (like DevOps but for Semantic Technology + IT Operations).  The basic idea is how can we create a pipeline to go from proposed enterprise ontology or taxonomy enhancements to “in-production” as frictionlessly as possible.

As so often happens, when we shine the Semantic Light on a topic area, we see things anew.  In this very circuitous way, we’ve come to some observations and benefits that we think will be of interest even to those who aren’t on the Semantic path.

DevOps for Data People

If you’re completely on the data side, you may not be aware of what developers are doing these days.  Most mature development teams have deployed some version of DevOps (Software Development + IT Operations) along with CI/CD (Continuous Integration / Continuous Deployment).

To understand what they are doing it helps to harken back to what preceded DevOps and CI/CD.  Once upon a time, software was delivered via the waterfall methodology.  Months or occasionally years would be spent getting the requirements for a project “just right.” The belief was that if you didn’t get the requirements right up front, the cost to add even a single new feature would cost 40 times what it would cost if the requirement were identified up front.  It turns out there was some good data on this cost factor, and it still casts its shadow any time you try to make a modification to a packaged enterprise application, 40 x is a reasonable benchmark compared to what it would cost to implement that feature outside the package.  This as a side note is the economics that creates the vast number of “satellite systems” that seem to spring up alongside large packaged applications.

Once the requirements were signed off on, the design began (more months or years) then coding (more months or years) finally systems testing (more months or years).  Then the big conversion weekend, the system goes into production, tee shirts are handed out to the survivors and the system becomes IT Operations problem.

There really was only ever, one “move to production” and few thought it worthwhile to invest the energy in making this more efficient.  Most sane people, once they’d stayed up all night on a conversion weekend, were loath to sign up for another, and it certainly didn’t occur to them to find out a way to make it better.

Then agile came along.  One of the tenets of agile was that you always had a working version that you could, in theory, push to production.  In the early days it wasn’t that people were pushing to production on any frequent schedule, but the fact that you always could was a good discipline to avoid technical debt and straying off building hypothetical components.

Over time, the idea that you could push to production became the idea that you should.  As people invested more and more in their unit testing and regression testing, and pipelines to move from dev to QA to production, people became used to the idea of pushing small incremental changes into production systems.  That was the birth of DevOps and CI/CD.  In mature organizations like Google and Amazon, new versions of their software are being pushed to production many times per day (some reports say many times per second, but this may be hyperbole).

The reason I bring it up is because there are some things in there that we expect to duplicate with SemOps, and some that we already have with data (as I was writing this sentence, I was tempted to write “DataOps” and I thought: “is there such a thing?”) A nanosecond of googling later and I found this extremely well written article on the topic from our friends at DataKitchen. They are focusing more on the data analytics part of the enterprise, which is a hugely important area. The points I was going to make were more focused on the data source end of the pipeline, but the two ideas tie together nicely.

Click here to read more on TDAN.com

A Data Engineer’s Guide to Semantic Modelling

While on her semantic modelling journey and as a Data Engineer herself, Ilaria Maresi encountered a range of challenges. There was not one definite source where she could quickly look things up, many of the resources were extremely technical and geared towards a more experienced audience while others were too wishy-washy. Therefore, she decided to compose this 50-page document where she explains semantic modelling and her most important lessons-learned – all in an engaging and down-to-earth writing style.

She starts off with the basics: what is a semantic model and why should you consider building one? Obviously, this is best explained by using a famous rock band as an example. In this way, you learn to draw the basic elements of a semantic model and some fun facts about Led Zeppelin at the same time!

For your model to actually work, it is essential that machines can also understand these fun facts. This might sound challenging if you are not a computer scientist but this guide will walk you through it  step-by-step – it even has pictures of baby animals! You will learn how to structure your model in Resource Description Framework (RDF) and give it meaning with the vocabulary extension that wins the prize for cutest acronym: Web Ontology Language (OWL).

All other important aspects of semantic modelling will be discussed. For example, how to make sure we all talk about the same Led Zeppelin by using Uniform Resource Identifiers (URIs). Moreover, you are not the first one thinking and learning about knowledge representation: many domain experts have spent serious time and effort in defining the major concepts of their field, called ontologies. To prevent you from re-inventing the wheel, we list the most important resources and explain their origin.

Are you a Data Engineer that has just started with semantic modelling? Want to refresh your memory? Maybe you have no experience with semantic modelling yet but feel it might come in handy? Well, this guide is for you!

Click here to access a data engineer’s guide to semantic modelling

Written by Tess Korthout

A Brief Introduction to the gist Semantic Model

Phil Blackwood, Ph.D.

It’s no secret that most companies have silos of data and continue to create new silos.  Data that has the same meaning is often represented hundreds or thousands of different ways as new data models are introduced with every new software application, resulting in a high cost of integration.

By contrast, the data-centric approach starts with the common meaning of the data to address the root cause of data silos:

An enterprise is data-centric to the extent that all application functionality is based on a single, simple, extensible, federate-able data model.

An early step along the way to becoming data-centric is to establish a semantic model of the common concepts used across your business.  This might sound like a huge undertaking, and perhaps it will be if you start from scratch.  A better option is to adopt an existing core semantic model that has been designed for businesses and has a track record of success, such as gist.

Gist is an open source semantic model created by Semantic Arts. 

Gist is an open source semantic model created by Semantic Arts.  It is the result of more than a decade of refinement based on data-centric projects done with major corporations in a variety of lines of business.  Semantic Arts describes gist as “… designed to have the maximum coverage of typical business ontology concepts with the fewest number of primitives and the least amount of ambiguity.”  The Wikipedia entry for upper ontologies compares gist to other ontologies, and gives a sense of why it is a match for corporate data management.

 

This blog post introduces gist by examining how some of the major Classes and Properties can be used.  We will not go into much detail; just enough to convey the general idea.

Everyone in your company would probably agree that running the business involves products, services, agreements, and events like payments and deliveries.  In turn, agreements and events involve “who, what, where, when, and why”, all of which are included in the gist model.  Gist includes about 150 Classes (types of things), and different parts of the business can be often be modeled by adding sub-classes.  Here are a few of the major Classes in gist:

Gist also includes about 100 standard ways things can be related to each other (Object Properties), such as:

  • owns
  • produces
  • governs
  • requires, prevents, or allows
  • based on
  • categorized by
  • part of
  • triggered by
  • occurs at (some place)
  • start time, end time
  • has physical location
  • has party (e.g. party to an agreement)

For example, the data representing a contract between a person and your company might include things like:

In gist, a Contract is a legally binding Agreement, and an Agreement is a Commitment involving two or more parties.  It’s clear and simple.  It’s also expressed in a way that is machine-readable to support automated inferences, Machine Learning, and Artificial Intelligence.

The items and relationships of the contract can be loaded into a knowledge graph, where each “thing” is a node and each relationship is an edge.  Existing data can be mapped to this standard representation to make it possible to view all of your contracts through a single lens of terminology.  The knowledge graph for an individual contract as sketched out above would look like:

Note that this example is just a starting point.  In practice, every node in the diagram would have additional properties (arrows out) providing more detail.  For example, the ID would link to a text string and to the party that allocated the ID (e.g. the state government that allocated a driver’s license ID).  The CatalogItem would be a detailed Product or Service Specification.

In the knowledge graph, there would be a single Person entry representing a given individual, and if two entries were later discovered to represent the same person, they could be linked with a sameAs relationship.

Relationships in gist (Properties) are first class citizens that have a meaning independent of the things they link, making them highly re-usable.  For example, identifiedBy is not limited to contracts, but can be used anywhere something has an ID.  Note that the Properties in gist are used to define relationships between instances rather than Classes; there are also a few standard relationships between Classes such as subClassOf and equivalentTo.

The categorizedBy relationship is a powerful one, because it allows the meaning of an item to be specified by linking to a taxonomy rather than by creating new Classes.  This pattern contributes to extensibility; adding new characteristics becomes comparable to adding valid values to a relational semantic model instead of adding new attributes.

Unlike traditional data models, the gist semantic model can be loaded into a knowledge graph and then the data is loaded into the same knowledge graph as an extension to the model.  There is no separation between the conceptual, logical, and physical models.  Similar queries can be used to discover the model or to view the data.

Gist uses the W3C OWL standard (Web Ontology Language), and you will need to understand OWL to get the most value out of gist.  To get started with OWL for corporate data management, check out the book Demystifying OWL for the Enterprise, by Michael Uschold.  There’s also a brief introduction to OWL and the way it uses set theory here.

The technology stack that supports OWL is well-established and has minimal vendor lock-in because of the simple standard data representation.  A semantic model created in one knowledge graph (triple store) can generally be ported to another tool without too much trouble.

To explore gist in more detail, you can download an ontology editor such as Protégé and then select File > Open From URL and enter: https://ontologies.semanticarts.com/o/gistCore9.4.0  Once you have the gist model loaded, select Entities and then review the descriptions of Classes, Object Properties (relationships between things), and Data Properties (which point to string or numeric values with no additional properties).  If you want to investigate gist in an orderly sequence, I’d suggest viewing items in groups of “who, what, when, where, and how.”

Take a look at gist.  It’s worth your time, because having a standard set f common terms like gist is a significant step toward reversing the trend toward more and more expensive data silos.

Click here to learn more about gist.

Sharing Ontologies Globally To Speed Science And Healthcare Solutions

Sharing Ontologies Globally To Speed Science And Healthcare SolutionsThe COVID-19 pandemic is a clear example of how healthcare practitioners require swift access to enormous amounts of diverse information to efficaciously treat patients. They must synthesize individual data (vital signs, clinical history, demographics, and more) with rapidly evolving knowledge about COVID-19 and make decisions relevant to the conditions from which specific patients suffer.ners rely on point-of-care decision support systems to accelerate patient-care analysis and to scale treatments for intake quantities of global pandemics. They analyze a plethora of inputs to produce tailored treatment recommendations, in near real-time, which significantly enhance the quality of treatment.

Ontologies Create The Foundation For Complex Data Analysi

The underlying utility of these systems is widely based on the vast quantities of healthcare knowledge analyzed. Such knowledge must be uniformly represented (at scale) with rich, contextualized descriptions of the full scope of clinical trials, pharmaceutical information, and research germane to the biomedical field that expands daily with each published paper and new findings. This knowledge should be rapidly accessible, reusable, and a sturdy foundation on which to base present and future research in this field, encompassing everything from long-standing maladies like peanut allergies to emergent ones like COVID-19.

Ontologies—evolving conceptual data models with standardized concepts and uniquely fulfill each of these requirements to fuel healthcare research and point-of-care decision support systems, helping save lives when they need saving most.

International Ontology Sharing Is Becoming A Reality

A consortium of researchers recently formed an organization dedicated to standardizing how scientists define their ontologies, which are essential for retrieving datasets as well as understanding and reproducing research. The group called OntoPortal Alliance is creating a public repository of internationally shared domain-specific ontologies. All the repositories will be managed with a common OntoPortal appliance that has been tested with AllegroGraph Semantic Knowledge Graph software. This enables any OntoPortal adopter to get all the power, features, maintainability, and support benefits that come from using a widely adopted, state-of-the-art semantic knowledge graph database.

The first set of ontology repositories making up the OntoPortal Alliance include BioPortal (biomedical and other ontologies used internationally), SIFR (biomedical ontologies in the French language), BMICC MedPortal (biomedical ontologies focused on Chinese users),  AgroPortal (ontologies focused on agronomy and related sciences), and) EcoPortal (ontologies focused on environmental science. The OntoPortal Alliance will be adding more ontology repositories and is open to working with researchers in other domains who want to offer ontologies publicly.

Click here to read the full article at HealthITOutcomes.com

Setting the Stage for Success Part 2

Envisioning Your Dream System with the Marie Kondo Method

Before you begin gathering your belongings, discarding, or reorganizing, Marie Kondo asks you to envision your dream lifestyle.  She insists that this is the critical first step to ensuring success Envisioning Your Dream System with the Marie Kondo Methodwith her method, and she provides some guidance on how to do so and examples from her clients.  The example Marie Kondo uses in her book is a young woman who lives in a tiny apartment, typical of Japanese cities.  Her floor is covered with things and her bed is a storage space when she isn’t sleeping on it.  She comes home from work tired and her living space compounds that exhaustion.  Maria Kondo has a dream and that dream is simple: to have the space be free from clutter, like a hotel suite, where she can come home and relax with tea and a bath before bed.

While the situation may be different for someone who has responsibility for stores of corporate data and systems, the process of envisioning your ideal environment is not.  As you begin to examine your systems, information architecture, data—an information landscape, in general—it’s absolutely critical to have in mind what you want.  Having in mind “better” or “new technology” leads you towards trends and vendors with cool product features that may meet your needs, but more likely will end up contributing to the data and system clutter in the long run.  It may seem like a simplistic question, “What do you want?” but your efforts in defining that will help you navigate the marketplace of emerging technology.  At this point, it is important not to focus on the process or the items in front of you that you may or may not want to keep; rather, envisioning your ideal end-state, be it a living space filled with only things you love or a database filled only with data that supports your business, is what empowers you to move forward.

If you’re a savvy tech professional, you’re already thinking, “This is the requirements gathering process,” and you would be right.  There is no shortage of requirements gathering methodologies out there and most of them are pretty good.  If it gets you to envision an ideal that is vendor and tool agnostic and is based on the needs and desires of your key stakeholders and end-users, your method is fine.  If your requirements include things like, “better search functionality,” or, “more insight into what data we have,” it’s very likely that you’re also in need of some data decluttering.

Get Started by Defining your Categories

The Marie Kondo method requires you to see your belongings in two overarching categories: things that spark joy and everything else.  Everything else should be discarded.  For our purposes, data that sparks joy is data that serves your business.  It is helpful to look at the antithesis of joy to get an idea of what should be kept or discarded.  For example, if you are facing an audit, the Get Started by Defining your Categoriesantithesis of joy is not being able to produce the documentation that the auditor needs to conduct the audit.  That could be because you can’t access it, because what you have isn’t what they need, you don’t have what they need, or what they need is too difficult to find amidst data and information that you have.  In this example, the information that allows you to have peace of mind during an audit is what you should keep. The bigger pattern here is that it’s important to know what business processes, data flows, decision points, and dependencies are impacting your business, and what the inputs and outputs are to those process steps.

Before you can begin to discard by category, you must know what categories drive your business.  Marie Kondo starts by outlining a series of categories that guide her clients through the process of discarding.  She starts with clothing, then books, then papers, then everything else.  She breaks down these categories even further, allowing people with astonishingly large and complex collections of things to take a systematic approach to decluttering. With organizational data, this approach will work, but the way you define the categories depends on the kind of organization you are.

The categories you need should emerge out of your efforts at process improvement. From Investopedia: “Kaizen is a Japanese term meaning ‘change for the better’ or ‘continuous improvement.’ It is a Japanese business philosophy regarding the processes that continuously improve operations and involve all employees. Kaizen sees improvement in productivity as a gradual and methodical process.”(1) Often, semantic work is done alongside large-scale business process improvement efforts.  Businesses want to know what the information inputs and outputs are, and they want to know how that information influences decisions and actions.  These efforts are often iterative, and it’s not uncommon to uncover conflicts in how people understand the data, or what they use it for.  I remember working with a team of medical experts who all used “normal” as a data point in their diagnostic processes.  It took our team years to come up with a good way to encode “normal” because each expert meant something different by the term.  There were heated debates about whether or not “normal” meant within the context of a patient who might be legally blind, in which case a low visual acuity score might be considered normal, or if normal was a cohort or population average, in which case that patient’s low score was not normal.  These conflicts and pain points are like mismatched socks and poorly-fitting jeans: they’re your clue about where you need to look at your data. This is also the starting point for determining which categories you need to use to evaluate your data. Do not strong-arm your conflicts into silence; use them to light the way ahead.

Building the Categories that Matter to You with the Marie Kondo Method

The Marie Kondo method categories are presented in an order that begins by teaching us what it means to feel that spark of joy (clothing) and works through household items that might be useful but not particularly exciting, and ends with items of sentimental value (photos and heirlooms).  One of the big challenges of applying the Marie Kondo method to organizational data is that this rubric and categorization doesn’t easily map to things like clothing and photos.  However, the underlying idea of what is essential to our survival and our comfort does easily translate to data.  Don’t getBuilding the Categories that Matter to You with the Marie Kondo Method bogged down in the details too soon. Marie Kondo advises that you create subcategories according to your need.

When I was organizing my miscellaneous items, I uncovered some camping gear I had purchased a couple years ago with the intention of going on a long bike ride that involved camping at night.  I was unable to go, so I packed the gear away for another time.  As I went through the process of evaluating my belongings using the Marie Kondo method, I decided I’ve always enjoyed camping and I was going to make space in my life for it.  I booked a camping trip for a few days, loaded my gear into a rental car, and put my gear to the test.

This camping trip was rich with lessons, pleasant and painful both. I took the gear I had bought for the bike trip, but since I had a car, I also supplemented it with larger and heavier items I knew would be useful now that I had the space.  Things I thought would be overkill turned out to be very useful: extra flashlight, large water container, spare book of matches, extra pillow, folding chair, extra plastic tub, etc.  Things I was certain I would use ended up coming home unused: pancake mix, spare sleeping bag, two changes of clothes, packets of sample skin and hair products, etc. And I found there were things I needed in the moment that I didn’t have: a lighter, fire starters, strong bug spray, an umbrella, and 4WD.  The underlying lesson here is that your gear should enable the activities you want to do.  And different types of gear serve different types of experiences, even if they’re categorically similar. If you look at the gear belonging to someone who likes glamping and compare it to someone who likes to through-hike the Appalachian Trail, there may not be a whole lot of overlap in the specifics, even though the categories are the same.  This is because your process determines your needs.

Camping gear is often designed to meet basic human needs and provide basic creature comforts.  Complex business processes can draw from this analog example, in that your categories are going to appear around the essential tasks of your business. In many of the projects I’ve done in the past, some effort has been made to identify key information areas that need development using Continuous Improvement or Kaizen principles.  Information artifacts, key concepts, subject headings, however you choose to refer to them, are the overarching conceptual subjects that drive your business.  Using the camping example, this might look like the following: Sleep, Food, Hygiene, Recreation. If you break down sleep, the process could be as simple as laying out a tarp and a blanket and wrapping yourself up in it and going to sleep.  Or it might be as complex as building a platform, building a tent, constructing a bed frame, unfolding sheets, pillows, and blankets, securing the tent, and finally going to sleep. In both scenarios, there are categories for sleep surface, shelter, and bedding.

Another key comparison comes up when considering duplication and re-use.  Chances are, you aren’t going to need a different sleeping bag for each camping scenario.  It’s interesting to note that if you go into an outdoor supply outfitter looking for sleeping bags, you will find a range of options based on very specific situations.  If your business is camping, you just might need several different bags!  But for most people, this just adds complexity and expense.  You do want to make sure the zipper works so you can control the amount of body heat you’re trapping in the bag, and if you’re camping in the cold you might add a blanket. But otherwise, a multi-season sleeping bag that’s comfortable and easy to care for is going to be re-used over and over in many camping scenarios.

For a business, the examples might range from a child’s lemonade stand to Starbucks. The information objects are going to be similar: menu, supplies metrics. Once you’ve established these categories, you can look at your data systematically.  Coming up with these key concepts allows you to define the scope of your work and priorities for development.

What’s Next?

Now that you’ve got a sense of how to create a list of categories based on your business processes, you can begin the process of discarding.  As with the process so far, it’s not as simple as it is for your possessions at home.  Disposition of data within an enterprise, large or small, comes with politics and legal requirements.  In part three, you will see some ideas about where to start with data disposition and how to use your company’s data disposition strategies to your advantage.

Click Here to Read Part 1 of this Series

Footnotes:
(1) https://www.investopedia.com/terms/k/kaizen.asp

A Mathematician and an Ontologist walk into a bar…

The Ontologist and Mathematician should be able to find common ground because Cantor introduced set theory into the foundation of mathematics, and W3C OWL uses set theory as a foundation for ontology language.  Let’s listen in as they mash up Cantor and OWL …

Ontologist: What would you like to talk about?

Mathematician: Anything.

Ontologist: Pick a thing. Any. Thing. You. Like.

Mathematician: [looks across the street]

A Mathematician and an Ontologist walk into a bar…

Ontologist: Sure, why not?  Wells Fargo it is.  If we wanted to create an ontology for banking, we might need to have a concept of a company being a bank to differentiate it from other types of companies.  We would also want to generalize a bit and include the concept of Organization.

Mathematician: That’s simple in the world of sets.

A Mathematician and an Ontologist walk into a bar…

Ontologist: In my world, every item in your diagram is related to every other item.  For example, Wells Fargo is not only a Bank, but it is also an Organization.  Relationships to “Thing” are applied automatically by my ontology editor.  When we build our ontology, we would first enter the relationships in the diagram below (read it from the bottom to the top):

A Mathematician and an Ontologist walk into a bar…

Then we would run a reasoner to infer other relationships.  The result would look like this:

A Mathematician and an Ontologist walk into a bar…

Mathematician: My picture has “Banks” and yours has “Bank”.  You took off the “s”.

Ontologist: Well, yes, I changed all the set names to make them singular because that’s the convention for Class names.  Sorry.  But now that you mention it … whenever I create a new Class I use a singular name just like everyone else does, but I also check to see if the plural is the good name for the set of things in the Class.  If the plural doesn’t sound like a set, I rethink it.  Try that with “Tom’s Stamp Collection” and see what you get.

Mathematician: I’d say you would have to rethink that Class name if you wanted the members of the Class to be stamps.  Otherwise, people using your model might not understand your intent.  Is a Class more like a set, or more like a template?

Ontologist: Definitely not a template, unlike object-oriented programming.  More like a set where the membership can change over time.

Mathematician: OK.  S or no S, I think we are mostly talking about the same thing.  In fact, your picture showing the Classes separated out instead of nested reminds me of what Georg Cantor said: “A set is a Many that allows itself to be thought of as a One.”

Ontologist: Yes.  You can think of a Class as a set of real world instances of a concept that is used to describe a subject like Banking.  Typically, we can re-use more general Classes and only need to create a subclass to differentiate its members from the other members of the existing Class (like Bank is a special kind of Company).  We create or re-use a Class when we want to give the Things in it meaning and context by relating them to other things.

Mathematician: Like this?

A Mathematician and an Ontologist walk into a bar…

Ontologist: Exactly.  Now we know more about Joan, and we know more about Wells Fargo.  We call that a triple.

Mathematician: A triple.  How clever.

Ontologist: Actually, that’s the way we store all our data.  The triples form a knowledge graph.

Mathematician: Oh, now that’s interesting …  nice idea. Simple and elegant.  I think I like it.

Ontologist: Good.  Now back to your triple with Joan and Wells Fargo.  How would you generalize it in the world of sets?

Mathematician: Simple.  I call this next diagram a mapping, with Domain defined as the things I’m mapping from and Range defined as the things I’m mapping to.

A Mathematician and an Ontologist walk into a bar…

Ontologist: I call worksFor an Object Property.  For today only, I’m going to shorten that to just “Property”.  But.  Wait, wait, wait.  Domain and Range?

A Mathematician and an Ontologist walk into a bar…

In my world, I need to be careful about what I include in the Domain and Range, because any time I use worksFor, my reasoner will conclude that the thing on the left is in the Domain and the thing on the right is in the Range.

Ontologist continues: Imagine if I set the Domain to Person and the Range to Company, and then assert that Sparkplug the horse worksFor Tom the farmer.  The reasoner will tell me Sparkplug is a Person and Tom is a Company.  That’s why Domain and Range always raise a big CAUTION sign for me.  I always ask myself if there is anything else that might possibly be in the Domain or Range, ever, especially if the Property gets re-used by some else.  I need to define the Domain and Range broadly enough for future uses so I won’t end up trying to find the Social Security number of a horse.

Mathematician: Bummer.  Good luck with that.

Ontologist: Oh, thank you.  Now back your “mapping”.  I suppose you think of it as a set of arrows and you can have subsets of them.

Mathematician: Yes, pretty much.  If I wanted to be more precise, I would say a mapping is a set of ordered pairs.  I’m going to use an arrow to show what order the things are in; and voila, here is your set diagram for the concept:

A Mathematician and an Ontologist walk into a bar…

You will notice that there are two different relationships:

A Mathematician and an Ontologist walk into a bar…

The pair (Joan, Wells Fargo) is in both sets, so it is in both mappings.  Does that make sense to you?

Ontologist: Yes, I think it makes sense.  In my world, if I cared about both of these types of relationships, I would make isAManagerAt a subProperty of worksFor, and enter the assertion that Joan is a manager at Wells Fargo.  My reasoner would add the inferred relationship that Joan worksFor Wells Fargo.

Mathematician: Of course!  I think I’ve got the basic idea now.  Let me show you what else I can do with sets.  I’ll even throw in some your terminology.

Ontologist: Oh, by all means. [O is silently thinking, “I bet this is all in OWL, but hey, the OWL specs don’t have pictures of sets.”]

Mathematician: [takes a deep breath so he can go on and on … ]

Let’s start with two sets:

A Mathematician and an Ontologist walk into a bar…

The intersection is a subset of each set, and each of the sets is a subset of the union.  If we want to use the intersection as a Class, we should be able to infer:

A Mathematician and an Ontologist walk into a bar…And if we want to use the union as a Class, then each original Class is a Sub Class of the union:

A Mathematician and an Ontologist walk into a bar…

If two Classes A and B have no members in common (disjoint), then every Sub Class of A is disjoint from every sub class of B:

A Mathematician and an Ontologist walk into a bar…A mapping where there is at most one arrow out from each starting point is called a function.

A Mathematician and an Ontologist walk into a bar…A mapping where there is at most one arrow into each ending point is called inverse-functional.

A Mathematician and an Ontologist walk into a bar…

You get the inverse of a mapping by reversing the direction of all the arrows in it.  As the name implies, if a mapping is inverse-functional, it means the inverse is a function.

Sometimes the inverse mapping ends up looking just like the original (called symmetric), and sometimes it is “totally different” (disjoint or asymmetric).

A Mathematician and an Ontologist walk into a bar…Sometimes a mapping is transitive, like our diagram of inferences with subClassOf, where a subclass of a subclass is a subclass.  I don’t have a nice simple set diagram for that, but our Class diagram is an easy way to visualize it.  Take two hops using the same relationship and you get another instance of the relationship:

A Mathematician and an Ontologist walk into a bar…

Sets can be defined by combining other sets and mappings, such as the set of all people who work for some bank (any bank).

Ontologist: Not bad.  Here’s what I would add:

Sometimes I define a set by a phrase like you mentioned (worksFor some Bank), and in OWL I can plug that phrase into any expression where a Class name would make sense.  If I want to turn the set into a named Class, I can say the Class is equivalent to the phrase that defines it.  Like this:

BankEmployee is equivalentTo (worksFor some Bank).

The reasoner can often use the phrase to infer things into the Class BankEmployee, or use membership in the Class to infer the conditions in the phrase are true.  A lot of meaning can be added to data this way.  Just as in a dictionary, we define things in terms of other things.

When two Classes are disjoint, it means they have very distinct and separate meanings.  It’s a really good thing, especially at more general levels.  When we record disjointness in the ontology, the reasoner can use it to detect errors.

Whenever I create a Property, I always check to see if it is a function.  If so, I record the fact that it is a function in the ontology because it sharpens the meaning.

We never really talked about Data Properties.  Maybe next time.  They’re for simple attributes like “the building is 5 stories tall”.

A lot of times, a high level Property can be used instead of creating a new subProperty.  Whenever I consider creating a new subProperty, I ask myself if my triples will be just as meaningful if I use the original Property.  A lot of times, the answer is yes and I can keep my model simple by not creating a new Property.

An ontology is defined in terms of sets of things in the real world, but our data base usually does not have a complete set of records for everything defined in the ontology.  So, we should not try to infer too much from the data that is present.  That kind of logic is built in to reasoners.

On the flip side, the data can include multiple instances for the same thing, especially when we are linking multiple data sets together.  We can use the sameAs Property to link records that refer to the same real-world thing, or even to link together independently-created graphs.

The OWL ontology language is explained well at: https://www.w3.org/TR/owl-primer/

However, even if we understand the theory, there are many choices to be made when creating an ontology.  If you are creating an ontology for a business, a great book that covers the practical aspects is Demystifying OWL for the Enterprise by Michael Uschold.

Mathematician: I want the last word.

Ontologist: OK.

Mathematician:

A Mathematician and an Ontologist walk into a bar…Ontologist: I agree, but that wasn’t a word.  🙂

Mathematician: OK.  I think I’m starting to see what you are doing with ontologies.  Here’s what it looks like to me: since it is based on set logic and triples, the OWL ontology language has a rock-solid foundation.

Written By: Phil Blackwood, Ph.D.

The Data-Centric Revolution: Data-Centric vs. Centralization

We just finished a conversation with a client who was justifiably proud of having centralized what had previously been a very decentralized business function (in this case, it was HR, but it could have been any of a number of functions). They had seemingly achieved many of the benefits of becoming data-centric through decentralization: all their data in one place, a single schema (data model) to describe the data, and dozens of decommissioned legacy systems.

We decided to explore whether this was data-centric and the desirable endgame for all their business functions.

A quick review. This is what a typical application looks like:

The metadata is the key. The application, the business logic and the UI are coded to the metadata (Schema), and the data is accessed through and understood by the metadata. What happens in every large enterprise (and most small ones) is that different departments or divisions implement their own applications.

Click on the image to see a larger version.

Many of the applications were purchased, and today, some are SaaS (Software as a Service) or built in-house. What they all fail to share is a common schema. The metadata is arbitrarily different and, as such, the code base on top of the metadata is different, so there is no possibility of sharing between departments. Systems integrators try to work out what the data means and piece it together behind the scenes. This is where silos come from. Most large firms don’t have just four silos, they have thousands of them.

One response to this is “centralization.” If you discover that you have implemented, let’s say, dozens of HR systems, you may think it’s time to replace them with one single centralized HR system. And you might think this will make you Data-Centric. And you would be, at least, partially right.

Recall one of the litmus tests for Data-Centricity:

Let’s take a deeper look at the centralization example.

Click on the image to see a larger version.

Centralization replaces a lot of siloed systems with one centralized one. This achieves several things. It gets all the data in one place, which makes querying easier. All the data conforms to the same schema (and single shared model). Typically, if this is done with traditional technology, this is not a simple model, nor is it extensible or federate-able, though there is some progress.

The downside is that everyone now must use the same UI and conform to the same model, and that’s the tradeoff.

Click on the image to see a larger version.

The tradeoff works pretty well for business domains where the functional variety from division to division is slight, or where the benefit to integration exceeds the loss due to local variation.  For many companies, centralization will work for back office functions like HR, Legal, and some aspects of Accounting.

However, in areas where the local differences are what drives effectiveness and efficiency (sales, production, customer service, or supply chain management) centralization may be too high a price to pay for lack of flexibility.

Let’s look at how Data-Centricity changes the tradeoffs.

Click here to read more on TDAN.com

Does your Data Spark Joy? Part 1

Why is Marie Kondo so popular for home organization?

Does your Data Spark Joy?Marie Kondo released her book, “The Life-Changing Magic of Tidying up,” almost ten years ago and has since gained much notoriety for motivating millions of people to de-clutter their homes, offices, and lives. Some people are literally buried in their possessions with no clear way to get from room to room.  Others simply struggle to get out the door in the morning because their keys, wallet, and phone play a daily game of hide-and-seek. Whatever the underlying cause of this overwhelm, Marie Kondo offers a simple, clear method for getting stuff under control. Not only that, but she promises that tidying up will clear the spaces in our lives, leaving room for peace and joy.

Why does this method apply to Data-Centric Architecture?

You might be wondering what this has to do with data-centric architecture.  In many ways the Marie Kondo method is easily extrapolated out of the realm of physical possessions and applied to virtual things: bits of data, documents, data storage containers, etc. In the world of information and data, it’s not surprising that people have seen parallels between belongings and data.  That said, it’s not enough to just say that new applications, storage methods, or business processes will solve the problems of information overload, data silos, or dirty data.  Instead, it’s important to examine your company’s data and the business that data serves.

Overarching Data-Centric Principles

For most businesses and agencies, data is essential to function and is ensconced in legal requirements and data lifecycle policy.  It simply isn’t realistic to say, “Throw it all out!”  Instead, the principles behind acquiring, using, storing, and eventually discarding things must be understood.  And in the virtual space, we can understand “things” to be data-centric, metadata, and systems.

Her Method Starts with “Why?”

In her book, Marie Kondo says, “Before you start, visualize your destination.”  And she expands on this, asking readers to think deeply about the question and visualize the outcome of having a tidy space: “Think in concrete terms so that you can vividly picture what it would be like to live in a clutter-free space.” Our clients will often engage us with some ideal data situation in mind.  It might be expressed in terms of requirements or use cases, but it often has to do with being able to harmonize and align data, do large-scale systems integration, or add more sophisticated querying capabilities to existing databases or tools.  In fact, the first steps of our client engagements have to do with developing these questions into statements of work.

Also, we encourage clients to envision their data and what it can tell them independently of applications, systems, and capabilities precisely to avoid the pitfall of thinking in terms of using new tools to solve undefined problems.  It’s uncanny that this method of interrogation into underlying motivations is common between data-centric development and spark-joy tidying up.

Her Method is About the Psychology of Belongings.

It is important to understand how organizations come to have their data.  In the US Government, entire programs are devoted to managing acquisition. In finance, manufacturing, and other industries, the process of acquiring systems and data is often a business unto itself. It’s not uncommon to hear people working with data to refer to “data silos” when talking about partitioned and disconnected collections of data.  Sometimes this data is shuffled into classified folders and proprietary systems unnecessarily, simply because someone wants to retain control of it. In my work at the Federal Government, I found that the process of determining the system of record to be intensely political and time-consuming.  It’s not a trivial process and not simple, but it is essential to the effort of tidying your data-centric environment.

Sort your Data by Category.

Marie Kondo recommends going categorically for a reason.  In her book, she talks about her process of evaluating her belongings by location, drawer by drawer, room by room, and discovering that she found herself organizing multiple drawers with the same things repeatedly.  She tells us, “The root of the problem lies in the fact that people often store the same type of item in more than one place.  When we tidy each place separately, we fail to see that we’re repeating the same work in many locations and become locked into a vicious circle of tidying.” If this doesn’t sound familiar, you aren’t even working with data.

For me, this principle became clear when I gathered all my office supplies in one place. I was astounded by the small mountain of binder clips (and Sharpies) that seemed to materialize out of nowhere. I always seem to be looking for binder clips and sharpies, so I was shocked by how many I had.

I can think of no closer parallel than the proliferation of siloed systems that appear in each department within an agency.  When I worked for a government agency, I was part of a team whose job it was to survey the offices to find out who was using flight data.  There were several billion-dollar systems in development and in maintenance that held flight data. Over the course of a few years, I would hear quotes about the agency-wide number of flight data systems go from 15 systems, to 20, to 30, and beyond.  It literally became an inside-joke with leadership. And at times, we would hear rumors about some small branch office that had their own Microsoft Access database to keep track of their own data, because they couldn’t get what they needed from the systems of record.  Systems are like the binder clips of enterprise data, except that this kind of proliferation is as easy as making a copy. You don’t even need to make a trip to the office supply store to end up with a pile of duplicates.  If you want to understand how much data redundancy you have, search for specific categories of data across all systems.

Does it spark Joy? What does joy mean in the context of systems and data?

How do you know what sparks joy?  First, look at how the principle of looking for joy is applied.  Presumably, you are in your line of business because on some level it brings you joy – joy that derives from fulfilling a purpose.  Remember the first step of understanding why you are embarking on a transformative process and go back to what you envisioned.  Another way that you can look at joy is whether your space and the things in it allow for that spark to happen.  Ideally, you remove the items from your space that hinder that spark, after acknowledging the lessons they’ve taught you.  Do you feel that spark of joy when you grab your keys in the morning on the way out the door? If you’ve ever tried to find misplaced keys while you’re in a rush, you know the antithesis of joy. Having done the work of creating a space where your keys are easy to find is a way of facilitating joy in your morning routine.

One of the supposed failures of the Marie Kondo method as it applies to data clutter is that it is impossible to physically hold, or even look at, every single piece of data in your system.  Again, rely on the principle behind her method, which is that it is important to be thorough and aim for an environment that facilitates ease and joy.  Don’t say, “We can’t delete any personnel data!” and quit.  Commit to taking an inventory of your personnel systems and the systems that use personnel data. If that process reveals that you have ten different personnel systems and personnel data scattered in several other systems, you must take a closer look at your data environment.  At one point in my physical de-cluttering, I found a tin full of paper clips.  I didn’t handle each shiny paper clip individually; rather, I acknowledged the paper clips served me when I printed more documents onto paper, and since I no longer had a printer, I decided to toss them into the recycling bin.

Remember why you’re considering a solution to data problems in the first place and make a commitment to doing the work of determining your real data needs. Purpose is key, because the way data sparks joy is by enabling you to fulfil that purpose.  This can be difficult where the work you do is abstract and somewhat removed from business that is easy to understand.  However, the critical point to knowing whether or not the data in front of you serves its purpose in your business is to fully understand your business.

Discard and Delete your Data

Take a wardrobe full of clothing for example.  Many of Marie Kondo’s clients are surprised when they start organizing their wardrobes. It’s surprising when you can see the amount of clothing that is unserviceable, the number of items that still have tags on them, the number of hand-me-downs or gifts that don’t suit you, etc. These items are sometimes difficult to discard because of several reasons:

  • It’s kept out of obligation to the giver.
  • It cost a lot of money to buy it.
  • It’s still in good repair.
  • It might be the perfect thing to wear at an unspecified event in the future.
  • It reminds you of the lovely event at which you wore it.
  • It reminds you of the person who left it with you.

It may seem far-fetched to apply these reasons to data storage, but a quick glance through failed data projects will show you otherwise.  Consider the proprietary data locked in a system owned by a vendor for which your license has lapsed, or the system that’s coded in an outdated language whose expert programmers have to be called out of retirement to access, or that directory of data that doesn’t really match the fields in your database, but you requested through a complex data-sharing agreement with another agency.  If you can’t think of an example of a system that has been paid for but hasn’t been used, just consider that the terms shelfware and vaporware exist. It’s easy to be cynical about data precisely because of the overlaps between why we keep things in our closets and garages, and why we keep systems and data in our repositories. When you consider these parallels and understand the principles behind evaluating the items you keep with the hope that they will make your life better, sparking joy becomes easier.

Storage experts are hoarders.

Marie Kondo says you don’t need more storage.  That new Cloud service that can take all the old databases you have and make them accessible is not going to solve your problem. Data storage is expensive, and you do not need a new data storage solution.  What you need is to understand your business process, the business need for the data you believe you have, and a disposition plan for everything else.

How do you start?

In summary, if you’re looking for smart data-centric solutions to help you manage an overwhelming amount of data, or you’re looking for ways to access your vast stores of data in a way that enables smarter business solutions, your bigger issue might be data hoarding.  Looking at your business needs, closely examining the data you have, and coming up with strategies for aligning your data to a manageable data lifecycle can seem overwhelming.  Using a data-centric approach will bring that dream into focus. Keep an eye out for part two of this series to learn how to get your data to spark joy for you.

Click Here to Read Part 2 of this Series

The Data-Centric Revolution: The Sky is Falling (Let’s Make Lemonade)

Recently IDC predicted that IT spending will drop by 5% due to the COVID-19 pandemic.[1] Last week, Gartner went further by predicting that IT spending would drop by 8% or $300 Billion.[2] (Expect a prediction bidding war.) Both were consistent: highest hit areas would be devices, followed by IT service and enterprise software.

The predicted $100 billion drop, in those last two categories, should send chills through those of us who make our living in those two categories. And keep inIT Spending mind, this drop will occur in the latter half of this year. To date, here have been very few cuts.

But I’m seeing the glass half full here. Half full of lemonade.[3]

Here is my thought process:

  • For at least five years, we have been advocating to abandon the senseless implementation of application after application. (You know: the silo making industry.) We have made a strong case for avoiding the application centric quagmire in Software Wasteland.[4]
  • And yet spending on implementing application systems had continued unabated since 2015.
  • With the need to slash budgets in the latter half of 2020, the large application implementation projects will be the easiest section to target.
  • Indeed, the IDC article says that “IT services spending will also decline, mostly due to delays in large projects.”
  • Furthermore, “some firms will cut capital spending and others will either delay new projects or seek to cut costs in other ways.”
  • Gartner reported that “some companies are cutting big IT projects altogether; others are ploughing ahead but delaying some elements of their plans to save money.”
  • Hershey has halted sections of a new ERP system and will drop IT capital spending from the budgeted $500 million to between $400-450 million.
  • Gartner also stated that “health care systems [are] pushing out projects to create digital health records by six months or more.”

This would be a terrible time to be an application software vendor or a systems integrator. The yearly 7% reductions in both categories are still in front of us. Any contract not yet signed will be put on hold. Even contracts in progress may get cancelled.

Click here to read more on TDAN.com