gistBFO: An Open-Source, BFO Compatible Version of gist

gistBFO: An Open-Source, BFO Compatible Version of gist 

Dylan ABNEY a,1, Katherine STUDZINSKI a, Giacomo DE COLLE b,c,  Finn WILSON b,c, Federico DONATO b,c, and John BEVERLEY b,c aSemantic Arts, Inc. 

bUniversity at Buffalo 

cNational Center for Ontological Research 

ORCiD ID: Dylan Abney https://orcid.org/0009-0005-4832-2900, Katherine Studzinski  https://orcid.org/0009-0001-3933-0643, Giacomo De Colle https://orcid.org/0000- 0002-3600-6506, Finn Wilson https://orcid.org/0009-0002-7282-0836, Federico  Donato https://orcid.org/0009-0001-6600-240X, John Beverley https://orcid.org/0000- 0002-1118-1738 

Abstract. gist is an open-source, business-focused ontology actively developed by  Semantic Arts. Its lightweight design and use of everyday terminology has made it a useful tool for kickstarting domain ontology development in a range of areas including finance, government, and pharmaceuticals. The Basic Formal Ontology  (BFO) is an ISO/IEC standard upper ontology that has similarly found practical application across a variety of domains, especially biomedicine and defense. Given its demonstrated utility, BFO was recently adopted as a baseline standard in the U.S.  Department of Defense and Intelligence Community. 

 Because BFO sits at a higher level of abstraction than gist, we see an opportunity  to align gist with BFO and get the benefits of both: one can kickstart domain  ontology development with gist, all the while maintaining an alignment with the  BFO standard. This paper presents such an alignment, which consists primarily of subclass relations from gist classes to BFO classes and includes some subproperty axioms. The union of gist, BFO, and this alignment is what we call “gistBFO.” The upshot is that one can model instance data using gist and then instances of gist classes can be mapped to BFO. This not only achieves compliance with the BFO  standard; it also enables interoperability with other domains already modelled using  BFO. We describe a methodology for aligning gist and BFO, provide rationale for decisions we made about mappings, and detail a vision for future development. 

Keywords. Ontology, upper ontology, ontology alignment, gist, BFO 

1. Introduction 

In this paper, we present an alignment between two upper ontologies: gist and the Basic  Formal Ontology (BFO). While both are upper ontologies, gist and BFO exhibit rather different formal structures. An alignment between these ontologies allows users to get  the benefits of both. 

An ontology is a representational artifact which includes a hierarchy of classes of entities and logical relations between them [1, p.1]. Ontologies are increasingly being  used to integrate diverse sorts of data owing to their emphasis on representing implicit  semantics buried within and across data sets, in the form of classes and logical relations  among them [2]. Such formal representations facilitate semantic interoperability, where diverse data is connected by a common semantic layer. Ontologies have additionally  proven valuable for clarifying the meanings of terms [3] and supporting advanced  reasoning when combined with data, in the form of knowledge graphs [4]. 

The Basic Formal Ontology (BFO) is an upper-level ontology that is used by over  700 open-source ontologies [5]. It is designed to be very small, currently consisting only  of 36 classes, 40 object properties, and 602 axioms [6]. BFO satisfies the conditions for counting as a top-level ontology, described in ISO/IEC 21838-1:2021: it is “…created to represent the categories…shared across a maximally broad range of domains” [7].  ISO/IEC 21838-2:2021 establishes BFO as a top-level ontology standard [8]. The BFO  ecosystem adopts a hub-and-spokes strategy for ontology extensions, where classes in  BFO form a hub, and new subclasses of BFO classes are made as spokes branching out  from it. Interoperability between different ontologies can be preserved by linking up to  BFO as a common hub. All classes in BFO are subclasses of bfo:Entity2 [9], which  includes everything that has, does, or will exist. Within this scope, BFO draws a fundamental distinction with two classes: bfo:Continuant and bfo:Occurrent.  Roughly, a continuant is a thing that persists over some amount of time, whereas an  occurrent is something that happens over time [1]. A chef is an example of a continuant,  and an act of cooking is an example of an occurrent. 

gist is a business-focused upper-level ontology that has been developed over the last  15+ years and used in over 100 commercial implementations [10]. Ontology elements found in gist leverage everyday labels in the interest of facilitating stakeholder understanding to support rapid modeling. Much like BFO, gist contains a relatively small  number of terms, relations, and formally specified axioms: It has 98 classes, 63 object  properties, 50 datatype properties, and approximately 1400 axioms, at the time of this  writing. Approximately 20 classes are at the highest level of the gist class hierarchy.  Subclasses are defined using a distinctionary pattern,3 which includes using a subclass axiom along with disjointness axioms and property restrictions to distinguish a class from its parents and siblings. gist favors property restrictions over domain and range axioms to maintain generality and avoid a proliferation of properties [12]. Commonly used top level classes include gist:Commitment, gist:Event, gist:Organizationgist:PhysicalIdentifiableItem, and gist:Place

Ontology alignments in general are useful because they allow interoperability between ontologies and consequently help prevent what has been called the ontology silo  problem, which arises when ontologies covering the same domain are constructed independently from one another, using differing syntax and semantics [13]. Ontologists  typically leverage the Resource Description Framework (RDF) and vocabularies extended from it, to maintain flexibility when storing data into graphs, which goes some  way to address silo problems. If, however, data is represented in RDF using different ontologies, enriched with different semantics, then ontology silo problems emerge.  Alignment between potential ontology silos can address this problem by allowing the data to be interpreted by each aligned ontology. 

Needless to say, given the respective scopes of gist and BFO, as well as their overlapping users and domains, we have identified them as ontology silos worth aligning.  For current users of gist, alignment provides a way to leverage BFO without requiring  

2 We adopt the convention of displaying class names in bold, prepended with a namespace identifier indicating provenance. 3 The distinctionary pattern outlined in [11] is like the Aristotelian approach described in [1].

any additional implementation. For new users of gist, it provides a pragmatic base for building domain ontologies. This is of particular importance as BFO was recently adopted as a baseline standard in the U.S. Department of Defense and Intelligence  Community [14]. For stakeholders in both the open and closed space, the alignment proposed here will allow users to model a domain in gist and align with other ontologies in the BFO ecosystem, satisfying the requirements of leveraging an ISO standard. In the other direction, users of BFO will be able to leverage domain representations in gist,  gaining insights into novel modeling patterns, potential gaps in the ecosystem, and avenues for future modeling research. 

2. Methodology 

In this section we discuss the process we used to build what we call “gistBFO,” an ontology containing a semantic alignment between gist and BFO. We started by creating an RDF turtle file that would eventually contain all the mappings, and then manually worked out the connections between gist and BFO starting from the upper-level classes of both ontologies. We specified that our new ontology imports both the gist and BFO  ontologies, complete with their respective terms and axioms. To make use of gistBFO, it  can be imported into a domain ontology that currently uses gist. 

Figure 1. gistBFO import hierarchy4 

2.1. Design principles 

To describe our methodology, it is helpful to distinguish between alignments and mappings [15]. By alignment we mean a set of assertions (or “triples”) of the form <s, p,  o> that relate the terms of one ontology to another. gistBFO contains such an alignment.  The individual assertions contained within an alignment are mappings.  

4 This diagram is adapted from a similar diagram in [10].

gist:Specification subClassOf bfo:GenericallyDependentContinuant5 (bfo:GDC,  hereafter) is an example of one mapping in gistBFO’s alignment [16].  By way of evaluation, we have designed gistBFO to exhibit a number of important  properties: consistency, coherence, conservativity, specificity, and faithfulness [17]. An  ontology containing an alignment is consistent just in case its mappings and the component ontologies do not logically entail a contradiction. For example, if a set of  assertions entails both that Bird is equivalent to NonBird and that NonBird is equivalent  to the complement of Bird, then it is inconsistent. Relatedly, such an ontology is  coherent just in case all of its classes are satisfiable. In designing an ontology, a common mistake is creating an unsatisfiable class—a class that cannot have members on pain of  a contradiction.6 Suppose a class A is defined as a subclass of both B and the complement of B. Anything asserted as a member of A would be inferred to be a member of B and its complement, resulting in a contradiction. Note that the definition of A itself  does not result in a logical inconsistency; it is only when an instance is asserted to be a member of A that a contradiction is generated. 

Consistency and coherence apply to gistBFO as a whole (i.e., the union of gist, BFO,  and the alignment between them). The next several apply more specifically to the alignment. 

An alignment is conservative just in case it does not add any logical entailments  within the aligned ontologies.7 Trivially, gistBFO allows more to be inferred than either gist or BFO alone, since it combines the two ontologies and adds mapping assertions between them. However, it should not imply anything new within gist or BFO, which would effectively change the meanings of terms within the ontologies. For example, gist  never claims that gist:Content subClassOf gist:Event. If gistBFO were to imply this,  it would not only be ontologically suspect, but it would extend gist in a non-conservative  manner, effectively changing the meaning of gist:Content. Similarly, BFO never claims  that BFO:GDC subClassOf BFO:Process (again, for good reason); so if gistBFO were  to imply this, this too would make it a non-conservative extension, changing the content  of BFO itself. It is desirable for the alignment to be a conservative extension of gist and  BFO so that it is not changing the meaning of terms within gist or BFO. By the same  token, if gistBFO were to remove axioms from gist or BFO, it would need to be handled carefully so that it too preserves the spirit of the two ontologies. (More on this in Section  4.1.1.) Additionally, if gistBFO does not remove any axioms from gist or BFO, there is  no need to maintain separate artifacts with modified axioms. 

An alignment is specific to the extent that terms from the ontologies are related to the most specific terms possible. For example, one possible alignment between gist and  BFO would contain mappings from each top-level gist class to bfo:Entity. While this would constitute a bonafide alignment by our characterization above, it is not an interesting or useful one. If it achieves BFO-compliance, it is only in a trivial sense. For  this reason, we aimed to be specific with our alignment and mapped gist classes to the lowest BFO classes that were appropriate.  

5 Strictly speaking, the IRI for generically dependent continuant in BFO is obo:BFO_0000031, but we use bfo:GenericallyDependentContinuant (and bfo:GDC for short). The actual subclass relation used in the  alignment is rdfs:subClassOf, but the namespace prefix is dropped for brevity. 6 In OWL, unsatisfiable classes are subclasses of owl:Nothing, the class containing no members. It is  analogous to the set-theoretic notion of “the empty set.” 7 See [17, p.3] for a more formal explanation of conservativity in the context of an alignment.

An alignment is faithful to the extent that it respects the intended meanings of the terms in each ontology. Intent is not always obvious, but it can often be gleaned from  formal definitions, informal definitions/annotations, and external sources. 

We aim in this work for gistBFO to exhibit the above properties. Note also that two ontologies are said to be synonymous just in case anything expressed in one ontology can be expressed in terms of the other (and vice versa) [18]. We do not attempt to establish synonymy with this alignment. First, for present purposes, our strategy is to model in gist and then move to BFO, not the other way around. Second, the alignment in its current form consists primarily of subclass assertions from gist classes to BFO classes. With an accurate subclassing bridge, instances modeled in gist would then achieve an initial level of BFO-compliance, as instances can be inferred into BFO classes. A richer mapping  might be able to take an instance modeled in gist and then translate that entirely into  BFO, preserving as much meaning as possible. For example, something modeled as a  gist:Event with gist:startDateTime and gist:endDateTime might be modeled as a  bfo:Process related to a bfo:TemporalRegion. We gesture at some more of these richer  mappings in the Conclusion section, noting that our ultimate plan is to investigate these  richer mappings in the future. So, while we do not attempt to establish synonymy  between gist and BFO at present, we do have a goal of preserving as much meaning as  possible in the alignment here, and plan to expand this work in the near future. In that  respect, our work here provides a firm foundation for a richer, more complex, semantic  alignment between gist and BFO. 

Given our aim of creating a BFO-compliant version of gist, we have created a consistent, coherent, conservative, specific, and faithful ontology. Since both gist and  BFO are represented in the OWL DL profile, consistency and coherence were established using HermiT, a DL reasoner [19, 20]. By running the reasoner, we were able to establish that no logical inconsistencies or unsatisfiable classes were generated. While it is  undecidable in OWL 2 DL whether an alignment is a conservative extension, one can  evaluate the approximate deductive difference by looking more specifically at the  subsumption relations that hold between named classes in gist or BFO.8 We checked, for example, that no new entailments between gist classes were introduced. Specificity and  faithfulness are not as easily measured, but we detail relevant design choices in the  Discussion section as justification for believing our alignment exhibits these properties  as well. 

2.2. Identifying the mappings 

The properties detailed in Section 2.1 give a sense of our methodological aims for gistBFO. Now we turn to our methods for creating the mappings within the alignment. In our initial development of the alignment, we leveraged the BFO Classifier [22].  Included in the BFO Classifier was a decision diagram that allowed us to take instances of gist classes, answer simple questions, and arrive at a highly plausible candidate superclass in BFO. For example, consider a blueprint for a home. In gist, a blueprint  would fall under gist:Specification. To see where a blueprint might fall in BFO, we  answered the following questions: 

8 The set of changed subsumption entailments from combining ontologies with mappings has been called the approximate deductive difference [17, p.3; 21].

Q: Does this entity persist in time or unfold in time? A: It persists. So, a  blueprint is a bfo:Continuant

Q: Is this entity a property of another entity or depends on at least one other  entity? A: Yes, a blueprint depends on another entity (e.g., a sheet of paper) to be represented. 

Q: May the entity be copied between a number of bearers? A: Yes, a blueprint can be copied across multiple sheets of paper. So, a blueprint is a bfo:GDC

Given that blueprints are members of gist:Specification and bfo:GDC (at least  according to our answer above), bfo:GDC was considered a plausible candidate  superclass for gist:Specification. And indeed, as we think about all the possible  instances of gist:Specification, they all seem like they would fall under bfo:GDC

Our alignment was not conducted entirely by using the BFO Classifier. Our teams are constituted by lead developers, stakeholders, and users of both gist and BFO.  Classification was refined through consensus-driven meetings, where the meanings of ontology elements in respective structures were discussed, debated, and clarified. Thus, while the BFO Classifier tool provided a very helpful starting point for discussions of alignment, thoughtful effort was put into identifying and verifying that the gist and BFO mappings exhibited the highest degree of accuracy. 

Tables 1 and 2 contain a non-exhaustive list of important classes and definitions from gist and BFO that we refer to throughout the paper. 

BFO Class Elucidation/Definition
Continuant An entity that persists, endures, or continues to exist through time while maintaining
its identity.

Independent
Continuant A continuant which is such that there is no x such that it specifically depends on x
and no y such that it generically depends on y.

Specifically
Dependent
Continuant A continuant which is such that (i) there is some independent continuant x that is not
a spatial region, and which (ii) specifically depends on x.

Generically
Dependent
Continuant An entity that exists in virtue of the fact that there is at least one of what may be
multiple copies.

Material Entity An independent continuant that at all times at which it exists has some portion of
matter as continuant part.

Immaterial Entity An independent continuant which is such that there is no time t when it has a
material entity as continuant part.

Object A material entity which manifests causal unity and is of a type instances of which
are maximal relative to the sort of causal unity manifested.

Occurrent An entity that unfolds itself in time or is the start or end of such an entity or is a
temporal or spatiotemporal region.

Process An occurrent that has some temporal proper part and for some time has a material
entity as participant.
Table 1. Selected BFO classes and definitions [6]

gist Class Elucidation/Definition
Event Something that occurs over a period of time, often characterized as an activity being
carried out by some person, organization, or software application or brought about
by natural forces.

Organization A generic organization that can be formal or informal, legal or non-legal. It can have
members, or not.


Building A relatively permanent man-made structure situated on a plot of land, having a roof
and walls, commonly used for dwelling, entertaining, or working.


Unit of Measure A standard amount used to measure or specify things.


Physical
Identifiable Item A discrete physical object which, if subdivided, will result in parts that are
distinguishable in nature from the whole and in general also from the other parts.


Specification One or more characteristics that specify what it means to be a particular type of
thing, such as a material, product, service or event. A specification is sufficiently
precise to allow evaluating conformance to the specification.


Intention Goal, desire, aspiration. This is the “teleological” aspect of the system that indicates
things are done with a purpose.

Temporal Relation A relationship existing for a period of time.


Category A concept or label used to categorize other instances without specifying any formal
semantics. Things that can be thought of as types are often categories.


Collection A grouping of things.


Is Categorized By Points to a taxonomy item or other less formally defined class.

Is Member Of Relates a member individual to the thing, such as a collection or organization, that it
is a member of.
Table 2. Selected gist classes and definitions [23]

3. Results 

The gistBFO alignment contains 43 logical axioms. 35 of these axioms are subclass assertions relating gist classes to more general classes in BFO. All gist classes have a superclass in BFO.9 The remaining eight axioms are subproperty assertions. We focused  on mapping key properties in gist (e.g., gist:isCategorizedBy and gist:isMemberOf) to  BFO properties. While mapping gist properties to more specific properties in BFO does  not serve the use case of starting with gist and inferring into BFO, it nevertheless  provides a richer connection between the ontologies, which we view as a worthy goal. 

In addition to these 43 logical axioms, gistBFO also contains annotations expressing  the rationale behind some of the mapping choices. We created an annotation property  gist:bfoMappingNote for this purpose. 

At the highest level, almost all classes in gist fall under bfo:Continuant, since their  instances are things that persist through time rather than unfold over time. Exceptions to  this are instances falling under gist:Event and its subclasses, which (generally) fall under  bfo:Occurrent

Some of the gist subclasses of bfo:Continuant include gist:Collectiongist:PhysicallyIdentifiableItem, and gist:Content. Within BFO, continuants break  down into bfo:IndependentContinuant (entities that bear properties), bfo:GDC  (copyable patterns that are often about other entities), and bfo:SDC (properties borne by  independent continuants). With respect to our alignment, introduced subclasses of  bfo:IndependentContinuant include gist:Building or gist:Component or other  material entities like gist:PhysicalSubstance.10 Subclasses of bfo:GDC include  gist:Content, gist:Language, gist:Specification, gist:UnitOfMeasure, and  

9 An exception is gist:Artifact, which, in addition to being difficult to place in BFO, is slated for removal  from gist. 10 Best practice in BFO is to avoid mass terms [1], whereas gist:PhysicalSubstance is intentionally  designed to represent them—e.g., a particular amount of sand. Regardless, this class of mass terms would  map into a subclass of bfo:IndependentContinuant.

gist:Template—all things that can be copied across multiple bearers.11 A subclass of  bfo:SDC includes gist:TemporalRelation—a relational quality holding between  multiple entities. 

In most cases, the subclass assertions are simple in construction, relating a named  class in gist to a named class in BFO, for example, gist:Specification subClassOf  bfo:GDC. A more complex pattern involves the use of OWL property restrictions. For  example, gist:ControlledVocabulary was asserted to be a subclass of bfo:GDCs that  have some bfo:GDC as a continuant part. 

gist:ControlledVocabulary 

rdfs:subClassOf [ 

 a owl:Class ; 

 owl:intersectionOf ( 

  # class = bfo:GDC 

 obo:BFO_0000031 

  [ 

  a owl:Restriction ; 

  # property = bfo:hasContinunantPart  owl:onProperty obo:BFO_0000178 ;  # class = bfo:GDC 

 owl:someValuesFrom obo:BFO_0000031 ; 

  ) ; ] ; . 

In other cases, we employed a union pattern—e.g., gist:Intention is a subclass of the  union of bfo:SDC and bfo:GDC. Had we chosen a single named superclass in BFO for  gist:Intention, it might have been bfo:Continuant. The union pattern, however, allows  our mapping to exhibit greater specificity, as discussed above.  

Figures 2 through 4 illustrate important subclass relationships between gist and BFO  classes: 

Figure 2. Continuants in gist 

11 Many of these can be understood as various sorts of ‘information’, which should be classified under  bfo:GDC. For example, units of measurement are standardized information which describe some magnitude  of quantity.

Figure 3. Independent and dependent continuants in gist 

Figure 4. gist:Event 

4. Discussion 

In this section we discuss in depth some specific mappings we made, focusing most  closely on some challenging cases. 

4.1.1. gist:Intention and gist:Specification 

One challenging case was gist:Intention and its subclass gist:Specification. The textual  definition of gist:Intention suggests it is a mental state that is plausibly placed under  bfo:SDC. That said, the textual definition of gist:Specification (think of a blueprint)  suggests this class plausibly falls under bfo:GDC. Given that bfo:SDC and bfo:GDC 

are disjoint in BFO, this would result in a logical inconsistency. We thus appear to have  encountered a genuine logical challenge to our mapping. 

Exploring strategies for continuing our effort, we considered importing a “relaxed”  version of BFO that drops the disjointness axiom between bfo:SDC and bfo:GDC.  Arguably this option would respect the spirit of gist (by placing gist:Intention and  gist:Specification in their true homes in BFO) while losing a bit of the spirit of BFO.  While this may appear to be an unsatisfactory mapping strategy, we maintain that—if  such relaxing of constraints are properly documented and tracked—there is considerable  benefit in adopting such a strategy. Given two ontologies developed independently of  one another, there are likely genuine semantic differences between them, differences that  cannot be adequately addressed by simply adopting different labels. Clarifying, as much as possible, what those differences are can be incredibly valuable when representing data  using each ontology structure. Putting this another way, if, say, gist and BFO exhibited  some 1-1 semantic mapping so that everything in gist corresponds to something in BFO  and vice versa, it would follow that the languages of gist and BFO were simply two 

different ways to talk about the same domain. We find this less interesting, to be candid,  than formalizing the semantic overlap between these structures, and noting precisely  where they semantically differ. One way in which such differences might be recorded is  by observing and documenting—as suggested in this option—where logical constraints  such as disjointness might need to be relaxed in alignment. 

The preceding stated, relaxing constraints should be the last, not the first, option  pursued, since for the benefits highlighted above to manifest, it is incumbent on us to  identify where exactly there is semantic alignment, and formalize this as clearly as  possible. With that in mind, we pursue another option here, namely, to use a disjunctive  definition for gist:Intention—asserted to be a subclass of the union of bfo:GDC and  bfo:SDC. While this disjunctive definition perhaps does not square perfectly with the  text definition of gist:Intention, it does seem to be in the spirit of how gist:Intention is  actually used—sometimes like an bfo:SDC (in the case of a gist:Function), sometimes  like a bfo:GDC (in the case of a gist:Specification). This option does not require a  modified version of BFO. It also aligns with our goal of exhibiting specificity in our  mapping, since otherwise we would have been inclined to assert gist:Intention to simply  be a subclass of bfo:Continuant

gist:Intention 

rdfs:subClassOf [ 

a owl:Class ; 

  owl:unionOf ( 

  obo:BFO_0000020 # bfo:SDC 

  obo:BFO_0000031 # bfo:GDC 

  );];. 

This mapping arguably captures the spirit of both gist and BFO while remaining  conservative—i.e., it does not change any of the logical entailments within gist or BFO. 

4.1.2. gist:Organization 

gist:Organization was another interesting case. During the mapping we consulted the  Common Core Ontologies (CCO), a suite of mid-level ontologies extended from BFO,  for guidance since it includes an organization class [24]. cco:Organization falls under bfo:ObjectAggregate. Arguably, however, organizations can be understood as  something over and above the aggregate of their members, perhaps even persisting when  there are no members. For this reason, we considered bfo:ImmaterialEntity and  bfo:GDC as superclasses of gist:Organization. On the one hand, the challenge with  asserting gist:Organization is a subclass of bfo:ImmaterialEntity is that instances of  the latter cannot have material parts, and yet organizations often do, i.e. members. On  the other hand, there is plausibly a sense in which organizations can be understood as,  say, prescriptions or directions (bfo:GDC) for how members occupying positions in that  organization should behave, whether there ever are actual members. The CCO  characterization of organization does not seem to reflect this sense, given it is defined in  terms of members. It was thus important for our team to clarify which sense, if either or  both, was best reflected in gist:Organization

Ultimately, we opted for asserting bfo:ObjectAggregate as the superclass for  gist:Organization, as the predominant sense in which the latter is to be understood  concerns members of such entities. This is, importantly, not to say there are not genuine 

alternative senses of organization worth modeling in both gist and within the BFO  ecosystem; rather, it is to say after reflection, the sense most clearly at play here for  gist:Organization involves membership. For some gist classes, annotations and  examples made it clear that they belonged under a certain BFO class. In the case of  gist:Organization, gist is arguably neutral with respect to a few candidate superclasses.  Typically what is most important in an enterprise context is modeling organizational  structure (with sub-organizations) and organization membership. Perhaps this alone does  not require gist:Organization being understood as a bfo:ObjectAggregate;  nevertheless, practical considerations pointed in favor of it. Adopting this subclassing  has the benefit of consistency with CCO (and a fortiori BFO) and allows for easy  modeling of organization membership in terms of BFO. 

4.1.3. gist:Event 

At a first pass, a natural superclass (or even equivalent class) for gist:Event is  bfo:Process. After all, ‘event’ is an alternative label for bfo:Process in BFO. Upon  further evaluation, it became clear that some instances of gist:Event would not be  instances of bfo:Process—namely, future events. In BFO, with its realist interpretation,  processes have happened if they are to be represented. It is in this way that BFO  differentiates how the world could be, e.g., this portion of sodium chloride could  dissolve, from how the world is, e.g., this portion of sodium chloride dissolves. Future  events can be modeled as specifications, ultimately falling under bfo:GDC. In contrast,  a subclass of gist:Event, namely gist:ScheduledEvent, includes within its scope events  that have not yet started. There is thus not a straightforward mapping between  bfo:Process and gist:Event. Following our more conservative strategy, however, the  identified discrepancy can be accommodated by asserting that gist:Event is a defined  subclass of the union of bfo:GDC and bfo:Process.12 In this respect, we are able to  represent instances of gist:Event that have started (as instances of bfo:Process) and  those that have not (as instances of bfo:GDC). 

gist:Event 

rdfs:subClassOf [ 

a owl:Class ; 

  owl:unionOf ( 

  obo:BFO_0000031 # bfo:GDC 

  obo:BFO_0000015 # bfo:Process 

  );];. 

4.1.4. gist:Category 

gist:Category is a commonly used class in gist. It allows one to categorize an entity  without introducing a new class into an ontology. It guards against the proliferation of  classes with little or no semantics; instead, such categories are treated as instances, which  

12It is common in gist to model planned-event-turned-actual-events as single entities that persist through both  stages. When a plan goes from being merely planned to actually starting, it can flip from a bfo:GDC to a  bfo:Process. Events that have a gist:actualStartDateTime will be instances of bfo:Process, and the presence  of this property could be used to automate the flip. Different subclasses of gist:Event will be handled  differently—e.g., gist:HistoricalEvent is a subclass of bfo:Process that would not require the transition from  bfo:GDC.

are related to entities by a predicate gist:isCategorizedBy. So, for example, one might  have an assertion like ex:_Car_1 gist:isCategorizedBy  ex:_TransmissionType_manual, where the object of this triple is an instance of  ex:TransmissionType, which would be a subclass of gist:Category

If one thinks of BFO as an ontology of particulars, and if instances of gist:Category are not particulars but instead types of things, then arguably gist:Category does not have  a home in BFO. 

Nevertheless, as a commonly-used class in gist, it is helpful to find a place for it in  BFO if possible. One option is bfo:SDC: indeed, there are some classes in CCO (e.g.,  cco:EyeColor) that seem like they could be subclasses of gist:Category. However,  instances of bfo:SDC (e.g., qualities and dispositions) are individuated by the things that  bear them (e.g., the eye color of a particular person), which does not seem to capture the  spirit of gist:Category. Ultimately, we opted for bfo:GDC as the superclass in part  because of the similarity of instances of gist:Category to information content entities in  CCO, which are bfo:GDCs. 

5. Conclusion 

5.1. Future work 

We have established a foundational mapping between gist and BFO. From this  foundation going forward we aim to improve gistBFO along multiple dimensions. The  first set of improvements relate to faithfulness. While we are confident in many of the  mappings we have made, we expect the alignment to become more and more accurate as  we continue development. In some cases, the intended meanings of concepts are obvious  from formal definitions and annotations. In other cases, intended meaning is best  understood by discussions about how the concepts are used in practice. As we continue  discussions with practitioners of gist and BFO, the alignment will continue to improve. 

Another aim related to faithfulness is to identify richer mappings. In its current form  gistBFO allows instance data modeled under gist to be inferred into BFO superclasses.  While this achieves an initial connection with BFO, a deeper mapping could take  something modeled in gist and translate it to BFO. Revisiting the previous example,  something modeled as a gist:Event with gist:startDateTime and gist:endDateTime 

might be modeled as a bfo:Process related to a bfo:TemporalRegion. Many of these  types of modeling patterns can be gleaned from formal definitions and annotations, but  they do not always tell the whole story. Again, this is a place where continued discussions  with practitioners of both ontologies can help. From a practical perspective, more  complex mappings like these could be developed using a rule language (e.g., datalog or  SWRL) or SPARQL INSERT queries. 

We have also considered alignment with the Common Core Ontologies (CCO). One  of the challenges with this alignment is that gist and CCO sit at similar levels of  abstraction. Indeed, gist and CCO even appear to share classes that exhibit overlapping  semantics, e.g., language and organization. The similar level of abstraction creates a  challenge because it is not always easy to determine which classes are more general than  which. For example, are gist:Organization and cco:Organization equivalent, or is one  a superclass of the other? Furthermore, because there are considerably more classes in  CCO than BFO, preserving consistency with a growing set of alignment axioms becomes 

more of a concern. Despite the challenges, a mapping between gist and CCO would help  with interoperability, and it is a topic we intend to pursue in the future to that end. 

5.2. Final remarks 

We have presented an open-source alignment between gist and BFO. We described a  methodology for identifying mappings, provided rationale for the mappings we made,  and outlined a vision for future development. Our hope is that gistBFO can serve as a  practical tool, promoting easier domain ontology development and enabling  interoperability. 

Acknowledgements 

Thank you to Dave McComb for support at various stages of the gistBFO design process,  from big-picture discussions to input on specific mappings. Thanks also to Michael  Uschold and Ryan Hohimer for helpful discussions about gistBFO. 

References 

[1] Arp R, Smith B, Spear AD. Building ontologies with Basic Formal Ontology. Cambridge, Massachusetts:  The MIT Press; 2015. p. 220 

[2] Hoehndorf R, Schofield PN, Gkoutos GV. The role of ontologies in biological and biomedical research:  a functional perspective. Brief Bioinform. 2015 Nov;16(6):1069–80, doi: 10.1093/bib/bbv011 [3] Neuhaus F, Hastings J. Ontology development is consensus creation, not (merely) representation.  Applied Ontology. 2022;17(4):495-513, doi:10.3233/AO-220273 

[4] Chen X, Jia S, Xiang Y. A review: Knowledge reasoning over knowledge graph. Expert Systems with  Applications. 2020 Mar;141:112948, doi:10.1016/j.eswa.2019.112948 

[5] Basic Formal Ontology Users [Internet]. Available from: https://basic-formal-ontology.org/users.html [6] GitHub [Internet]. Basic Formal Ontology (BFO) Wiki – Home. Available from: https://github.com/BFO ontology/BFO/wiki/Home 

[7] ISO/IEC 21838-1:2021: Information technology — Top-level ontologies (TLO) Part 1: Requirements  [Internet]. Available from: https://www.iso.org/standard/71954.html 

[8] ISO/IEC 21838-2:2021: Information technology — Top-level ontologies (TLO) Part 2: Basic Formal  Ontology (BFO) [Internet]. Available from: https://www.iso.org/standard/74572.html [9] Otte J, Beverley J, Ruttenberg A. Basic Formal Ontology: Case Studies. Applied Ontology. 2021  Aug;17(1): doi:10.3233/AO-220262 

[10] McComb D. A BFO-ready Version of gist [Internet]. Semantic Arts. Available from:  https://www.semanticarts.com/wp-content/uploads/2025/01/20241024-BFO-and-gist-Article.pdf [11] McComb D. The Distictionary [Internet]. Semantic Arts. 2015 Feb. Available from:  https://www.semanticarts.com/white-paper-the-distinctionary/ 

[12] Carey D. Avoiding Property Proliferation [Internet]. Semantic Arts. Available from:  https://www.semanticarts.com/wp-content/uploads/2018/10/AvoidingPropertyProliferation012717.pdf [13] Trojahn C, Vieira R, Schmidt D, Pease A, Guizzardi G. Foundational ontologies meet ontology matching:  A survey. Semantic Web. 2022 Jan 1;13(4):685–704, doi.org/10.3233/SW-210447 

[14] Gambini B, Intelligence Community adopt resource developed by UB ontologists [Internet]. News  Center. 2024 [cited 2025 Mar 30]. Available from:  https://www.buffalo.edu/news/releases/2024/02/department-of-defense-ontology.html 

[15] Euzenat J, Shvaiko P. Ontology Matching, 2nd edition. Heidelberg: Springer; 2013. doi:10.1007/978-3- 642-38721-0. 

[16] GitHub [Internet]. Available from: https://github.com/semanticarts/gistBFO 

[17] Prudhomme T, De Colle G, Liebers A, Sculley A, Xie P “Karl”, Cohen S, Beverley J. A semantic  approach to mapping the Provenance Ontology to Basic Formal Ontology. Sci Data. 2025 Feb  17;12(1):282, doi:10.1038/s41597-025-04580-1 

[18] Aameri B, Grüninger M. A New Look at Ontology Correctness. Logical Formalizations of Commonsense  Reasoning. Papers from the 2015 AAAI Spring Symposium; 2015. doi:10.1613/jair.5339

[19] Shearer R, Motik B, Horrocks I. HermiT: A highly-efficient OWL reasoner. OWLED, 2008, Available  from: https://ceur-ws.org/Vol-432/owled2008eu_submission_12.pdf. 

[20] Glimm B, Horrocks I, Motik B, Stoilos G, Wang Z. HermiT: an OWL 2 reasoner. Journal of Automated  Reasoning. 2014;53:245–269, doi:10.1007/s10817-014-9305-1 

[21] Solimando A, Jiménez-Ruiz E, Guerrini G. Minimizing conservativity violations in ontology alignments:  algorithms and evaluation. Knowl Inf Syst. 2017;51:775–819, doi:10.1007/s10115-016-0983-3 [22] Emeruem C, Keet CM, Khan ZC, Wang S. BFO Classifier: Aligning Domain Ontologies to BFO. 8th  Joint Ontology Workshops; 2022. 

[23] GitHub [Internet]. gist. https://github.com/semanticarts/gist 

[24] Jensen M, De Colle G, Kindya S, More C, Cox AP, Beverley J. The Common Core Ontologies. 14th  International Conference on Formal Ontology in Information Systems; 2024:  doi:10.48550/arXiv.2404.17758.

How a “User” Knowledge Graph Can Help Change Data Culture

How a “User” Knowledge Graph Can Help Change Data Culture

Identity and Access Management (IAM) has had the same problem since  Fernando Corbató of MIT first dreamed up the idea of digital passwords in  1960: opacity. Identity in the physical world is rich and well-articulated, with a wealth of different ways to verify information on individual humans and devices. By contrast, the digital realm has been identity data impoverished, cryptic and inflexible for over 60  years now. 

Jans Aasman, CEO of Franz, provider of the entity-event knowledge graph  solution Allegrograph, envisions a “user” knowledge graph as a flexible and more  manageable data-centric solution to the IAM challenge. He presented on the topic at this past summer’s Data-Centric Architecture Forum, which Semantic Arts hosted near its headquarters in Fort Collins, Colorado. 

Consider the specificity of a semantic graph and how it could facilitate secure access control. Knowledge graphs constructed of subject-predicate-object triples make it possible to set rules and filters in an articulated and yet straightforward manner. Information about individuals that’s been collected for other HR purposes  could enable this more precise filtering. 

For example, Jans could disallow others’ access to a triple that connects “Jans”  and “salary”. Or he could disallow access to certain predicates. 

Identity and access management vendors call this method Attribute-Based  Access Control (ABAC). Attributes include many different characteristics of users and  what they interact with, which is inherently more flexible than role-based access control  (RBAC). 

Cell-level control is also possible, but as Forrest Hare of Summit Knowledge  Solutions points out, such security doesn’t make a lot of sense, given how much meaning is absent in cells controlled in isolation. “What’s the classification of the  number 7?” He asked. Without more context, it seems silly to control cells that are just storing numbers or individual letters, for example. 

Simplifying identity management with a knowledge graph approach  

Graph databases can simplify various aspects of the process of identity  management. Let’s take Lightweight Directory Access Protocol, or LDAP, for example. 

This vendor-agnostic protocol has been around for 30 years, but it’s still popular  with enterprises. It’s a pre-web, post-internet hierarchical directory service and authentication protocol. 

“Think of LDAP as a gigantic, virtual telephone book,” suggests access control management vendor Foxpass. Foxpass offers a dashboard-based LDAP management product which it claims is much easier to manage than OpenLDAP. 

If companies don’t use LDAP, they might as well use Microsoft’s Active Directory,  which is a broader, database-oriented identity and access management product that covers more of the same bases. Microsoft bundles AD with its Server and Exchange products, a means of lock-in that has been quite effective. Lock-in, obviously, inhibits innovation in general. 

Consider the whole of identity management as it exists today and how limiting it has been. How could enterprises embark on the journey of using a graph database-oriented approach as an alternative to application-centric IAM software? The first step  involves the creation of a “user” knowledge graph. 

Access control data duplication and fragmentation  

Semantic Arts CEO Dave McComb in his book Software Wasteland estimated  that 90 percent of data is duplicated. Application-centric architectures in use since the  days of mainframes have led to user data sprawl. Part of the reason there is such a duplication of user data is that authentication, authorization, and access control (AAA)  methods require more bits of personally identifiable information (PII) be shared with central repositories for AAA purposes. 

B2C companies are particularly prone to hoovering up these additional bits of  PII lately and storing that sensitive info in centralized repositories. Those repositories become one-stop shops for identity thieves. Customers who want to pay online have to  enter bank routing numbers and personal account numbers. As a result, there’s even more duplicate PII sprawl.

One of the reasons a “user” knowledge graph (and a knowledge graph enterprise foundation) could be innovative is that enterprises who adopt such an approach can move closer to zero-copy integration architectures. Model-driven development of the type that knowledge graphs enable assumes and encourages shared data and logic. 

A “user” graph coupled with project management data could reuse the same  enabling entities and relationships repeatedly for different purposes. The model-driven development approach thus incentivizes organic data management. 

The challenge of harnessing relationship-rich data  

Jans points out that enterprises, for example, run massive email systems that could be tapped to analyze project data for optimization purposes. And  disambiguation by unique email address across the enterprise can be a starting point  for all sorts of useful applications. 

Most enterprises don’t apply unique email address disambiguation, but Franz has a pharma company client that does, an exception that proves the rule. Email continues to be an untapped resource in many organizations precisely because it’s a treasure trove of relationship data. 

Problematic data farming realities: A social media example  

Relationship data involving humans is sensitive by definition, but the reuse potential of sensitive data is too important to ignore. Organizations do need to interact with individuals online, and vice versa. 

Former US Federal Bureau of Investigation (FBI) counterintelligence agent Peter  Strzok quoted from Deadline: White House, an MSNBC program in the US aired on  August 16: 

“I’ve served I don’t know how many search warrants on Twitter (now known as X)  over the years in investigations. We need to put our investigator’s hat on and talk about  tradecraft a little bit. Twitter gathers a lot of information. They just don’t have your tweets. They have your draft tweets. In some cases, they have deleted tweets. They have DMs that people have sent you, which are not encrypted. They have your draft  DMs, the IP address from which you logged on to the account at the time, sometimes  the location at which you accessed the account and other applications that are  associated with your Twitter account, amongst other data.”

X and most other social media platforms, not to mention law enforcement  agencies such as the FBI, obviously care a whole lot about data. Collecting, saving, and  allowing access to data from hundreds of millions of users in such a broad,  comprehensive fashion is essential for X. At least from a data utilization perspective,  what they’ve done makes sense. 

Contrast these social media platforms with the way enterprises collect and  handle their own data. That collection and management effort is function- rather than human-centric. With social media, the human is the product. 

So why is a social media platform’s culture different? Because with public social media, broad, relationship-rich data sharing had to come first. Users learned first-hand  what the privacy tradeoffs were, and that kind of sharing capability was designed into  the architecture. The ability to share and reuse social media data for many purposes  implies the need to manage the data and its accessibility in an elaborate way. Email, by contrast, is a much older technology that was not originally intended for multi-purpose reuse. 

Why can organizations like the FBI successfully serve search warrants on data from data farming companies? Because social media started with a broad data sharing assumption and forced a change in the data sharing culture. Then came adoption.  Then law enforcement stepped in and argued effectively for its own access. 

Broadly reused and shared, web data about users is clearly more useful than siloed data. Shared data is why X can have the advertising-driven business model it does. One-way social media contracts with users require agreement with provider terms. The users have one choice: Use the platform, or don’t. 

The key enterprise opportunity: A zero-copy user PII graph that respects users  

It’s clear that enterprises should do more to tap the value of the kinds of user data that email, for example, generates. One way to sidestep the sensitivity issues associated with reusing that sort of data would be to treat the most sensitive user data separately. 

Self-sovereign identity (SSI) advocate Phil Windley has pointed out that agent-managed, hashed messaging and decentralized identifiers could make it unnecessary to duplicate identifiers that correlate. If a bartender just needs to confirm that a patron  at the bar is old enough to drink, the bartender could just ping the DMV to confirm the  fact. The DMV could then ping the user’s phone to verify the patron’s claimed adult status.

Given such a scheme, each user could manage and control their access to their  own most sensitive PII. In this scenario, the PII could stay in place, stored, and encrypted on a user’s phone. 

Knowledge graphs lend themselves to such a less centralized, and yet more fine-grained and transparent approach to data management. By supporting self-sovereign identity and a data-centric architecture, a Chief Data Officer could help the  Chief Risk Officer mitigate the enterprise risk associated with the duplication of personally identifiable information—a true, win-win.

Zero Copy Integration and Radical Simplification

Zero Copy Integration and Radical Simplification

Dave McComb’s book Software Wasteland underscored a fundamental problem:  Enterprise software sometimes costs 1,000 times more than it ought to. The poster child for cost overruns was highlighted in the book was Healthcare.gov, a public registration system for the US Affordable Care Act, enacted in 2010. By 2018, the US Federal government had spent  $2.1 billion to build and implement the system. Most of that money was wasted. The  government ended up adopting many of the design principles embodied in an equivalent  system called HealthSherpa, which cost $1 million to build and implement. 

In an era where the data-centric architecture Semantic Arts advocates should be the  norm, application-centric architecture still predominates. But data-centric architecture doesn’t just reduce the cost of applications. It also attacks the data duplication problem attributable to  poor software design. This article explores how expensive data duplication has become, and  how data-centric, zero-copy integration can put enterprises on a course to simplification. 

Data sprawl and storage volumes  

In 2021, Seagate became the first company to ship three zettabytes worth of hard disks.  It took them 36 years to ship the first zettabyte. Six years to ship the second zettabyte, and only one additional year to ship the third zettabyte. 

The company’s first product, the ST-506, was released in 1980. The ST-506 hard disk,  when formatted, stored five megabytes (10002). By comparison, an IBM RAMAC 305,  introduced in 1956, stored five to ten megabytes. The RAMAC 305 weighed 10 US tons (the  equivalent of nine metric tonnes). By contrast, the Seagate ST-506, 24 years later, weighed five  US pounds (or 2.27 kilograms). 

A zettabyte is the equivalent of 7.3 trillion MP3 files or 30 billion 4K movies, according to  Seagate. When considering zettabytes: 

  • 1 zettabyte equals 1,000 exabytes. 
  • 1 exabyte equals 1,000 petabytes. 
  • 1 petabyte equals 1,000 terabytes. 

IDC predicts that the world will generate 178 zettabytes of data by 2025. At that pace, “The  Yottabyte Era” would succeed The Zettabyte Era by 2030, if not earlier. 

The cost of copying  

The question becomes, how much of the data generated will be “disposable” or  unnecessary data? In other words, how much data do we actually need to generate, and how 

much do we really need to store? Aren’t we wasting energy and other resources by storing  more than we need to? 

Let’s put it this way: If we didn’t have to duplicate any data whatsoever, the world would only have to generate 11 percent of the data it currently does. In 2021 terms, we’d only need  to generate 8.7 zettabytes of data, compared with the 78 zettabytes we actually generated worldwide over the course of that year. 

Moreover, Statista estimates that the ratio of unique to replicated data stored worldwide will decline to 1:10 from 1:9 by 2024. In other words, the trend is  toward more duplication, rather than less. 

The cost of storing oodles of data is substantial. Computer hardware guru Nick  Evanson, quoted by Gerry McGovern in CMSwire, estimated in 2020 that storing two  yottabytes would cost $58 trillion. If the cost per byte stored stayed constant, 40 percent of the world’s economic output would be consumed in 2035 by just storing data. 

Clearly, we should be incentivizing what graph platform Cinchy calls “zero-copy  integration”–a way of radically reducing unnecessary data duplication. The one thing we don’t  have is “zero-cost” storage. But first, let’s finish the cost story. More on the solution side and zero-copy integration later. 

The cost of training and inferencing large language models  

Model development and usage expenses are just as concerning. The cost of training  machines to learn with the help of curated datasets is one thing, but the cost of inferencing–the  use of the resulting model to make predictions using live data–is another. 

“Machine learning is on track to consume all the energy being supplied, a model that is costly, inefficient, and unsustainable,” Brian Bailey in Semiconductor Engineering pointed out  in 2022. AI model training expense has increased with the size of the datasets used, but more importantly, as the amount of parameters increases by four, the amount of energy consumed in the process increases by 18,000 times. Some AI models included as many as 150 billion parameters in 2022. The more recent ChatGPT LLM Training includes 180 billion parameters.  Training can often be a continuous activity to keep models up to date. 

But the applied model aspect of inferencing can be enormously costly. Consider the AI  functions in self-driving cars, for example. Major car makers sell millions of cars a year, and each  one they sell is utilizing the same carmaker’s model in a unique way. 70 percent of the energy  consumed in self-driving car applications could be due to inference, says Godwin Maben, a  scientist at electronic design automation (EDA) provider Synopsys. 

Data Quality by Design  

Transfer learning is a machine learning term that refers to how machines can be taught  to generalize better. It’s a form of knowledge transfer. Semantic knowledge graphs can be a  valuable means of knowledge transfer because they describe contexts and causality well with  the help of relationships.

Well-described knowledge graphs provide the context in contextual computing.  Contextual computing, according to the US Defense Advanced Research Projects Agency  (DARPA), is essential to artificial general intelligence. 

A substantial percentage of training set data used in large language models is more or less duplicate data, precisely because of poorly described context that leads to a lack of generalization ability. Thus the reason why the only AI we have is narrow AI. And thus the reason large language models are so inefficient. 

But what about the storage cost problem associated with data duplication? Knowledge graphs can help with that problem also, by serving as a means for logic sharing. As Dave has  pointed out, knowledge graphs facilitate model-driven development when applications are  written to use the description or relationship logic the graph describes. Ontologies provide the logical connections that allow reuse and thereby reduce the need for duplication. 

FAIR data and Zero-Copy Integration  

How do you get others who are concerned about data duplication on board with semantics and knowledge graphs? By encouraging data and coding discipline that’s guided by  FAIR principles. As Dave pointed out in a December 2022 blog post, semantic graphs and FAIR  

principles go hand in hand. https://www.semanticarts.com/the-data-centric-revolution-detour shortcut-to-fair/ 

Adhering to the FAIR principles, formulated by a group of scientists in 2016, promotes  reusability by “enhancing the ability of machines to automatically find and use the data, in  addition to supporting its reuse by individuals.” When it comes to data, FAIR stands for Findable, Accessible, Interoperable, and Reusable. FAIR data is easily found, easily shared,  easily reused quality data, in other words. 

FAIR data implies the data quality needed to do zero-copy integration. 

Bottom line: When companies move to contextual computing by using knowledge  graphs to create FAIR data and do model-driven development, it’s a win-win. More reusable  data and logic means less duplication, less energy, less labor waste, and lower cost. The term  “zero-copy integration” underscores those benefits.

A Knowledge Model for Explainable Military AI

A Knowledge Model for Explainable Military AI 

Forrest Hare, Founder of Summit Knowledge Solutions, is a retired US Air Force targeting and information operations officer who now works with the Defense Intelligence Agency (DIA). His  experience includes integrating intelligence from different types of communications, signals,  imagery, open source, telemetry, and other sources into a cohesive and actionable whole. 

Hare became aware of semantics technology while at SAIC and is currently focused on building a space + time ontology called the DIA Knowledge Model so that Defense Department intelligence could use it to contextualize these multi-source inputs. 

The question becomes, how do you bring objects that don’t move and objects that do move into the same information frame with a unified context? The information is currently organized by collectors and producers. 

The object-based intelligence that does exist involves things that don’t move at all. Facilities,  for example, or humans using phones that are present on a communications network are more or less static. But what about the things in between such as trucks that are only intermittently present? 

Only sparse information is available about these. How do you know the truck that was there  yesterday in an image is the same truck that is there today? Not to mention the potential hostile forces who own the truck that have a strong incentive to hide it. 

Objects in object-based intelligence not only include these kinds of assets, but also events and locations that you want to collect information about. In an entity-relationship sense, objects are entities. 

Hare’s DIA Knowledge Model uses the ISO standard Basic Formal Ontology (BFO) to unify  domains so that the information from different sources is logically connected and therefore  makes sense as part of a larger whole. BFO’s maintainers (Director Barry Smith and team at the  National Center for Ontological Research (NCOR) at the University of Buffalo) keep the ontology strictly limited to 30 or so classes. 

The spatial-temporal regions of the Knowledge Model are what’s essential to do the kinds of  dynamic, unfolding object tracking that’s been missing from object-based intelligence. Hare gave the example of a “site” (an immaterial entity) from a BFO perspective. A strict geolocational definition of “site” makes it possible for both humans and machines to make sense of the data about sites. Otherwise, Hare says, “The computer has no idea how to  understand what’s in our databases, and that’s why it’s a dumpster fire.”

This kind of mutual human and machine understanding is a major rationale behind explainable  AI. A commander briefed by an intelligence team must know why the team came to the  conclusions it did. The stakes are obviously high. “From a national security perspective, it’s extremely important for AI to be explainable,” Hare reminded the audience. Black boxes such as ChatGPT as currently designed can’t effectively answer the commander’s question on how the intel team arrived at the conclusions it did. 

Finally, the level of explain-ability knowledge models like the DIA’s becomes even more critical as information flows into the Joint Intelligence Operations Center (JIOC). Furthermore, the various branches of the US Armed Forces must supply and continually update a Common  Intelligence Picture that’s actionable by the US President, who’s the Commander in Chief for the military as a whole. 

Without this conceptual and spatial-temporal alignment across all service branches, joint operations can’t proceed as efficiently and effectively as they should. Certainly, the risk of failure looms much larger as a result.

How US Homeland Security Plans to Use Knowledge Graph

How US Homeland Security Plans to Use Knowledge Graph

During this summer’s Data Centric Architecture Forum, Ryan Riccucci, Division Chief for  U.S. Border Patrol – Tucson (AZ) Sector, and his colleague Eugene Yockey gave a glimpse of what the data environment is like within the US Department of Homeland Security (DHS), as well as how transforming that data environment has been evolving. 

The DHS celebrated its 20-year anniversary recently. The Federal department’s data challenges are substantial, considering the need to collect, store, retrieve and manage information associated with 500,000 daily border crossings, 160,000 vehicles, and $8 billion in imported goods processed daily by 65,000 personnel. 

Riccucci is leading an ontology development effort within the Customs and Border  Patrol (CBP) agency and the Department of Homeland Security more generally to support  scalable, enterprise-wide data integration and knowledge sharing. It’s significant to note that a  Division Chief has tackled the organization’s data integration challenge. Riccucci doesn’t let leading-edge, transformational technology and fundamental data architecture change intimidate him. 

Riccucci described a typical use case for the transformed, integrated data sharing  environment that DHS and its predecessor organizations have envisioned for decades. 

The CBP has various sensor nets that monitor air traffic close to or crossing the borders  between Mexico and the US, and Canada and the US. One such challenge on the Mexican border is Fentanyl smuggling into the US via drones. Fentanyl can be 50 times as powerful as morphine. Fentanyl overdoses caused 110,000 deaths in the US in 2022. 

On the border with Canada, a major concern is gun smuggling via drone from the US to Canada. Though legal in the US, Glock pistols, for instance, are illegal and in high demand in Canada. 

The challenge in either case is to intercept the smugglers retrieving the drug or weapon drops while they are in the act. Drones may only be active for seven to 15 minutes at a time, so  the opportunity window to detect and respond effectively is a narrow one. 

Field agents ideally need to see enough visual real-time, mapped airspace information on the sensor activated, allowing them to move quickly and directly to the location. Specifics are important; verbally relayed information, by contrast, can often be less specific, causing confusion or misunderstanding.

The CBP’s successful proof of concept involved a basic Resource Description Framework  (RDF) triple, semantic capabilities with just this kind of information: 

Sensor → Act of sensing → drone (SUAS, SUAV, vehicle, etc.) 

In a recent test scenario, CBP collected 17,000 records that met specified time/space requirements for a qualified drone interdiction over a 30-day period. 

The overall impression that Riccucci and Yockey conveyed was that DHS has both the budget and the commitment to tackle this and many other use cases using a transformed data-centric architecture. By capturing information within an interoperability format, the DHS has  been apprehending the bad guys with greater frequency and precision.Copyright @ semanticarts.com

Ontology Consultant Job Description

Ontology Consultant Job Description

As a Semantic Arts ontologist, you will be essential in fixing the tangled mess of information in  client enterprise systems and promoting a world where enterprise information is widely  understood and easily accessed by all who have permission. Come work with the best in the  business on interesting projects with global leaders! 

Working together with other ontology consultants, you will take existing design artifacts and  work with subject matter experts to convert models to formal semantic expressions for clients.  We work with a diverse set of clients of all sizes and across industries. Therefore, you can expect a variety of work across many domains of knowledge. We have a strong sense of team, with no rigid hierarchy and place a high value on individual input. 

Requirements: 

  • A passion for information and knowledge modeling 
  • Must be trained in ontological development, either through formal training or on-the-job development. 
  • Should have experience in data modeling or related analytical skills. 
  • Strong interpersonal communication skills, experience managing client, stakeholder, and internal interactions. 
  • Experience in OWL, RDF, SPARQL and the ability to program against triplestores
  • A desire to learn new domains of knowledge in a fast-paced environment.
  • Bachelor’s degree in Computer Science, Information Systems, Knowledge Management,  Engineering, Philosophy, Business, or similar. 

Nice to Have: 

  • Prior use or understanding of W3C semantic web standards 
  • Advanced academic degree preferred. 

About Us: 

Promoting a vision of Data-Centric Architecture for more than 20 years, people are catching on!  We have been awarded the 2022 ″Colorado Companies to Watch”, 2022 “Top 30 Innovators of the Year”, 2021 “30 Innovators to Watch”, and 2020 “30 Best Small Companies to Watch”.  Semantic Arts is growing quickly and expanding our domains, projects, and roles. We have assembled what might be the largest team of individuals passionately dedicated to this task,  making Semantic Arts a great place to develop skills and grow professionally in this exciting field. 

What We Offer: 

  • Remote Position, with travel for onsite work with clients required (up to 3 days every 3  weeks) 
  • Professional development fund to develop skills, attend conferences, and advance your career. 
  • Medical, Dental, and Vision Benefits 
  • SIMPLE IRA with company match 
  • Student Loan Reimbursement
  • Annual Bonus Potential 
  • Equipment Purchase Assistance 
  • Employee Assistance Program 

Employment Type: 

Full-time 

Authorization: 

Candidates must be authorized to work for any employer within the US, UK, or Canada. We are not currently able to sponsor visas or hire outside of those countries. 

Compensation: 

Compensation for this position varies based on experience, billable utilization, and other factors.  Entry-level ontologists start around $70,000 USD annually and generally rise quickly, with the overall average being approximately $150,000 USD, and about 1/3 of consultants averaging more than $175,000 USD. More details shared during the interview process. 

Semantic Arts is committed to the full inclusion of all qualified individuals. In keeping with our commitment, we will take steps to assure that people with disabilities are provided reasonable accommodations. Accordingly, if a reasonable accommodation is required to fully participate in  the job application or interview process, to perform the essential duties of the position, and/or to  receive all other benefits and privileges of employment, please contact our HR representative at  [email protected]

Semantic Arts is an Equal Opportunity Employer. We respect and seek to empower each  individual and support the diverse cultures, perspectives, skills, and experiences within our  workforce. We support an inclusive workplace where employees excel based on merit,  qualifications, experience, ability, and job performance.

Extending an Upper-Level Ontology

Extending an Upper-Level Ontology 

If you have been following my blogs over the past year or so, then you will know I am a big  fan of adopting an upper-level ontology to help bootstrap your own bespoke ontology  project. Of the available upper-level ontologies I happen to like gist as it embraces a “less is more” philosophy. 

Given that this is 3rd party software with its own lifecycle, how does one “merge” such an upper ontology with your own? Like most things in life, there are two primary ways. 

CLONE MODEL 

This approach is straightforward: simply clone the upper ontology and then modify/extend it directly as if it were your own (being sure to retain any copyright notice). The assumption  here is that you will change the “gist” domain into something else like “mydomain”. The  benefit is that you don’t have to risk any 3rd party updates affecting your project down the  road. The downside is that you lose out on the latest enhancements/improvements over time, which if you wish to adopt, would require you to manually re-factor into your own  ontology. 

As the inventors of gist have many dozens of man-years of hands-on experience with  developing and implementing ontologies for dozens of enterprise customers, this is not an  approach I would recommend for most projects. 

EXTEND MODEL 

Just as when you extend any 3rd party software library you do so in your own namespace,  you should also extend an upper-level ontology in your own namespace. This involves just a  couple of simple steps: 

First, declare your own namespace as an owl ontology, then import the 3rd party upper-level  ontology (e.g. gist) into that ontology. Something along the lines of this: 

<https://ont.mydomain.com/core>  

 a owl:Ontology ; 

 owl:imports <https://ontologies.semanticarts.com/o/gistCore11.0.0> ;  .

Second, define your “extended” classes and properties, referencing appropriate gist  subclasses, subproperties, domains, and/or range assertions as needed. A few samples  shown below (where “my” is the prefix for your ontology domain): 

my:isFriendOf  

 a owl:ObjectProperty ; 

 rdfs:domain gist:Person ; 

my:Parent  

 a owl:Class ; 

 rdfs:subClassOf gist:Person ; 

my:firstName  

 a owl:DatatypeProperty ; 

 rdfs:subPropertyOf gist:name ; 

The above definitions would allow you to update to new versions of the upper-level  ontology* without losing any of your extensions. Simple right? 

*When a 3rd party upgrades the upper-level ontology to a new major version — defined as non backward compatible — you may find changes that need to be made to your extension ontology;  as a hypothetical example, if Semantic Arts decided to remove the class gist:Person, the assertions  made above would no longer be compatible. Fortunately, when it comes to major updates  Semantic Arts has consistently provided a set of migration scripts which assist with updating your  extended ontology as well as your instance data. Other 3rd parties may or may not follow suit.

DCA Forum Recap: Forrest Hare, Summit Knowledge Solutions

A knowledge model for explainable military AI

Forrest Hare, Founder of Summit Knowledge Solutions, is a retired US Air Force targeting and information operations officer who now works with the Defense Intelligence Agency (DIA). His experience includes integrating intelligence from different types of communications, signals, imagery, open source, telemetry, and other sources into a cohesive and actionable whole.

Hare became aware of semantics technology while at SAIC and is currently focused on building a space + time ontology called the DIA Knowledge Model so that Defense Department intelligence could use it to contextualize these multi-source inputs.

The question becomes, how do you bring objects that don’t move and objects that do move into the same information frame with a unified context? The information is currently organized by collectors and producers.

The object-based intelligence that does exist involves things that don’t move at all.  Facilities, for example, or humans using phones that are present on a communications network are more or less static. But what about the things in between such as trucks that are only intermittently present?

Only sparse information is available about these. How do you know the truck that was there yesterday in an image is the same truck that is there today? Not to mention the potential hostile forces who own the truck that have a strong incentive to hide it.

Objects in object-based intelligence not only include these kinds of assets, but also events and locations that you want to collect information about. In an entity-relationship sense, objects are entities.

Hare’s DIA Knowledge Model uses the ISO standard Basic Formal Ontology (BFO) to unify domains so that the information from different sources is logically connected and therefore makes sense as part of a larger whole. BFO’s maintainers (Director Barry Smith and team at the National Center for Ontological Research (NCOR) at the University of Buffalo) keep the ontology strictly limited to 30 or so classes.

The spatial-temporal regions of the Knowledge Model are what’s essential to do the kinds of dynamic, unfolding object tracking that’s been missing from object-based intelligence. Hare gave the example of a “site” (an immaterial entity) from a BFO perspective. A strict geolocational definition of “site” makes it possible for both humans and machines to make sense of the data about sites. Otherwise, Hare says, “The computer has no idea how to understand what’s in our databases, and that’s why it’s a dumpster fire.”

This kind of mutual human and machine understanding is a major rationale behind explainable AI. A commander briefed by an intelligence team must know why the team came to the conclusions it did. The stakes are obviously high. “From a national security perspective, it’s extremely important for AI to be explainable,” Hare reminded the audience. Black boxes such as ChatGPT as currently designed can’t effectively answer the commander’s question on how the intel team arrived at the conclusions it did.

Finally, the level of explain-ability knowledge models like the DIA’s becomes even more critical as information flows into the Joint Intelligence Operations Center (JIOC). Furthermore, the various branches of the US Armed Forces must supply and continually update a Common Intelligence Picture that’s actionable by the US President, who’s the Commander in Chief for the military as a whole.

Without this conceptual and spatial-temporal alignment across all service branches, joint operations can’t proceed as efficiently and effectively as they should.  Certainly, the risk of failure looms much larger as a result.

Contributed by Alan Morrison

Financial Data Transparency Act “PitchFest”

The Data Foundation (Data Foundation PitchFest) hosted at PitchFest on “Unlocking the vision of the Financial Data Transparency Act” a few days ago. Selected speakers were given 10 minutes to bring their best ideas on how to use the improved financial regulatory information and data.

The Financial Data Transparency Act is a new piece of legislation directly affecting the financial services industry. In short, it directs financial regulators to harmonize data collections and move to machine (and people) readable forms. The goal is to reduce the burdens of compliance on regulated industries, increase the ability to analyze data, and to enhance overall transparency.

Two members of our team, Michael Atkin and Dalia Dahleh were given the opportunity to present. Below is the text from Michael Atkin’s pitch:

  1. Background – Just to set the stage. I’ve been fortunate to have been in the position as scribe, analyst, advocate and organizer for data management since 1985.  I’ve always been a neutral facilitator – allowing me to sit on all sides of the data management issue all over the world – from data provider to data consumer to market authority to regulator.  I’ve helped create maturity models outlining best practice – performed benchmarking to measure progress – documented the business case – and created and taught the Principles of Data Management at Columbia University.  I’ve also served on the SEC’s Market Data Advisory Committee, the CFTC’s Technical Advisory Committee and as the Chair of the Data Subcommittee of the OFR’s Financial Research Advisory activity during the financial crisis of 2008.  So, I have some perspective on the challenges the regulators face and the value of the FDTA.
  2. Conclusion (slide 2) – My conclusions after all that exposure are simple. There is a real data dilemma for many entities.  The dilemma is caused by fragmentation of technology.  It’s nobody’s fault.  We have business and operational silos.  They are created using proprietary software.  The same things are modeled differently based on the whim of the architects, the focus of the applications and the nuances of the technical solution.This fragmentation creates “data incongruence” – where the meaning of data from one repository doesn’t match other repositories.  We have the same words, with different meanings.  We have the same meaning using different words.  And we have nuances that get lost in translation.  As a result, we spend countless effort and money moving data around, reconciling meaning and doing mapping.  As one of my banking clients said … “My projects end up as expensive death marches of data cleansing and manipulation just to make the software work.”  And we do this over and over ad infinitum.Not only do we suffer from data incongruence – we suffer from the limitations of relational technology that still dominates our way of processing data.  For the record, relational technology is over 50 years old.  It was (and is) great for computation and structured data.  It’s not good for ad hoc inquiry and scenario-based analysis.  The truth is that data has become isolated and mismatched across repositories due to technology fragmentation and the rigidity of the relational paradigm.  Enterprises (including government enterprises) often have thousands of business and data silos – each based on proprietary data models that are hard to identify and even harder to change.  I refer to this as the bad data tax.  It costs most organizations somewhere around 40-60% of their IT budget to address.  So, let’s recognize that this is a real liability.  One that diverts resources from business goals, extends time-to-value for analysts, and leads to knowledge worker frustration.  The new task before FSOC leadership and the FDTA is now about fixing the data itself.
  3. Solution (slide 3) – The good news is that the solution to this data dilemma is actually quite simple and twofold in nature. First – adopt the principles of good data hygiene.  And on that front, there appears to be good progress thanks to efforts around the Federal Data Strategy and things related to BCBS 239 and the Open Government Data Act.  But governance alone will not solve the data dilemma.The second thing that is required is to adopt data standards that were specifically designed to address the problems of technology fragmentation.  And these open data web-based standards are quite mature.  They include the Internationalized Resource Identifier (or IRI) for identity resolution.  The use of ontologies – that enable us to model simple facts and relationship facts.  And the expression of these things in standards like RDF for ontologies, OWL for inferencing and SHACL for business rules.From these standards you get a bunch of capabilities.  You get quality by math (because the ontology ensures precision of meaning).  You get reusability (which eliminates the problem of hard coded assumptions and the problem of doing the same thing in slightly different ways).  You get access control (because the rules are embedded into the data and not constrained by systems or administrative complexity).  You get lineage traceability (because everything is linked to a single identifier so that data can be traced as it flows across systems).  And you get good governance (since these standards use resolvable identity, precise meaning and lineage traceability to shift governance from people-intensive data reconciliation to more automated data applications).
  4. FDTA (slide 4) – Another important component is that this is happening at the right time. I see the FDTA as the next step in a line of initiatives seeking to modernize regulatory reporting and reduce risk.  I’ve witnessed the efforts to move to T+1 (to address the clearing and settlement challenge).  I’ve seen the recognition of global interdependencies (with the fallout from Long Term Capital, Enron and the problems of derivatives in Orange County).  We’ve seen the problems of identity resolution that led to KYC and AML requirements.  And I was actively involved in understanding the data challenges of systemic risk with the credit crisis of 2008.The problem with all these regulatory activities is that most of them are not about fixing the data.  Yes, we did get LEI and data governance.  Those are great things, but far from what is required to address the data dilemma.  I also applaud the adoption of XBRL (and the concept of data tagging).  I like the XBRL taxonomies (as well as the Eurofiling regulatory taxonomies) – but they are designed vertically report-by-report with a limited capability for linking things together.  Not only that, most entities are just extracting XBRL into their relational environments that does little to address the problem of structural rigidity.  The good news is that all the work that has gone into the adoption of XBRL is able to be leveraged.  XML is good for data transfer.  Taxonomies are good for unraveling concepts and tagging.  And the shift from XML to RDF is straightforward and would not affect those who are currently reporting using XBRL.One final note before I make our pitch.  Let’s recognize that XBRL is not the way the banks are managing their internal data infrastructures.  They suffer from the same dilemmas as the regulators and almost every G-SIB and D-SIB I know is moving toward semantic standards.  Because even though FDTA is about the FSOC agencies – it will ultimately affect the financial institutions.  I see this as an opportunity for collaboration between regulators and the regulated, in building the infrastructure for the digital world.
  5. Proposal (slide 5) – Semantic Arts is proposing a pilot project to implement the foundational infrastructure of precise data about financial instruments (including identification, classification, descriptive elements and corporate actions), legal entities (including entity types as well as information about ownership and control), obligations (associated with issuance, trading, clearing and settlement), and holdings about the portfolios of the regulated entities. These are the building blocks of linked risk analysis.To implement this initiative, we are proposing you start with a single simple model of the information from one of the covered agencies.  The Initial project would focus on defining the enterprise model and conforming two to three key data sets to the model.  The resulting model would be hosted on a graph database.  Subsequent projects would involve expanding the footprint of data domains to be added to the graph, and gradually building functionality to begin to reverse the legacy creation process.We would initiate things by leveraging the open standard upper ontology (GIST) from Semantic Arts as well as the work of the Financial Industry Business Ontology (from the EDM Council) and any other vetted ontology like the one OFR is building for CFI.Semantic Arts has a philosophy of “think big” (like cross-agency interoperability) but “start small” (like a business domain of one of the agencies).  The value of adopting semantic standards is threefold – and can be measured using the “three C’s” of metrics.  The first C is cost containment starting with data integration and includes areas focused on business process automation and consolidation of redundant systems (best known as technical agility).  The second C is capability enhancement for analysis of the degrees of interconnectedness, the nature of transitive relationships, state contingent cash flow, collateral flow, guarantee and transmission of risk.  The final C is implementation of the control environment focused on tracking data flow, protecting sensitive information, preventing unwanted outcomes, managing access and ensuring privacy.
  6. Final Word (contact) – Just a final word to leave you with. Adopting these semantic standards can be accomplished at a fraction of the cost of what you spend each year supporting the vast cottage industry of data integration workarounds.  The pathway forward doesn’t require ripping everything out but instead building a semantic “graph” layer across data to connect the dots and restore context.  This is what we do.  Thank you.

Link to Slide Deck

DCA Forum Recap: Jans Aasman, Franz

How a “user” knowledge graph can help change data culture

Identity and Access Management (IAM) has had the same problem since Fernando Corbató of MIT first dreamed up the idea of digital passwords in 1960: opacity. Identity in the physical world is rich and well-articulated, with a wealth of different ways to verify information on individual humans and devices. By contrast, the digital realm has been identity data impoverished, cryptic and inflexible for over 60 years now.

Jans Aasman, CEO of Franz, provider of the entity-event knowledge graph solution Allegrograph, envisions a “user” knowledge graph as a flexible and more manageable data-centric solution to the IAM challenge. He presented on the topic at this past summer’s Data-Centric Architecture Forum, which Semantic Arts hosted near its headquarters in Fort Collins, Colorado.

Consider the specificity of a semantic graph and how it could facilitate secure access control. Knowledge graphs constructed of subject-predicate-object triples make it possible to set rules and filters in an articulated and yet straightforward manner.  Information about individuals that’s been collected for other HR purposes could enable this more precise filtering.

For example, Jans could disallow others’ access to a triple that connects “Jans” and “salary”. Or he could disallow access to certain predicates.

Identity and access management vendors call this method Attribute-Based Access Control (ABAC). Attributes include many different characteristics of users and what they interact with, which is inherently more flexible than role-based access control (RBAC).

Cell-level control is also possible, but as Forrest Hare of Summit Knowledge Solutions points out, such security doesn’t make a lot of sense, given how much meaning is absent in cells controlled in isolation. “What’s the classification of the number 7?” He asked. Without more context, it seems silly to control cells that are just storing numbers or individual letters, for example.

Simplifying identity management with a knowledge graph approach

Graph databases can simplify various aspects of the process of identity management. Let’s take Lightweight Directory Access Protocol, or LDAP, for example.

This vendor-agnostic protocol has been around for 30 years, but it’s still popular with enterprises. It’s a pre-web, post-internet hierarchical directory service and authentication protocol.

“Think of LDAP as a gigantic, virtual telephone book,” suggests access control management vendor Foxpass. Foxpass offers a dashboard-based LDAP management product which it claims is much easier to manage than OpenLDAP.

If companies don’t use LDAP, they might as well use Microsoft’s Active Directory, which is a broader, database-oriented identity and access management product that covers more of the same bases. Microsoft bundles AD with its Server and Exchange products, a means of lock-in that has been quite effective. Lock-in, obviously, inhibits innovation in general.

Consider the whole of identity management as it exists today and how limiting it has been. How could enterprises embark on the journey of using a graph database-oriented approach as an alternative to application-centric IAM software? The first step involves the creation of a “user” knowledge graph.

Access control data duplication and fragmentation

Semantic Arts CEO Dave McComb in his book Software Wasteland estimated that 90 percent of data is duplicated. Application-centric architectures in use since the days of mainframes have led to user data sprawl. Part of the reason there is such a duplication of user data is that authentication, authorization, and access control (AAA) methods require more bits of personally identifiable information (PII) be shared with central repositories for AAA purposes.

B2C companies are particularly prone to hoovering up these additional bits of PII lately and storing that sensitive info in centralized repositories. Those repositories become one-stop shops for identity thieves. Customers who want to pay online have to enter bank routing numbers and personal account numbers. As a result, there’s even more duplicate PII sprawl.

One of the reasons a “user” knowledge graph (and a knowledge graph enterprise foundation) could be innovative is that enterprises who adopt such an approach can move closer to zero-copy integration architectures. Model-driven development of the type that knowledge graphs enable assumes and encourages shared data and logic.

A “user” graph coupled with project management data could reuse the same enabling entities and relationships repeatedly for different purposes. The model-driven development approach thus incentivizes organic data management.

The challenge of harnessing relationship-rich data

Jans points out that enterprises, for example, run massive email systems that could be tapped to analyze project data for optimization purposes. And disambiguation by unique email address across the enterprise can be a starting point for all sorts of useful applications.

Most enterprises don’t apply unique email address disambiguation, but Franz has a pharma company client that does, an exception that proves the rule. Email continues to be an untapped resource in many organizations precisely because it’s a treasure trove of relationship data.

Problematic data farming realities: A social media example

Relationship data involving humans is sensitive by definition, but the reuse potential of sensitive data is too important to ignore. Organizations do need to interact with individuals online, and vice versa.

Former US Federal Bureau of Investigation (FBI) counterintelligence agent Peter Strzok quoted from Deadline: White House, an MSNBC program in the US aired on August 16:

“I’ve served I don’t know how many search warrants on Twitter (now known as X) over the years in investigations. We need to put our investigator’s hat on and talk about tradecraft a little bit. Twitter gathers a lot of information. They just don’t have your tweets. They have your draft tweets. In some cases, they have deleted tweets. They have DMs that people have sent you, which are not encrypted. They have your draft DMs, the IP address from which you logged on to the account at the time, sometimes the location at which you accessed the account and other applications that are associated with your Twitter account, amongst other data.” 

X and most other social media platforms, not to mention law enforcement agencies such as the FBI, obviously care a whole lot about data. Collecting, saving, and allowing access to data from hundreds of millions of users in such a broad, comprehensive fashion is essential for X. At least from a data utilization perspective, what they’ve done makes sense.

Contrast these social media platforms with the way enterprises collect and handle their own data. That collection and management effort is function- rather than human-centric. With social media, the human is the product.

So why is a social media platform’s culture different? Because with public social media, broad, relationship-rich data sharing had to come first. Users learned first-hand what the privacy tradeoffs were, and that kind of sharing capability was designed into the architecture. The ability to share and reuse social media data for many purposes implies the need to manage the data and its accessibility in an elaborate way. Email, by contrast, is a much older technology that was not originally intended for multi-purpose reuse.

Why can organizations like the FBI successfully serve search warrants on data from data farming companies? Because social media started with a broad data sharing assumption and forced a change in the data sharing culture. Then came adoption. Then law enforcement stepped in and argued effectively for its own access.

Broadly reused and shared, web data about users is clearly more useful than siloed data. Shared data is why X can have the advertising-driven business model it does. One-way social media contracts with users require agreement with provider terms. The users have one choice: Use the platform, or don’t.

The key enterprise opportunity: A zero-copy user PII graph that respects users

It’s clear that enterprises should do more to tap the value of the kinds of user data that email, for example, generates. One way to sidestep the sensitivity issues associated with reusing that sort of data would be to treat the most sensitive user data separately.

Self-sovereign identity (SSI) advocate Phil Windley has pointed out that agent-managed, hashed messaging and decentralized identifiers could make it unnecessary to duplicate identifiers that correlate. If a bartender just needs to confirm that a patron at the bar is old enough to drink, the bartender could just ping the DMV to confirm the fact. The DMV could then ping the user’s phone to verify the patron’s claimed adult status.

Given such a scheme, each user could manage and control their access to their own most sensitive PII. In this scenario, the PII could stay in place, stored, and encrypted on a user’s phone.

Knowledge graphs lend themselves to such a less centralized, and yet more fine-grained and transparent approach to data management. By supporting self-sovereign identity and a data-centric architecture, a Chief Data Officer could help the Chief Risk Officer mitigate the enterprise risk associated with the duplication of personally identifiable information—a true, win-win.

 

Contributed by Alan Morrison