DCA Forum Recap: US Homeland Security

How US Homeland Security plans to use knowledge graphs in its border patrol efforts

During this summer’s Data Centric Architecture Forum, Ryan Riccucci, Division Chief for U.S. Border Patrol – Tucson (AZ) Sector, and his colleague Eugene Yockey gave a glimpse of what the data environment is like within the US Department of Homeland Security (DHS), as well as how transforming that data environment has been evolving.

The DHS celebrated its 20-year anniversary recently. The Federal department’s data challenges are substantial, considering the need to collect, store, retrieve and manage information associated with 500,000 daily border crossings, 160,000 vehicles, and $8 billion in imported goods processed daily by 65,000 personnel.

Riccucci is leading an ontology development effort within the Customs and Border Patrol (CBP) agency and the Department of Homeland Security more generally to support scalable, enterprise-wide data integration and knowledge sharing. It’s significant to note that a Division Chief has tackled the organization’s data integration challenge. Riccucci doesn’t let leading-edge, transformational technology and fundamental data architecture change intimidate him.

Riccucci described a typical use case for the transformed, integrated data sharing environment that DHS and its predecessor organizations have envisioned for decades.

The CBP has various sensor nets that monitor air traffic close to or crossing the borders between Mexico and the US, and Canada and the US. One such challenge on the Mexican border is Fentanyl smuggling into the US via drones. Fentanyl can be 50 times as powerful as morphine. Fentanyl overdoses caused 110,000 deaths in the US in 2022.

On the border with Canada, a major concern is gun smuggling via drone from the US. to Canada. Though legal in the US, Glock pistols, for instance, are illegal and in high demand in Canada.

The challenge in either case is to intercept the smugglers retrieving the drug or weapon drops while they are in the act. Drones may only be active for seven to 15 minutes at a time, so the opportunity window to detect and respond effectively is a narrow one.

Field agents ideally need to see enough visual real-time, mapped airspace information on the sensor activated, allowing them to move quickly and directly to the location. Specifics are important; verbally relayed information by contrast can often be less specific, causing confusion or misunderstanding.

The CBP’s successful proof of concept involved a basic Resource Description Framework (RDF) triple, semantic capabilities with just this kind of information:

Sensor → Act of sensing → drone (SUAS, SUAV, vehicle, etc.)

In a recent test scenario, CBP collected 17,000 records that met specified time/space requirements for a qualified drone interdiction over a 30-day period.

The overall impression that Riccucci and Yockey conveyed was that DHS has both the budget and the commitment to tackle this and many other use cases using a transformed data-centric architecture. By capturing information within an interoperability format, the DHS has been apprehending the bad guys with greater frequency and precision.

Contributed by Alan Morrison

HR Tech and The Kitchen Junk Drawer

I often joke that when I started with Semantic Arts nearly two years ago, I had no idea a solution existed to a certain problem that I well understood. I had experienced many of the challenges and frustrations of an application-centric world but had always assumed it was just a reality of doing business. As an HR professional, I’ve heard over the years about companies having to pick the “best of the worst” technologies. Discussion boards are full of people dissatisfied with current solutions – and when they try new ones, they are usually dissatisfied with those too!

The more I have come to understand the data-centric paradigm, the more I have discovered its potential value in all areas of business, but especially in human resources. It came as no surprise to me when a recent podcast by Josh Bersin revealed that the average large company is using 80 to 100 different HR Technology systems (link). Depending on who you ask, HR is comprised of twelve to fifteen key functions – meaning that we have an average of six applications for each key function. Even more ridiculously, many HR leaders would admit that there are probably even more applications in use that they don’t know about.  Looking beyond HR at all core business processes, larger companies are using more than two hundred applications, and the number is growing by 10% per year, according to research by Okta from earlier this year (link). From what we at Semantic Arts have seen, the problem is actually much greater than this research indicates.

Why Is This a Problem?

Most everyone has experienced the headaches of such application sprawl. Employees often have to crawl through multiple systems, wasting time and resources, either to find data they need or to recreate the analytics required for reporting. As more systems come online to try to address gaps, employees are growing weary of learning yet another system that carries big promises but usually fails to deliver (link). Let’s not forget the enormous amount of time spent by HR Tech and other IT resources to ensure everything is updated, patched and working properly. Then, there is the near daily barrage of emails and calls from yet another vendor promising some incremental improvement or ROI that you can’t afford to miss (“Can I have just 15 minutes of your time?”).

Bersin’s podcast used a great analogy for this: the kitchen drawer problem. We go out and procure some solution, but it gets thrown into the drawer with all the other legacy junk. When it comes time to look in the drawer, either it’s so disorganized or we are in such a hurry that it seems more worthwhile to just buy another app than to actually take the time to sort through the mess.

Traditional Solutions

When it comes to legacy applications, companies don’t even know where to start. We don’t know who is even using which system, so we don’t dare to shut off or replace anything. So we end up with a mess of piecemeal integrations that may solve the immediate issue, but just kicks the technical debt down the road. Sure, there are a few ETL and other integration tools out there that can be helpful, but without a unified data model and a broad plan, these initiatives usually end up in the drawer with all the other “flavor of the month” solutions.

Another route is to simply put a nice interface over the top of everything, such as ServiceNow or other similar solutions. This can enhance the employee experience by providing a “one stop shop” for information, but it does nothing to address the underlying issues. These systems have gotten quite expensive, and can run $50,000-$100,000 per year (link). The systems begin to look like ERPs in terms of price and upkeep, and eventually they become legacy systems themselves.

Others go out and acquire a “core” solution such as SAP, Oracle, or another ERP system. They hope that these solutions, together with the available extensions, will provide the same interface benefits. A company can then buy or build apps that integrate. Ultimately, these solutions are also expensive and become “black boxes” where data and its related insights are not visible to the user due to the complexity of the system. (Intentional? You decide…). So now you go out and either pay experts in the system to help you manipulate it or settle for whatever off-the-shelf capabilities and reporting you can find. (For one example of how this can go, see link).

A Better Path Forward

Many of the purveyors of these “solutions” would have you believe there is no better way forward; but those familiar with data-centricity know better. To be clear, I’m not a practioner or technologist. I joined Semantic Arts in an HR role, and the ensuing two years have reshaped the way I see HR and especially HR information systems. I’ll give you a decent snapshot as I understand it, along with an offer that if your interested in the ins and outs of these things I’d be happy to introduce you to someone that can answer them in greater detail.

Fundamentally, a true solution requires a mindset shift away from application silos and integration, towards a single, simple model that defines the core elements of the business, together with a few key applications that are bound to that core and speak the same language. This can be built incrementally, starting with specific use cases and expanding as it makes sense. This approach means you don’t need to have it “all figured out” from the start. With the adoption of an existing ontology, this is made even easier … but more on that later.

Once a core model is established, an organization can begin to deal methodically with legacy applications. You will find that over time many organizations go from legacy avoidance to legacy erosion, and eventually to legacy replacement. (See post on Incremental Stealth Legacy Modernization). This allows a business to slowly clean out that junk drawer and avoid filling it back up in the future (and what’s more satisfying than a clean junk drawer?).

Is this harder in the short term than traditional solutions? It may appear so on the surface, but really it isn’t. When a decision is made to start slowly, companies discover that the flexibility of semantic knowledge graphs allows for quick gains. Application development is less expensive and applications more easily modified as requirements change. Early steps help pay for future steps, and company buy-in becomes easier as stakeholders see their data come to life and find key business insights with ease.

For those who may be unfamiliar with semantic knowledge graphs, let me try to give a brief introduction. A graph database is a fundamental shift away from the traditional relational structure. When combined with formal semantics, a knowledge graph provides a method of storing and querying information that is more flexible and functional (more detail at link or link). Starting from scratch would be rather difficult, but luckily there are starter models (ontologies) available, including one we’ve developed in-house called gist, which is both free and freely available. By building on an established structure, you can avoid re-inventing the wheel.

HR departments looking to leverage AI and large language models in the future will find this data-centric transformation even more essential, but that’s a topic for another time.

Conclusion

HR departments face unique challenges. They deal with large amounts of information and must justifying their spending as non-revenue producing departments. The proliferation of systems and applications is a drain on employee morale and productivity and represents a major source of budget drain.

By adopting data-centric principles and applying them intentionally in future purchasing and application development, HR departments can realize greater strategic insights while saving money and providing a richer employee experience.

Taken all the way to completion, adoption of these technologies and principles would mean business data stored in a single, secured location. Small apps or dashboards can be rapidly built and deployed as the business evolves. No more legacy systems, no more hidden data, no more frustration with systems that simply don’t work.

Maybe, just maybe, this model will provide a success story that leads the rest of the organization to adopt similar principles.

 

JT Metcalf is the Chief Administrative Officer at Semantic Arts, managing HR functions along with many other hats.

Morgan Stanley , Global Fortune 100 Financial Institution, Transforms Information & Knowledge Manage ment

CASE STUDY: Morgan Stanley , Global Fortune 100 Financial Institution, Transforms Information & Knowledge Management 

Morgan Stanley is one of the largest investment banking and wealth management firms with offices in more than 42 countries and more than 60,000 employees, ranking 67th on the 2018 Fortune 500 list of the largest US corporations by total revenue. Headquartered in New York City, the organization faced challenges for better information retrieval, records retention, and legal hold capabilities or potentially face steep compliance fines. Securing data from outside threats is critical, but information from within the friendly firewall’s hamstrings business ability to operate, even without regulatory pressures. With worldwide data to swell by 10-fold by  

2025, a better solution needed to be addressed. Leadership at Morgan Stanley solicited several consulting experts and chose  Semantic Arts to guide in strategic resolution of this massive information sprawl while enabling greater information retrieval and easier user consumption. 

“Information management” as part of the legal department took lead as it was chartered with knowing about all data sets within the firm: Structured, Unstructured and everything in between. A major undertaking for any group yet alone a global giant with divisions all over the world. 

PROBLEM STATEMENT: Information management determined that existing traditional architectures and relational data structures were failing to keep pace with data growth and management of information assets. A solution that offered scale, extend-ability, and an enhanced user search experience was the primary objectives. Like other organizations entrenched in data silos and single ownership, information resided in many data sources (SQL, Oracle, SAP, SharePoint, Excel, PDF, videos, and shared files, to name a few), making for difficult data aggregation with accuracy. Decades of integration have resulted in highly dependent systems and applications. In fact, changes to any data schemas were laborious coding and testing exercises that yielded little business benefit. In short, it was problematic to access the right data and costly to make even simple changes.  

STRATEGY: By collaborating with Semantic Arts, experts in Data /Digital transformation, a data strategy was established for better  information management. Implementation of Semantic Knowledge Graphs and a flexible Ontology for future information growth  was decided after lengthy evaluation. It offered strategic value for supporting numerous domain areas simultaneously; including risk management, regulatory compliance, asset management, adviser information retrieval while linking data from each domain.  Additionally, an important use case of a Semantic Knowledge Graph approach is the architectural advantage of limitless extendibility across the enterprise for reuse. This factored into the long-term reasoning and vision of becoming Data Centric. 

APPROACH: Starting strategic initiatives like this can be particularly tricky in that achieving a balance between building a foundation for future success and immediate results can be a high wire act in organizational politics. With the advice of Semantic Arts, a “Think Big and Start Small” initial phase of work was proposed and accepted. This involved building a core Ontology in parallel with a Domain model, whereby both will be connected for building data relationships in future phases. This strategy will address the mission of contextually enriching the data organizationally, which in turn can be leveraged for greater insights in making business decisions and improved data governance.

Semantic Arts represents professional management consulting services for untangling the ad hoc patchwork of systems integration; turbo-charging new Knowledge and Information initiatives. We call it the “Data-Centric Revolution” that inverts the dependency between data models and application code. In short order, the code will become dependent on the shared information model. Join the Revolution!  

RESULTS: A small team of consultants and Morgan Stanley SME’s assembled for a 6-month assignment. During the engagement,  results came quickly. Within the initial weeks, after loading the data into a Triple Store and applying some very simplistic natural language processing routines, the team took the firm from 0.5% tagging of information to 25%, a 50-fold increase in information classification with relatively nominal effort. 

By incorporating Semantic Arts strategy of instituting a flexible Ontology and Knowledge Graphs, the improved visibility and harmonization of the information across multiple data sets quickly captured the attention of business capability owners. Amazing is the fact that only 1% accuracy was in place by leveraging existing technologies. Collaboratively, the team captured hundreds of regulatory jurisdictions used for promoting rules. By linking this data with billions of internal documents from disparate databases, it gave contextual information surrounding a document or repository for a self-assembling capability. Previously, aggregation was manually driven, inaccurate, clumsy and time-consuming. 

OTHER DOMAINS JOIN IN: Follow up engagements with Equity Research, and Operations Resiliency soon followed as the changes made a tangible impact. Those domain teams have taken on smaller use case purposes to answer difficult questions while leveraging the functionality from the core Ontology foundation developed by the Semantic Arts consultants during the first initiative. The inherent nature of Knowledge Graphs linking data relationships can transform into a Siri like experience by offering answers, recommendations and learn when tied to AI capabilities. Furthermore, the information within the Graph enriches the contextual value because its connected, resulting in a single model. The business value of capturing knowledge for expanding wisdom growth multiples as the connections become realized between domains. Beginnings are taking form: removal of data silos,  replication of data, and costly integration of application functionality.  

CHANGING WALL STREET: Combining a strategic data plan and incorporating Knowledge Graphs as a companion solution is making a difference. Wall Street reports are now being unlocked with AskResearch Chatbot capabilities to extract value by delivering hard to find information from hundreds of data sources. With coaching in best practice Ontology development, the  Equities Research team has successfully continued expansion of this graphs initial use case. 

“You have this historical archive sitting in a library and   there is so much value embedded in it, but traditionally   it has been hard to unlock that value because insights   and data are fixed in monolithic PDFs.” -D’Arcy Carr, the global head of research, editorial, and publishing

Claims of future time savings in (Billions/year) are hard to quantify but clearly usage of the Chatbot is steadily increasing. Leveraging Knowledge Graphs as the backbone for information retrieval was critical for intuitive search functionality and giving realization to self-service capability for users.  

FINANCIAL INDUSTRY INFORMATION FUTURE: The ability to leverage AI and Machine Learning in tandem with Knowledge  Graphs according to Forbes, is the financial industry future. Use will soon shift from a competitive edge to a must-have. Further discussion between Semantic Arts and Marketing and HR innovators at Morgan Stanley are in flight with more dynamic results pending.  

Semantic Arts represents professional management consulting services for untangling the ad hoc patchwork of systems integration; turbo charging  new Knowledge and Information initiatives. We call it the, “Data-Centric Revolution” that inverts the dependency between data models and application code. In short order, the code will become dependent on the shared information model. Join the Revolution!  

Extending an upper-level ontology (like GIST)

Michael Sullivan is a Principle Cloud Solutions Architect at Oracle.  Article reprinted with permission (original is here)

If you have been following my blogs over the past year or so they you will know I am a big fan of adopting an upper-level ontology to help bootstrap your own bespoke ontology project. Of the available upper-level ontologies I happen to like gist as it embraces a “less is more” philosophy.

Given that this is 3rd party software with its own lifecycle, how does one “merge” such an upper ontology with your own? Like most things in life, there are two primary ways.

CLONE MODEL

This approach is straightforward: simply clone the upper ontology and then modify/extend it directly as if it were your own (being sure to retain any copyright notice). The assumption here is that you will change the “gist” domain into something else like “mydomain”. The benefit is that you don’t have to risk any 3rd party updates affecting your project down the road. The downside is that you lose out on the latest enhancements/improvements over time, which if you wish to adopt, would require you to manually re-factor into your own ontology.

As the inventors of gist have many dozens of man-years of hands-on experience with developing and implementing ontologies for dozens of enterprise customers, this is not an approach I would recommend for most projects.

EXTEND MODEL

Just as when you extend any 3rd party software library you do so in your own namespace, you should also extend an upper-level ontology in your own namespace. This involves just a couple of simple steps:

First, declare your own namespace as an owl ontology, then import the 3rd party upper-level ontology (e.g. gist) into that ontology. Something along the lines of this:

<https://ont.mydomain.com/core> 
    a owl:Ontology ;
    owl:imports <https://ontologies.semanticarts.com/o/gistCore11.0.0> ;
    .

Second, define your “extended” classes and properties, referencing appropriate gist subclasses, subproperties, domains, and/or range assertions as needed. A few samples shown below (where “my” is the prefix for your ontology domain):

my:isFriendOf 
     a owl:ObjectProperty ;
     rdfs:domain gist:Person ;
     .
my:Parent 
    a owl:Class ;
    rdfs:subClassOf gist:Person ;
    .
my:firstName 
    a owl:DatatypeProperty ;
    rdfs:subPropertyOf gist:name ;
    .

The above definitions would allow you to update to new versions of the upper-level ontology* without losing any of your extensions. Simple right?

*When a 3rd party upgrades the upper-level ontology to a new major version — defined as non-backward compatible — you may find changes that need to be made to your extension ontology; as a hypothetical example, if Semantic Arts decided to remove the class gist:Person, the assertions made above would no longer be compatible. Fortunately, when it comes to major updates Semantic Arts has consistently provided a set of migration scripts which assist with updating your extended ontology as well as your instance data. Other 3rd parties may or may not follow suit.

Thanks to Rebecca Younes of Semantic Arts for providing insight and clarity into this.

Knowledge Graph Modeling: Time series micro-pattern using GIST

Michael Sullivan is a Principle Cloud Solutions Architect at Oracle.  Article reprinted with permission (original is here)

For any enterprise, being able to model time series is more than just important, in many cases it is critical. There are many examples but some trivial ones include “Person is employed By Employer” (Employment date-range), “Business has Business Address” (Established Location date-range), “Manager supervises Member Of Staff” (Supervision date-range), and so on. But many developers who dabble in RDF graph modeling end up scratching their heads — how can one pull that off if one can’t add attributes to an edge? While it is true that one can always model things using either reification or leveraging RDF Quads (see my previous blog semantic rdf properties) now might be a good time to take a step back and explore how the semantic gurus at Semantic Arts have neatly solved how to model time series starting with version 11 of GIST, their free upper-level ontology (link below).

First a little history. The core concept of RDF is to “connect” entities via predicates (a.k.a. “triples”) as shown below. Note that either predicate could be inferred from the other, bearing in mind that you need to maintain at least one explicit predicate between the two as there is no such thing in RDF as an subject without a predicate/object. Querying such data is also super simple.

Typical entity to entity relationships in RDF

So far so good. In fact, this is about as simple as it gets. But what if we wanted to later enrich the above simple semantic relationship with time-series? After all, it is common to want to know WHEN Mark supervised Emma. With out-of-the-box RDF you can’t just hang attributes on the predicates (I’d argue that this simplistic way of thinking is why property graphs tend to be much more comforting to developers). Further, we don’t want to throw out our existing model and go through the onerous task of re-modeling everything in the knowledge graph. Instead, what if we elevated the specific “supervises” relationship between Mark and Emma to become a first-class citizen? What would that look like? I would suggest that a “relation” entity that becomes a placeholder for the “Mark Supervises Emma” relationship would fit the bill. This entity would in turn reference Mark via a “supervision by” predicate while referencing Emma via a “supervision of” predicate.

Ok, now that we have a first-class relation entity, we are ready to add additional time attributes (i.e. triples), right? Well, not so fast! The key insight that in GIST, is that the “actual end date” and “actual start date” predicates as used here specify the precision of the data property (rather than letting the data value specifying the precision), which in our particular use case we want to be the overall date, not any specific time. Hence our use of gist:actualStartDate and gist:actualEndDate here instead of something more time-precise.

The rest is straightforward as depicted in the micro-pattern diagram shown immediately below. Note that in this case, BOTH the previous “supervised by” and “supervises” predicates connecting Mark to Emma directly can be — and probably should be — inferred! This will allow time-series to evolve and change over time while enabling queryable (inferred) predicates to always be up-to-date and in-sync. It also means that previous queries using the old model will continue to work. A win-win.

Time series micro-pattern using GIST

A clever ontological detail not shown here: A temporal relation such as “Mark supervises Emma” must be gist:isConnectedTo a minimum of two objects — this cardinality is defined in the GIST ontology itself and is thus inherited. The result is data integrity managed by the semantic database itself! Additionally, you can see the richness of the GIST “at date time” data properties most clearly in the expression of the hierarchical model in latest v11 ontology (see Protégé screenshot below). This allows the modeler to specify the precision of the start and end date times as well as distinguishing something that is “planned” vs. “actual”. Overall a very flexible and extensible upper ontology that will meet most enterprises’ requirements.

"at date time" data property hierarchy as defined in GIST v11

Further, this overall micro-pattern, wherein we elevate relationships to first-class status, is infinitely re-purposable in a whole host of other governance and provenance modeling use-cases that enterprises typically require. I urge you to explore and expand upon this simple yet powerful pattern and leverage it for things other than time-series!

One more thing…

Given that with this micro-pattern we’ve essentially elevated relations to be first class citizens — just like in classic Object Role Modeling (ORM) — we might want to consider also updating the namespaces of the subject/predicate/object domains to better reflect the objects and roles. After all, this type of notation is much more familiar to developers. For example, the common notation object.instance is much more intuitive than owner.instance. As such, I propose that the traditional/generic use of “ex:” as used previously should be replaced with self-descriptive prefixes that can represent both the owner as well as the object type. This is good for readability and is self-documenting. And ultimately doing so may help developers become more comfortable with RDF/SPARQL over time. For example:

  • ex:_MarkSupervisesEmma becomes rel:_MarkSupervisesEmma
  • ex:supervisionBy becomes role:supervisionBy
  • ex:_Mark becomes pers:_Mark

Where:

@prefix rel: <www.example.com/relation/>.
@prefix role: <www.example.com/role/>.
@prefix pers: <www.example.com/person/>.

Links

Alan Morrison: Zero-Copy Integration and Radical Simplification

Dave McComb’s book Software Wasteland underscored a fundamental problem: Enterprise software sometimes costs 1,000 times more than it ought to. The poster child for cost overruns was highlighted in the book was Healthcare.gov, a public registration system for the US Affordable Care Act, enacted in 2010. By 2018, the US Federal government had spent $2.1 billion to build and implement the system. Most of that money was wasted. The government ended up adopting many of the design principles embodied in an equivalent system called HealthSherpa, which cost $1 million to build and implement.

In an era where the data-centric architecture Semantic Arts advocates should be the norm, application-centric architecture still predominates. But data-centric architecture doesn’t just reduce the cost of applications. It also attacks the data duplication problem attributable to poor software design. This article explores how expensive data duplication has become, and how data-centric, zero-copy integration can put enterprises on a course to simplification.

Data sprawl and storage volumes

In 2021, Seagate became the first company to ship three zettabytes worth of hard disks. It took them 36 years to ship the first zettabyte. six years to ship the second zettabyte, and only one additional year to ship the third zettabyte. 

The company’s first product, the ST-506, was released in 1980. The ST-506 hard disk, when formatted, stored five megabytes (10002). By comparison, an IBM RAMAC 305, introduced in 1956, stored five to ten megabytes. The RAMAC 305 weighed 10 US tons (the equivalent of nine metric tonnes). By contrast, the Seagate ST-506, 24 years later, weighed five US pounds (or 2.27 kilograms).

A zettabyte is the equivalent of 7.3 trillion MP3 files or 30 billion 4K movies, according to Seagate. When considering zettabytes:

  • 1 zettabyte equals 1,000 exabytes.
  • 1 exabyte equals 1,000 petabytes.
  • 1 petabyte equals 1,000 terabytes.

IDC predicts that the world will generate 178 zettabytes of data by 2025. At that pace, “The Yottabyte Era” would succeed The Zettabyte Era by 2030, if not earlier.

The cost of copying

The question becomes, how much of the data generated will be “disposable” or unnecessary data? In other words, how much data do we actually need to generate, and how much do we really need to store? Aren’t we wasting energy and other resources by storing more than we need to?

Let’s put it this way: If we didn’t have to duplicate any data whatsoever, the world would only have to generate 11 percent of the data it currently does. In 2021 terms, we’d only need to generate 8.7 zettabytes of data, compared with the 78 zettabytes we actually generated worldwide over the course of that year.

Moreover, Statista estimates that the ratio of unique to replicated data stored worldwide will decline to 1:10 from 1:9 by 2024. In other words, the trend is toward more duplication, rather than less.

The cost of storing oodles of data is substantial. Computer hardware guru Nick Evanson, quoted by Gerry McGovern in CMSwire, estimated in 2020 that storing two yottabytes would cost $58 trillion. If the cost per byte stored stayed constant, 40 percent of the world’s economic output would be consumed in 2035 by just storing data.

Clearly, we should be incentivizing what graph platform Cinchy calls “zero-copy integration”–a way of radically reducing unnecessary data duplication. The one thing we don’t have is “zero-cost” storage. But first, let’s finish the cost story. More on the solution side and zero-copy integration later.

The cost of training and inferencing large language models

Model development and usage expenses are just as concerning. The cost of training machines to learn with the help of curated datasets is one thing, but the cost of inferencing–the use of the resulting model to make predictions using live data–is another. 

“Machine learning is on track to consume all the energy being supplied, a model that is costly, inefficient, and unsustainable,” Brian Bailey in Semiconductor Engineering pointed out in 2022. AI model training expense has increased with the size of the datasets used, but more importantly, as the amount of parameters increases by four, the  amount of energy consumed in the process increases by 18,000 times. Some AI models included as many as 150 billion parameters in 2022. The more recent ChatGPT LLM Training includes 180 billion parameters. Training can often be a continuous activity to keep models up to date.

But the applied model aspect of inferencing can be enormously costly. Consider the AI functions in self-driving cars, for example. Major car makers sell millions of cars a year, and each one they sell is utilizing the same carmaker’s model in a unique way. 70 percent of the energy consumed in self-driving car applications could be due to inference, says Godwin Maben, a scientist at electronic design automation (EDA) provider Synopsys.

Data Quality by Design

Transfer learning is a machine learning term that refers to how machines can be taught to generalize better. It’s a form of knowledge transfer. Semantic knowledge graphs can be a valuable means of knowledge transfer because they describe contexts and causality well with the help of relationships. 

Well-described knowledge graphs provide the context in contextual computing. Contextual computing, according to the US Defense Advanced Research Projects Agency (DARPA), is essential to artificial general intelligence.

A substantial percentage of training set data used in large language models is more or less duplicate data, precisely because of poorly described context that leads to a lack of generalization ability. Thus the reason why the only AI we have is narrow AI. And thus the reason large language models are so inefficient.

But what about the storage cost problem associated with data duplication? Knowledge graphs can help with that problem also, by serving as a means for logic sharing. As Dave has pointed out, knowledge graphs facilitate model-driven development when applications are written to use the description or relationship logic the graph describes. Ontologies provide the logical connections that allow reuse and thereby reduce the need for duplication.

FAIR data and Zero-Copy Integration

How do you get others who are concerned about data duplication on board with semantics and knowledge graphs? By encouraging data and coding discipline that’s guided by FAIR principles. As Dave pointed out in a December 2022 blogpost, semantic graphs and FAIR principles go hand in hand. https://www.semanticarts.com/the-data-centric-revolution-detour-shortcut-to-fair/ 

Adhering to the FAIR principles, formulated by a group of scientists in 2016, promotes reusability by “enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.”  When it comes to data, FAIR stands for Findable, Accessible, Interoperable, and Reusable. FAIR data is easily found, easily shared, easily reused quality data, in other words. 

FAIR data implies the data quality needed to do zero-copy integration.

Bottom line: When companies move to contextual computing by using knowledge graphs to create FAIR data and do model-driven development, it’s a win-win. More reusable data and logic means less duplication, less energy, less labor waste, and lower cost. The term “zero-copy integration” underscores those benefits.

 Alan Morrison is an independent consultant and freelance writer on data tech and enterprise transformation. He is a contributor to Data Science Central and TechTarget sites with over 35 years of experience as an analyst, researcher, writer, editor and technology trends forecaster, including 20 years in emerging tech R&D at PwC.

The Data-Centric Revolution: An Interview with Dave McComb

Are today’s economics of software projects and support inevitable? No.

They are a product of the fact that the industry has collectively chosen the application-centric route to implementing new functionality. When every business problem calls for a new application and every new application comes with its own database, what you really get is runaway complexity. Many clients have thousands of applications. But it isn’t inevitable. A few firms have shown the way out: data-centric development.

In this ground-breaking interview with Business Rules Community, Dave McComb explains what being ‘data-centric’ is about and how it can be made to work.

Read more at: The Data-Centric Revolution: An Interview with Dave McComb (Features) (brcommunity.com)

Six Enterprise Knowledge Graph Anti-Patterns

Anti-pattern #1 — Agreeing with the Status Quo

Anti-pattern #2 — Fad Surfing

Anti-pattern #3 — Too Small

Anti-pattern #4 — Too Big

Anti-pattern #5 — Data Governance

Anti-pattern #6 — Data Hoarding

Need a sherpa to get up the mountain?

CONTACT US

Originally posted at Medium.com

The Greatest Sin of Tabular Data

We recently came across this great article titled “The greatest sin of tabular data”. It is an excellent summary of the kind of work we do for our clients and how they benefit.

You can read it at The greatest sin of tabular data · A blog @ nonodename.com

The journey of capturing the meaning to data is an elusive process.  If 80% of data science is simply data wrangling, how can we do better actually providing value by making sense of that data?

With a disciplined approach and levering RDF capabilities, Semantic Arts can help to create clear, defined data, saving time and money and driving true value instead of getting bogged down in simply trying to understand data.

As stated by the author, “We can do better!”

Reach out to Semantic Arts today to see how we can help.

Original article at nonodename.com/Dan Bennett via LinkedIn Post.

Get the gist: start building simplicity now

While organizing data has always been important, a noticeably profound interest in optimizing information models with Semantic Knowledge graphs has arisen.  LinkedIn, AirBnB, in addition to giants Google and Amazon use graphs, but without a model for connecting concepts with rules for membership buyer recommendations and enhanced searchability (follow your nose) capabilities would lack accuracy.
Drum roll please … Introduce the ontology.
It is a model that supports semantic knowledge graph reasoning, inference, and provenance enablement.  Think of an ontology as the brain giving messages to the nervous systems (the knowledge graph).  An ontology organizes data into well-defined categories with clearly defined relationships.  This model represents a foundational starting point that allows humans and machines to read, understand, and infer knowledge based on its classification.  In short, this automatically figures out what is similar and what is different.
We’re asked often, where do I start?
Enter ‘gist’ a minimalist business ontology (model) to springboard transitioning information into knowledge.  With more than a decade of refinement grounded in simplicity, ‘gist’ is designed to have the maximum coverage of typical business ontology concepts with the fewest number of primitives and least amount of ambiguity.  ‘gist’ is available for free under a Creative Commons license and is being applied and extended within a number of business use cases and utilized by countless industries.
Recently, senior Ontologist Michael Uschold has been sharing an introductory overview of ‘gist’, maintained by Semantic Arts.
One compelling difference from most publicly available ontologies, ‘gist’ has an active governance and best practices community, called the gist Council. The council meets virtually on the first Thursday of every month to discuss how to use ‘gist’ and make suggestions on its evolution.
See Part I of Michael’s introduction here:

See Part II of Michael’s introduction here:

Stay tuned for the final installment!

Interested in gist? Visit Semantic Arts – gist

See more informative videos on Semantic Arts – YouTube