Dave McComb’s book Software Wasteland underscored a fundamental problem: Enterprise software sometimes costs 1,000 times more than it ought to. The poster child for cost overruns was highlighted in the book was Healthcare.gov, a public registration system for the US Affordable Care Act, enacted in 2010. By 2018, the US Federal government had spent $2.1 billion to build and implement the system. Most of that money was wasted. The government ended up adopting many of the design principles embodied in an equivalent system called HealthSherpa, which cost $1 million to build and implement.
In an era where the data-centric architecture Semantic Arts advocates should be the norm, application-centric architecture still predominates. But data-centric architecture doesn’t just reduce the cost of applications. It also attacks the data duplication problem attributable to poor software design. This article explores how expensive data duplication has become, and how data-centric, zero-copy integration can put enterprises on a course to simplification.
Data sprawl and storage volumes
In 2021, Seagate became the first company to ship three zettabytes worth of hard disks. It took them 36 years to ship the first zettabyte. six years to ship the second zettabyte, and only one additional year to ship the third zettabyte.
The company’s first product, the ST-506, was released in 1980. The ST-506 hard disk, when formatted, stored five megabytes (10002). By comparison, an IBM RAMAC 305, introduced in 1956, stored five to ten megabytes. The RAMAC 305 weighed 10 US tons (the equivalent of nine metric tonnes). By contrast, the Seagate ST-506, 24 years later, weighed five US pounds (or 2.27 kilograms).
A zettabyte is the equivalent of 7.3 trillion MP3 files or 30 billion 4K movies, according to Seagate. When considering zettabytes:
- 1 zettabyte equals 1,000 exabytes.
- 1 exabyte equals 1,000 petabytes.
- 1 petabyte equals 1,000 terabytes.
IDC predicts that the world will generate 178 zettabytes of data by 2025. At that pace, “The Yottabyte Era” would succeed The Zettabyte Era by 2030, if not earlier.
The cost of copying
The question becomes, how much of the data generated will be “disposable” or unnecessary data? In other words, how much data do we actually need to generate, and how much do we really need to store? Aren’t we wasting energy and other resources by storing more than we need to?
Let’s put it this way: If we didn’t have to duplicate any data whatsoever, the world would only have to generate 11 percent of the data it currently does. In 2021 terms, we’d only need to generate 8.7 zettabytes of data, compared with the 78 zettabytes we actually generated worldwide over the course of that year.
Moreover, Statista estimates that the ratio of unique to replicated data stored worldwide will decline to 1:10 from 1:9 by 2024. In other words, the trend is toward more duplication, rather than less.
The cost of storing oodles of data is substantial. Computer hardware guru Nick Evanson, quoted by Gerry McGovern in CMSwire, estimated in 2020 that storing two yottabytes would cost $58 trillion. If the cost per byte stored stayed constant, 40 percent of the world’s economic output would be consumed in 2035 by just storing data.
Clearly, we should be incentivizing what graph platform Cinchy calls “zero-copy integration”--a way of radically reducing unnecessary data duplication. The one thing we don’t have is “zero-cost” storage. But first, let’s finish the cost story. More on the solution side and zero-copy integration later.
The cost of training and inferencing large language models
Model development and usage expenses are just as concerning. The cost of training machines to learn with the help of curated datasets is one thing, but the cost of inferencing–the use of the resulting model to make predictions using live data–is another.
“Machine learning is on track to consume all the energy being supplied, a model that is costly, inefficient, and unsustainable,” Brian Bailey in Semiconductor Engineering pointed out in 2022. AI model training expense has increased with the size of the datasets used, but more importantly, as the amount of parameters increases by four, the amount of energy consumed in the process increases by 18,000 times. Some AI models included as many as 150 billion parameters in 2022. The more recent ChatGPT LLM Training includes 180 billion parameters. Training can often be a continuous activity to keep models up to date.
But the applied model aspect of inferencing can be enormously costly. Consider the AI functions in self-driving cars, for example. Major car makers sell millions of cars a year, and each one they sell is utilizing the same carmaker’s model in a unique way. 70 percent of the energy consumed in self-driving car applications could be due to inference, says Godwin Maben, a scientist at electronic design automation (EDA) provider Synopsys.
Data Quality by Design
Transfer learning is a machine learning term that refers to how machines can be taught to generalize better. It’s a form of knowledge transfer. Semantic knowledge graphs can be a valuable means of knowledge transfer because they describe contexts and causality well with the help of relationships.
Well-described knowledge graphs provide the context in contextual computing. Contextual computing, according to the US Defense Advanced Research Projects Agency (DARPA), is essential to artificial general intelligence.
A substantial percentage of training set data used in large language models is more or less duplicate data, precisely because of poorly described context that leads to a lack of generalization ability. Thus the reason why the only AI we have is narrow AI. And thus the reason large language models are so inefficient.
But what about the storage cost problem associated with data duplication? Knowledge graphs can help with that problem also, by serving as a means for logic sharing. As Dave has pointed out, knowledge graphs facilitate model-driven development when applications are written to use the description or relationship logic the graph describes. Ontologies provide the logical connections that allow reuse and thereby reduce the need for duplication.
FAIR data and Zero-Copy Integration
How do you get others who are concerned about data duplication on board with semantics and knowledge graphs? By encouraging data and coding discipline that’s guided by FAIR principles. As Dave pointed out in a December 2022 blogpost, semantic graphs and FAIR principles go hand in hand. https://www.semanticarts.com/the-data-centric-revolution-detour-shortcut-to-fair/
Adhering to the FAIR principles, formulated by a group of scientists in 2016, promotes reusability by “enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.” When it comes to data, FAIR stands for Findable, Accessible, Interoperable, and Reusable. FAIR data is easily found, easily shared, easily reused quality data, in other words.
FAIR data implies the data quality needed to do zero-copy integration.
Bottom line: When companies move to contextual computing by using knowledge graphs to create FAIR data and do model-driven development, it’s a win-win. More reusable data and logic means less duplication, less energy, less labor waste, and lower cost. The term “zero-copy integration” underscores those benefits.
Alan Morrison is an independent consultant and freelance writer on data tech and enterprise transformation. He is a contributor to Data Science Central and TechTarget sites with over 35 years of experience as an analyst, researcher, writer, editor and technology trends forecaster, including 20 years in emerging tech R&D at PwC.