At some point, there will be full stack data-centric architectures available to buy, to use as a service or as an open source project. At the moment, as far as we know, there isn’t a full stack data-centric architecture available to direct implementation. What this means is that early adopters will have to roll their own.
This is what the early adopters I’m covering in my next book have done and—I expect for the next year or two at least— what the current crop of early adopters will need to do.
I am writing a book that will describe in much greater detail the considerations that will go into each layer in the architecture.
This paper will outline what needs to be considered to give people an idea of the scope of such an undertaking. You might have some of these layers already covered.
Simplicity
There are many layers to this architecture, and at first glance it may appear complex. I think the layers are a pretty good separation of concern, and rather than adding to the complexity, I believe it may simplify it.
As you review the layers, do so through the prism of the two driving APIs. There will be more than just these two APIs and we will get into the additional ones, as appropriate, but this is not going to be the usual Swiss army knife of a whole lot of APIs, with each one doing just a little bit. The APIs are of course RESTful.
The core is composed of two APIs (with our working titles):
- ExecuteNamedQuery—This API assumes a SPARQL query has been stored in the triple store and given a name. In addition, the query is associated with a set of substitutable parameters. At run time, the name of the query is forwarded to the server with the parameter names and values. The back end fetches the query, rewrites it with the parameter values in place, executes that, and returns it to the client. Note that if the front end did not know the names of the available queries, it could issue another named query that returns all the available named queries (with their parameters). Also note that this also implies the existence of an API that will get the queries into the database, but we’ll cover that in the appropriate layer when we get to it.
- DeltaTriples—This API accepts two arrays of triples as its payload. One is the “adds” array, which lists the new triples that the server needs to create, and the other is “deletes,” which are the triples to be removed. This puts a burden on the client. The client will be constructing a UI from the triples it receives in a request, allowing a user to change data interactively, and then evaluate what changed. This part isn’t as hard as it sounds when you consider that order is unimportant with triples. There will be quite a lot going on with this API as we descend down the stack, but the essential idea is that this API is the single route through which all updates pass through, and will ultimately result in an ACID compliant transaction being updated to the triple store.
I’m going to proceed from the bottom (center) of the architecture up, with consideration for how these two key APIs will be influenced by each of the layers.
A graphic that ties this all together appears at the end of this article.
Data Layer
At the center of this architecture is the data. It would be embarrassing if something else were at the center of the data-centric architecture. The grapefruit wedges here are each meant to represent a different repository. There will be more than one repository in the architecture.
The darker yellow ones on the right are meant to represent repositories that are more highly curated. The lighter ones on the left represent those less curated (perhaps data sets retrieved from the web). The white wedge is a virtual repository. The architecture knows where the data is but resolves it at query time. Finally, the cross hatching represents provenance data. In most cases, the provenance data will be in each repository, so this is just a visual clue.
The two primary APIs bottom out here, and become queries and updates.
Federation Layer
One layer up is the ability to federate a query over multiple repositories. At this time, we do not believe it will be feasible or desirable to spread an update over more than one repository (this would require the semantic equivalent of a two-phased commit). In most implementations this will be a combination between native abilities of a triple store, reliance on support for the standards-based federation, and bespoke capability. The federation layer will be interpreting the ExecuteNamedQuery requests.