Bob DuCharme, who has recently been exploring a variety of triple stores, has an insightful post up asking questions about the idea of named graphs in RDF stores. Since the Open Anzo repository is based around named graphs (as are all Cambridge Semantics’ products based on Open Anzo such as Anzo for Excel), I thought I’d take a stab at giving our answers to Bob’s questions:
1. If graph membership is implemented by using the fourth part of a quad to name the graph that the triple belongs to, then a triple can only belong directly to one graph, right?
This is correct. In Open Anzo, triples are really quads, in that every subject-predicate-object triple has a fourth component, a URI that designates the named graph of the triple. The named graph with URI u comprises all of the triples (quads) that have u as their fourth component.
Of course, this means that the same triple (subject-predicate-object) can exist in multiple named graphs. In such a case, each such triple is distinct from the others; it can be removed from one named graph independently of its presence in other named graphs.
2. I say "belong directly" because I'm thinking that a graph can belong to another graph. If so, how would this be indicated? Is there some specific predicate to indicate that graph x belongs to graph y?
Open Anzo has no concept of nesting graphs or graph hierarchies. The URI of a named graph can be used as the subject or object of a triple just like any other URI, with a meaning specific to whatever predicate is being used. So two graphs can be related by means of ordinary triples, but there is no special support for any such constructs.
3. If we're going to use named graphs to track provenance, then it would make sense to assign each batch of data added to my triplestore to its own graph. Let's say that after a while I have thousands of graphs, and I want to write a SPARQL query whose scope is 432 of those graphs. Do I need 432 "FROM NAMED" clauses in my query? (Let's assume that I plan to query those same 432 multiple times.)
There are a couple of points here.
- First, for Open Anzo at least, it's up to the application developer how to group triples into named graphs. I don't think we've ever ourselves used the scheme you suggest (everything updated at once is a named graph), but you could if you wanted. Instead, named graphs tend to collect triples that represent a reasonably core object in the application's domain of discourse.
- Open Anzo does use named graphs for provenance. Named graphs are the basic unit for:
- Versioning. When one or more triples in a named graph are updated, the entire graph is versioned. Open Anzo tracks the modification time and the user that instigated the change, and also provides an API for getting at previous revisions of a graph. (Graphs can also be explicitly created that do not keep track of revisions. Those still track the last updated on and last updated by bits of provenance.)
- Access control. Control of who can read, write, remove, or change permissions on RDF data in Open Anzo is attached strictly at the named-graph level. This tends to work nicely with the general modeling approach that lets a named graph represent a conceptual entity.
- Replication. Client applications can maintain local replicas of data from an Open Anzo server. Replication occurs at the level of a named graph.
- Second, it's worth noting that Open Anzo adds a bit of infrastructure for handling this sort of provenance. Each named graph in an Open Anzo repository has an associated metadata graph. The system manages the triples in the metadata graph, which can include access control data, provenance data, version histories, associated ontological elements, and more. This lets all of the provenance information be treated as RDF without conflating it with user/application-created triples.
- Third, regarding the challenge of handling queries that need to span hundreds or thousands of named graphs: As Bob observed, this is a common situation when you are basing a store around named graphs. The Open Anzo approach to this problem is to introduce the idea of a named dataset. A named dataset is a URI-identified collection of graphs. (Technically, it's two collections of graphs, representing both the default and named graph elements of a SPARQL query.) Glitter, the Open Anzo SPARQL engine, extends SPARQL with a FROM DATASET <u> clause that scopes the query to the graphs contained in the referenced named dataset, u. Currently, named datasets explicitly enumerate their constituent graphs. There's no reason, however, that the same approach could not be used along with other methods of identifying the dataset's graph contents, such as URI patterns or a query.
All in all, we find the named graph model to be extremely empowering when building applications based on RDF. It gives a certain degree of scaffolding that allows all sorts of engineering and user experience flexibility. At a high level, we approach named graphs in a similar fashion to how we approach ontologies. We find both constructs useful for dealing with large amounts of RDF in practical enterprise environments, for engineering various ways of partitioning and understanding the data throughout the software stack. In the end, the named graph model goes to the heart of a few of RDF's core value propositions: agility and expressivity of the data model and adaptability of software built upon it.