SPARQL Calendar Demo: Using SPARQL to Find, Identify, and Name People

This is the fifth in a series of entries about the SPARQL calendar demo. If you haven't already, you can read the previous entry.

This entry is the first of a few entries that will examine the specific SPARQL queries used in the calendar demo. While SPARQL bears surface resemblances to SQL, querying an RDF graph is a distinct approach from querying a relational data store, and there are several idioms and subtleties that are unique to the SPARQL language. (None of these ideas are new, of course! But as SPARQL has just moved to Candidate Recommendation status, I thought it might be useful to throw some real SPARQL queries out into the wild.)

This query is issued against the current dataset every time a new URI is added to the dataset (either manually or via the discover more people link):

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ical: <http://www.w3.org/2002/12/cal/icaltzd#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?who ?name ?id ?cal 
WHERE {
  ?who rdf:type foaf:Person .
  OPTIONAL { ?who foaf:name ?name }
  OPTIONAL { ?who rdfs:label ?name }
  OPTIONAL {
    { ?who foaf:mbox ?id } 
      UNION 
    { ?who foaf:mbox_sha1sum ?id } 
  }
  OPTIONAL {
    ?who rdfs:seeAlso ?cal .
     ?cal rdf:type ical:Vcalendar 
  }
} ORDER BY ?name

In English, this query asks:

Show me all people along with their names (if found), unique IDs (if found), and calendar URLs (if found) in my current RDF dataset.
There are a few interesting observations that we can take away from this query:

  • Why all the OPTIONALs? We want to build as exhaustive list of people as we can given our current dataset. When people reference their friends in their FOAF files, the amount of information that they include about them ranges from a URI-only to an IFP-only to a full suite of URI, name, and IFP information. Because we do not know the shape of the data we are querying, we take advantage of the SPARQL OPTIONAL keyword which allows us to include triple patterns which are allowed to not match the data being queried. That is, OPTIONAL ensures that if a person has a name but not an id (an IFP) that we'll receive the name and vice versa; the query will return all the information it can find without failing due to shaggy data.
  • Why are there two different OPTIONAL blocks that can bind the ?name variable? This idiom takes advantage of the fact that the OPTIONAL keyword is left-associative to express an ordered preference between predicates within our SPARQL query1. That is:
      OPTIONAL { ?who foaf:name ?name }
      OPTIONAL { ?who rdfs:label ?name }
    
    can be read as (given that ?who is already bound by the first (non-optional) triple pattern in the query):
    Bind ?name to the object of either the foaf:name or rdfs:label predicates; but if both such bindings exist, we prefer the object of foaf:name.
    It's a very useful idiom for sure, especially in the absence of a rules-enabled datastore that could map one predicate to another in the absence of a triple with a more-desirable predicate.
  • Why don't we use the same trick for finding bindings to ?id? This SPARQL query uses the ?id variable to bind to the values of inverse-functional properties (?ifp would likely have been a better name for the variable). Each such property uniquely identifies a person, and the calendar demo uses them to smush together seemingly distinct foaf:Person URIs or bnodes that actually refer to the same person. Because of this, we want to learn about as many IDs as we can and therefore we use the SPARQL UNION keyword to disjunctively include all possible bindings for ?id. (Of course, we wrap the UNION in an OPTIONAL because we want the query pattern to match a person even if no IFPs are found for that person.)
  • What's that oddness with the calendar gunk in the query? And why is that in this query? OK, you got me there. This bit of functionality doesn't belong here, and in fact is duplicated in the SPARQL query which mines the current RDF dataset to discover new default and named graphs to add to the dataset. I'll discuss that query next time, and explain what this bit of SPARQL is saying. Until then, happy SPARQLing...

1 The nitty gritty: SPARQL defines A OPTIONAL B OPTIONAL C as (A OPTIONAL B) OPTIONAL C. In the case in question, A is our required triple pattern which binds ?who to the resource or bnode representing a foaf:Person. As per the definition of OPTIONAL then, the parenthesized portion of (A OPTIONAL B) OPTIONAL C will match successfully no matter what (since we're assuming A has already matched a foaf:Person), but will include bindings for B (that is, bindings of ?name to the object of foaf:name) if they exist. In either case, we then examine C. If B matched, then C can only match if it shares the same binding for ?name, so any other value as the object of rdfs:label gets ignored. If B failed to match then ?name remains unbound, and any object of rdfs:label will be bound to ?name. Voila—we have the desired behavior of expressing an ordered preference.