" /> TechnicaLee Speaking: March 2008 Archives

« February 2008 | Main | May 2008 »

March 26, 2008

Now available online - Scientific American: "The Semantic Web in Action"

I blogged previously about my experience co-authoring an article on the Semantic Web for Scientific American. Since then, Scientific American has granted me permission to publish the text of the article on my Web site. So please feel free to enjoy the article and share it with others: "The Semantic Web In Action"

A few notes:

  • The default view of the article breaks it into multiple pages to make it more easily digestible and bookmarkable. There is a link at the top and bottom to a single-page version suitable for printing and reading offline. Or if you just happen to prefer reading it like that.
  • The article text is followed by the text of the article's sidebars. There are links back and forth between the main text and the relevant sidebars. Most of the sidebars in the article included artwork which I do not have permission to reproduce online at this time.
  • At the end of the article I've gathered links to the various companies, projects, and technologies referenced in the article. (The terms of the reproduction rights from Scientific American prohibit adding links within the main content of the article.)

Please let me know what you think. Also, if you have any trouble reading or printing the article, let me know as well. (I whipped together some JavaScript to do the pagination while maintaining the browser's back button and internal anchors and things like that, so there may be some bugs. I'll write more about the JavaScript some other time.)

March 18, 2008

Gathering SPARQL Extensions

I realized that I hadn't blogged a pointer to the compilation of SPARQL extensions that I've created on the ESW wiki. Quoting myself:

Over the DAWG's lifetime (and since publication of the SPARQL Recommendations in January), there have been many important features that have been discussed but did not get included in the SPARQL specifications. I -- and many others -- hope that many of these topics will be addressed by a future working group, though there are no concrete plans for such a group at this time.

In the interest of cataloging these extensions and encouraging SPARQL developers to seek interoperable implementations of SPARQL extensions, I've created:


   http://esw.w3.org/topic/SPARQL/Extensions


That page links to individual pages for (currently) 13 categories of SPARQL extensions. Each of those pages, in turn, discusses the relevant type of SPARQL extension and attempts to provide links to research, discussion, and implementations of the extension.


I also plan to use this list to help encourage user- and implementor-driven discussion of these extensions over the coming months. Again, the goal is to allow SPARQL users to make known what features are most important to them and also to allow implementations to seek common syntaxes and semantics for SPARQL extensions. (All of this, in the end, should help a future working group charter a new version of SPARQL and produce a specification that allows for interoperable SPARQL v2 implementations.)

It's a wiki. Please add references that are not there, new topics, or discussions of existing topics. (I've tried to reuse existing ESW Wiki pages for some topics that already had discussion.)

Where I say "this list" above, I mean public-sparql-dev@w3.org. Please subscribe if you're interested in discussing any or all of these potential SPARQL extensions.

March 12, 2008

Semantic Web tutorial

Last week, Eric Prud'hommeaux and I presented a tutorial on Semantic Web technologies at the Conference on Semantics in Healthcare & Life Sciences (C-SHALS). It was a four-hour session covering an intro to RDF, SPARQL, GRDDL, RDFa, RDFS, and OWL, mostly in the context of health care (patients' clinical examination records) and life sciences (pyramidal neurons in Alzheimer's Disease, as per the W3C HCLS interest group's knowledgebase use case). We reprised the GRDDL and RDFa sections yesterday in a whirlwind 15-20 minute talk at yesterday's Cambridge Semantic Web gathering.

Enjoy the slides. I'd welcome any suggestions so that the slides can be enhanced and reused (by myself and others) in the future.

March 8, 2008

Modeling Statistics in RDF - A Survey and Discussion

At the Semantic Technologies Conference in San Jose in May, Brand Niemann of the U.S. EPA and I are presenting Getting to Web Semantics for Spreadsheets in the U.S. Government. In particular, Brand and I are working to exploit the semantics implicit in the nearly 1,500 spreadsheets that are in the U.S. Census Bureau's annual Statistical Abstract of the United States. The rest of this post discusses various strategies for modeling this sort of statistical data in RDF; for more information on the background of this work, please see my presentation from the February 5, 2008, SICoP Special Conference.)

The data for the Statistical Abstract is effectively time-based statistics. There are a variety of ways that this information can be modeled as semantic data. The approaches differ in simplicity/complexity, semantic expressivity, and verbosity. At least as interestingly, they vary in precisely what they are modeling: statistical data or a particular domain of discourse. The goal of this effort is to examine the potential approaches to modeling this information in terms of ease of reuse, ease of query, ability to integrate with information from all 1,500 spreadsheets (and other sources), and the ability to enhance the model incrementally with richer semantics. There are surely other approaches to modeling this information as well: I'd love to here any ideas or suggestions for other approaches to consider.

Contents

[hide]

D2R Server for Eurostat

The D2R server guys host an RDF copy of the Eurostat collection of European economic, demographic, political, and geographic data. From the start, they make the simplifying assumption that:

Most statistical data are time series, therefore only the latest availabe value is provided here.

In other words, they do not try to capture historic statistics at all. The disclaimer also notes that what is modeled in RDF is a small subset of the available data tables.

Executing a SELECT DISTINCT ?p { ?s ?p ?o } to learn more about this dataset tells us:

   db:eurostat/population_total
   db:eurostat/electricity_consumption_GWh
   db:eurostat/killed_in_road_accidents
   db:eurostat/RnD_exp_mio_euro
   db:eurostat/parentcountry
   db:eurostat/population_male
   rdfs:label
   db:eurostat/RnD_personel_percent_of_act_pop
   db:eurostat/total_average_population
   db:eurostat/population_female
   db:eurostat/unemployment_rate_total
   db:eurostat/avg_annual_population_growth
   db:eurostat/total_area_km2
   db:eurostat/name_encoded
   db:eurostat/disposable_income
   db:eurostat/injured_in_road_accidents
   db:eurostat/electricity_production_capacity_MWh
   db:eurostat/hospital_beds_per100000hab
   db:eurostat/name
   db:eurostat/landuse_total
   db:eurostat/GDP
   db:eurostat/geocode
   owl:sameAs
   rdf:type
   db:eurostat/level_of_internetaccess_households
   db:eurostat/death_rate
   db:eurostat/fertility_rate_total
   db:eurostat/level_of_internet_access
   db:eurostat/marriages
   db:eurostat/ecommerce_via_internet
   db:eurostat/pupils_and_students
   db:eurostat/inflation_rate
   db:eurostat/employment_rate_total
   db:eurostat/average_exit_age_from_laborforce
   db:eurostat/comparative_price_levels
   db:eurostat/GDP_current_prices
   db:eurostat/GDP_per_capita_PPP
   db:eurostat/monthly_labour_costs

I make a few observations from this:

  • Most of these are predicates that correspond to a statistical category. I'm curious what the types of the subjects are. The query here is (the filter is added to limit the question to resources that use the Eurostat predicates):
     SELECT DISTINCT ?t WHERE {  ?s rdf:type ?t .  ?s ?p ?o .
      FILTER(regex(str(?p), 'eurostat') )
     }
    
    The result is two types: regions and countries. Simple enough.
  • I'm also curious as to the types of the objects. Let's see if there are any resources (URIs) as objects. We do the ?s ?p ?o query from before but add in FILTER(isURI(?o)). The result shows that, aside from rdf:type and owl:sameAs (which we expected), only the predicate db:eurostat/parentcountry points to other resources. Doing a query on this predicate, we see that it relates regions (e.g. db:regions/Lorraine) to countries (e.g. db:countries/France).
  • I'd expect that, especially in the absence of time-based data, they don't have object structures with blank nodes. Changing the previous filter to use isBlank confirms that this is true.
  • So what are the types of the other data? Strings? Numbers? Let's find out. Poking around with various values for XXX in the filter FILTER(isLiteral(?o) && datatype(?o) = XXX) we see that some data uses xsd:strings while other data uses xsd:double. Poking around at the remaining predicates, we discover that they use xsd:long for non-decimal numbers.
  • What are they using owl:sameAs for? Executing SELECT ?s ?o { ?s owl:sameAs ?o } shows what I suspected: they're equating URIs that they've minted under a Eurostat namespace (http://www4.wiwiss.fu-berlin.de/eurostat/resource/) to DBPedia URIs (to broaden the linked data Web). Let's see if they use owl:sameAs for anything else. We add FILTER(!regex(str(?o), 'dbpedia')) and the query now returns no results.

The 2000 U.S. Census

Joshua Tauberer converted the 2000 U.S. Census Data into 1 billion RDF triples. He provides a well-documented perl script that can convert various subsets of the census data into N3. One mode that this script can be run in is to output the schema from SAS table layout files. Joshua's about provides an overview of the data. In particular, I note that he is working with tables that are multiple levels deep (e.g. population by sex and then by age).

The most useful part of the writeup, though, is the writeup specifically about modeling the census data in RDF. In general, Joshua models nested levels of statistical tables (representing multiple facets of the data) as a chain of predicates (with the interim nodes as blank nodes). If a particular criterion is further subdivided, then the aggregate total at that level is linked with rdf:value. Otherwise, the value is given as the object itself. Note that the subjects are not real-world entities ("the U.S.") but instead are data tables ("the U.S. census tables"). The entities themselves are related to the data tables via a details predicate. The below excerpt combines both types of information (the entity itself followed by the data tables above the entity):

 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
 @prefix dc: <http://purl.org/dc/elements/1.1/> .
 @prefix dcterms: <http://purl.org/dc/terms/> .
 @prefix : <tag:govshare.info,2005:rdf/census/details/100pct> .
 @prefix politico: <http://www.rdfabout.com/rdf/schema/politico/> .
 @prefix census: <http://www.rdfabout.com/rdf/schema/census/> .

 <http://www.rdfabout.com/rdf/usgov/geo/us>
   a politico:country ;
   dc:title "United States" ;
   census:households 115904641 ;
   census:waterarea "664706489036 m^2" ;
   census:population 281421906 ;
   census:details <http://www.rdfabout.com/rdf/usgov/geo/us/censustables> ;
   dcterms:hasPart <http://www.rdfabout.com/rdf/usgov/geo/us/al>, <http://www.rdfabout.com/rdf/usgov/geo/us/az>, ...
 .

 <http://www.rdfabout.com/rdf/usgov/geo/us/censustables>  :totalPopulation 281421906 ;     # P001001
   :totalPopulation [
      dc:title "URBAN AND RURAL (P002001)";
      rdf:value 281421906 ;   # P002001
      :urban [
         rdf:value 222360539 ;  # P002002
         :insideUrbanizedAreas 192323824 ;   # P002003
         :insideUrbanClusters 30036715 ;     # P002004
      ] 
      :rural 59061367 ;   # P002005
   ] 
   :totalPopulation [
     dc:title "RACE (P003001)";
     rdf:value 281421906 ;   # P003001
   :populationOfOneRace [
       rdf:value 274595678 ;    # P003002
       :whiteAlone 211460626 ;     # P003003
       :blackOrAfricanAmericanAlone 34658190 ;     # P003004
       :americanIndianAndAlaskaNativeAlone 2475956 ;   # P003005
   ]
 ...

This is an inconsistent modeling (which Joshua admits himself in the description). Note for instance how :totalPopulation > :urban has a rdf:value link to the aggregate US urban population. When you go one level deeper though, :totalPopulation > :urban > :insideUrbanizedAreas has an object which is itself the value of that statistic.

As I see it, this inconsistency could be avoided in two ways:

  1. Always insist that a statistic hangs off of a resource (URI or blank node) via the rdf:value predicate.
  2. Allow a criterion/classificaiton predicate to point both to a literal (aggregate) value, and also to further subdivisions. This would allow the above example to have a triple which was :totalPopulation > :urban > 222360539 in addition to the further nested :totalPopulation > :urban > :insideUrbanizedAreas > 192323824.

The second approach seems simpler to me (less triples). It can be queried with an isLiteral filter restriction. The first approach might be a slightly simpler query, as it would always just query for rdf:value. (The queries would be about the same size, but the rdf:value approach is a bit clearer to read than the isLiteral filter approach.)

As an aside, this statement from Joshua is a telling factor on the value of what we are doing with the U.S. Statistical Abstract data:

(If you followed Region > households > nonFamilyHouseholds you would get the number of households, not people, that are nonFamilyHouseHolds. To know what a "non-family household" is, you would have to consult the PDFs published by the Census.)

Riese: RDFizing and Interlinking the EuroStat Data Set Effort

Riese is another effort to convert the EuroStat data to RDF. It seeks to expand on the coverage of the D2R effort. Project discussion is available on an ESW wiki page, but the main details of the effort are on the project's about page. Currently, riese only provides five million out of the three billion triples that it seeks to provide.

The under the hood section of the about page links to the riese schema. (Note: this is a simple RDF schema; no OWL in sight.) The schema models statistics as items that link to times, datasets, dimensions, geo information, and a value (using rdf:value).

Every statistical data item is a riese:item. riese:items are qualified with riese:dimensions, one of which is, in particular, dimension:Time.

The "ask" page gives two sample queries over the EuroStat RDF data, but those only deal in the datasets. RDF can be retrieved for the various Riese tables and data items by appending /content.rdf to the items' URIs and doing an HTTP GET. Here's an example of some of the RDF for a particular data item (this is not strictly legal Turtle, but you'll get the point):

@prefix : <http://riese.joanneum.at/data/> .
@prefix riese: <http://riese.joanneum.at/schema/core#> .
@prefix dim: <http://riese.joanneum.at/dimension/> .
@prefix dim-schema: <http://riese.joanneum.at/schema/dimension/> .

:bp010 a riese:dataset ;
  # all dc:title's repeated as rdfs:label
  dc:title "Current account - monthly: Total" ;
  riese:data_start "2002m10" ; # proprietary format?
  riese:data_end   "2007m09" ;
  riese:structure  "geo\time" ; # not sure of this format
  riese:datasetOf :bp010/2007m03_ea .

:bp010/2007m03_ea a riese:Item ;
  dc:title "Table: bp010, dimensions: ea, time: 2007m03" ;
  rdf:value "7093" ; # not typed
  riese:dimension dim:geo/ea ;
  riese:dimension dim:time/2007m03 ;
  riese:dataset :bp010 .

dim:geo/ea a dim-schema:Geo .
  dc:title "Euro area (EA11-2000, EA12-2006, EA13-2007, EA15)" .

dim:time/2007m03 a dim-schema:Time .
  dc:title "" . # oops

dim-schema:Geo rdfs:subClassOf riese:Dimension ; dc:title "Geo" .
dim-schema:Time rdfs:subClassOf riese:Dimension ; dc:title "Time" .

(A lot of this is available in dic.nt (39 MB).)

Summary

In summary, these three examples show three distinct approaches for modeling statistics:

  1. Simple, point-in-time statistics. Predicates that fully describe each statistic relate a (geographic, in this case) entity to the statistic's value. There's no way to represent time in this (or other dimensions) into this model other than to create a new predicate for every combination of dimensions (e.g. country:bolivia stat:1990population18-30male 123456). Queries are flat and rely on knowledge of or metadata (e.g. rdfs:label) about the predicates. No way to generate tables of related values easily. Observation: this approach effectively builds a model of the real-world, ignoring statistical artifacts such as time, tables, and subtables.
  2. Complex, point-in-time statistics. An initial predicate relates a (geographic, in this case) entity to both an aggregate value for the statistic, as well as to (via blank nodes) other predicates that represent dimensions. Aggregate values are available off of any point in the predicate chain. Applications need to be aware of the hierarchical predicate structure of the statistics for queries, but can reuse (and therefore link) some predicates amongst different statistcs. Nested tables can easily be constructed from this model. Observation: this approach effectively builds a model of the statistical domain in question (demographics, geography, economics, etc. as broken into statistical tables).
  3. Complex statistics over time. Each statistic (each number) is represented as an item with a value. Dimensions (including time) are also described as resources with values, titles, etc. In this approach, the entire model is described by a small number of predicates. Applications can flexibly query for different combinations of time and other dimensions, though they still must know the identifying information for the dimensions in which they are interested. Applications can fairily easily construct nested tables from this model. Observation: this approach effectively uses a model of statistics (in general) which in turn is used to express statistics about the domains in question.

Statistical Abstract data

Simple with time

One of the simplest data tables in the Statistical Abstract gives statistics for airline on-time arrivals and departures. A sample of how this table is laid out is:

Airport On-time Arrivals On-time Departures
2006 Q1 2006 Q2 2006 Q1 2006 Q2
Total major airports 77.0 76.7 79.0 78.5
Atlanta, Hartsfield 73.9 75.5 76.0 74.3
Boston, Logan International 75.6 66.8 80.5 74.8

Overall, this is fairly simple. Every airport, for each time period has an on-time arrival percentage and an on-time departure percentage. If we simplified it even further by removing the use of multiple times, then it's just a simple grid spreadsheet (relating airports to arrival % and departure %). This does have the interesting (?) twist that the aggregate data (total major airports) is not simply a sum of the constituent data items (since we're dealing in percentages).

Simple point-in-time approach

If we ignore time (and choose 2006 Q1 as our point in time), then this data models as:

 ex:ATL ex:ontime-arrivals 73.9 ; ex:ontime-departures 76.0 .
 ex:BOS ex:ontime-arrivals 75.6 ; ex:ontime-departures 80.5
 ex:us-major-airports ex:ontime-arrivals 77.0 ; ex:ontime-departures 79.0

This is simple, but ignores time. It also doesn't give any hint that ex:us-major-airports is a total/aggregate of the other data. We could encode time in the predicates themselvs (ex:ontime-arrivals-2006-q1), but I think everyone would agree that that's a bad idea. We could also let each time range be a blank node off the subjects, but that assumes all subjects have data conforming to the same time increments. Any such approach starts to get close to the complex point-in-time approach, so let's look at that.

Complex point-in-time approach

If we ignore time and view the "total major airports" as unrelated to the individual airports, then we have no "nested tables" and this approach degenerates to the simple point-in-time approach, effectively:

 ex:ATL a ex:Airport ;
   dcterms:isPartOf ex:us-major-airports ;
   stat:details [
     ex:on-time-arrivals 73.9 ;
     ex:on-time-departures 76.0
   ] .
 ex:BOS a ex:Airport ;
   dcterms:isPartOf ex:us-major-airports ;
   stat:details [
     ex:on-time-arrivals 75.6 ;
     ex:on-time-departures 80.5
   ] .
 ex:us-major-airports
   dcterms:hasPart ex:ATL, ex:BOS ;
   stat:details [
     ex:on-time-arrivals 77.0 ;
     ex:on-time-departures 79.0 ;
   ] .    

We could treat time as a special-case that conditionalizes the statistics (stat:details) for any particular subject, such as:

 ex:ATL a ex:Airport ;
   dcterms:isPartOf ex:us-major-airports ;
   stat:details [
     stat:start "2006-01-01"^^xsd:date ;
     stat:end   "2006-02-28"^^xsd:date ;
     stat:details [
       ex:on-time-arrivals 73.9 ;
       ex:on-time-departures 76.0
     ] .
   ] .

If we ignore time but view the "total major airports" statistics as an aggregate of the individual airports (which are subtables, then), we get this RDF structure:

 ex:us-major-airports
   ex:on-time-arrivals 77.0 ;
   ex:on-time-departures 79.0 ;
   ex:ATL [
     ex:on-time-arrivals 73.9 ;
     ex:on-time-departures 76.0
   ] ;
   ex:BOS [
     ex:on-time-arrivals 75.6 ;
     ex:on-time-departures 80.5
   ];

This is interesting because it treats the individual airports as subtables of the dataset. I don't think it's really a great way to model the data, however.

Complex Statistics Over Time

 ex:ontime-flights a stat:Dataset ;
   dc:title "On-time Flight Arrivals and Departures at Major U.S. Airports: 2006" ;
   stat:date_start "2006-01-01"^^xsd:date ;
   stat:date_end "2006-12-31"^^xsd:date ;
   stat:structure "... something that explains how to display the stats ? ..." ;
   stat:datasetOf ex:atl-arr-2006q1, ex:atl-dep-2006q1, ... ;
 
 ex:atl-arr-2006q1 a stat:Item ;
   rdf:value 73.9 ;
   stat:dataset ex:ontime-flights ;
   stat:dimension ex:Q12006 ;
   stat:dimension ex:arrivals ;
   stat:dimension ex:ATL .
 
 ex:atl-dep-2006q1 a stat:Item ;
   rdf:value 76.0 ;
   stat:dataset ex:ontime-flights ;
   stat:dimension ex:Q12006 ;
   stat:dimension ex:departures ;
   stat:dimension ex:ATL .
 
 ... more data items ...
 
 ex:Q12006 a stat:TimePeriod ;
   dc:title "2006 Q1" ;
   stat:date_start "2006-01-01"^^xsd:date ;
   stat:date_end "2006-03-31"^^xsd:date .
 
 ex:arrivals a stat:ScheduledFlightTime ;
   dc:title "Arrival time" .
 
 ex:departures a stat:ScheduledFlightTime ;
   dc:title "Departure time" .
 
 ex:ATL a stat:Airport ;
   dc:title "Atlanta, Hartsfield" .
 
 ... more dimension values ...
 
 stat:TimePeriod rdfs:subClassOf stat:Dimension ; dc:title "time period" .
 stat:ScheduledFlightTime rdfs:subClassOf stat:Dimension ; dc:title "arrival or departure" .
 stat:Airport rdfs:subClassOf stat:Dimension ; dc:title "airport" .

First, this seems to be the most verbose. It also seems to give the greatest flexibility in terms of modeling time and querying the resulting data. One related alternative to this approach would replace dimension objects with dimension predicates, as in:

 ex:atl-arr-2006q1 a stat:Item ;
   rdf:value 73.9 ;
   stat:dataset ex:ontime-flights ;
   stat:date_start "2006-01-01"^^xsd:date ;
   stat:date_end "2006-03-31"^^xsd:date .
   stat:airport ex:ATL ;
   stat:scheduled-flight-time ex:arrivals .
 
 stat:airport rdfs:subPropertyOf stat:dimension ; dc:title "airport " .

This may be a bit less verbose, but loses the ability to have multivalued dimensions such as stat:TimePeriod in the first example.

Conclusion

The riese approach seems the best combination of flexibility and usability. It should allow us to recreate the data-table structures with a reasonable degree of fidelity in another environment (e.g. on the Web), as well as to construct a basic semantic repository by attaching definitions to the various statistical entities, facets, and properties. All that said, the proofs in the pudding, and until I'm quite open to other suggestions.