At the Semantic Technologies Conference in San Jose in May, Brand Niemann of the U.S. EPA and I are presenting Getting to Web Semantics for Spreadsheets in the U.S. Government. In particular, Brand and I are working to exploit the semantics implicit in the nearly 1,500 spreadsheets that are in the U.S. Census Bureau's annual Statistical Abstract of the United States. The rest of this post discusses various strategies for modeling this sort of statistical data in RDF; for more information on the background of this work, please see my presentation from the February 5, 2008, SICoP Special Conference.)
The data for the Statistical Abstract is effectively time-based statistics. There are a variety of ways that this information can be modeled as semantic data. The approaches differ in simplicity/complexity, semantic expressivity, and verbosity. At least as interestingly, they vary in precisely what they are modeling: statistical data or a particular domain of discourse. The goal of this effort is to examine the potential approaches to modeling this information in terms of ease of reuse, ease of query, ability to integrate with information from all 1,500 spreadsheets (and other sources), and the ability to enhance the model incrementally with richer semantics. There are surely other approaches to modeling this information as well: I'd love to here any ideas or suggestions for other approaches to consider.
Contents[hide] |
D2R Server for Eurostat
The D2R server guys host an RDF copy of the Eurostat collection of European economic, demographic, political, and geographic data. From the start, they make the simplifying assumption that:
Most statistical data are time series, therefore only the latest availabe value is provided here.
In other words, they do not try to capture historic statistics at all. The disclaimer also notes that what is modeled in RDF is a small subset of the available data tables.
Executing a SELECT DISTINCT ?p { ?s ?p ?o } to learn more about this dataset tells us:
db:eurostat/population_total db:eurostat/electricity_consumption_GWh db:eurostat/killed_in_road_accidents db:eurostat/RnD_exp_mio_euro db:eurostat/parentcountry db:eurostat/population_male rdfs:label db:eurostat/RnD_personel_percent_of_act_pop db:eurostat/total_average_population db:eurostat/population_female db:eurostat/unemployment_rate_total db:eurostat/avg_annual_population_growth db:eurostat/total_area_km2 db:eurostat/name_encoded db:eurostat/disposable_income db:eurostat/injured_in_road_accidents db:eurostat/electricity_production_capacity_MWh db:eurostat/hospital_beds_per100000hab db:eurostat/name db:eurostat/landuse_total db:eurostat/GDP db:eurostat/geocode owl:sameAs rdf:type db:eurostat/level_of_internetaccess_households db:eurostat/death_rate db:eurostat/fertility_rate_total db:eurostat/level_of_internet_access db:eurostat/marriages db:eurostat/ecommerce_via_internet db:eurostat/pupils_and_students db:eurostat/inflation_rate db:eurostat/employment_rate_total db:eurostat/average_exit_age_from_laborforce db:eurostat/comparative_price_levels db:eurostat/GDP_current_prices db:eurostat/GDP_per_capita_PPP db:eurostat/monthly_labour_costs
I make a few observations from this:
- Most of these are predicates that correspond to a statistical category. I'm curious what the types of the subjects are. The query here is (the filter is added to limit the question to resources that use the Eurostat predicates):
SELECT DISTINCT ?t WHERE { ?s rdf:type ?t . ?s ?p ?o . FILTER(regex(str(?p), 'eurostat') ) }
The result is two types: regions and countries. Simple enough. - I'm also curious as to the types of the objects. Let's see if there are any resources (URIs) as objects. We do the ?s ?p ?o query from before but add in FILTER(isURI(?o)). The result shows that, aside from rdf:type and owl:sameAs (which we expected), only the predicate db:eurostat/parentcountry points to other resources. Doing a query on this predicate, we see that it relates regions (e.g. db:regions/Lorraine) to countries (e.g. db:countries/France).
- I'd expect that, especially in the absence of time-based data, they don't have object structures with blank nodes. Changing the previous filter to use isBlank confirms that this is true.
- So what are the types of the other data? Strings? Numbers? Let's find out. Poking around with various values for XXX in the filter FILTER(isLiteral(?o) && datatype(?o) = XXX) we see that some data uses xsd:strings while other data uses xsd:double. Poking around at the remaining predicates, we discover that they use xsd:long for non-decimal numbers.
- What are they using owl:sameAs for? Executing SELECT ?s ?o { ?s owl:sameAs ?o } shows what I suspected: they're equating URIs that they've minted under a Eurostat namespace (http://www4.wiwiss.fu-berlin.de/eurostat/resource/) to DBPedia URIs (to broaden the linked data Web). Let's see if they use owl:sameAs for anything else. We add FILTER(!regex(str(?o), 'dbpedia')) and the query now returns no results.
The 2000 U.S. Census
Joshua Tauberer converted the 2000 U.S. Census Data into 1 billion RDF triples. He provides a well-documented perl script that can convert various subsets of the census data into N3. One mode that this script can be run in is to output the schema from SAS table layout files. Joshua's about provides an overview of the data. In particular, I note that he is working with tables that are multiple levels deep (e.g. population by sex and then by age).
The most useful part of the writeup, though, is the writeup specifically about modeling the census data in RDF. In general, Joshua models nested levels of statistical tables (representing multiple facets of the data) as a chain of predicates (with the interim nodes as blank nodes). If a particular criterion is further subdivided, then the aggregate total at that level is linked with rdf:value. Otherwise, the value is given as the object itself. Note that the subjects are not real-world entities ("the U.S.") but instead are data tables ("the U.S. census tables"). The entities themselves are related to the data tables via a details predicate. The below excerpt combines both types of information (the entity itself followed by the data tables above the entity):
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix : <tag:govshare.info,2005:rdf/census/details/100pct> . @prefix politico: <http://www.rdfabout.com/rdf/schema/politico/> . @prefix census: <http://www.rdfabout.com/rdf/schema/census/> . <http://www.rdfabout.com/rdf/usgov/geo/us> a politico:country ; dc:title "United States" ; census:households 115904641 ; census:waterarea "664706489036 m^2" ; census:population 281421906 ; census:details <http://www.rdfabout.com/rdf/usgov/geo/us/censustables> ; dcterms:hasPart <http://www.rdfabout.com/rdf/usgov/geo/us/al>, <http://www.rdfabout.com/rdf/usgov/geo/us/az>, ... . <http://www.rdfabout.com/rdf/usgov/geo/us/censustables> :totalPopulation 281421906 ; # P001001 :totalPopulation [ dc:title "URBAN AND RURAL (P002001)"; rdf:value 281421906 ; # P002001 :urban [ rdf:value 222360539 ; # P002002 :insideUrbanizedAreas 192323824 ; # P002003 :insideUrbanClusters 30036715 ; # P002004 ] :rural 59061367 ; # P002005 ] :totalPopulation [ dc:title "RACE (P003001)"; rdf:value 281421906 ; # P003001 :populationOfOneRace [ rdf:value 274595678 ; # P003002 :whiteAlone 211460626 ; # P003003 :blackOrAfricanAmericanAlone 34658190 ; # P003004 :americanIndianAndAlaskaNativeAlone 2475956 ; # P003005 ] ...
This is an inconsistent modeling (which Joshua admits himself in the description). Note for instance how :totalPopulation > :urban has a rdf:value link to the aggregate US urban population. When you go one level deeper though, :totalPopulation > :urban > :insideUrbanizedAreas has an object which is itself the value of that statistic.
As I see it, this inconsistency could be avoided in two ways:
- Always insist that a statistic hangs off of a resource (URI or blank node) via the rdf:value predicate.
- Allow a criterion/classificaiton predicate to point both to a literal (aggregate) value, and also to further subdivisions. This would allow the above example to have a triple which was :totalPopulation > :urban > 222360539 in addition to the further nested :totalPopulation > :urban > :insideUrbanizedAreas > 192323824.
The second approach seems simpler to me (less triples). It can be queried with an isLiteral filter restriction. The first approach might be a slightly simpler query, as it would always just query for rdf:value. (The queries would be about the same size, but the rdf:value approach is a bit clearer to read than the isLiteral filter approach.)
As an aside, this statement from Joshua is a telling factor on the value of what we are doing with the U.S. Statistical Abstract data:
(If you followed Region > households > nonFamilyHouseholds you would get the number of households, not people, that are nonFamilyHouseHolds. To know what a "non-family household" is, you would have to consult the PDFs published by the Census.)
Riese: RDFizing and Interlinking the EuroStat Data Set Effort
Riese is another effort to convert the EuroStat data to RDF. It seeks to expand on the coverage of the D2R effort. Project discussion is available on an ESW wiki page, but the main details of the effort are on the project's about page. Currently, riese only provides five million out of the three billion triples that it seeks to provide.
The under the hood section of the about page links to the riese schema. (Note: this is a simple RDF schema; no OWL in sight.) The schema models statistics as items that link to times, datasets, dimensions, geo information, and a value (using rdf:value).
Every statistical data item is a riese:item. riese:items are qualified with riese:dimensions, one of which is, in particular, dimension:Time.
The "ask" page gives two sample queries over the EuroStat RDF data, but those only deal in the datasets. RDF can be retrieved for the various Riese tables and data items by appending /content.rdf to the items' URIs and doing an HTTP GET. Here's an example of some of the RDF for a particular data item (this is not strictly legal Turtle, but you'll get the point):
@prefix : <http://riese.joanneum.at/data/> . @prefix riese: <http://riese.joanneum.at/schema/core#> . @prefix dim: <http://riese.joanneum.at/dimension/> . @prefix dim-schema: <http://riese.joanneum.at/schema/dimension/> . :bp010 a riese:dataset ; # all dc:title's repeated as rdfs:label dc:title "Current account - monthly: Total" ; riese:data_start "2002m10" ; # proprietary format? riese:data_end "2007m09" ; riese:structure "geo\time" ; # not sure of this format riese:datasetOf :bp010/2007m03_ea . :bp010/2007m03_ea a riese:Item ; dc:title "Table: bp010, dimensions: ea, time: 2007m03" ; rdf:value "7093" ; # not typed riese:dimension dim:geo/ea ; riese:dimension dim:time/2007m03 ; riese:dataset :bp010 . dim:geo/ea a dim-schema:Geo . dc:title "Euro area (EA11-2000, EA12-2006, EA13-2007, EA15)" . dim:time/2007m03 a dim-schema:Time . dc:title "" . # oops dim-schema:Geo rdfs:subClassOf riese:Dimension ; dc:title "Geo" . dim-schema:Time rdfs:subClassOf riese:Dimension ; dc:title "Time" .
(A lot of this is available in dic.nt (39 MB).)
Summary
In summary, these three examples show three distinct approaches for modeling statistics:
- Simple, point-in-time statistics. Predicates that fully describe each statistic relate a (geographic, in this case) entity to the statistic's value. There's no way to represent time in this (or other dimensions) into this model other than to create a new predicate for every combination of dimensions (e.g. country:bolivia stat:1990population18-30male 123456). Queries are flat and rely on knowledge of or metadata (e.g. rdfs:label) about the predicates. No way to generate tables of related values easily. Observation: this approach effectively builds a model of the real-world, ignoring statistical artifacts such as time, tables, and subtables.
- Complex, point-in-time statistics. An initial predicate relates a (geographic, in this case) entity to both an aggregate value for the statistic, as well as to (via blank nodes) other predicates that represent dimensions. Aggregate values are available off of any point in the predicate chain. Applications need to be aware of the hierarchical predicate structure of the statistics for queries, but can reuse (and therefore link) some predicates amongst different statistcs. Nested tables can easily be constructed from this model. Observation: this approach effectively builds a model of the statistical domain in question (demographics, geography, economics, etc. as broken into statistical tables).
- Complex statistics over time. Each statistic (each number) is represented as an item with a value. Dimensions (including time) are also described as resources with values, titles, etc. In this approach, the entire model is described by a small number of predicates. Applications can flexibly query for different combinations of time and other dimensions, though they still must know the identifying information for the dimensions in which they are interested. Applications can fairily easily construct nested tables from this model. Observation: this approach effectively uses a model of statistics (in general) which in turn is used to express statistics about the domains in question.
Statistical Abstract data
Simple with time
One of the simplest data tables in the Statistical Abstract gives statistics for airline on-time arrivals and departures. A sample of how this table is laid out is:
Airport | On-time Arrivals | On-time Departures | ||
---|---|---|---|---|
2006 Q1 | 2006 Q2 | 2006 Q1 | 2006 Q2 | |
Total major airports | 77.0 | 76.7 | 79.0 | 78.5 |
Atlanta, Hartsfield | 73.9 | 75.5 | 76.0 | 74.3 |
Boston, Logan International | 75.6 | 66.8 | 80.5 | 74.8 |
Overall, this is fairly simple. Every airport, for each time period has an on-time arrival percentage and an on-time departure percentage. If we simplified it even further by removing the use of multiple times, then it's just a simple grid spreadsheet (relating airports to arrival % and departure %). This does have the interesting (?) twist that the aggregate data (total major airports) is not simply a sum of the constituent data items (since we're dealing in percentages).
Simple point-in-time approach
If we ignore time (and choose 2006 Q1 as our point in time), then this data models as:
ex:ATL ex:ontime-arrivals 73.9 ; ex:ontime-departures 76.0 . ex:BOS ex:ontime-arrivals 75.6 ; ex:ontime-departures 80.5 ex:us-major-airports ex:ontime-arrivals 77.0 ; ex:ontime-departures 79.0
This is simple, but ignores time. It also doesn't give any hint that ex:us-major-airports is a total/aggregate of the other data. We could encode time in the predicates themselvs (ex:ontime-arrivals-2006-q1), but I think everyone would agree that that's a bad idea. We could also let each time range be a blank node off the subjects, but that assumes all subjects have data conforming to the same time increments. Any such approach starts to get close to the complex point-in-time approach, so let's look at that.
Complex point-in-time approach
If we ignore time and view the "total major airports" as unrelated to the individual airports, then we have no "nested tables" and this approach degenerates to the simple point-in-time approach, effectively:
ex:ATL a ex:Airport ; dcterms:isPartOf ex:us-major-airports ; stat:details [ ex:on-time-arrivals 73.9 ; ex:on-time-departures 76.0 ] . ex:BOS a ex:Airport ; dcterms:isPartOf ex:us-major-airports ; stat:details [ ex:on-time-arrivals 75.6 ; ex:on-time-departures 80.5 ] . ex:us-major-airports dcterms:hasPart ex:ATL, ex:BOS ; stat:details [ ex:on-time-arrivals 77.0 ; ex:on-time-departures 79.0 ; ] .
We could treat time as a special-case that conditionalizes the statistics (stat:details) for any particular subject, such as:
ex:ATL a ex:Airport ; dcterms:isPartOf ex:us-major-airports ; stat:details [ stat:start "2006-01-01"^^xsd:date ; stat:end "2006-02-28"^^xsd:date ; stat:details [ ex:on-time-arrivals 73.9 ; ex:on-time-departures 76.0 ] . ] .
If we ignore time but view the "total major airports" statistics as an aggregate of the individual airports (which are subtables, then), we get this RDF structure:
ex:us-major-airports ex:on-time-arrivals 77.0 ; ex:on-time-departures 79.0 ; ex:ATL [ ex:on-time-arrivals 73.9 ; ex:on-time-departures 76.0 ] ; ex:BOS [ ex:on-time-arrivals 75.6 ; ex:on-time-departures 80.5 ];
This is interesting because it treats the individual airports as subtables of the dataset. I don't think it's really a great way to model the data, however.
Complex Statistics Over Time
ex:ontime-flights a stat:Dataset ; dc:title "On-time Flight Arrivals and Departures at Major U.S. Airports: 2006" ; stat:date_start "2006-01-01"^^xsd:date ; stat:date_end "2006-12-31"^^xsd:date ; stat:structure "... something that explains how to display the stats ? ..." ; stat:datasetOf ex:atl-arr-2006q1, ex:atl-dep-2006q1, ... ; ex:atl-arr-2006q1 a stat:Item ; rdf:value 73.9 ; stat:dataset ex:ontime-flights ; stat:dimension ex:Q12006 ; stat:dimension ex:arrivals ; stat:dimension ex:ATL . ex:atl-dep-2006q1 a stat:Item ; rdf:value 76.0 ; stat:dataset ex:ontime-flights ; stat:dimension ex:Q12006 ; stat:dimension ex:departures ; stat:dimension ex:ATL . ... more data items ... ex:Q12006 a stat:TimePeriod ; dc:title "2006 Q1" ; stat:date_start "2006-01-01"^^xsd:date ; stat:date_end "2006-03-31"^^xsd:date . ex:arrivals a stat:ScheduledFlightTime ; dc:title "Arrival time" . ex:departures a stat:ScheduledFlightTime ; dc:title "Departure time" . ex:ATL a stat:Airport ; dc:title "Atlanta, Hartsfield" . ... more dimension values ... stat:TimePeriod rdfs:subClassOf stat:Dimension ; dc:title "time period" . stat:ScheduledFlightTime rdfs:subClassOf stat:Dimension ; dc:title "arrival or departure" . stat:Airport rdfs:subClassOf stat:Dimension ; dc:title "airport" .
First, this seems to be the most verbose. It also seems to give the greatest flexibility in terms of modeling time and querying the resulting data. One related alternative to this approach would replace dimension objects with dimension predicates, as in:
ex:atl-arr-2006q1 a stat:Item ; rdf:value 73.9 ; stat:dataset ex:ontime-flights ; stat:date_start "2006-01-01"^^xsd:date ; stat:date_end "2006-03-31"^^xsd:date . stat:airport ex:ATL ; stat:scheduled-flight-time ex:arrivals . stat:airport rdfs:subPropertyOf stat:dimension ; dc:title "airport " .
This may be a bit less verbose, but loses the ability to have multivalued dimensions such as stat:TimePeriod in the first example.
Conclusion
The riese approach seems the best combination of flexibility and usability. It should allow us to recreate the data-table structures with a reasonable degree of fidelity in another environment (e.g. on the Web), as well as to construct a basic semantic repository by attaching definitions to the various statistical entities, facets, and properties. All that said, the proofs in the pudding, and until I'm quite open to other suggestions.
Complex Statistics Over Time is a great way to encode tabular data, and I think I want to redo my Census dataset with that approach.
Some comments-
What about encoding the relation between Total and ATL?
ex:Total a stat:Airport.
ex:ATL a stat:Airport ;
stat:nonAdditiveDivisionOf ex:Total.
The dataset's date_start and date_end I think should be min, max, (and increment) properties on the dimensions.
What if the table has a column for Number of Flights Per Day? While it would be on the horizontal axis visually, the values do not have Time or ScheduledFlightTime dimension values. So it would seem that not all data points would have to be valued for all dimensions. On the other hand, it shows that some columns *do* sum for the total column while others don't.
With dimension predicates, I don't think you lose the ability to have complex dimensions -- just as with stat:airport ex:ATL, you can have stat:time time2006. I don't really see any difference between the two approaches, except with dimension predicates things are a little redundantly encoded (which may not be a bad thing). Or maybe I missed the point?
I had this piece sitting in a browser tab for quite a while, and I'm glad I read it now. A concise, to the point, and crystal clear look at the issue. Well done.
About the Eurostat D2R Server, our intention was indeed to model the statistics like a description of the real world. The idea was to create different named graphs for different points in time, and annotate each named graph to state its time period.
Unfortunately, D2RQ doesn't have the ability to generate multiple named graphs yet, and we couldn't find the time/budget to do the necessary coding, so we gave up and did just one, the most recent, graph.
I think I like the named graphs approach best, as the “description of reality” modelling is very intuitive, and treating time as different “realities” seems quite natural to me. If we have to stay in triplespace, then I like the riese approach best, it's verbose but can be queried and sliced very easily.
Again, great job on coming up with the “real world / statistical domain / statistics in general” categories, this is a great way to explain the issue.
Thanks to both of you for the thoughtful comments!
@Josh:
"""
What about encoding the relation between Total and ATL?
ex:Total a stat:Airport.
ex:ATL a stat:Airport ;
stat:nonAdditiveDivisionOf ex:Total.
"""
Yes, that's a good point / idea. I wonder if it would be semantically pure to use something like skos:member for this relation? It's probably best to use a predicate specifically for the statistical connotation, as in your example. With a proper subproperty hierarchy with something like stat:divisionOf and stat:additiveDivisionOf (or in the other direction: stat:aggregates and stat:sums or something similar?) it could give enough information to present the statistics well visually.
Thanks for the date_start -> min, date_end -> max suggestion. That could be applied more broadly to other dimensions, which is nice. Increment is a bit tricky if we're reusing these properties for things like dates and also integers/reals.
You're absolutely right that different (visual) columns have different dimensions. In fact, that's extremely common in the Statistical Abstract dataset; most of the tables have various heterogeneous statistics captured. As to the aggregate column, it would be nice to capture the semantics of when component parts can be summed, but I don't think it's actually necessary right now. I'll have to see.
And lastly, you're right about the dimension predicate approach. It's basically putting semantics into the predicate instead of (or as you point out in some cases in addition to) the rdf:type of the object. *shrug*
@Richard:
I have the feeling that -- in the long run -- the best thing to do might be to take both approaches. For my particular use case, modeling the real world with named graph metadata for time is insufficient: part of what I'm actually trying to model (to be able to recreate it in other contexts) is the statistics themselves. On the other hand, there's clearly value in recording the semantics of the actual facts that the statistics denote. Also, I think it's far more difficult to accurately map from statistical tables to real-world semantics without cramming every last bit of semantics into each predicate.