At the Semantic Technologies Conference in San Jose in May, Brand Niemann of the U.S. EPA and I are presenting Getting to Web Semantics for Spreadsheets in the U.S. Government. In particular, Brand and I are working to exploit the semantics implicit in the nearly 1,500 spreadsheets that are in the U.S. Census Bureau's annual Statistical Abstract of the United States. The rest of this post discusses various strategies for modeling this sort of statistical data in RDF; for more information on the background of this work, please see my presentation from the February 5, 2008, SICoP Special Conference.)
The data for the Statistical Abstract is effectively time-based statistics. There are a variety of ways that this information can be modeled as semantic data. The approaches differ in simplicity/complexity, semantic expressivity, and verbosity. At least as interestingly, they vary in precisely what they are modeling: statistical data or a particular domain of discourse. The goal of this effort is to examine the potential approaches to modeling this information in terms of ease of reuse, ease of query, ability to integrate with information from all 1,500 spreadsheets (and other sources), and the ability to enhance the model incrementally with richer semantics. There are surely other approaches to modeling this information as well: I'd love to here any ideas or suggestions for other approaches to consider.
D2R Server for Eurostat
The D2R server guys host an RDF copy of the Eurostat collection of European economic, demographic, political, and geographic data. From the start, they make the simplifying assumption that:
Most statistical data are time series, therefore only the latest availabe value is provided here.
In other words, they do not try to capture historic statistics at all. The disclaimer also notes that what is modeled in RDF is a small subset of the available data tables.
Executing a SELECT DISTINCT ?p { ?s ?p ?o } to learn more about this dataset tells us:
db:eurostat/population_total
db:eurostat/electricity_consumption_GWh
db:eurostat/killed_in_road_accidents
db:eurostat/RnD_exp_mio_euro
db:eurostat/parentcountry
db:eurostat/population_male
rdfs:label
db:eurostat/RnD_personel_percent_of_act_pop
db:eurostat/total_average_population
db:eurostat/population_female
db:eurostat/unemployment_rate_total
db:eurostat/avg_annual_population_growth
db:eurostat/total_area_km2
db:eurostat/name_encoded
db:eurostat/disposable_income
db:eurostat/injured_in_road_accidents
db:eurostat/electricity_production_capacity_MWh
db:eurostat/hospital_beds_per100000hab
db:eurostat/name
db:eurostat/landuse_total
db:eurostat/GDP
db:eurostat/geocode
owl:sameAs
rdf:type
db:eurostat/level_of_internetaccess_households
db:eurostat/death_rate
db:eurostat/fertility_rate_total
db:eurostat/level_of_internet_access
db:eurostat/marriages
db:eurostat/ecommerce_via_internet
db:eurostat/pupils_and_students
db:eurostat/inflation_rate
db:eurostat/employment_rate_total
db:eurostat/average_exit_age_from_laborforce
db:eurostat/comparative_price_levels
db:eurostat/GDP_current_prices
db:eurostat/GDP_per_capita_PPP
db:eurostat/monthly_labour_costs
I make a few observations from this:
The 2000 U.S. Census
Joshua Tauberer converted the 2000 U.S. Census Data into 1 billion RDF triples. He provides a well-documented perl script that can convert various subsets of the census data into N3. One mode that this script can be run in is to output the schema from SAS table layout files. Joshua's about provides an overview of the data. In particular, I note that he is working with tables that are multiple levels deep (e.g. population by sex and then by age).
The most useful part of the writeup, though, is the writeup specifically about modeling the census data in RDF. In general, Joshua models nested levels of statistical tables (representing multiple facets of the data) as a chain of predicates (with the interim nodes as blank nodes). If a particular criterion is further subdivided, then the aggregate total at that level is linked with rdf:value. Otherwise, the value is given as the object itself. Note that the subjects are not real-world entities ("the U.S.") but instead are data tables ("the U.S. census tables"). The entities themselves are related to the data tables via a details predicate. The below excerpt combines both types of information (the entity itself followed by the data tables above the entity):
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix : <tag:govshare.info,2005:rdf/census/details/100pct> .
@prefix politico: <http://www.rdfabout.com/rdf/schema/politico/> .
@prefix census: <http://www.rdfabout.com/rdf/schema/census/> .
<http://www.rdfabout.com/rdf/usgov/geo/us>
a politico:country ;
dc:title "United States" ;
census:households 115904641 ;
census:waterarea "664706489036 m^2" ;
census:population 281421906 ;
census:details <http://www.rdfabout.com/rdf/usgov/geo/us/censustables> ;
dcterms:hasPart <http://www.rdfabout.com/rdf/usgov/geo/us/al>, <http://www.rdfabout.com/rdf/usgov/geo/us/az>, ...
.
<http://www.rdfabout.com/rdf/usgov/geo/us/censustables> :totalPopulation 281421906 ; # P001001
:totalPopulation [
dc:title "URBAN AND RURAL (P002001)";
rdf:value 281421906 ; # P002001
:urban [
rdf:value 222360539 ; # P002002
:insideUrbanizedAreas 192323824 ; # P002003
:insideUrbanClusters 30036715 ; # P002004
]
:rural 59061367 ; # P002005
]
:totalPopulation [
dc:title "RACE (P003001)";
rdf:value 281421906 ; # P003001
:populationOfOneRace [
rdf:value 274595678 ; # P003002
:whiteAlone 211460626 ; # P003003
:blackOrAfricanAmericanAlone 34658190 ; # P003004
:americanIndianAndAlaskaNativeAlone 2475956 ; # P003005
]
...
This is an inconsistent modeling (which Joshua admits himself in the description). Note for instance how :totalPopulation > :urban has a rdf:value link to the aggregate US urban population. When you go one level deeper though, :totalPopulation > :urban > :insideUrbanizedAreas has an object which is itself the value of that statistic.
As I see it, this inconsistency could be avoided in two ways:
- Always insist that a statistic hangs off of a resource (URI or blank node) via the rdf:value predicate.
- Allow a criterion/classificaiton predicate to point both to a literal (aggregate) value, and also to further subdivisions. This would allow the above example to have a triple which was :totalPopulation > :urban > 222360539 in addition to the further nested :totalPopulation > :urban > :insideUrbanizedAreas > 192323824.
The second approach seems simpler to me (less triples). It can be queried with an isLiteral filter restriction. The first approach might be a slightly simpler query, as it would always just query for rdf:value. (The queries would be about the same size, but the rdf:value approach is a bit clearer to read than the isLiteral filter approach.)
As an aside, this statement from Joshua is a telling factor on the value of what we are doing with the U.S. Statistical Abstract data:
(If you followed Region > households > nonFamilyHouseholds you would get the number of households, not people, that are nonFamilyHouseHolds. To know what a "non-family household" is, you would have to consult the PDFs published by the Census.)
Riese: RDFizing and Interlinking the EuroStat Data Set Effort
Riese is another effort to convert the EuroStat data to RDF. It seeks to expand on the coverage of the D2R effort. Project discussion is available on an ESW wiki page, but the main details of the effort are on the project's about page. Currently, riese only provides five million out of the three billion triples that it seeks to provide.
The under the hood section of the about page links to the riese schema. (Note: this is a simple RDF schema; no OWL in sight.) The schema models statistics as items that link to times, datasets, dimensions, geo information, and a value (using rdf:value).
Every statistical data item is a riese:item. riese:items are qualified with riese:dimensions, one of which is, in particular, dimension:Time.
The "ask" page gives two sample queries over the EuroStat RDF data, but those only deal in the datasets. RDF can be retrieved for the various Riese tables and data items by appending /content.rdf to the items' URIs and doing an HTTP GET. Here's an example of some of the RDF for a particular data item (this is not strictly legal Turtle, but you'll get the point):
@prefix : <http://riese.joanneum.at/data/> .
@prefix riese: <http://riese.joanneum.at/schema/core#> .
@prefix dim: <http://riese.joanneum.at/dimension/> .
@prefix dim-schema: <http://riese.joanneum.at/schema/dimension/> .
:bp010 a riese:dataset ;
# all dc:title's repeated as rdfs:label
dc:title "Current account - monthly: Total" ;
riese:data_start "2002m10" ; # proprietary format?
riese:data_end "2007m09" ;
riese:structure "geo\time" ; # not sure of this format
riese:datasetOf :bp010/2007m03_ea .
:bp010/2007m03_ea a riese:Item ;
dc:title "Table: bp010, dimensions: ea, time: 2007m03" ;
rdf:value "7093" ; # not typed
riese:dimension dim:geo/ea ;
riese:dimension dim:time/2007m03 ;
riese:dataset :bp010 .
dim:geo/ea a dim-schema:Geo .
dc:title "Euro area (EA11-2000, EA12-2006, EA13-2007, EA15)" .
dim:time/2007m03 a dim-schema:Time .
dc:title "" . # oops
dim-schema:Geo rdfs:subClassOf riese:Dimension ; dc:title "Geo" .
dim-schema:Time rdfs:subClassOf riese:Dimension ; dc:title "Time" .
(A lot of this is available in dic.nt (39 MB).)
Summary
In summary, these three examples show three distinct approaches for modeling statistics:
- Simple, point-in-time statistics. Predicates that fully describe each statistic relate a (geographic, in this case) entity to the statistic's value. There's no way to represent time in this (or other dimensions) into this model other than to create a new predicate for every combination of dimensions (e.g. country:bolivia stat:1990population18-30male 123456). Queries are flat and rely on knowledge of or metadata (e.g. rdfs:label) about the predicates. No way to generate tables of related values easily. Observation: this approach effectively builds a model of the real-world, ignoring statistical artifacts such as time, tables, and subtables.
- Complex, point-in-time statistics. An initial predicate relates a (geographic, in this case) entity to both an aggregate value for the statistic, as well as to (via blank nodes) other predicates that represent dimensions. Aggregate values are available off of any point in the predicate chain. Applications need to be aware of the hierarchical predicate structure of the statistics for queries, but can reuse (and therefore link) some predicates amongst different statistcs. Nested tables can easily be constructed from this model. Observation: this approach effectively builds a model of the statistical domain in question (demographics, geography, economics, etc. as broken into statistical tables).
- Complex statistics over time. Each statistic (each number) is represented as an item with a value. Dimensions (including time) are also described as resources with values, titles, etc. In this approach, the entire model is described by a small number of predicates. Applications can flexibly query for different combinations of time and other dimensions, though they still must know the identifying information for the dimensions in which they are interested. Applications can fairily easily construct nested tables from this model. Observation: this approach effectively uses a model of statistics (in general) which in turn is used to express statistics about the domains in question.
Statistical Abstract data
Simple with time
One of the simplest data tables in the Statistical Abstract gives statistics for airline on-time arrivals and departures. A sample of how this table is laid out is:
Airport |
On-time Arrivals |
On-time Departures |
|
2006 Q1 |
2006 Q2 |
2006 Q1 |
2006 Q2 |
Total major airports |
77.0 |
76.7 |
79.0 |
78.5 |
Atlanta, Hartsfield |
73.9 |
75.5 |
76.0 |
74.3 |
Boston, Logan International |
75.6 |
66.8 |
80.5 |
74.8 |
Overall, this is fairly simple. Every airport, for each time period has an on-time arrival percentage and an on-time departure percentage. If we simplified it even further by removing the use of multiple times, then it's just a simple grid spreadsheet (relating airports to arrival % and departure %). This does have the interesting (?) twist that the aggregate data (total major airports) is not simply a sum of the constituent data items (since we're dealing in percentages).
Simple point-in-time approach
If we ignore time (and choose 2006 Q1 as our point in time), then this data models as:
ex:ATL ex:ontime-arrivals 73.9 ; ex:ontime-departures 76.0 .
ex:BOS ex:ontime-arrivals 75.6 ; ex:ontime-departures 80.5
ex:us-major-airports ex:ontime-arrivals 77.0 ; ex:ontime-departures 79.0
This is simple, but ignores time. It also doesn't give any hint that ex:us-major-airports is a total/aggregate of the other data. We could encode time in the predicates themselvs (ex:ontime-arrivals-2006-q1), but I think everyone would agree that that's a bad idea. We could also let each time range be a blank node off the subjects, but that assumes all subjects have data conforming to the same time increments. Any such approach starts to get close to the complex point-in-time approach, so let's look at that.
Complex point-in-time approach
If we ignore time and view the "total major airports" as unrelated to the individual airports, then we have no "nested tables" and this approach degenerates to the simple point-in-time approach, effectively:
ex:ATL a ex:Airport ;
dcterms:isPartOf ex:us-major-airports ;
stat:details [
ex:on-time-arrivals 73.9 ;
ex:on-time-departures 76.0
] .
ex:BOS a ex:Airport ;
dcterms:isPartOf ex:us-major-airports ;
stat:details [
ex:on-time-arrivals 75.6 ;
ex:on-time-departures 80.5
] .
ex:us-major-airports
dcterms:hasPart ex:ATL, ex:BOS ;
stat:details [
ex:on-time-arrivals 77.0 ;
ex:on-time-departures 79.0 ;
] .
We could treat time as a special-case that conditionalizes the statistics (stat:details) for any particular subject, such as:
ex:ATL a ex:Airport ;
dcterms:isPartOf ex:us-major-airports ;
stat:details [
stat:start "2006-01-01"^^xsd:date ;
stat:end "2006-02-28"^^xsd:date ;
stat:details [
ex:on-time-arrivals 73.9 ;
ex:on-time-departures 76.0
] .
] .
If we ignore time but view the "total major airports" statistics as an aggregate of the individual airports (which are subtables, then), we get this RDF structure:
ex:us-major-airports
ex:on-time-arrivals 77.0 ;
ex:on-time-departures 79.0 ;
ex:ATL [
ex:on-time-arrivals 73.9 ;
ex:on-time-departures 76.0
] ;
ex:BOS [
ex:on-time-arrivals 75.6 ;
ex:on-time-departures 80.5
];
This is interesting because it treats the individual airports as subtables of the dataset. I don't think it's really a great way to model the data, however.
Complex Statistics Over Time
ex:ontime-flights a stat:Dataset ;
dc:title "On-time Flight Arrivals and Departures at Major U.S. Airports: 2006" ;
stat:date_start "2006-01-01"^^xsd:date ;
stat:date_end "2006-12-31"^^xsd:date ;
stat:structure "... something that explains how to display the stats ? ..." ;
stat:datasetOf ex:atl-arr-2006q1, ex:atl-dep-2006q1, ... ;
ex:atl-arr-2006q1 a stat:Item ;
rdf:value 73.9 ;
stat:dataset ex:ontime-flights ;
stat:dimension ex:Q12006 ;
stat:dimension ex:arrivals ;
stat:dimension ex:ATL .
ex:atl-dep-2006q1 a stat:Item ;
rdf:value 76.0 ;
stat:dataset ex:ontime-flights ;
stat:dimension ex:Q12006 ;
stat:dimension ex:departures ;
stat:dimension ex:ATL .
... more data items ...
ex:Q12006 a stat:TimePeriod ;
dc:title "2006 Q1" ;
stat:date_start "2006-01-01"^^xsd:date ;
stat:date_end "2006-03-31"^^xsd:date .
ex:arrivals a stat:ScheduledFlightTime ;
dc:title "Arrival time" .
ex:departures a stat:ScheduledFlightTime ;
dc:title "Departure time" .
ex:ATL a stat:Airport ;
dc:title "Atlanta, Hartsfield" .
... more dimension values ...
stat:TimePeriod rdfs:subClassOf stat:Dimension ; dc:title "time period" .
stat:ScheduledFlightTime rdfs:subClassOf stat:Dimension ; dc:title "arrival or departure" .
stat:Airport rdfs:subClassOf stat:Dimension ; dc:title "airport" .
First, this seems to be the most verbose. It also seems to give the greatest flexibility in terms of modeling time and querying the resulting data. One related alternative to this approach would replace dimension objects with dimension predicates, as in:
ex:atl-arr-2006q1 a stat:Item ;
rdf:value 73.9 ;
stat:dataset ex:ontime-flights ;
stat:date_start "2006-01-01"^^xsd:date ;
stat:date_end "2006-03-31"^^xsd:date .
stat:airport ex:ATL ;
stat:scheduled-flight-time ex:arrivals .
stat:airport rdfs:subPropertyOf stat:dimension ; dc:title "airport " .
This may be a bit less verbose, but loses the ability to have multivalued dimensions such as stat:TimePeriod in the first example.
Conclusion
The riese approach seems the best combination of flexibility and usability. It should allow us to recreate the data-table structures with a reasonable degree of fidelity in another environment (e.g. on the Web), as well as to construct a basic semantic repository by attaching definitions to the various statistical entities, facets, and properties. All that said, the proofs in the pudding, and until I'm quite open to other suggestions.