Using RDF on the Web: A Vision

| 8 Comments

(This is the second part of two posts about using RDF on the Web. The first post was a survey of approaches for creating RDF-data-driven Web applications.) All existing implementations referred to in this post are discussed in more detail and linked to in part one.

Here's what I would like to see, along with some thoughts on what is or is not implemented. It's by no means a complete solution and there are plenty of unanswered questions. I'd also never claim that it's the right solution for all or most applications. But I think it has a certain elegance and power that would make developing certain types of Web applications straightforward, quick, and enjoyable. Whenever I refer to "the application" or "the app", I'm talking about browser-based Web application implemented in JavaScript.

  • To begin with, I imagine servers around the Web storing domain-specific RDF data. This could be actual, materialized RDF data or virtual RDF views of underlying data in other formats. This first piece of the vision is, of course, widely implemented (e.g. Jena, Sesame, Boca, Oracle, Virtuoso, etc.)

  • The application fetches RDF from such a server. This may be done in a variety of ways:

    • An HTTP GET request for a particular RDF/XML or Turtle document
    • An HTTP GET request for a particular named graph within a quad store (a la Boca or Sesame)
    • A SPARQL CONSTRUCT query extracting and transforming the pieces of the domain-specific data that are most relevant to the application
    • A SPARQL DESCRIBE query requesting RDF about a particular resource (URI)

    In my mind, the CONSTRUCT approach is the most appealing method here: it allows the application to massage data which it may be receiving from multiple data sources into a single domain-specific RDF model that can be as close as possible to the application's own view of the world. In other words, reading the RDF via a query effectively allows the application to define its own API.

    Once again, the software for this step already exists via traditional Web servers and SPARQL protocol endpoints.

  • Second, the application must parse the RDF into a client-side model. Precisely how this is done depends on the form taken by the RDF received from the server:

    • The server returns RDF/XML. In this case, the client can use Jim Ley's parser to end up with a list of triples representing the RDF graph. The software to do this is already implemented.
    • The server returns Turtle. In this case, the client can use Masahide Kanzaki's parser to end up with a list of triples representing the RDF graph. The software to do this is already implemented.
    • The server returns RDF/JSON. In this case, the client can use Douglas Crockford's JSON parsing library (effectively a regular expression security check followed by a call to eval(...) While the software is implemented here, the RDF/JSON standard which I've cavalierly tossed about so far does not yet exist. Here, I'm imagining a specification which defines RDF/JSON based on the common JavaScript data structure used by the above two parsers. ( A bit of work probably still needs to be done if this were to become a full RDF/JSON specification, as I do not believe the current format used by the two parsers can distinguish blank node subjects from subjects with URIs.)

    In any case, we now have on the client a simple RDF graph of data specific to the domain of our application. Yet as I've said before, we'd like to make application development easier by moving away from triples at this point into data structures which more closely represent the concepts being manipulated by the application.

  • The next step, then, is to map the RDF model into a application-friendly JavaScript object model. If I understand ActiveRDF correctly (and in all fairness I've only had the chance to play with it a very limited amount), it will examine either the ontological statements or instance data within an RDF model and will generate a Ruby class hierarchy accordingly. The introduction to ActiveRDF explains the dirty-but-well-appreciated trick that is used: "Just use the part of the URI behind the last ”/” or ”#” and Active RDF will figure out what property you mean on its own." Of course, sometimes there will be ambiguities, clashes, or properties written to which did not already exist (with full URIs) in the instance data received; in these cases, manual intervention will be necessary. But I'd suggest that in many, many cases, applying this sort of best-effort heuristics to a domain-specific RDF model (especially one which the application has selected especially via a CONSTRUCT query) will result in extremely natural object hierarchies.

    None of this piece is implemented at all. I'd imagine that it would not be too difficult, following the model set forth by the ActiveRDF folks.

    Late-breaking news: Niklas Lindström, developer of the Python RDF ORM system Oort followed up on my last post and said (among other interesting things):

    I use an approach of "removing dimensions": namespaces, I18N (optionally), RDF-specific distinctions (collections vs. multiple properties) and other forms of graph traversing.

    Sounds like there would be some more simplification processes that could be adapted from Oort in addition to those adapted from ActiveRDF.

  • The main logic of the Web application (and the work of the application developer) goes here. The developer receives a domain model and can render it and attach logic to it in any way he or she sees fit. Often this will be via a traditional model-view-controller approach: this approach is facilitated by toolkits such as dojo or even via a system such as nike templates (nee microtemplates). Thus, the software to enable this meat-and-potatoes part of application development already exists.

    In the course of the user interacting with the application, certain data values change, new data values are added, and/or some data items are deleted. The application controller handles these mutations via the domain-specific object structures, without regards to any RDF model.

  • When it comes time to commit the changes (this could happen as changes occur or once the user saves/commits his or her work), standard JavaScript (i.e. a reusable library, rather than application-specific code) recognizes what has changed and maps (inverts) the objects back to the RDF model (as before, represented as arrays of triples). This inversion is probably performed by the same library that automatically generated the object structure from the RDF model in the first place. As with that piece of this puzzle, this library does not yet exist.

    Reversing the RDF ORM mapping is clearly challenging, especially when new data is added which has not been previously seen by the library. In some cases--perhaps even in most?--the application will need to provide hints to the library to help the inversion. I imagine that the system probably needs to keep an untouched deep copy of the original domain objects to allow it to find new, removed, and dirty data at this point. (An alternative would be requiring adds, deletes, and mutations to be performed via methods, but this constrains the natural use of the domain objects.)

  • Next, we determine the RDF difference between our original model and our updated model. The canonical work on RDF deltas is a design note by Tim Berners-Lee and Dan Connolly. Basically, though, an RDF diff amounts simply to a collection of triples to remove and a collection of triples to add to a graph. No (JavaScript) code yet exists to calculate RDF graph diffs, though the algorithms are widely implemented in other environments including cwm, rdf-utils, and SemVersion. We also work often with RDF diffs in Boca (when the Boca client replicates changes to a Boca server). I'd hope that this implementation experience would translate easily to a JavaScript implementation.

  • Finally, we serialize the RDF diffs and send them back to the data source. This requires two components that are not yet well-defined:

    • A serialization format for the RDF diffs. Tim and Dan's note uses the ability to quote graphs within N3 combined with a handful of predicates (diff:replacement, diff:deletion, and diff:insertion). I can also imagine a simple extension of (whatever ends up being) the RDF/JSON format to specify the triples to remove and add:
        {
          'add' : [ RDF/JSON triple structures go here ],
          'remove' : [ RDF/JSON triple structures go here ]
        }
      
    • An endpoint or protocol which accepts this RDF diff serialization. Once we've expressed the changes to our source data, of course, we need somewhere to send them. Preferably, there would be a standard protocol (à la the SPARQL Protocol) for sending these changes to a server. To my knowledge, endpoints that accept RDF diffs to update RDF data are not currently implemented. (Late-breaking addition: on my first post, Chris and Richard both pointed me to Mark Baker's work on RDF forms. While I'm not very familiar with any existing uses of this work, it looks like it might be an interesting way to describe the capabilities of an RDF update endpoint.)

    As an alternative for this step, the entire client-side RDF model could be serialized (to RDF/XML or to N-Triples or to RDF/JSON) and HTTP PUT back to an origin server. This strategy seems to make the most sense in a document-oriented system; to my knowledge this is also not currently implemented.

That's my vision, as raw and underdeveloped as it may be. There are a large number of extensions, challenges and related work that I have not yet mentioned, but which will need to be addressed when creating or working with this type of Web application. Some discussion of these is also in order.

Handling Multiple Sources of Data

To use the above Web-application-development environment to create Web 2.0-style mash-ups, most of the steps would need to be performed once per data source being integrated. This adds to the system a provenance requirement, whereby the libraries could offer the application a unified view of the domain-specific data while still maintaining links between individual data elements and their source graphs/servers/endpoints to facilitate update. When the RDF diffs are computed, they would need to be sent back to the proper origins. Also, the sample JavaScript structures that I've mentioned as a base for RDF/JSON and the RDF/JSON diff serialization would likely need to be augmented with a URI identifying the source graph of each triple. (That is, we'd end up working with a quad system, though we'd probably be able to ignore that in the object hierarchy that the application deals with.) In many cases, though, an application that reads from many data sources will write only to a single source; it does not seem particularly onerous for the application to specify a default "write-back" endpoint.

Inverting SPARQL CONSTRUCT Queries

An appealing part of the above system (to me, at least) is the use of CONSTRUCT queries to map origin data to a common RDF model before merging it on the client and then mapping it into a domain-specific JavaScript object structure. Such transformations, however, would make it quite difficult--if not impossible--to automatically send the proper updates back to the origin servers. We'd need a way of inverting the CONSTRUCT query which generated the triples the application has (indirectly) worked with, and while I have not given it much thought, I imagine that that is quite difficult, if not impossible.

SPARQL UPDATE.

The DAWG has postponed any work on updating graphs for the initial version of SPARQL, but Max Völkel and Richard Cyganiak have started a bit of discussion on what update in SPARQL might look like (though Richard has apparently soured on the idea a bit since then). At first blush, using SPARQL to update data seems like a natural counterpart to using SPARQL to retrieve the data. However, in the vision I describe above, the application would likely need to craft a corresponding SPARQL UPDATE query for each SPARQL CONSTRUCT query that is used to retrieve the data in the first place. This would be a larger burden on the application developer, so should probably be avoided.

Related Work

I wanted to acknowledge that in several ways this whole pattern is closely related to but (in some mindset, at least) the inverse of a paradigm that Danny Ayers has floated in the past. Danny has suggested using SPARQL CONSTRUCT queries to transition from domain-specific models to domain-independent models (for example, a reporting model). Data from various sources (and disparate domains) can be merged at the domain-independent level and then (perhaps via XSLT) used to generate Web pages summarizing and analyzing the data in question. In my thoughts above, we're also using the CONSTRUCT queries to generate an agreed-upon model, but in this case we're seeking an extremely domain-specific model to make it easier for the Web-application developer to deal with RDF data (and related data from multiple sources).

Danny also wrote some related material to www-archive. It's not the same vision, but parts of it sound familiar.

Other Caveats

Updating data has security implications, of course. I haven't even begun to think about them.

Blank nodes complicate almost everything; this may be sacrilege in some circles, but in most cases I'm willing to pretend that blank nodes don't exist for my data-integration needs. Incorporating blank nodes makes the RDF/JSON structures (slightly) more complicated; it raises the question of smushing together nodes when joining various models; and it significantly complicates the process of specifying which triples to remove when serializing the RDF diffs. I'd guess that it's all doable using functional and inverse-functional properties and/or with told bnodes, but it probably requires more help from the application developer.

I have some worries about concurrency issues for update. Again, I haven't thought about that much and I know that the Queso guys have already tackled some of those problems (as have many, many other people I'm sure), so I'm willing to assert that these issues could be overcome.

In many rich-client applications, data is retrieved incrementally in response to user-initiated actions. I don't think that this presents a problem for the above scheme, but we'd need to ensure that newly arriving data could be seamlessly incorporated not only into the RDF models but also into the object hierarchies that the application works with.

Bill de hÓra raised some questions about the feasibility of roundtripping RDF data with HTML forms a while back. There's some interesting conversation in the comments there which ties into what I've written here. That said, I don't think the problems he illustrates apply here--there's power above and beyond HTML forms in putting an extra JavaScript-based layer of code between the data entry interface (whether it be an HTML form or a more specialized Web UI) and the data update endpoint(s).


OK, that's more than enough for now. These are still ideas clearly in progress, and none of the ideas are particularly new. That said, the environment as I envision doesn't exist, and I suppose I'm claiming that if it did exist it would demonstrate some utility of Semantic Web technologies via ease of development of data- and integration-driven Web applications. As always, I'd enjoy feedback on these thoughts and also any pointers to work I might not know about.

8 Comments

Hello, I am part of the ActiveRDF team.

Let me first say, that this post does a great job at putting together a lot of pieces of the "web application for the semantic web" puzzle.

I have some general comments, and some ActiveRDF specific ones:

* While your list of ways to get RDF data from a triple store is fairly complete, there is at least one currently implemented way of posting/writing RDF data unmentioned: HTTP PUT of RDF data is currently supported in Sesame 2, but it only works on a "replace the whole RDF graph with the graph I give you now" basis. It is not possible to add just a few triples or a diff to an RDF graph. The sesame developers said they might consider this for the future.

* using SPARQL CONSTRUCT to transform data from multiple sources on the fly to the domain specific model is a very good idea. There is some enormous potential there, but I think currently a lot of SPARQL endpoints dont support CONSTRUCT queries. Also, this does not deal with merging of blank nodes, and smushing based on RDF schema and OWL inferences. I believe smushing will have to be done on the side of the client, or by a third entity specialised on graph merging.

* in your first post you said that the traditional client/server model is to old fashioned for accessing RDF data. yet, in this post explaining your vision you rely on it. The web application which is written in Javascript is such a client, and it gets its data from various RDF sources, which can be seen as servers. Of course, your client has its own in memory triple store, or some other means to access the RDF data. After that step of getting the RDF data to the client, does the possibility exist for the client to generate an object oriented representation of the RDF data, which is specific to both the domain of the application (e.g. cars, accounts, users) and to the domain of the programming model used in the web application (e.g. javascript objects).

* for your vision the fetching of data from multiple data sources, the merging, and creation of the domain specific representation is done through java script on the web browser. You did not specify the reasons leading you to believe that using java script on the web browser is superior to using server side scripting to achieve your vision.

* without changing the implications and goals of your vision, all of these steps could be done on a web server, using e.g. Ruby on Rails or Java.
* Implementing your vision, put on the web server, not on the web browser, using ActiveRDF and Ruby on Rails, is the next step for the ActiveRDF team here at DERI. We call this idea the Semantic Web Application Framework (SWAF). It gets of course enabled by the Semantic Web on Rails Development (SWORD) plugin :)

The SWORD plugin for Ruby on Rails currently exists as a proof of concept, with a lot of ongoing development on it.
How to try it out is described at: http://wiki.activerdf.org/SWORD

The Semantic Web Application Framework idea is described at: http://www.semanticdesktop.org/xwiki/bin/view/Wiki/SWAF-plan
and at http://www.semanticdesktop.org/xwiki/bin/view/Wiki/SWAF .

The main motivation behind SWAF is to give developers of web applications for the semantic web a framework, which adheres to the main principles behind the web (Interoperability, Decentralization, Universality, and Evolvability) out of the box. These principles change the assumptions for a web application significantly:

* not only one database, but multiple decentralized data sources with different provenance
* the controller has to be able to integrate data from those heterogeneous sources
* not only one view, but the possibility of describing a view in a decentralized way using fresnel

I agree that the appeal of using SPARQL CONSTRUCTs is that it gets you graphs (rather than variable-binding tables). Hyperdata is importantly, and oftening definingly, graph-structured, so of course we need to work with the graphs.

But I'd take this one step further and say that this means the graphs, and not their decomposition into triples or flattening into variable-bindings, should be the base level of shared data abstraction. I think those serious problems you run into by the end of your architectural sketch are the fault of the foundation you assume at the beginning. RDF and SPARQL are both a level too low. To get to an actual usable Web of Data I'm pretty sure we're going to need a shared object-based data-model (not one independently reconstructed out of triples by every individual application) and a graph-based query language (i.e. SQL:SPARQL::XPATH:?) to go along with it.

Hi Benjamin,

Thanks for all of the comments. I'm going to reply to many of them here (the ones I don't reply to I agree with completely :-)

+ I never meant to apply that client/server model is too old-fashioned :) In fact, it's great for many, many applications. I've just been seeking a model with more emphasis on the client, for a variety of reasons (see my next point...)

+ I agree with you that *almost* all of this vision could be done on the server-side in Ruby or another language. And I'm thrilled to hear more about SWAF and SWORD (I'm on the ActiveRDF mail list, but admittedly not following along as closely as I'd like.)

For me, there are two major selling points to doing most of the work on the client:

1) I want to be able to integrate data from multiple Web sites. This is a sword that cuts both ways. On one hand, I would say that server-side models that pull in data from across the Web before answering a client's HTTP request are very unusual (but doable!). On the other hand, performing the data integration on the server is one solution to the XMLHttpRequest cross-site security restrictions. I think that ActiveRDF supports an adaptor which uses a SPARQL protocol endpoint as the data source?

My second point is that *something* needs to be returned to the browser at *some point*. This can be as little as static HTML, but if you want any sort of client-side-only user interaction (which I'd say is a large part of what most users feel is new about "Web 2.0"), then you need some sort of object model on the client (in JavaScript) to manipulate and interact with the user experience. In this case, performing the rest of the solution on the server means that we're not maintaining both server-side code infrastructure and client-side code infrastructure. In many cases this is fine, but it does require additional skill sets and maintenance of wo environments of code (Ruby and JavaScript0.

Anyway, those are my reasons, for what they're worth. :-)

I'm particularly glad to hear that you're working with fresnel to describe views. We've done some work with using fresnel in an SWT context at our labs, and we think it's a promising direction forward.

thanks again for the comments, Benjamin.

Lee

Hi glenn,

I find your comment very interesting :-) When I started reading it, I started nodding my head about the possibility that RDF and SPARQL being "too low level". Then I got to "we're going to need a shared object-based data-model" and I scratched my chin a bit. Without individual application constructing their own shared object data models which suit their own needs, how might we get different data publishers to agree on a shared object model? To me, at least, that sounds like quite the daunting challenge!

I like the model which allows individual applications to define their own shared model, because it gives a very low standard for data publishers to meet to be useful, and it doesn't force application developers to shoehorn an already existing model into the way their user experience works.

Lee

Oh, by "shared object-based data-model" I didn't mean that we have to pre-define the actual domain-specific ontologies, just that we need a basic data structure that's a level up from bare RDF triples. Something a little more like typed objects connected by named relationships, I suspect, where the "nodes" have a little more built-in heft than they do in RDF, and thus do a better job of sustaining whatever individual data abstraction the user is trying to believe. C:Ruby::RDF:?, maybe.

Inverting SPARQL CONSTRUCT Queries

In essense what you are doing is using SPARQL to define views and then to do updates on views. There's quite a bit of literatur on the topic of updateable view in the database and in particular in the deductive database community.


With respect to views defined with SPARQL CONSTRUCT there actually seems to be a nice subset of view definitions that are trivially updateable (queries for which every variable in the select parts also appears in the construct part and that do not contain unions and stuff). These you can really just turn around (replace select part with contruct part and vice versa) to compute the updates for the root graph.

(albeit you will then need a notion of "inadmissable update" in cases where the select part of the inverted query does not match an update request)

JDIL (Json Data Integration Layer - http://jdil.org) is a JSON-based syntax for URI-labeled object graphs - a stab at the "level up from bare RDF triples" that Glenn McDonald mentions above. JDIL works as syntax for RDF when nomenclature is added (no new encoding methods required).


JDIL transports calls and results between client and server in Platial's mapping applications ( http://platial.com ). As an experiment,we also expose our map content as JDIL feeds. We've been motivated by considerations expressed here in "A Vision" - in particular the need to arrive at a "application-friendly object model" at the client based on RDF data emitted by the server. In our case, the client side model is pretty much the object (rather than triple-set) form of the original RDF structure; qualified names are used for properties (eg x["dc:title"] = ...). Using a JSON rather than XML-based syntax for client-server communication is certainly the more efficient solution in this circumstance.

PS: Here's a clickable version of the JDIL link: http://jdil.org (auto href failed above)