A number of tools and technologies were considered for use in SPQR. Here we document the tools considered and chosen and the reasons why.
Creation – RDFization – combining information
SPQR is concerned with several types of data sources which need to be translated into the RDF model. These sources are:
- WebPages (XTHML) – these may be viewed as “schema-less XML documents”.
- XML documents with defined schemas – for example EPI-DOC or ArcheoML.
- Relational databases.
We have a problem of converting to the RDF data model from two data data models: XML documents and relational tables. This mapping exercise to be useful needs to go beyond simple “format translation” and include definition of vocabularies and a dereferencable URI scheme. In this section we discuss possible tools for “format translation”.
XML => RDF/XML
The RDF model has an XML rendering called RDF-XML. A standard approach for going from XML to XML is an XSLT transform. One could create a mapping transform for each XML schema and use one of the XSLT engines (e.g. Xalan, http://www.stylusstudio.com/xslt/xalan.html) to map the data. There are several downsides to this approach. One is that XSLT is a rather verbose, XML language. The other is that RDF-XML can not represent all RDF models. For example it is impossible to have a blank node that is the object to two statements. Of course the output of an XSLT transform could be anything. The main problem with XSLT is the awkwardness of providing user-defined functions and, as we will see, the conversion process will often require specialised functions that, for example, look up URIs from external web services.
An alternative approach is applied in Europeana (http://www.europeana.eu/portal/). There the conversion process starts from converting – in a generic, as-is way – any XML document to the RDF model. This is relatively simple to imagine if one views an XML document as a tree which in turn is a special type of a graph. Of course, this is a simple view and there are many details that need to be taken care of. The point is that the Europeana tool can automatically and consistently convert an XML document to its RDF representation. Once this is done the translation becomes a matter of mapping between two RDF models. This is done within the RDF framework used by Europeana and consists of writing sets of rules that match subgraphs in the source model and emit triples for the target model. This technique is also used for mapping third-arty ontologies to the Europeana data model. This approach is much more practical compared to XSLT as users have the full expressive power of an underlying Prolog-based runtime at their disposal which is further enriched by an RDF specific API.
DB => RDF
Exposing relational databases as RDF can be done in a dynamic way using D2R (http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/). The tool allows to customise the mapping to use, for example, specific vocabularies. It is most useful in scenarios when the underlying database frequently changes and where exposing dumps would lead to stale data. Often, more fine grained control over the mapping process will be needed. In such situations one needs to go beyond the D2R tool, but still the initial RDF model generated by D2R could be of use. For example one could write mapping rules using Europeana translation rules.
We initially evaluated RDFTEF (http://rdftef.sourceforge.net/) which supports XML=>RDF. However, we discovered that the project has been stagnant for 5 years, it only supports a small subset of TEF, it uses a product-specific ontology (which includes names in Italian) and the RDF did not seem correct.
Triplr (http://triplr.org, 23/02/11) was also tried. This is an online REST-based service which takes various types of input (e.g. RDF, Turtle, JSON) and outputs triples. We used this on the http://nomisma.org data by just viewing this URL in a browser:
Triplr interprets this as a request to:
- Visit http://nomisma.org/nomisma.org.xml (which is just a large XML document)
- Return the result as an RDF document.
However, we decided to go with a variation of the Europeana approach and use Clojure scripts (http://clojure.org/) to express the mapping. Clojure is a Lisp-like functional programming language that targets Java. Clojure allows us to use the wealth of Java libraries including the existing semantic web frameworks like Jena.
Semantic Programming Frameworks
The choice of which semantic programming framework to use is largely a matter of taste, a personal decision of a developer. We decided to use Jena because it is Java based and, as already mentioned, Clojure. Also, Jena has been chosen by other JISC funded Linked Data projects. By using Jena we have access to a wealth of previous experience accumulated by the JISC Linked Data community.
There are multiple triple store implementations available, both open source and commercial. A sample list of available products can be found at the Wikipedia article on triple stores (http://en.wikipedia.org/wiki/Triplestore). Evaluating triple stores is a hard problem without a clear set of requirements. When choosing the right product the following questions may be asked:
- Is the product free or does it cost?
- Is it open source or closed source?
- What are the licencing conditions?
- Is it actively developed (regular releases)?
- Is there an active community formed around the product?
- Will a solution scale with the number of stored triples? what are the performance characteristics?
- Does it support UTF-8 characters in stored documents and in queries?
- Is the product easy to deploy and maintain?
- Is the programming API sensible?
- What is the level of reasoning support (RDFs / OWL DLP)?
- Does it provide a SPARQL end-point web service?
- If planning to use third party tools to build applications on top of the triple store, does the triple store support the APIs you’ll need?
For SPQR, our requirements are:
- Preferably open source
- UTF-8 support. This is crucial, not only in stored triples but also in SPARQL queries. Many epigraphic data sets contain non-ASCII characters, e.g. those dealing with Greek inscriptions.
- Support for SPARQL. The ability to query the data is crucial.
- Query processing performance may be important, especially if an interactive RDF browser is needed – the faster results are available the more responsive a browser is.
- REST-based updates. This could be useful if we want to support annotations. Of course, one can always use triple store API and perform updates at the back-end. However doing it in a RESTful style is more accessible for third party developers wanting to develop alternative clients for any data we publish.
- Support for authentication and authorisation in the SPARQL endpoint, for updates.
- Scalability is not a big issue. There is unlikely to be 10s of terabytes of data.
For performance we decided that if it became an issue we’d check out the many performance reports available on the web. Links to benchmarks can be found at the RDF Store Benchmarking Page (http://esw.w3.org/RdfStoreBenchmarking).
We evaluated the two popular triple stores – Jena (http://jena.sourceforge.net/) and Sesame (http://www.openrdf.org/) – in detail. Other tools that were of possible application to SPQR e.g. Danno Annotation Server (http://metadata.net/sites/danno-1.3.1/) use these APIs. Therefore, it made sense to focus on these.
Later we also decided upon AllegroGraph (http://www.franz.com/agraph/allegrograph/). This decision was primarily motivated by its support for an advanced graphical browser – Gruff, see below. AllegroGraph, though a commercial product, was free, easy to set up and provided support for a SPARQL endpoint and also for full-text searches, a feature of great interest to the target users.
To evaluate the utility of linking data to epigraphy researchers we planned to develop our own graphical browser and query tool. However, this meant we would have nothing for researchers to use until the very late stages of the project. So, we decided to use the Gruff browser as a basis for the evaluation. This allows RDF data to be browsed graphically and features a graphical SPARQL query builder and full-text search feature. It is however limited in that it is closed source (though free) and must be run locally (so cannot be run from a web browser for example). However, it was considered to be a useful vehicle for assessing the utility of linking data for epigraphy researchers, in principle, while also allowing requirements for an open source, client-server graphical browser to be gathered. We also found it useful ourselves when exploring epigraphic data that was already in LD format or that we converted into an LD format. In some cases it allowed the identification of inconsistencies in the original data sets.
The Pellet OWL2 reasoner (http://clarkparsia.com/pellet/, 11/10/11, AGPL or commercial) reasoner for reasoning about resources and properties, computing classification heirarchies, and for consistency checking was downloaded and worked with no problems with an example Jena deployment.
Epigraphers, our target user community, were most interested in the possibility of being able to run full-text searches across their data, entering a keyword occurring in a document and getting back a list of documents containing that keyword. To support this we deemed the open source SOLR Search Server http://lucene.apache.org/solr/) to be a candidate to support this functionality. SOLR provides powerful full-text search capabilities based upon the open source Lucene (http://lucene.apache.org/) text index and search libraries. URIs for resources in our LD datasets could be dereferenced, the resource descriptions parsed and the text indexed. A full text search could then return a list of URIs of all resources whose descriptions contain the text of interest.