Linked data tools

There are a variety of tools available for producing, manipulating and exploring linked data (LD). Here, we provide an overview of tools available, categorised by function. The licencing of each tool and the date of the last update to the tool’s web site is also given.

Creation and storage

LD can either be created from scratch or via conversion from a legacy format e.g. relational data, XML data, Microsoft Excel spreadsheets, or text files. Conversion can be done either by hand or using a tool. Conversion tools exist to both convert data prior to its publication or to convert it “on the fly” when it is needed.

Creation from scratch or conversion prior to publication

If creating LD from scratch then the LD is typically stored directly in triple-stores optimised for RDF or in dedicated relational back-ends. The advantage of storing the LD as LD is potentially more efficient data access and querying since, unlike “on the fly” approaches, there is no need for data translation every time the data is accessed.

Convertor tools and related resources include:

B2RDF (http://sourceforge.net/projects/db2rdf/, 17/07/09, GPL). A tool that converts relational data to RDF. It also supports a SPARQL endpoint for querying the data.
GRDDL (http://www.w3.org/TR/grddl/, 11/09/07, W3C document licence). A W3C specification that defines a method for exposing XML as RDF via XSLT (a technology for mapping XML-to-XML).
RDFTEF (http://rdftef.sourceforge.net/, 04/10/05, GPL). A tool that converts XML documents consisting of a subset of TEI (Text Encoding Initiative) XML into RDF. TEI (http://www.tei-c.org/index.xml) is a standard for representing texts in digital form and has been used for epigraphic data which is the application area for SPQR (e.g. the IAphrodisias dataset)
Krextor (http://trac.kwarc.info/krextor, 12/08, Lesser-GPL). An XSLT framework for XML-to-RDF conversion which can be invoked via shell scripts or Java.

An alternative is to write custom scripts in some scripting language to extract and translate legacy data into RDF e.g. XSL transforms from XML or running queries against RESTful endpoints exposed by online data and then “web scraping” the results (parsing the HTML query result pages into RDF).

Triple stores

For storing LD, a number of established products exist, including:

Virtuoso (http://virtuoso.openlinksw.com/, 22/09/10, versions available licenced under commercial licence or GPL (OpenLink Virtuoso)). A data server that supports various data representations including relational, XML and RDF. It provides an RDF triple-store and supports SPARQL endpoints and Sesame and Jena APIs allowing it to be used with those products.
Sesame (http://www.openrdf.org/, 12/09, BSD-style Sesame licence). An RDF framework supporting SPARQL and other query languages.
Jena (http://jena.sourceforge.net/, 18/02/11, BSD). A semantic web framework with Java APIs for RDF manipulation and serialization to a relational database. Unlike Sesame it has support for OWL.
Talis (http://www.talis.com, 19/01/11, custom pricing model, free up to certain data volumes). A semantic web application platform offered as a service. Talis will host open source linked data and provide a SPARQL endpoint, content negotiation and access control.
AllegroGraph (http://www.franz.com/agraph/allegrograph/, 17/01/11, free and commercial licences, closed source). An RDF database with support for SPARQL queries and Prolog reasoning.
Mulgara (http://www.mulgara.org/, 01/10/10, Open Software Licence). A 100% Java RDF database which supports REST interfaces for SPARQL and also to insert, update or delete triples.
Cliopatria (http://cliopatria.swi-prolog.org/home, 27/01/11, free and open source, licence unknown). An RDF database with web server, user management, SPARQL query support and Prolog reasoning.

“On the fly” conversion

In “on the fly” approaches, requests for LD-compliant data are translated, by a wrapper, into calls to the underlying data held in legacy formats. This allows data providers to continue to use their existing datasets but make them available for integration into an LD environment. A range of tools support such “on the fly” exposure:

Virtuoso supports RDF Views allowing relational data to be dynamically exposed as RDF and exposed via a SPARQL endpoint. Mappings are expressed declaratively.
D2R (http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/, 29/11/10, Apache 2.0). A server for exposing relational data as RDF. The mapping is expressed via a declarative notation. It can also expose relational data held in Sesame or Jena frameworks.
METAmorphoses (http://metamorphoses.sourceforge.net/, 21/01/08, Lesser GPL) A tool suite supporting generation of RDF from relational data and augmenting HTML pages with RDF.
SquirrelRDF (http://jena.sourceforge.net/SquirrelRDF/, 06/06/06, BSD) Allows relational data and LDAP servers to be queried via SPARQL.
TopBraid Composer (http://www.topquadrant.com/products/TB_Composer.html, 02/11, free and priced versions). Supports exposure of data held within Jena, AllegroGraph, Sesame and Oracle as well as relational and XML data and Microsoft Excel spreadsheets.
XLWrap (http://xlwrap.sourceforge.net/, 16/03/10, Apache 2.0) A wrapper to expose Microsoft Excel and OpenDocument spreadsheets, and files of comma-separated values, as RDF graphs and via a SPARQL endpoint. The mappings are specified in a custom XLWrap notation.
Anzo For Excel (http://www.cambridgesemantics.com/products/anzo_for_excel, 01/02/11, closed source, commercial) A Microsoft Excel plug-in.

A survey of approaches for relational data is available at S. Sahoo, W. Halb, S. Hellmann, K. Idehen, T. Thibodeau Jr, S. Auer, J. Sequeda, A. Ezzat. A Survey of Current Approaches for Mapping of Relational Databases to RDF, W3C RDB2RDF Incubator Group, W3C, 2009. http://esw.w3.org/Rdb2RdfXG/StateOfTheArt (last accessed, 22/02/11).

Publication

LD can be published directly, often just as a downloadable RDF dump. Complementing this, or in addition to it, it can be exposed via a queryable endpoint. It is common to support endpoints supporting queries expressed using SPARQL – see below.

Exploration

The subject-predicate-object nature of LD lends itself towards exploration. Users can browse from one resource to another much in the same way that users can explore the web via following links from one page to another. A key requirement in publishing LD for SPQR is providing a researcher with access to integrated datasets that allows them to explore these in an intuitive way following paths though the data from one dataset to another via common attributes (e.g. names, places or dates).

There have been a number of attempts to produce LD browsers that support a similar mode of interaction as that of a web browser, allowing links between LD resources to be traversed and information about the linked resources displayed. Examples of browsers currently available include:

Disco (http://www4.wiwiss.fu-berlin.de/bizer/ng4j/disco/, 04/03/07, BSD)
Tabulator (http://www.w3.org/2005/ajar/tab, 07, Creative Commons). Available as a Mozilla Firefox plug-in or as a web application. An online demo is available.
SIOC browser (http://sioc-project.org/browser/, 31/05/06, unclear)
Marbles (http://marbles.sourceforge.net/, 04/07/09, Apache 2.0). A server-side application that manages formatted presentation of Semantic Web data. An online demo is available.

These browsers are primarily text based, some just displaying RDF triples, others, like Marbles, providing a more formatted presentation of information. Work has been done, though, on supporting a more graphical model of interaction, exploiting the fact that RDF triples for graphs:

Gruff (http://www.franz.com/agraph/gruff/, 17/01/11, free and commercial licences, closed source). A browser for AllegroGraph that displays LD in terms of nodes and arcs, allows the execution of queries, built using a graphical query builder, and full text searches.
CLink (http://conceptlinkage.org, 12/11/09, GPL 2). A graphical tool from the JISC-funded Concept Linkage in Knowledge Repositories (http://www.jisc.ac.uk/whatwedo/programmes/inf11/jiscri/clkr.aspx) project which offers a tool to search for the links that connect two named Wikipedia resources and so see the related resources that inter-link and provide a context around the two resources.

More common is the provision of portals to browse specific linked data sets as these can support a presentation of the data specific to the data itself e.g. the Europeana portal (http://eculture.cs.vu.nl/europeana/session/search) allows the browsing of data about cultural heritage resources (paintings, sculptures etc) which renders images of the resources and their RDF triples as tables with resources being hyperlinked to allow the underlying RDF to be explored.

Queries

To make LD more usable and to support information discovery, LD publication tools typically allow LD to be exposed as SPARQL endpoints and examples have already been listed (e.g. Virtuoso, D2R, Sesame and Jena). This allows queries to be run over the underlying LD so information of interest can be found more quickly than by link traversal – analogous to the difference between exploring web pages by following links against running searches using Google.

Since SPARQL queries can return either RDF or result sets, tools have been created to allow SPARQL endpoints to themselves be exposed as LD, for example Pubby (http://www4.wiwiss.fu-berlin.de/pubby/, 26/01/11, open source).

Integration

If a user wishes to run a query across multiple data sets then the typical model when using LD is to pull all the LD into one single data repository and then run the query. This is impractical for any but small numbers of datasets or datasets of a small size:

Time is needed to pull the data sets of interested into the data repository.
The data repository needs enough space available to store all the data downloaded.
It is inefficient to undergo all this effort if the user is interested in only involves a small fraction of the intersection between two of the data sets.
If the original data changes then the user has to get the most recent update.

There is a growing requirement for distributed query processing that allow queries to be run over distributed LD data sets. This can contribute to providing integrated views across multiple heterogeneous datasets, allowing researchers to explore (browse) or search (query) within and across those datasets SPARQL 1.1 defines a SERVICE keyword as a starting point towards federated SPARQL queries.

Tools are emerging to support distributed query processing across SPARQL endpoints and to support more powerful search capabilities within LD datasets. These include:

OGSA-DAI RDF extensions (http://sourceforge.net/projects/ogsa-dai, 23/02/11, Apache 2.0) under development at the University of Madrid to allow distributed SPARQL queries to be run exploiting similarities to distributed relational queries.
Distributed SPARQL (http://www.uni-koblenz-landau.de/koblenz/fb4/AGStaab/Research/systeme/DistributedSPARQL, 01/01/08, LGPL 3)
DARQ (http://darq.sourceforge.net/, 28/06/06, BSD). This extends the ARQ implementation of Jena with a query planner and executor. This project is no longer live and seems to have been superceded by Jena’s ARQ itself, see http://jena.sourceforge.net/ARQ/service.html

Linking

When publishing LD, researchers need to create links within their data and also to create outgoing links to existing datasets to provide a richer context of exploration. Conversely, they also need to persuade third-parties to create incoming links to their LD. These capabilities are also required if a researcher withes to augment existing datasets with their own observations or opinions. There are three main approaches to identifying candidate links:

Manual approaches
Semi-automated approaches
Automated approaches based upon string matching, the use of common keys or patterns in the linked datasets, using naming schemes already present in the data, or, most complex, property or graph-based approaches.

The use of automated approaches may be limited depending upon the nature of the data. For example, in the epigraphic data that is the focus for SPQR, there may be a number of possibilities as to who a person might be or which ancient places map to which modern ones. In these cases, manual intervention is needed to make the decision and, equally importantly, information about that decision (e.g. who made it) also needs to be recorded.

A review of the approaches is given in Burger, T., Morozova, O., Zaihrayeu, I., Andrews, P., Pane, J. Report on Methods and Algorithms for Linking User-generated Semantic Annotations to Semantic Web and Supporting their Evolution in Time, Technical Report DISI-10-010, University of Trento, Italy, January 2010. http://eprints.biblio.unitn.it/archive/00001811/01/010.pdf. Last accessed 23/02/11)

Annotations

In an application area like epigraphy it can be useful if data can be augmented or annotated by a researcher. This may arise if the researcher has found out new information about some artefact or inscription, has found information that contradicts what is currently known, or identifies a link between two resources.

A selection of tools that support annotations include:

Dannotate Web Annotator (http://metadata.net/sfprojects/dannotate.html, 14/05/09, available via Danno) A tool for marking up web pages, adding information about this markup and then storing it in the Danno Annotation Server (http://metadata.net/sfprojects/danno.html, 23/02/11, GPL). Danno is based on the W3C Annotea protocol (http://metadata.net/sfprojects/www.w3.org/2001/Annotea/) for managing web page annotations.
LORE (http://itee.uq.edu.au/~eresearch/projects/aus-e-lit/lore.php, 17/02/11, GPL 3.0). A FireFox extension for annotating web resources about literature and creating links between them.
The OpenAnnotation project (http://www.openannotation.org/) is looking at developing an annotation environment to enable, support, share and manage annotations via the development of specifications and tools.

Semantic Programming Frameworks

There are several semantic web frameworks available. All of them provide similar functionality and target different runtime environments. A typical set of features includes:

Parsers and extractors for reading different RDF formats.
Serialisers for dumping RDF models to files (different RDF renderings).
RDF storage (triple store).
RDF querying (via APIs and/or SPARQL).
SPARQL endpoints.
Inferencing support (built in or via plug-ins).

In terms of runtime environments, examples include:

ARC (http://arc.semsol.org/, 2010, W3C software licence) targeting PHP programmers and LAMP based deployments.
RDFLib (http://www.rdflib.net/, 21/02/11, New BSD) targeting Python programmers.
Jena, based on Java.
Sesame, based on Java.

Further information

The W3C Semantic Web pages maintain a list of Semantic Web and linked data tools and technologies at http://www.w3.org/2001/sw/wiki/Tools.