Assessing Jena and Sesame

We undertook an evaluation of the use and functionality of the triple stores provided by the Sesame (http://www.openrdf.org) and Jena (http://jena.sourceforge.net/) semantic frameworks. This evaluation was done under Linux and assumes that Java 1.6 and Python (http://www.python.org/) are available.

 Getting RDF data

We focused on populating the triple stores with existing RDF resources. It is most convenient to obtain local copies prior to bulk loading. The wget tool was used to access epigraphic data that is already available as LD:

wget http://nomisma.org/nomisma.org.rdf
wget http://classics.uc.edu/troy/grbpottery/nt/grbpilion.nt
wget -nd -r -l1 http://lkws1.rdg.ac.uk/ure/metadata/ ; rm `ls -1 | egrep -v '*.rdf'`
wget -nd -r -l1 http://lkws1.rdg.ac.uk/ure/vlma/rdf/ ; rm `ls -1 | egrep -v '*.rdf'`

Setting up and populating Jena

Now, we download: 

We unzip all of the downloaded archives.

Now, we go to the Joseki directory and copy:

cp joseki-config-tdb.ttl joseki-config.ttl

Edit joseki-config.ttl to point to the location where you want your triple store to store data – look for the following config triple, and change the tdb:location value:

## ---- A whole dataset managed by TDB
<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location "/home/bdobrzel/SPQR/data/triplestore/jena" ;
    .

We can now populate the triple store. TDB provides a tdbloader loader tool that bulk loads RDF files into TDB triple store. Since some of our datasets consist of multiple RDF files it is useful to create a script that will bulk load in batch.

The tdbdirloader.py script can be found in the SPQR code repository, http://code.google.com/p/spqr/source/browse/#svn%2Ftrunk%2Fsrc%2Fpython%2Fmain.

Then we need to set some environmental variables. So,  from the TDB distribution directory invoke:

export TDBROOT=`pwd`
export PATH=$PATH:$TDBROOT/bin
chmod a+x bin/*

We are now ready to bulk load datasets. The script needs to be pointed to the TDB configuration file and the directory with the RDF data e.g. for the GRB Pottery data:

./tdbdirloader.py /home/bdobrzel/SPQR/soft/Joseki-3.4.2/joseki-config.ttl \
                  /home/bdobrzel/SPQR/data/raw/grbpottery/

The bulk load for the grbpottery dataset will fail with:

ERROR [line: 2221, col: 105] ... org.openjena.riot.RiotException: [line: 2221, col: 105] ...

Unfortunately, this arises as there is a corrupt entry present in the dataset. The easiest way to get around it is to delete the offending line (2221) from the relevant file. After the line is removed bulk load will succeed reporting

-- Start triples data phase
** Load into triples table with existing data
Load: /home/bdobrzel/SPQR/data/raw/grbpottery//grbpilion.nt -- 2010/08/31 11:44:26 BST
-- Finish triples data phase
10,807 triples loaded in 6.60 seconds [Rate: 1,636.43 per second]
-- Start triples index phase
-- Finish triples index phase
-- Finish triples load
** Completed: 10,807 triples loaded in 6.61 seconds [Rate: 1,633.71 per second]

The bulk load of the nomisma.org dataset will report incorrect URIs in the dataset, but should complete successfully.

WARN  {W107} Bad URI: <http://Roses,_Girona> Code: 28/NOT_DNS_NAME in HOST: \
      The host component did not meet the restrictions on DNS names.
WARN  {W107} Bad URI: <[nm:thickness> Code: 0/ILLEGAL_CHARACTER in SCHEME: \
      The character violates the grammar rules for URIs/IRIs.

The bulk load for the ure datasets will take a long time because these datasets contain multiple RDF files. The tdbloader tool is Java-based and therefore a new JVM is spawned for each RDF file. One way around this would be to provide a tool that would concatenate multiple RDF files into a single file. Both ure datasets contain empty RDF documents and empty files. These are listed below

ure_metadata:
img.594.rdf  img.629.rdf  img.593.rdf  img.627.rdf  img.704.rdf  img.795.rdf  img.979.rdf
img.928.rdf  img.982.rdf  img.921.rdf  img.926.rdf  img.793.rdf  img.25.rdf   img.626.rdf
img.592.rdf  img.584.rdf  img.927.rdf  img.983.rdf  img.630.rdf  img.984.rdf  img.605.rdf
img.33.rdf   img.923.rdf  img.587.rdf  img.924.rdf  img.989.rdf  img.987.rdf  img.601.rdf
img.788.rdf  img.39.rdf   img.30.rdf   img.26.rdf   img.28.rdf   img.598.rdf  img.596.rdf
img.31.rdf   img.44.rdf   img.586.rdf  img.590.rdf  img.623.rdf  img.624.rdf  img.29.rdf
img.791.rdf  img.796.rdf  img.27.rdf
ure_metadata_empty:
img.59.rdf  img.57.rdf
ure_vlma:

After populating the store we can launch the web server. Joseki comes with an embedded Jetty container. Go to the Joseki directory, set the JOSEKIROOT variable and launch the server:

export JOSEKIROOT=`pwd`
chmod a+x bin/*
 ./bin/rdfserver

To query the triple store we can point a browser to http://host:2020/sparql.html and run a sample query:

SELECT ?p ?o
WHERE {
<http://classics.uc.edu/troy/grbpottery/html/K-L16-17.0118.html> ?p ?o
}

or query the SPARQL endpoint directly via:

http://host:2020/sparql?query=SELECT ?p ?o WHERE {<http://classics.uc.edu/troy/grbpottery/html/K-L16-17.0118.html> ?p ?o}

The Joseki SPARQL endpoint is now available for queries. There is one more thing we might want to do. To simplify administration we would like the Joseki SPARQL endpoint to run inside the same container as the Sesame web application. The script that lets us achieve it is presented below ($CATALINA_HOME points to the Tomcat container). Starting from the Joseki directory:

cp -R webapps/joseki $CATALINA_HOME/webapps
mkdir $CATALINA_HOME/webapps/joseki/WEB-INF/lib
mkdir $CATALINA_HOME/webapps/joseki/WEB-INF/classes
cp `find lib/* | egrep -v '(lib/jetty.*|lib/servlet.*)'` $CATALINA_HOME/webapps/joseki/WEB-INF/lib
cp joseki-config.ttl $CATALINA_HOME/webapps/joseki/WEB-INF/classes

We can now restart the Tomcat container and try to query the Joseki SPARQL endpoint via:

http://host:9000/joseki/sparql?query=SELECT ?p ?o WHERE {<http://classics.uc.edu/troy/grbpottery/html/K-L16-17.0118.html> ?p ?o}

Setting up and populating Sesame

First we need to download:

Now, unzip the downloaded archives.

Copy the war/openrdf-sesame.war file from the Sesame distribution to the webapps directory of the Tomcat distribution.

 Sesame uses a JVM argument to set the default repository location. We can pass this argument to the Tomcat JVM by setting the environmental variable e.g.:

export JAVA_OPTS='-Dinfo.aduna.platform.appdata.basedir=/home/bdobrzel/SPQR/data/triplestore/sesame'

Alternatively, we can edit the Tomcat startup script (bin/catalina.sh) to set this variable.

Note that Sesame uses a low level file locking scheme that will not work on directories residing on NFS-mounted volumes. If Sesame reports problems with creating a repository and we can see

java.security.AccessController.doPrivileged(Native Method

somewhere is the stack trace then we are probably dealing with a network attached file system.

We now start Tomcat and, after openrdf-sesame is deployed, remove the webapps/openrdf-sesame.war file to avoid redeployment on server restart.

The repository can be initialized either via the Sesame API or with the console program. Creating a new repository and loading single RDF files are best done using the console. For efficiency, we are going to connect to the file system repository instead of going through the web application. To avoid problems with locking shut down  Tomcat prior to invoking the console. An example session is as follows where we load in the GRB Pottery data:

export PATH=$PATH:`pwd`/bin
console.sh -d /opt/data/openrdf-sesame
Connected to /opt/data
Commands end with '.' at the end of a line
Type 'help.' for help
> create native-rdfs.
Please specify values for the following variables:
Repository ID [native-rdfs]: spqr
Repository title [Native store with RDF Schema inferencing]: SPQR Native with RDFS
Triple indexes [spoc,posc]:
Repository created
> open spqr.
Opened repository 'spqr'
spqr> load /home/bdobrzel/SPQR/data/raw/grbpottery/grbpilion.nt.
Loading data...
Data has been added to the repository (7911 ms)
spqr> quit.
Closing repository 'spqr'...
Disconnecting from /opt/data/openrdf-sesame
Bye

We have created a repository with support for RDFs inferencing. Consult the Sesame Users Guide (http://www.openrdf.org/doc/sesame2/users/ch07.html) for more information about repository types.

We are now ready to query our store. We start up the web container and point our browser to:

http://host:9000/openrdf-sesame/repositories/spqr?query=SELECT ?p ?o WHERE {<http://classics.uc.edu/troy/grbpottery/html/K-L16-17.0118.html> ?p ?o}

If all is well you should see some query results.

The console is convenient enough for loading individual files. For multiple files – it is useful to automate the process. We have provided a simple script that should be copied to the bin directory of the Sesame SDK directory. Before using this script ensure that the Tomcat container is shut down. The script is at http://code.google.com/p/spqr/source/browse/#svn%2Ftrunk%2Fsrc%2Fpython%2Fmain and is called sesamedirloader.py. An example if its use is:

cp sesamedirloader.py bin/
sesamedirloader.py /opt/data/openrdf-sesame spqr /home/bdobrzel/SPQR/data/raw/ure_metadata 2> /dev/null

The sesamedirloader.py loads all the RDF files in a given directory (stderr is redirected to /dev/null to avoid annoying logging messages) e.g. if the bulk loader detected empty files in datasets:

ure_metadata:
img.59.rdf  img.57.rdf

General comments

In terms of security, access control is rather limited. Using proxies would be good idea. There may be ways of enforcing some access policies via Tomcat.

For data quality checking, the Jena TDB bulk loader is much more verbose – it warns us about invalid URLs and about empty RDF documents. The Sesame console program loaded the grbpottery dataset without complaints about empty RDF documents and only reported problems with empty files.

For UTF-8 support we tested both triple stores with triple containing a literal with Greek letters

  <rdf:Description rdf:about="http://nomisma.org/id/test">
    <skos:definition xml:lang="el">?????</skos:definition>
  </rdf:Description>

Both triple stores stored this triple without any complaint. Both correctly answered the following query via their Java API.

SELECT ?s ?p
WHERE {?s ?p \"?????\"@el .}

Attempts to execute this query via a URL-encoded query sent to SPARQL endpoints using a web browser failed. This may be due to an incorrect URL encoding and would be investigated further if such a capability is found to be crucial.