USEWOD 2013 Data Challenge

The USEWOD 2013 Data Challenge invites research and applications built on the basis of USEWOD 2013 Dataset. Accepted submissions will be presented at USEWOD2013, where a winner will be chosen. Examples of analyses and research that could be done with the dataset are the following (but not limited to those):

  • correlations between linked data requests and real-world events
  • types of structured queries
  • linked data access vs. conventional access
  • analysis of user agents visiting the sites
  • geographical analysis of requests
  • detection and visualisation of trends
  • correlations between site traffic and available datasets
  • etc. - let your imagination run wild!

Please note that there is now also an additional On-site Hacking Challenge!

USEWOD 2013 Dataset

The USEWOD dataset consists of CLF server logs from from four major web services publishing datasets on the Web of linked data, as well as two additional datasets containing queries posted to SPARQL endpoints. In particular, the dataset contains logs from the following sources:

  • CLF Server Logs:
    • DBPedia: slices of log data spanning several months from the linked data twin of Wikipedia, one of the focal points of the Web of data. The logs were kindly made available to us for the challenge by OpenLink Software! Further details about this part of the dataset to follow.
    • SWDF: Semantic Web Dog Food is a constantly growing dataset of publications, people and organisations in the Web and Semantic Web area, covering several of the major conferences and workshops, including WWW, ISWC and ESWC. The logs contain over four years of requests to the server from about 12/2008 until 01/2013.
    • Linked Open Geo Data: Linked Open Geo Data offers information collected by OpenStreetMap as RDF.
    • Bio2RDF (KEGG): Bio2RDF offers Linked Data for life sciences. KEGG (Kyoto Encyclopedia of Genes and Genomes) is one of about 40 atomic datasets (find the list at http://sourceforge.net/apps/mediawiki/bio2rdf/index.php?title=Datasets) served by the Bio2RDF project and which is (w.r.t. the logs we have) the most frequently used part.
  • SPARQL Endpoint Logs
    • Open-BioMed.org.uk: This service offers gene expression search for Drosophila research, as well as drug discovery for the Alzheimer's disease. The dataset included here consists of time-stamped SPARQL queries that were posted to Open-BioMed in a period between 02/2011 and 11/2012.
    • BioPortal: BioPortal provides access to commonly used biomedical ontologies. Included here are time-stamped SPARQL queries from 12/2012 to 02/2013.

Content

The first four datasets are all regular server log files (see below), but differ slightly in their content. All four datasets contain SPARQL queries (.../sparql?...) to their respective services, while the SWDF and parts of the DBPedia dataset (3.3 and 3.4) also contain direct linked data requests for individual resources.

Where direct resource request are contained in the logs, different resource representations can be distinguished. E.g., in DBPedia you will have:

  • http://dbpedia.org/resource/Berlin - the URI for Berlin
  • http://dbpedia.org/data/Berlin.rdf - the RDF/XML representation of Berlin
  • http://dbpedia.org/page/Berlin - the HTML representation of Berlin

Similarly, in SWDF you will find:

  • http://data.semanticweb.org/person/tim-berners-lee - the URI for Sir Tim Berners-Lee
  • http://data.semanticweb.org/person/tim-berners-lee/rdf - the RDF/XML representation of Tim Berners-Lee
  • http://data.semanticweb.org/person/tim-berners-lee/html - the HTML representation of Tim Berners-Lee

Furthermore, the URIs in SWDF are structured based on the kind of resource they denote:

  • .../person/PERSON_NAME - URIs for people
  • .../organization/ORG_NAME - URIs for organisations
  • .../(conference|workshop)/EVENT_NAME/EVENT_YEAR/... - URIs for everything to do with a particular event, such as papers, talks, chairs, etc.

The SWDF logs also contain keyword searches (.../search/node/...).

Format

The first four log datasets are provided in the Apache Combined Log Format. However, there are slight differences between the datasets, due to the fact that they have been provided by different parties. The IP addresses in all datasets have been anonymised, but in different ways:

  • bio2RDF: All bio2rdf log entries show requests that have been routed through the service's web interface, and therefore have identical IP addresses and user agents.
  • DBPedia 3.3, 3.4 and SWDF: all IP addresses have been set to "0.0.0.0". Two additional fields have been added to the end of each log entry: the country code of the request IP (determined using the GeoLite Country API), and a hash of the IP.
  • DBPedia 3.5.1: all IP addresses have been set to "0.0.0.0". All agent strings have been replaced with "preprocessed".
  • DBPedia 3.6 and 3.8: all IP address fields have been replaced with a hash of the address. All timestamps have been set to 04:00.
  • LGD: All IP addresses have been replaced with "0.0.0.X", where X=1 for the first IP encountered, and X+1 for each new IP after that.

The Open-BioMed and BioPortal SPARQL logs contain time stamps followed by a query that was processed by the server. The BioPortal logs additionally contain a line indicating the execution time and how many results were returned for the query. Here is an example from Open-BioMed:

------------
# timestamp: 2011-07-06T16:38:45
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX chado: <http://purl.org/net/chado/schema/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX so: <http://purl.org/obo/owl/SO#>
PREFIX syntype: <http://purl.org/net/flybase/synonym-types/>
SELECT DISTINCT ?flybaseID ?symbol ?annotationSymbol ?fullName 
WHERE {
  ?feature skos:altLabel "schuy" ;
    a so:SO_0000704 ;
    chado:organism <http://purl.org/net/open-biomed/id/flybase/organism/Drosophila_melanogaster> ;
    chado:uniquename ?flybaseID ;
    chado:name ?symbol ;
      chado:annotationSymbol ?annotationSymbol ;
      chado:fullName ?fullName .
  FILTER (regex(str(?annotationSymbol), "^CG[0-9]*$"))
}			 
			

Changes from Last Year's Challenge Data

The USEWOD2013 dataset subsume last year's challenge dataset, but makes the following additions:

  • DBpedia data: an additional set of DBpedia (v3.8) logs has been added
  • SWDF data: an additional set of SWDF/Semantic Dog Food data has been added (from 1,086 days to 1,490 days)
  • Open-BioMed SPARQL queries: queries to the SPARQL endpoint available from open-biomed.org.uk have been added
  • BioPortal SPARQL queries: queries to the SPARQL endpoint available from bioportal.bioontology.org have been added

Download and License

To get access to the USEWOD2013 dataset, please print, sign and scan the usage agreement, and email the scan to USEWOD2012 Chairs. Once your form is processed (usually within 1-3 days), you will be sent a URL and password where you can download the collection.

Here is a sample of CLF log entries from the collection.

Thanks

We would like to thank a number of people for their support in making the collection of this dataset possible (in alphabetic order):

  • Sören Auer (Universität Leipzig, Germany)
  • Chris Bizer (FU Berlin, Germany)
  • Kingsley Idehen (OpenLink Software, US)
  • Patrick van Kleef (OpenLink Software, US)
  • Marc-Alexandre Nolin (Laval University, Canada)
  • Natasha Noy (Stanford University, US)
  • Claus Stadler (Universität Leipzig, Germany)
  • Jun Zhao (University of Oxford, UK)