Mighty Storage Challenge – Tasks and Training Data

Task 1: RDF Data Ingestion

Summary Description

The constant growth of the Linked Data Web in velocity and volume has increased the need for triple stores to ingest streams of data and perform queries on this data efficiently. The aim of this task is to measure the performance of SPARQL query processing systems when faced with streams of data from industrial machinery in terms of efficiency and completeness. The experimental setup will hence be as follows: we will increase the size and velocity of RDF data used in our benchmarks to evaluate how well can a system store streaming RDF data obtained from industry. The data will be generated from one or multiple resources in parallel and will be inserted using SPARQL INSERT queries. This facet of triple stores has (to the best of our knowledge) never been benchmarked before. SPARQL SELECT queries will be used to test the system’s ingestion performance and storage abilities.

Testing and Training Data

The input data for this task consists of data derived from mimicking algorithms trained on real industrial datasets (see Use Cases for details). Each training dataset will include RDF triples generated within a period of time (e.g., a production cycle) of production. Each event (e.g., each sensor measurement or tweet) will have a timestamp that indicates when it was generated. The datasets will differ in size regarding the number of triples per second. During the test, data to be ingested will be generated using data agents (in form of distributed threads). An agent is a data generator who is responsible for inserting its assigned set of triples into a triple store, using a SPARQL INSERT query. Each agent will emulate a dataset that covers the duration of the benchmark. All agents will operate in parallel and will be independent of each other. As a result, the storage solution benchmarked will have to support concurrent inserts. The insertion of a triple is based on its generation timestamp. To emulate the ingestion of streaming RDF triples produced within large time periods within a shorter time frame, we will use a time dilatation factor that will be applied to the timestamp of the triples. Our benchmark allows for testing the performance of the ingestion in terms of precision and recall by deploying datasets that vary in volume (size of triples and timestamps), and use different dilatation values, various number of agents and different size of update queries. The testing and training data are public transport datasets, Twitter,  transportation datasets and molding machines and are available here:

Evaluation Methodology

Our evaluation consists of two KPIs: precision and recall. To compute the reference dataset for each of our SELECT queries, we will compute the result set returned by a reference SPARQL implementation which passes all the micro-tests devised by the W3C. The relevant data will be loaded into the system using a bulk load. We will determine the precision, recall and F-measure achieved by the systems at hand by comparing the set of query solutions contained in their result set with the set of query solutions contained in the reference result set. Transparency will be assured by releasing the dataset generators as well as the configurations

Availability of resources

We will provide the participants with the test data by January 13th, 2017 and ask them to submit their final system version as a docker image by April 7th, 2017. The final tests will be carried out with the dockers uploaded by the participants by April 7th, 2017 using version 1.0 of the HOBBIT platform. The results will be available by April 21st and will be made public at the ESWC 2017 conference. We have already gathered all the datasets necessary for this task. We are currently completing the implementation of the HOBBIT benchmarking platform. We are hence confident, that we will be able to run all the benchmarks on the platform by April 7th. The baseline implementations of triple stores are already available and will include OpenLink’s Virtuoso.

Use Cases

This task will aim to reflect real loads on triple stores used in real applications. We will hence use the following datasets

  • Public Transport Data
  • Social network data from Twitter (Twitter Data)
  • Car traffic data gathered from sensor data gathered by TomTom (Transportation Data)
  • Sensor data from plastic injection moulding industrial plants of Weidmüller (Molding Machines Data)

The descriptions of the datasets can be found here.

Task 2: Data Storage Benchmark

Summary Description

In this task, we will develop an RDF benchmark that measures how datastores perform with interactive, simple, read, SPARQL queries as well as complex, business intelligence (BI) queries. Running the queries will be accompanied with high insert rate of the data (SPARQL UPDATE queries), in order to mimic the real use cases where READ and WRITE operations are bundled together. Typical bulk loading scenarios will also be supported. The queries and query mixes will be designed to stress the system under test in different choke-point areas, while being credible and realistic.

Testing and Training Data

The LDBC Social Network Benchmark will be used as a starting point for this benchmark. The dataset generator developed for the previously mentioned benchmark will be modified in order to produce synthetic RDF datasets available in different sizes, but more realistic and more RDF-like. The structuredness of the dataset will be in line with real-world RDF datasets unlike the LDBC Social Network Benchmark dataset, which is designed to be more generic and very well structured . The output of the dataset will be split in three parts: the dataset that should be loaded by the system under test, a set of update streams containing update queries and a set of files containing the different parameter bindings that will be used by the driver to generate the read queries of the workloads. The data for the Task are available at:  ftp://hobbitdata.informatik.uni-leipzig.de/mighty-storage-challenge/Task2/

Evaluation Methodology

After generating the dataset of desired size, the whole dataset will be bulk loaded, and the time will be measured. Running the benchmark consists of three separate parts: validating the query implementations, warming the database and performing the benchmark run. The queries are validated by means of the official validation datasets that we will provide. The auditor must load the provided dataset and run the driver in validation mode, which will test that the queries provide the official results. The warm-up will be performed using the driver. A valid benchmark run must last at least 2 hours of simulation time (datagen time). The only KPI that will be relevant is throughput. The challenge participant may specify a different target throughput to test, by “squeezing” together or “stretching” apart the queries of the workload. This is achieved by means of the “Time Compression Ratio” that will multiply the frequencies of the queries.

Availability of resources

We will provide the driver responsible of executing the whole workload, but each triple store provider must add the missing part of it, such as connection to the store, initialization, loading scripts, etc. We will provide the participants with the final version of the data generator by January 13th, 2017 and asked to submit their final system version as a docker image by April 7th, 2017.

Use Cases

The use case of this task is an online social network since it is the most representative and relevant use case of modern graph-like applications. A social network site represents a relevant use case for the following reasons:

  • It is simple to understand for a large audience, as it is arguably present to our every-day life in different shapes and forms.
  • It allows testing a complete range of interesting challenges, by means of different workloads targeting systems of different nature and characteristics.
  • A social network can be scaled, allowing the design of a scalable benchmark targeting systems of different sizes and budgets.

Task 3: Versioning RDF Data

Summary Description

The evolution of datasets would often require storing different versions of the same dataset, so that interlinked datasets can refer to older versions of an evolving dataset and upgrade at their own pace, if at all. Supporting the functionality of accessing and querying past versions of an evolving dataset is the main challenge for archiving/versioning systems. In this sub-challenge we propose a benchmark that will be used to test the ability of versioning systems to efficiently manage evolving datasets and queries evaluated across the multiple versions of said datasets.

Testing and Training Data

The Semantic Publishing Benchmark (SPB) generator will be used to produce datasets and versions thereof. SPB was developed in the context of the LDBC project and is inspired by the Media/Publishing industry, and in particular by BBC’s “Dynamic Semantic Publishing” (DSP) concept. We will use the SPB generator that uses ontologies and reference datasets provided by BBC, to produce sets of creative works. Creative works are metadata represented in RDF about real world events (e.g., sport events, elections). The data generator supports the creation of arbitrarily large RDF datasets in the order of billions of triples that mimic the characteristics of the real BBC datasets. Data generation follows three principles: data clustering in which the number of creative works produced diminishes as time goes by, correlation of entities where two or three entities are used to tag creative works for a fixed period of time and last, random tagging of entities. The data generator follows distributions that have been obtained from real world datasets thereby producing data that bear similar characteristics to real ones. The versioning benchmark that will be used in this sub-challenge include datasets and versions thereof that respect the aforementioned principles.

The data that will be used from the participants for training purposes, has been produced in such a way that guarantees the coverage of a high spectrum of available use cases (many changes on small graphs, few changes on large graphs, etc.).

In more details training data consists of 12 datasets which are located at: ftp://hobbitdata.informatik.uni-leipzig.de/mighty-storage- challenge/Task3. These datasets have the following characteristics:

  • Datasets of different Scale Factors (including triples of all versions)
    • SF0: 1M triples (generated creative works has a duration of 1 year starting from 2016).
    • SF4: 16M triples (generated creative works has a duration of 2 year starting from 2015).
    • SF7: 128M triples (generated creative works has a duration of 5 year starting from 2012).
    • SF10: 1B triples (generated creative works has a duration of 10 year starting from 2007).
  • Varying number of versions
    • 10
    • 50
    • 100

Each one of the generated datasets identified with a name of the form of “generated_sf[SF#]v[V#]”. For example, a dataset containing 1M triples (SF0) in 10 versions, identified by “generated_sf0v10.tar.gz“. Each version can be found in its own directory starting from V0  to Cend. In particular V0 denotes the starting dataset and every C1,C2, … ,Cend the change set with respect to previous version (a set of added triples in our case). So each version is computed as follows:  V0={ V0 }, V1={ V0 +C1}, V2={ V0+C1+C2 },…,Vend={ V0+C1+…+Cend }

A more detailed analysis of triples number of each generated version/changeset (including V0  to Cend sizes) can be found data_stats.xlsx (a sheet per dataset).

Evaluation Methodology

To test the ability of a versioning system to store multiple versions of datasets, our versioning benchmark will produce versions of an initial dataset using as parameters (a) the number of required versions (b) the total size of triples (including triples of all versions). The number of versions will be specified by the user of the benchmark who will also be able to specify the starting time as long as the duration of generated data. In this manner the user will be able to check how well the storage system can address the requirements raised by the nature of the versioned data. The benchmark will include a number of queries that will be used to test the ability of the system to answer historical and cross version queries. These queries will be specified in terms of the ontology of the Semantic Publishing Benchmark and written in SPARQL 1.1.

In our evaluation we will focus on 3 KPIs:

  • Storage space: We will measure the space required to store the different versioned datasets that we will use in our experiments. This KPI is essential to understand whether the system can choose the best strategy (e.g. full materialization, delta-based, annotated triples or hybrid) for storing the versions and how optimized such strategy is implemented.
  • Ingestion time: We will measure the time that a system needs for storing a new coming version. This KPI is essential to quantify the possible overhead of complex computations, such as delta computation, during the data ingestion.
  • Query performance: For each of the eight versioning query types (e.g. version materialization, single version queries, cross-version queries etc.) we will measure the average time required to answer the benchmark queries.

Availability of resources

We will provide the driver responsible of executing the whole workload, but each versioning system must add the missing part of it, such as connection to the store, initialization, loading scripts, etc. We will provide the participants with the final version of the data generator by January 13th, 2017 and asked to submit their final system version as a docker image by April 7th, 2017.

Use Cases

The use cases that are considered by this benchmark are those that address versioning problems. Such use cases span from different domains and applications of interest such as the energy domain, semantic publishing, biology, etc. For this task we will employ data from the semantic publishing domain.

Task 4: Faceted Browsing Benchmark

Summary Description

Faceted browsing stands for a session-based (state-dependent) interactive method for query formulation over a multi-dimensional information space. It provides a user with an effective way for exploration through a search space. After having defined the initial search space, i.e., the set of resources of interest to the user, a browsing scenario consists of applying (or removing) filter restrictions of object-valued properties or of changing the range of a number-valued properties. Using such operations aimed to select resources with desired properties, the user browses from state to state, where a state consists of the currently chosen facets and facet values and the current set of instances satisfying all chosen constraints. The task on faceted browsing checks existing solutions for their capabilities of enabling faceted browsing through large-scale RDF datasets, that is, it analyses their efficiency in navigating through large datasets, where the navigation is driven by intelligent iterative restrictions. We aim to measure the performance relative to dataset characteristics, such as overall size and graph characteristics.

Testing and training data

For this task, the transport dataset of linked connections will be used. The transport dataset is provided by a data generator and consists of train connections modelled using the transport ontology following GTFS (General Transit Feed Specification) standards – see here for more details. The datasets may be generated in different sizes, while the underlying ontology remains the same – see here  for a visualization of the ontology relevant to the task.

A participating system is required to answer a sequence of SPARQL queries, which simulate browsing scenarios through the underlying dataset. The browsing scenarios are motivated by the natural navigation behaviour of a user (such as a data scientist) through the data, as well as to check participating systems on certain choke points. The queries involve temporal (time slices), spatial (different map views) and structural (ontology related) aspects.

For training we provide  a dataset of triples in Turtle coming from our generator as well as as list of SPARQL queries for sample browsing scenarios. Two scenarios are similar to the ones used in the testing phase, while a third is meant to illustrate all the possible choke points that we aim to test on. The training data are available at: ftp://hobbitdata.informatik.uni-leipzig.de/mighty-storage-challenge/Task4/

(21.02.2017 — Please note, that there has been a slight change in the training data. Latitude and longitude values are now modeled as xsd:decimal.)

Next to the training dataset that we provide, we will make use of the Transport Disruption Ontology.

A list of possible choke points for participating systems can be found here.

Evaluation Methodology

At each state of the simulated browsing scenario through the dataset two types of queries are to be answered correctly:

  1. Facet counts (in form of SPARQL SELECT COUNT queries):
    For a specific facet, we ask for the number of instances that remain relevant after restriction over this facet. To increase efficiency, approximate counts (e.g. obtained by different indexing techniques) may be returned by a participating system.
  2. Instance retrieval (in form of SPARQL SELECT queries):
    After selecting a certain facet as a further filter on the solution space, the remaining instances are required to be returned.

One browsing scenario consists of between 8 to 11 changes of the solution space (instance retrievals), where each step may be the selection of a certain facet, a change in the range value for a literal property (which may be indirectly related through a complex property-path), or the action of undoing a previously chosen facet or range restriction.

The evaluation is based on the following performance KPIs.

  1. Time: The time required by the system is measured for the two tasks facet count and instance retrieval separately. The results are returned in a score function computing number of returned queries per second. For the instance retrieval queries, we additionally compute the query per second score for several choke points separately.
  1. Accuracy of counts: The facet counts are being checked for correctness. For each facet count, we record the distance of the returned count to the correct count in terms of absolute value and we record the error in relation to the size of solution space (relative error). We both sum and average over all steps of the browsing scenario, resulting in four overall error terms:
    1. overall absolute error (sum of all errors)
    2. average absolute error
    3. overall relative error (sum of all errors over sum of all counts)
    4. average relative error (average of relative over all count queries)
  1. Accuracy of instance retrievals: For each instance retrieval we collect the true positives, the false positives and false negatives to compute an overall precision, recall and F1-score. Additionally, we compute precision, recall and F1-score for each of several choke points separately.

Use Cases

Intelligent browsing by humans aims to find specific information under certain assumptions along temporal, spatial or other dimensions of statistical data. “Since plain web browsers support sessional browsing in a very primitive way (just back and forth), there is a need for more effective and flexible methods that allow users to progressively reach a state that satisfies them”, as Tzitzikas et al point out in their recent survey on faceted browsing (DOI: 10.1007/s10844-016-0413-8). The ability to efficiently perform such faceted browsing is therefore important for the exploration of most datasets, for example in human-controlled information retrieval from topic oriented datasets. We will include a use case in which a data analyst wants to explore the characteristics of a train network (e.g. delays in a particular region and certain times of a day) based on Linked Connections dataset (see here for details).


A final version of a participating system needs to be submitted as a docker image by April 7th, 2017.