Mighty Storage Challenge – Tasks and Training Data

Task 1: RDF Data Ingestion

Summary Description

The constant growth of the Linked Data Web in velocity and volume has increased the need for triple stores to ingest streams of data and perform queries on this data efficiently. The aim of this task is to measure the performance of SPARQL query processing systems when faced with streams of data from industrial machinery in terms of efficiency and completeness. The experimental setup will hence be as follows: we will increase the size and velocity of RDF data used in our benchmarks to evaluate how well can a system store streaming RDF data obtained from industry. The data will be generated from one or multiple resources in parallel and will be inserted using SPARQL INSERT queries. This facet of triple stores has (to the best of our knowledge) never been benchmarked before. SPARQL SELECT queries will be used to test the system’s ingestion performance and storage abilities.

Testing and Training Data

The input data for this task consists of data derived from mimicking algorithms trained on real industrial datasets (see Use Cases for details). Each training dataset will include RDF triples generated within a period of time (e.g., a production cycle) of production. Each event (e.g., each sensor measurement or tweet) will have a timestamp that indicates when it was generated. The datasets will differ in size regarding the number of triples per second. During the test, data to be ingested will be generated using data agents (in form of distributed threads). An agent is a data generator who is responsible for inserting its assigned set of triples into a triple store, using a SPARQL INSERT query. Each agent will emulate a dataset that covers the duration of the benchmark. All agents will operate in parallel and will be independent of each other. As a result, the storage solution benchmarked will have to support concurrent inserts. The insertion of a triple is based on its generation timestamp. To emulate the ingestion of streaming RDF triples produced within large time periods within a shorter time frame, we will use a time dilatation factor that will be applied to the timestamp of the triples. Our benchmark allows for testing the performance of the ingestion in terms of precision and recall by deploying datasets that vary in volume (size of triples and timestamps), and use different dilatation values, various number of agents and different size of update queries. The testing and training data are public transport datasets, Twitter,  transportation datasets and molding machines and are available here:

Evaluation Methodology

Our evaluation consists of two KPIs: precision and recall. To compute the reference dataset for each of our SELECT queries, we will compute the result set returned by a reference SPARQL implementation which passes all the micro-tests devised by the W3C. The relevant data will be loaded into the system using a bulk load. We will determine the precision, recall and F-measure achieved by the systems at hand by comparing the set of query solutions contained in their result set with the set of query solutions contained in the reference result set. Transparency will be assured by releasing the dataset generators as well as the configurations

Availability of resources

We will provide the participants with the test data by January 13th, 2017 and ask them to submit their final system version as a docker image by April 30th, 2017. The final tests will be carried out with the dockers uploaded by the participants by April 30th, 2017 using version 1.0 of the HOBBIT platform. The results will be made public at the ESWC 2017 conference closing ceremony.

Use Cases

This task will aim to reflect real loads on triple stores used in real applications. We will hence use the following datasets

  • Public Transport Data
  • Social network data from Twitter (Twitter Data)
  • Car traffic data gathered from sensor data gathered by TomTom (Transportation Data)
  • Sensor data from plastic injection moulding industrial plants of Weidmüller (Molding Machines Data)

The descriptions of the datasets can be found here.

Implementation

For this task participants must:

  • provide his/her solution as a docker image. First install docker using the instructions found here and then follow the guide on how to create your own docker image found here .
  • provide a SystemAdapter class on their preferred programming language. The SystemAdapter is main component that establishes the communication between the other benchmark components and the participant’s system.  The functionality of a SystemAdapter is divided in 4 steps:
    • Initialization of the storage system
    • Retrieval of triples in the form of an INSERT SPARQL query and insertion of the aforementioned triples into the storage.
    • Retrieval of string representation of the graph name of each data generator.
    • Retrieval and execution of SELECT SPARQL queries against the storage system and then send the results to the EvaluationStorage component.
    • Shut down of the storage system.
      For more information on how to create a SystemAdapter please follow the instructions here.
  • provide a storage system that processes INSERT SPARQL queries. The data insertion will be performed via INSERT SPARQL queries, that will be generated by different data generators. The triple store must be able to handle multiple INSERT SPARQL queries at the same time, since each data generator runs independently of the others. Firstly, an INSERT query will be created by a data generator using an Apache Jena RDF model (dataModel), that includes of a set of triples that were generated at particular point in time. The INSERT query will be created, saved into a file and then transformed into a byte array using the following Java commands. Please note that each data generator will perform INSERT queries against its own graph, so the INSERT query will include the GRAPH clause with the name of the corresponding graph.
UpdateRequest insertQuery = UpdateRequestUtils.createUpdateRequest(dataModel, ModelFactory.createDefaultModel()); OutputStream outStream = null;  
String fileName = "insertQuery.sparql"; try {  
   outStream = new FileOutputStream(fileName);  
} 
catch (FileNotFoundException e) {
    e.printStackTrace();
} 
IndentedWriter out = new IndentedWriter(outStream);  
insertQuery.output(out); 
String fileContent = null;  
try {  
     fileContent = FileUtils.readFileToString(new File(fileName), Charsets.UTF_8);  
} 
catch (IOException e) {   
     e.printStackTrace();  
}  
byte[][] insertData = new byte[1][];  
insertData[0] = RabbitMQUtils.writeByteArrays(new byte[][] { RabbitMQUtils.writeString(fileContent) });  
sentToSystemAdapter(RabbitMQUtils.writeByteArrays(insertData));

The fileContent includes a UTF-8 String representation of the INSERT query. The function RabbitMQUtils.writeString(String str) creates a byte array representation of the given String str using UTF-8 encoding. The function RabbitMQUtils.writeByteArrays(byte[][] array) returns a byte array containing all given input arrays and places their lengths in front of them. Then, the insertData will be send to the SystemAdapter from the data generator using the function sendDataToSystemAdapter(byte[] insertData). The SystemAdapter must be able to receive the INSERT query as byte array and then transform them into a UTF-8 encoded String format. If a participant chooses Java as his/her programming language, the received byte array can be read using the ByteBuffer buffer = ByteBuffer.wrap(insertData) function, that wraps a byte array into a buffer, and then use the command RabbitMQUtils.readString( buffer) to get a UTF-8 encoded String for the INSERT query. Then, the query must be performed against the storage system.

  • provide a storage solution that can process the UTF-8 string representation of the name of a graph, in which the SELECT and INSERT queries will be performed. A graph name will be converted by each data generator using the command byte[] insertData = RabbitMQUtils.writeByteArrays(new byte[][] { RabbitMQUtils.writeString(graphName) }); and then sent to the SystemAdapter using sendDataToSystemAdapter(byte[] insertData).
  • provide a storage  solution that can process SELECT SPARQL queries. The performance of a storage system will be explored by using SELECT sparql queries. Each SELECT query will be executed after a predefined set of INSERT queries are performed against the system (configurable platform parameter). A SELECT query will be conducted by the corresponding data generator and then it will be send to a task generator, along with the expected results, all combined in the form of a byte array[]. The task generator will read both the SELECT query and the expected results, and will send the SELECT query as a byte array to the  SystemAdapter along with the task unique identifier, which is a UTF-8 encoded String. The expected results will be read also by the task generator and send to the storage system component as a byte array.
    The SystemAdapter must be able to receive the SELECT query as a byte array and then transform it into a UTF-8 encoded String format (as described above in case of Java). Then, the SystemAdapter performs the SELECT query against the storage system and the retrieved results must be serialised into JSON format (that abides the w3c standards https://www.w3.org/TR/rdf-sparql-json-res/). The JSON strings should be UTF-8 encoded. Finally, the results in JSON will be transformed into a byte array and sent to the evaluation storage along with the task’s unique identifier.  Please note that each SELECT query will include the GRAPH clause with the URI of the corresponding graph that the data must be inserted to.
  • provide any necessary parameters  to their systems that grant access for inserting triples into their storage system.

The following example is a description of HOBBIT’s API for participants that use Java as their programming language.

Firstly, read the article here , that will give you a general idea on how to develop a System Adapter in Java.

As explained in the aformentioned article, the  storage system SystemAdapter class in Java must extend the abstract class AbstractSystemAdapter.java (https://github.com/hobbit-project/core/blob/master/src/main/java/org/hobbit/core/components/AbstractSystemAdapter.java).

A SystemAdapter must override the following methods:

  • public void init() throws Exception{}: this method is responsible for initializing the storage system, by executing a command that starts the system’s docker container. First this function must call super.init(). Then, the participant can run the container by calling the function createContainer(String imageName, String[] enviromentalVariables). This function returns the name of container that is running the system. The participant must store this value in order to be able to terminate his/her system’s container later.  The enviromentalVariables should be given to the function in the form of String[]{“key1=value1”, “key2=value2”, ..}.
  • public void receiveGeneratedData(byte[] arg0){}: this method is responsible for two tasks:
    • During Phase 2, the SystemAdapter must be able to receive from each data generator the URI of graph that the data generator will want to insert and select data from. The parameter arg0 contains the URI of graph of each data generator. The data generator will create a UTF-8 String representation of the graph’s name using the command:
byte[] insertData = RabbitMQUtils.writeByteArrays(new byte[][] { RabbitMQUtils.writeString(graphName) });
    • During Phase 3, the SystemAdapter must be able to insert data into the storage system. The parameter arg0 contains the corresponding INSERT sparql query, generated by a data generator. The INSERT query will be created, saved into a file and then transformed into a byte array using the commands described in the Requirements section above.

It is suggested that a participant should use the following commands in the body of the receiveGeneratedData function to convert a byte array to String:

ByteBuffer buffer = ByteBuffer.wrap(insertData);   
String insertQuery = RabbitMQUtils.readString(buffer); 

Where the ByteBuffer buffer = ByteBuffer.wrap(arg0) function wraps a byte array into a buffer, and the RabbitMQUtils.readString( buffer) function transforms the buffer into a UTF-8 encoded String.

In order for the system to discriminate between Phase 2 and 3, we suggest the use of a boolean field (phase2) inside the SystemAdapter class that takes the value true until the BULK_LOAD_DATA_GEN_FINISHED signal is received by the benchmark controller, and false afterwards. In this manner, the receiveGeneratedData function can choose which task to perform. Note that after the BULK_LOAD_DATA_GEN_FINISHED is received, the SystemAdapter must also send a BULK_LOADING_DATA_FINISHED to the controller. For the controller to receive the BULK_LOAD_DATA_GEN_FINISHED signal, it must override the function public void receiveCommand(byte command, byte[] data). Here is an example:

@Override  
public void receiveCommand(byte command, byte[] data) {  
     if (VirtuosoSystemAdapterConstants.BULK_LOAD_DATA_GEN_FINISHED == command) { 
           phase2 = false;  
           try { 
               sendToCmdQueue(VirtuosoSystemAdapterConstants.BULK_LOADING_DATA_FINISHED);  
           } 
           catch (IOException e) {
               e.printStackTrace(); 
            }  
      }  
      super.receiveCommand(command, data); }
}
  • public void receiveGeneratedTask(String arg0, byte[] arg1){}: this method is responsible for performing a SELECT sparql query against the storage system. The parameter arg0 contains the taskID for that particular SELECT task and the parameter  arg1 represents the SELECT query in the form of a byte array. The taskID and the  SELECT query will be sent to the SystemAdapter as a string and byte array resp. from a task generator using the following command:
sendTaskToSystemAdapter(String taskID, byte[] task);

The participants should use the aforementioned commands in the receiveGeneratedData description to convert task into a String.

Once the SELECT query is executed and system response is received inside the receiveGeneratedTask body, the result set must be converted to JSON format, that abides to  w3c standards. If the results of the SELECT query are an instance of the org.apache.jena.query.ResultSet class, then the conversion to JSON can be done using the commands:

ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); ResultSetFormatter.outputAsJSON(outputStream, results);


Once the results are serialised to JSON format, they must be send to the evaluation storage as byte[] along with the taskID .

sendResultToEvalStorage(String taskID, byte[] data); 

If a participant has used the previous commands to convert the results into JSON, then he/she can use the following command to convert them into a byte[]:

byte[] data = outputStream.toByteArray();
    • public void close() throws IOException{}: this method is responsible for shutting down the storage system so it will not receive any further SELECT or INSERT queries. First, the participants can stop the container by  calling the function stopContainer(String containerName). Then, this function must call super.close(). As containerName the participant must use the container name obtained from the init() function.

For further explanation of the benchmark components, please read this tutorial: https://github.com/hobbit-project/platform/wiki/Develop-a-system-adapter  and https://github.com/hobbit-project/platform/wiki/Develop-a-system-adapter-in-Java

Evaluation

Our evaluation consists of three KPIs:

  • Recall: The INSERT queries created by each data generator will be send into a triple store by bulk load (Note that the insertion of triples via INSERT queries will not be done in equal time periods but based on their real time stamp generation, emulating a realistic scenario).  After a stream of INSERT queries is performed against the triple store, a SELECT query will be conducted by the corresponding data generator. The SELECT query will be sent to the task generator along with the expected answers. Then, the task generator will send the SELECT query to the SystemAdapater and the expected results in the evaluation storage. As explained above, once the SystemAdapter performs the SELECT query against the the triple store system, it receives and sends the retrieved results into the evaluation storage, as well. At the end of each experiment, we will compute the recall of each SELECT query by comparing the expected and retrieved results, and the micro and macro average recall of the whole benchmark. The expected results for each SELECT query will be conducted former to the system evaluation by inserting and query an instance of the Jena TDB storage solution.
  • Triples per second: at the end of each stream and once the corresponding SELECT query is performed against the system, we will measure the triples per second as a fraction of the total number of triples that were inserted during that stream divided by the total time needed for those triples to be inserted (begin point of SELECT query – begin point of the first INSERT query of the stream).
  • Average answer time: we will report the average answer delay between the time stamp that the SELECT query has been executed and the time stamp that the results are send to the evaluation storage. There is no additional effort need by the participants to calculate the aforementioned time stamps, since the first time stamp is generated by the task generator and the second time stamp is generated by the evaluation storage.

Transparency will be assured by releasing the dataset generators as well as the configurations.

Task 2: Data Storage Benchmark

Summary Description

In this task, we will develop an RDF benchmark that measures how datastores perform with interactive, simple, read, SPARQL queries as well as complex, business intelligence (BI) queries. Running the queries will be accompanied with high insert rate of the data (SPARQL UPDATE queries), in order to mimic the real use cases where READ and WRITE operations are bundled together. Typical bulk loading scenarios will also be supported. The queries and query mixes will be designed to stress the system under test in different choke-point areas, while being credible and realistic.

Testing and Training Data

The LDBC Social Network Benchmark will be used as a starting point for this benchmark. The dataset generator developed for the previously mentioned benchmark will be modified in order to produce synthetic RDF datasets available in different sizes, but more realistic and more RDF-like. The structuredness of the dataset will be in line with real-world RDF datasets unlike the LDBC Social Network Benchmark dataset, which is designed to be more generic and very well structured . The output of the dataset will be split in three parts: the dataset that should be loaded by the system under test, a set of update streams containing update queries and a set of files containing the different parameter bindings that will be used by the driver to generate the read queries of the workloads. The data for the Task are available at:  ftp://hobbitdata.informatik.uni-leipzig.de/mighty-storage-challenge/Task2/

Evaluation Methodology

After generating the dataset of desired size, the whole dataset will be bulk loaded, and the time will be measured. Running the benchmark consists of three separate parts: validating the query implementations, warming the database and performing the benchmark run. The queries are validated by means of the official validation datasets that we will provide. The auditor must load the provided dataset and run the driver in validation mode, which will test that the queries provide the official results. The warm-up will be performed using the driver. A valid benchmark run must last at least 2 hours of simulation time (datagen time). The only KPI that will be relevant is throughput. The challenge participant may specify a different target throughput to test, by “squeezing” together or “stretching” apart the queries of the workload. This is achieved by means of the “Time Compression Ratio” that will multiply the frequencies of the queries.

Availability of resources

We will provide the driver responsible of executing the whole workload, but each triple store provider must add the missing part of it, such as connection to the store, initialization, loading scripts, etc. We will provide the participants with the final version of the data generator by January 13th, 2017 and asked to submit their final system version as a docker image by April 30th, 2017.

Use Cases

The use case of this task is an online social network since it is the most representative and relevant use case of modern graph-like applications. A social network site represents a relevant use case for the following reasons:

  • It is simple to understand for a large audience, as it is arguably present to our every-day life in different shapes and forms.
  • It allows testing a complete range of interesting challenges, by means of different workloads targeting systems of different nature and characteristics.
  • A social network can be scaled, allowing the design of a scalable benchmark targeting systems of different sizes and budgets.

Implementation

Participants must:

  • provide his/her solution as a docker image (same as Task 1)
  • provide a SystemAdapter class that:
    • Receives generated data that come from Data Generators
      The implemented SystemAdapter has to retrieve triples, representing the dataset, that has to be bulk loaded, in RDF files (in standard Turtle format). The receiveGeneratedData(byte[] data) method should handle the received data same as Task 3.
    • Receives tasks that come from Task Generators:
      There are two different types of tasks:

      1. SPARQL SELECT query: There are 14 different query types, with a lot of query parameters, that should be executed in specific order, with specified query mix
      2. SPARQL UPDATE query: There are 8 different types of updates, inserting different types of entities to the triple store

All the the queries are written in SPARQL 1.1 standard.
The receiveGeneratedTask(String tId, byte[] data) method should handle the received data as follows:

  • First 4 bytes → length of query text (queryTextLength)
  • Next queryTextLength bytes → the query text
    • Sends the results to the evaluation storage as a byte array, for:
      1. SPARQL SELECT query:
        • First 4 bytes → length of query execution time (executionTimeLength)
        • Next executionTimeLength bytes → the query execution time (in ms)
        • Next 4 bytes → length of total number of results (resultRowCountLength)
        • Next resultRowCountLength bytes → the total number of returned results
        • Next 4 bytes → length of query’s result set (resultLength)
        • Next resultLength bytes → the result set in JSON format, as described in Task1
      2. SPARQL UPDATE query:
        • First 4 bytes → length of query execution time (executionTimeLength)
        • Next executionTimeLength bytes → the query execution time (in ms)
  • provide a storage solution that can handle SELECT and UPDATE SPARQL queries

Evaluation

After generating the dataset of desired size, the whole dataset will be bulk loaded, and the time will be measured. Running the benchmark consists of three separate parts: validating the query implementations, warming the database and performing the benchmark run. The queries are validated by means of the official validation datasets that we will provide. The warmup will be performed before the official measurement. The KPIs that will be relevant are throughput (queries per second), and bulk loading time.

Task 3: Versioning RDF Data

Summary Description

The evolution of datasets would often require storing different versions of the same dataset, so that interlinked datasets can refer to older versions of an evolving dataset and upgrade at their own pace, if at all. Supporting the functionality of accessing and querying past versions of an evolving dataset is the main challenge for archiving/versioning systems. In this sub-challenge we propose a benchmark that will be used to test the ability of versioning systems to efficiently manage evolving datasets and queries evaluated across the multiple versions of said datasets.

Testing and Training Data

The Semantic Publishing Benchmark (SPB) generator will be used to produce datasets and versions thereof. SPB was developed in the context of the LDBC project and is inspired by the Media/Publishing industry, and in particular by BBC’s “Dynamic Semantic Publishing” (DSP) concept. We will use the SPB generator that uses ontologies and reference datasets provided by BBC, to produce sets of creative works. Creative works are metadata represented in RDF about real world events (e.g., sport events, elections). The data generator supports the creation of arbitrarily large RDF datasets in the order of billions of triples that mimic the characteristics of the real BBC datasets. Data generation follows three principles: data clustering in which the number of creative works produced diminishes as time goes by, correlation of entities where two or three entities are used to tag creative works for a fixed period of time and last, random tagging of entities. The data generator follows distributions that have been obtained from real world datasets thereby producing data that bear similar characteristics to real ones. The versioning benchmark that will be used in this sub-challenge include datasets and versions thereof that respect the aforementioned principles.

The data that will be used from the participants for training purposes, has been produced in such a way that guarantees the coverage of a high spectrum of available use cases (many changes on small graphs, few changes on large graphs, etc.).

In more details training data consists of 12 datasets which are located at: ftp://hobbitdata.informatik.uni-leipzig.de/mighty-storage- challenge/Task3. These datasets have the following characteristics:

  • Datasets of different Scale Factors (including triples of all versions)
    • SF0: 1M triples (generated creative works has a duration of 1 year starting from 2016).
    • SF4: 16M triples (generated creative works has a duration of 2 year starting from 2015).
    • SF7: 128M triples (generated creative works has a duration of 5 year starting from 2012).
    • SF10: 1B triples (generated creative works has a duration of 10 year starting from 2007).
  • Varying number of versions
    • 10
    • 50
    • 100

Each one of the generated datasets identified with a name of the form of “generated_sf[SF#]v[V#]”. For example, a dataset containing 1M triples (SF0) in 10 versions, identified by “generated_sf0v10.tar.gz“. Each version can be found in its own directory starting from V0  to Cend. In particular V0 denotes the starting dataset and every C1,C2, … ,Cend the change set with respect to previous version (a set of added triples in our case). So each version is computed as follows:  V0={ V0 }, V1={ V0 +C1}, V2={ V0+C1+C2 },…,Vend={ V0+C1+…+Cend }

A more detailed analysis of triples number of each generated version/changeset (including V0  to Cend sizes) can be found data_stats.xlsx (a sheet per dataset).

Evaluation Methodology

To test the ability of a versioning system to store multiple versions of datasets, our versioning benchmark will produce versions of an initial dataset using as parameters (a) the number of required versions (b) the total size of triples (including triples of all versions). The number of versions will be specified by the user of the benchmark who will also be able to specify the starting time as long as the duration of generated data. In this manner the user will be able to check how well the storage system can address the requirements raised by the nature of the versioned data. The benchmark will include a number of queries that will be used to test the ability of the system to answer historical and cross version queries. These queries will be specified in terms of the ontology of the Semantic Publishing Benchmark and written in SPARQL 1.1.

In our evaluation we will focus on 3 KPIs:

  • Storage space: We will measure the space required to store the different versioned datasets that we will use in our experiments. This KPI is essential to understand whether the system can choose the best strategy (e.g. full materialization, delta-based, annotated triples or hybrid) for storing the versions and how optimized such strategy is implemented.
  • Ingestion time: We will measure the time that a system needs for storing a new coming version. This KPI is essential to quantify the possible overhead of complex computations, such as delta computation, during the data ingestion.
  • Query performance: For each of the eight versioning query types (e.g. version materialization, single version queries, cross-version queries etc.) we will measure the average time required to answer the benchmark queries.

Availability of resources

We will provide the driver responsible of executing the whole workload, but each versioning system must add the missing part of it, such as connection to the store, initialization, loading scripts, etc. We will provide the participants with the final version of the data generator by January 13th, 2017 and asked to submit their final system version as a docker image by April 30th, 2017.

Use Cases

The use cases that are considered by this benchmark are those that address versioning problems. Such use cases span from different domains and applications of interest such as the energy domain, semantic publishing, biology, etc. For this task we will employ data from the semantic publishing domain.

Implementation

Participants must

  • provide his/her solution as a docker image. (Same as Task 1)
  • provide a SystemAdapter class that:
    • (Same as Task 1, except of the way that new triples are retrieved).
      • The implemented SystemAdapter have to retrieve triples that compose the new versions from RDF files (in a standard RDF format, such as nt, rdf etc.) in order to be loaded on benchmarked systems.
      • More specifically, the receiveGeneratedData(byte[] data) method should handle the received data as follows:
        • First 4 bytes→  length of file name path (fineNameLength)
        • Next fineNameLength bytes→ file name path
        • Next 4 bytes → length of file content (fileContent)
        • Next fileContent bytes → file content
    • Receiving tasks come from Task Generators:
      • There are three type of tasks
        1. Ingestion task: As many tasks as the total number of versions
          • Query text: “Version X, Ingestion task”
        2. Storage space task: 1 task to get the total disk space used for storing all dataset’s versions
          • Query text: “Storage space task”
        3. Query performance task: select SPARQL queries of 8 different types (there may exist more than one query per query type, depends on the benchmark’s parameter: querySubstitutionParameters)
          • Query text: the SPARQ query
      • The receiveGeneratedTask(String tId, byte[] data) method should handle the received data as follows:
        • First 4 bytes→ length of task type (taskTypeLength)
        • Next taskTypeLength bytes → the task type (1, 2 or 3 as previously defined)
        • Next 4 bytes → length of query text (queryTextLength)
        • Next queryTextLength bytes → the query text as previously defined
      • Tasks of type 3 (SPARQL queries) are written in SPARQL 1.1 assuming that each version is stored in its own named graph. Systems that follow different storage implementation or uses their own enhanced versions of SPARQL to query versions, have to rewrite the queries accordingly.
      • According to the task type the results should sent to the evaluation storage as a byte array that structured as follows:
        • If 1
          • First 4 bytes→ length of task type (taskTypeLength)
          • Next taskTypeLength bytes →the task type (1, 2 or 3 as previously defined)
          • Next 4 bytes → length of the total number of triples after version’s load (totalTriplesLength)
          • Next totalTriplesLength bytes → the total number of triples loading of current version finish  
          • Next 4 bytes → length of loading time (loadingTimeLength)
          • Next loadingTimeLength bytes → the total time (in ms) that required to load each version
        • If 2:
          • First 4 bytes→ length of task type (taskTypeLength)
          • Next taskTypeLength bytes →the task type (1, 2 or 3 as previously defined)
          • Next 4 bytes → length of the total storage space cost (totalStorageSpaceLength)
          • Next totalStorageSpaceLength bytes → the total storage space cost (in bytes) for storing all versions
        • If 3:
          • First 4 bytes→ length of task type (taskTypeLength)
          • Next taskTypeLength bytes →the task type (1, 2 or 3 as previously defined)
          • Next 4 bytes → length of query type (queryTypeLength)
          • Next queryTypeLength bytes → the query type (of the eight different ones)
          • Next 4 bytes → length of query execution time (executionTimeLength)
          • Next executionTimeLength bytes → the query execution time (in ms)
          • Next 4 bytes → length of total number of results (resultRowCountLength)
          • Next resultRowCountLength bytes → the total number of returned results
          • Next 4 bytes → length of query’s result set (resultLength)
          • Next resultLength bytes → the result set in JSON format, as described in Task1
  • provide a storage solution that can handle SELECT SPARQL queries on top of versioned data. Such SPARQL queries will be produced by task generator based on a set of templates that correspond to eight versioning query types. Task generator will send the queries as byte arrays to SystemAdapter along with the task unique identifier.

Evaluation

In our evaluation we will focus on the following KPIs:

  • Ingestion time: We will measure the time that a system needs for storing a new coming version. This KPI is essential to quantify the possible overhead of complex computations, such as delta computation, during the data ingestion.
    • Initial version ingestion speed (triples per second)
    • Average ingestion speed of new versions (applied changes per second)
  • Storage space: We will measure the space required to store the different versioned datasets that we will use in our experiments. This KPI is essential to understand whether the system can choose the best strategy (e.g. full materialization, delta-based, annotated triples or hybrid) for storing the versions and how optimized such strategy is implemented. After the completion of data ingestion process systems are responsible to report the overall storage space required (based on the appropriate solution they follow to store underling data)
    • Total storage space cost
  • Query performance: For each of the eight versioning query types (e.g. version materialization, single version queries, cross-version queries etc.) we will measure the average time required to answer the benchmark queries.
    • Average execution time of queries of type X, where X = 1, 2, … 8

Task 4: Faceted Browsing Benchmark

Summary Description

Faceted browsing stands for a session-based (state-dependent) interactive method for query formulation over a multi-dimensional information space. It provides a user with an effective way for exploration through a search space. After having defined the initial search space, i.e., the set of resources of interest to the user, a browsing scenario consists of applying (or removing) filter restrictions of object-valued properties or of changing the range of a number-valued properties. Using such operations aimed to select resources with desired properties, the user browses from state to state, where a state consists of the currently chosen facets and facet values and the current set of instances satisfying all chosen constraints. The task on faceted browsing checks existing solutions for their capabilities of enabling faceted browsing through large-scale RDF datasets, that is, it analyses their efficiency in navigating through large datasets, where the navigation is driven by intelligent iterative restrictions. We aim to measure the performance relative to dataset characteristics, such as overall size and graph characteristics.

Testing and training data

For this task, the transport dataset of linked connections will be used. The transport dataset is provided by a data generator and consists of train connections modelled using the transport ontology following GTFS (General Transit Feed Specification) standards – see here for more details. The datasets may be generated in different sizes, while the underlying ontology remains the same – see here  for a visualization of the ontology relevant to the task.

A participating system is required to answer a sequence of SPARQL queries, which simulate browsing scenarios through the underlying dataset. The browsing scenarios are motivated by the natural navigation behaviour of a user (such as a data scientist) through the data, as well as to check participating systems on certain choke points. The queries involve temporal (time slices), spatial (different map views) and structural (ontology related) aspects.

For training we provide  a dataset of triples in Turtle coming from our generator as well as as list of SPARQL queries for sample browsing scenarios. Two scenarios are similar to the ones used in the testing phase, while a third is meant to illustrate all the possible choke points that we aim to test on. The training data are available at: ftp://hobbitdata.informatik.uni-leipzig.de/mighty-storage-challenge/Task4/

(21.02.2017 — Please note, that there has been a slight change in the training data. Latitude and longitude values are now modeled as xsd:decimal.)

Next to the training dataset that we provide, we will make use of the Transport Disruption Ontology.

A list of possible choke points for participating systems can be found here.

Evaluation Methodology

At each state of the simulated browsing scenario through the dataset two types of queries are to be answered correctly:

  1. Facet counts (in form of SPARQL SELECT COUNT queries):
    For a specific facet, we ask for the number of instances that remain relevant after restriction over this facet. To increase efficiency, approximate counts (e.g. obtained by different indexing techniques) may be returned by a participating system.
  2. Instance retrieval (in form of SPARQL SELECT queries):
    After selecting a certain facet as a further filter on the solution space, the remaining instances are required to be returned.

One browsing scenario consists of between 8 to 11 changes of the solution space (instance retrievals), where each step may be the selection of a certain facet, a change in the range value for a literal property (which may be indirectly related through a complex property-path), or the action of undoing a previously chosen facet or range restriction.

The evaluation is based on the following performance KPIs.

  1. Time: The time required by the system is measured for the two tasks facet count and instance retrieval separately. The results are returned in a score function computing number of returned queries per second. For the instance retrieval queries, we additionally compute the query per second score for several choke points separately.
  1. Accuracy of counts: The facet counts are being checked for correctness. For each facet count, we record the distance of the returned count to the correct count in terms of absolute value and we record the error in relation to the size of solution space (relative error). We both sum and average over all steps of the browsing scenario, resulting in four overall error terms:
    1. overall absolute error (sum of all errors)
    2. average absolute error
    3. overall relative error (sum of all errors over sum of all counts)
    4. average relative error (average of relative over all count queries)
  1. Accuracy of instance retrievals: For each instance retrieval we collect the true positives, the false positives and false negatives to compute an overall precision, recall and F1-score. Additionally, we compute precision, recall and F1-score for each of several choke points separately.

Use Cases

Intelligent browsing by humans aims to find specific information under certain assumptions along temporal, spatial or other dimensions of statistical data. “Since plain web browsers support sessional browsing in a very primitive way (just back and forth), there is a need for more effective and flexible methods that allow users to progressively reach a state that satisfies them”, as Tzitzikas et al point out in their recent survey on faceted browsing (DOI: 10.1007/s10844-016-0413-8). The ability to efficiently perform such faceted browsing is therefore important for the exploration of most datasets, for example in human-controlled information retrieval from topic oriented datasets. We will include a use case in which a data analyst wants to explore the characteristics of a train network (e.g. delays in a particular region and certain times of a day) based on Linked Connections dataset (see here for details).

Implementation

Participants must:

  • provide his/her solution as a docker image (as Task 1)
  • provide a SystemAdapter recieving and executing SELECT and SELECT COUNT SPARQL queries and subsequently send the results to the EvaluationStorage in formats defined below. The platform sends out a new query to the system adapter only after a reply to the former query has been recorded.
  • the incoming SPARQL queries have to be read from incoming byte arrays as defined by the task queue, where the ‘data’ part of the byte array contains the SPARQL query in a String in UTF-8.
  • for instance retrieval (SELECT) queries, the result list should be returned as a byte array following the result queue standard with the ‘data’ part of the byte array containing a UTF-8 String with the results as a comma separated list of URIs. The byte array needs to be sent to the evaluation storage.
  • for facet count (SELECT COUNT) queries, the result should be returned as a byte array following the result queue standard with the ‘data’ part consisting of a String that contains the count (integer) value as UTF-8 encoded String.
  • A participating system needs to answer SPARQL `SELECT’ and `SELECT COUNT’ queries. To answer the queries, the system in particular needs to support
  • Systems have to correctly interpret the notion rdfs:subClassOf* denoting a path of zero or more occurrences of rdfs:subclassOf .

Evaluation

During the simulated browsing scenario through the dataset two types of queries are to be answered correctly

  • Facet counts (in form of SPARQL SELECT COUNT queries):
    For a specific facet, we ask for the number of instances that remain relevant after restriction over this facet. To increase efficiency, approximate counts (e.g. obtained by different indexing techniques) may be returned by a participating system.
  • Instance retrieval (in form of SPARQL SELECT queries):
    After selecting a certain facet as a further filter on the solution space, the actual remaining instances are required to be returned.

One browsing scenario consists of between 8 to 11 changes of the solution space (instance retrievals), where each step may be the selection of a certain facet, a change in the range value for a literal property (which may be indirectly related through a complex property-path), or the action of undoing a previously chosen facet or range restriction.

The evaluation is based on the following performance measures:

  • Time: The time required by the system is measured for the two tasks facet count and instance retrieval separately. The results are returned in a score function computing number of returned queries per second. For the instance retrieval queries, we additionally compute the query per second score for several choke points separately.
  • Accuracy of counts: The facet counts are being checked for correctness. For each facet count, we record the distance of the returned count to the correct count in terms of absolute value and we record the error in relation to the size of solution space (relative error). We both sum and average over all steps of the browsing scenario, resulting in four overall error terms:
    1. overall absolute error (sum of all errors)
    2. average absolute error
    3. overall relative error (sum of all errors over sum of all counts)
    4. average relative error (average of relative over all count queries)
  • Accuracy of instance retrievals: For each instance retrieval we collect the true positives, the false positives and false negatives to compute an overall precision, recall and F1-score. Additionally, we compute precision, recall and F1-score for several choke points separately.

Participation

A final version of a participating system needs to be submitted as a docker image by April 30th, 2017.