MOCHA2018 – Tasks and Training Data

Task 1: RDF Data Ingestion

Summary Description

The constant growth of the Linked Data Web in velocity and volume has increased the need for triple stores to ingest streams of data and perform queries on this data efficiently. The aim of this task is to measure the performance of SPARQL query processing systems when faced with streams of data from industrial machinery in terms of efficiency and completeness. The experimental setup will hence be as follows:, we will increase the size and velocity of RDF data used in our benchmarks to evaluate how well can a system store streaming RDF data obtained from industry. The data will be generated from one or multiple resources in parallel and will be inserted using SPARQL INSERT queries. This facet of triple stores has (to the best of our knowledge) never been benchmarked before. SPARQL SELECT queries will be used to test the system’s ingestion performance and storage abilities. The components of the benchmark for this task are implemented in Java.

Testing and Training Data

The input data for this task consists of data derived from mimicking algorithms trained on real industrial datasets (see Use Cases for details). Each training dataset will include RDF triples generated within a period of time (e.g., a production cycle) of production. Each event (e.g., each sensor measurement or tweet) will have a timestamp that indicates when it was generated. The datasets will differ in size regarding the number of triples per second. During the test, data to be ingested will be generated using data agents (in form of distributed threads). An agent is a data generator who is responsible for inserting its assigned set of triples into a triple store, using a SPARQL INSERT query. Each agent will emulate a dataset that covers the duration of the benchmark. All agents will operate in parallel and will be independent of each other. As a result, the storage solution benchmarked will have to support concurrent inserts. The insertion of a triple is based on its generation timestamp. To emulate the ingestion of streaming RDF triples produced within large time periods within a shorter time frame, we will use a time dilatation factor that will be applied to the timestamp of the triples. Our benchmark allows for testing the performance of the ingestion in terms of precision and recall by deploying datasets that vary in volume (size of triples and timestamps), and use different dilatation values, various number of agents and different size of update queries. The testing and training data are derived from public transport datasets and are available here:

Use Cases

This task will aim to reflect real loads on triple stores used in real applications. We will hence use the the public transport dataset. However, the benchmark itself can be used for other real time applications including (but not limited to):

  • Social network data from Twitter (Twitter Data)
  • Car traffic data gathered from sensor data gathered by TomTom (Transportation Data)
  • Sensor data from plastic injection moulding industrial plants of Weidmüller (Molding Machines Data)

The descriptions of the datasets can be found here.

Requirements

For task 1, participants must:

  • provide his/her solution as a docker image. First install docker using the instructions found here  https://docs.docker.com/engine/installation/ and then follow the guide on how to create your own docker image found here https://docs.docker.com/engine/getstarted/step_four/.
  • provide a SystemAdapter class on their prefered programming language. The SystemAdapter is main component that establishes the communication between the other benchmark components and the participant’s system.  The functionality of a SystemAdapter is divided in the following steps:
    • Initialization of the storage system
    • Retrieval of triples in the form of an INSERT SPARQL query and insertion of the aforementioned triples into the storage.
    • Retrieval of string representation of the graph name of each data generator.
    • Retrieval and execution of SELECT SPARQL queries against the storage system and then send the results to the EvaluationStorage component.
    • Shut down of the storage system.
  • For more information on how to create a SystemAdapter please follow the instructions of this link:  https://github.com/hobbit-project/platform/wiki/Develop-a-system-adapter
  • provide a storage system that processes INSERT SPARQL queries. The data insertion will be performed via  INSERT SPARQL queries, that will be generated by different data generators. The triple store must be able to handle multiple INSERT SPARQL queries at the same time, since each data generator runs independently of the others.  Firstly, an INSERT query will be created by a data generator using an Apache Jena RDF model (dataModel), that includes of a set of triples that were generated at particular point in time. The INSERT query will be created, saved into a file and then transformed into a byte array using the following Java commands. Please note that each data generator will perform INSERT queries against its own graph, so the INSERT query will include the GRAPH clause with the name of the corresponding graph.
UpdateRequest insertQuery = UpdateRequestUtils.createUpdateRequest(dataModel, ModelFactory.createDefaultModel());
OutputStream outStream = null;  
String fileName = "insertQuery.sparql";  
try {  
   outStream = new FileOutputStream(fileName);  
} catch (FileNotFoundException e) {  
   e.printStackTrace(); 
}  
IndentedWriter out = new IndentedWriter(outStream);  
insertQuery.output(out);  
String fileContent = null;  
try {  
   fileContent = FileUtils.readFileToString(new File(fileName), Charsets.UTF_8);  
} catch (IOException e) {  
   e.printStackTrace();  
}  
byte[][] insertData = new byte[1][];  
insertData[0] = RabbitMQUtils.writeByteArrays(new byte[][] { RabbitMQUtils.writeString(fileContent) });  
sentToSystemAdapter(RabbitMQUtils.writeByteArrays(insertData));
  • The fileContent includes a UTF-8 String representation of the INSERT query. The function RabbitMQUtils.writeString(String str) creates a byte array representation of the given String str using UTF-8 encoding. The function RabbitMQUtils.writeByteArrays(byte[][] array) returns a byte array containing all given input arrays and places their lengths in front of them. Then, the insertData will be send to the SystemAdapter from the data generator using the function sendDataToSystemAdapter(byte[] insertData). The SystemAdapter must be able to receive the INSERT query as byte array and then transform them into a UTF-8 encoded String format. If a participant chooses Java as his/her programming language, the received byte array can be read using the ByteBuffer buffer = ByteBuffer.wrap(insertData) function, that wraps a byte array into a buffer, and then use the command RabbitMQUtils.readString( buffer) to get a UTF-8 encoded String for the INSERT query. Then, the query must be performed against the storage system.
  • provide a storage solution that can process the UTF-8 string representation of the name of a graph, in which the SELECT and INSERT queries will be performed. A graph name will be converted by each data generator using the command byte[] insertData = RabbitMQUtils.writeByteArrays(new byte[][] { RabbitMQUtils.writeString(graphName) }); and then sent to the SystemAdapter using sendDataToSystemAdapter(byte[] insertData).
  • provide a storage  solution that can process SELECT SPARQL queries. The performance of a storage system will be explored by using SELECT sparql queries. Each SELECT query will be executed after a predefined set of INSERT queries are performed against the system (configurable platform parameter). A SELECT query will be conducted by the corresponding data generator and then it will be send to a task generator, along with the expected results, all combined in the form of a byte array[]. The task generator will read both the SELECT query and the expected results, and will send the SELECT query as a byte array to the  SystemAdapter along with the task unique identifier, which is a UTF-8 encoded String. The expected results will be read also by the task generator and send to the storage system component as a byte array.
  • The SystemAdapter must be able to receive the SELECT query as a byte array and then transform it into a UTF-8 encoded String format (as described above in case of Java). Then, the SystemAdapter performs the SELECT query against the storage system and the retrieved results must be serialised into JSON format (that abides the w3c standards https://www.w3.org/TR/rdf-sparql-json-res/). The JSON strings should be UTF-8 encoded. Finally, the results in JSON will be transformed into a byte array and sent to the evaluation storage along with the task’s unique identifier.  Please note that each SELECT query will include the GRAPH clause with the URI of the corresponding graph that the data must be inserted to.
  • provide any necessary parameters  to their systems that grant access for inserting triples into their storage system.

The following example is a description of HOBBIT’s API for participants that use Java as their programming language.

Firstly, read the article in the following link https://github.com/hobbit-project/platform/wiki/Develop-a-system-adapter-in-Java, that will give you a general idea on how to develop a System Adapter in Java.

Evaluation

Our evaluation consists of three KPIs:

  • Recall, Precision and F-measure: The INSERT queries created by each data generator will be send into a triple store by bulk load. Note that the insertion of triples via INSERT queries will not be done in equal time periods but based on their real time stamp generation, emulating a realistic scenario.  After a stream of INSERT queries is performed against the triple store, a SELECT query will be conducted by the corresponding data generator. The SELECT query will be sent to the task generator along with the expected answers. Then, the task generator will send the SELECT query to the SystemAdapater and the expected results in the evaluation storage. As explained above, once the SystemAdapter performs the SELECT query against the the triple store system, it receives and sends the retrieved results into the evaluation storage, as well. At the end of each experiment, we will compute the recall, precision and F-measure of each SELECT query by comparing the expected and retrieved results, and the micro and macro average recall  precision and F-measure of the whole benchmark. The expected results for each SELECT query will be conducted former to the system evaluation by inserting and query an instance of the Jena TDB storage solution.
  • Triples per second: at the end of each stream and once the corresponding SELECT query is performed against the system, we will measure the triples per second as a fraction of the total number of triples that were inserted during that stream divided by the total time needed for those triples to be inserted (begin point of SELECT query – begin point of the first INSERT query of the stream).
  • Average answer time: we will report the average answer delay between the time stamp that the SELECT query has been executed and the time stamp that the results are send to the evaluation storage. There is no additional effort need by the participants to calculate the aforementioned time stamps, since the first time stamp is generated by the task generator and the second time stamp is generated by the evaluation storage.

Transparency will be assured by releasing the dataset generators as well as the configurations.

Task 2: Data Storage Benchmark

Summary Description

This task consists of an RDF benchmark that measures how datastores perform with interactive, simple, read, SPARQL queries. Running the queries is accompanied with high insert rate of the data (SPARQL INSERT queries), in order to mimic the real use cases where READ and WRITE operations are bundled together. Typical bulk loading scenarios are supported. The queries and query mixes are designed to stress the system under test in different choke-point areas, while being credible and realistic.

Testing and Training Data

The LDBC Social Network Benchmark is used as a starting point for this benchmark. The dataset generator developed for the previously mentioned benchmark is modified in order to produce synthetic RDF datasets available in different sizes, but more realistic and more RDF-like. The structuredness of the dataset is in line with real-world RDF datasets unlike the LDBC Social Network Benchmark dataset, which is designed to be more generic and very well structured . The output of the dataset is splitted in three parts: the dataset that should be loaded by the system under test, a set of update streams containing update queries and a set of files containing the different parameter bindings that will be used by the driver to generate the read queries of the workloads. The testing and training data are available here:

Use Cases

The use case of this task is an online social network since it is the most representative and relevant use case of modern graph-like applications. A social network site represents a relevant use case for the following reasons:

  • It is simple to understand for a large audience, as it is arguably present to our every-day life in different shapes and forms.
  • It allows testing a complete range of interesting challenges, by means of different workloads targeting systems of different nature and characteristics.
  • A social network can be scaled, allowing the design of a scalable benchmark targeting systems of different sizes and budgets.

Requirements

For this task, participants must:

  • provide his/her solution as a docker image (same as Task 1)
  • provide a SystemAdapter class that:
    • Receives generated data that come from Data Generator
    • The implemented SystemAdapter has to retrieve triples, representing the dataset, that has to be bulk loaded, in RDF files (in standard Turtle format).
    • Receives tasks that come from Task Generators:
    • There are two different types of tasks:
      • SPARQL SELECT query: There are 21 different query types, with a lot of query parameters, that should be executed in specific order, with specified query mix
      • SPARQL UPDATE query: There are 8 different types of updates, inserting different types of entities to the triple store
      • All the queries are written in SPARQL 1.1 standard.
    • Sends the results to the evaluation storage as a byte array
  • provide a storage solution that can handle SELECT and UPDATE SPARQL queries, as well as the bulk loading of the dataset files

Evaluation

After generating the dataset of desired size, the whole dataset will be bulk loaded, and the time will be measured. Afterwards, the queries (SPARQL INSERT and SPARQL SELECT) will be executed against the system under test, their results will be sent to the evaluation storage, that is responsible for measuring their execution times, and comparison the actual answers to the expected ones used as a gold standard.

The KPIs that will be relevant are:

  • Average query execution times per query type (in msec): The execution time is measured for every single query, and for each query type the average query execution time will be calculated.
  • Bulk loading time: Time in milliseconds needed for initial bulk loading phase.
  • Query failures: The number of returned results that are not as expected.

Task 3: Versioning RDF Data

Summary Description

The evolution of datasets would often require storing different versions of the same dataset, so that interlinked datasets can refer to older versions of an evolving dataset and upgrade at their own pace, if at all. Supporting the functionality of accessing and querying past versions of an evolving dataset is the main challenge for archiving/versioning systems. In this task we will propose the second version of the versioning benchmark (SPBv v2.0) that will be used to test the ability of versioning systems to efficiently manage evolving datasets, where triples are added or deleted, and queries evaluated across the multiple versions of said datasets.

Testing and Training Data

The Semantic Publishing Benchmark (SPB) generator will be used to produce the initial version of the dataset as long as the add-sets of the upcoming versions. SPB was developed in the context of the Linked Data Benchmark Council (LDBC) project and is inspired by the Media/Publishing industry, and in particular by BBC’s “Dynamic Semantic Publishing” (DSP) concept. We will use the SPB generator that uses ontologies and a DBpedia reference dataset provided by BBC, to produce sets of creative works. Creative works are metadata represented in RDF about real world events (e.g., sport events, elections). The data generator supports the creation of arbitrarily large RDF datasets in the order of billions of triples that mimic the characteristics of the real BBC datasets. Data generation follows three principles:

  • data clustering in which the number of creative works produced diminishes as time goes by
  • correlation of entities where two or three entities are used to tag creative works for a fixed period of time and last
  • random tagging of entities where random data distributions are defined with a bias towards popular entities created when the tagging is performed.

The data generator follows distributions that have been obtained from real world datasets thereby producing data that bear similar characteristics to real ones. The versioning benchmark that will be used in this task will include datasets and versions thereof that respect the aforementioned principles. In addition to the produced creative works, five different versions of the reference dataset of DBpedia will be maintained as well. Such versions include all triples where the entities that used for annotating creative works appeared as subject.

The training data are available here. Please first read the documentation in the README.txt file.

Use Cases

The use cases that are considered by this benchmark are those that address versioning problems. Such use cases span from different domains and applications of interest such as the energy domain, semantic publishing, biology, etc.  For this task we will employ data from the semantic publishing domain.

Requirements

For task 3, participants must:

  • provide his/her solution as a docker image. First install docker using the instructions found here and then follow the guide on how to create your own docker image found here.
  • upload their systems to the HOBBIT platform using the instructions found here.
  • provide a SystemAdapter class on their prefered programming language. The SystemAdapter is main component that establishes the communication between the other benchmark components and the participant’s system. The functionality of a SystemAdapter is divided in the following steps:
    • Initialization of the storage system
    • Retrieval of UTF-8 string representation of the graph name
    • Retrieval of generated data in the form of RDF files in N-triples format.
    • Retrieval and execution of SELECT SPARQL queries against the storage system and then send the results to the EvaluationStorage component.
    • Shut down of the storage system.
  • participate in the appropriate sub-task of task 3 according to the archiving strategy that his/her versioning system implements. The three available sub-task are configured to send version’s generated data as:
    • task3.1: Independent Copies. Suitable for systems implement the Full Materialization archiving strategy
    • task3.2: Change-sets (sets of added and deleted triples). Suitable for systems implement the Delta Based or Annotated Triples archiving strategy
    • task3.3: Both Independent Copies and Change-Sets. Suitable for systems implement a Hybrid archiving strategy, where independent copies or change-set may be stored.

Here are more details of some of the aforementioned steps:

 

Retrieval of string representation of the graph name:

  • When generated data are configured to be sent as independent copies, graph names have the following form:
    • http://datagen.version.{VERSION_NUM}.{FILE_NAME}, where VERSION_NUM determines the version where received data have to be loaded and FILE_NAME the name of the sent data file.
  • When generated data are configured to be sent as change-sets, graph names have one of the following form:
    • http://datagen.addset.set.{VERSION_NUM}.{FILE_NAME}, where VERSION_NUM determines the version that we will end up after adding the current sent data. And FILE_NAME indicate the name of the sent data file. The data files of other versions than 0, include the generated creative works and the DBpedia data if any.
    • http://datagen.deleteset.set.{VERSION_NUM}.{FILE_NAME}, where VERSION_NUM determines the version that we will end up after deleting the sent data from the previous version, and FILE_NAME is the name of the sent data file.
  • When generated data are configured to be sent both as independent copies and change-sets, graph names  have both aforementioned forms.

Retrieval and execution of SELECT SPARQL queries:

The performance of a storage system will be explored by using 8 different types of SELECT SPARQL queries. The SystemAdapter must be able to receive the SELECT query as a byte array and then transform it into a UTF-8 encoded String format. The received queries are written in SPARQL 1.1 assuming that each version is stored in its own named graph of the form of http://graph.version.{VERSION_NUM}. Systems that follow different storage implementation or uses their own enhanced versions of SPARQL to query versions, have to rewrite the queries accordingly. After the appropriate adjustments the SystemAdapter performs the SELECT query against the storage system and the retrieved results must be serialised into JSON format (that abides the w3c standards). The JSON strings should be UTF-8 encoded. Finally, the results in JSON will be transformed into a byte array and sent to the evaluation storage along with the task’s unique identifier.

For more information on how to create a SystemAdapter please follow the instructions found here.

Here you can find an example SystemAdapter for Virtuoso system implemented in JAVA. To let Virtuoso manage evolving data we considered that each version is stored in its own named graph with name http://graph.version.{VERSION_NUM}, so the Full Materialization archiving strategy was followed and configures the benchmark to send data as independent copies.

Evaluation

To test the ability of a versioning system to store and query multiple versions of datasets, our versioning benchmark SPBv will produce versions of an initial dataset using as parameters

  • the generated data form (changesets or independent copies)
  • the number of required versions
  • the size of the initial version
  • the proportion of added and deleted triples from one version to another

The generated data form parameters is related to the storage strategy that the versioning system is implementing. If a system implements the Full Materialization archiving strategy, then it is able to receive the data exactly in the form it requires them – each version as an independent copy. In case that a system implements another policy as the Delta-based or Annotated-triples ones it is able to receive the data as a set of added and deleted triples.

In addition, the number of versions will be specified by the user of the benchmark who will also be able to specify the size of the initial version of the dataset and the type of changes between versions. In this manner the user can check how well the storage system can address the requirements raised by the nature of the versioned data. The benchmark tests the ability of the system to answer eight different type of versioning queries, as those described in Section 2.2 of Deliverable D5.2.1. Among to these eight query types there are structured queries on top of the current, past or multiple versions that are based on real DBpedia queries. All queries are written in SPARQL 1.1.

In our evaluation we will focus on the following KPIs:

  • Query failures: The number of queries that failed to be executed are measured. By failure we mean that the returned results are not those that expected.
  • Throughput (in queries per second): The execution rate per second for all queries.
  • Initial version ingestion speed (in triples per second): The total triples that can be loaded per second for the dataset’s initial version. We distinguish this from the ingestion speed of the other versions because the loading of the initial version greatly differs in relation to the loading of the following ones, where different underlying procedures as, computing deltas, reconstructing versions, storing duplicated information between versions, may take place.
  • Applied changes speed (in triples per second): The average number of changes that can be stored by the benchmarked system per second after the loading of all new versions. Such KPI tries to quantify the overhead of underlying procedures, mentioned in Initial version ingestion speed KPI, that take place when a set of changes are applied to a previous version.
  • Storage space (in MB): This KPI measures the total storage space required to store all versions.
  • Average query execution time (in ms): The average execution time, in milliseconds for each one of the eight versioning query types (e.g. version materialization, single version queries, cross-version queries etc.

Task 4: Faceted Browsing Benchmark

Summary Description

Faceted browsing stands for a session-based (state-dependent) interactive method for query formulation over a multi-dimensional information space. It provides a user with an effective way for exploration through a search space. After having defined the initial search space, i.e., the set of resources of interest to the user, a browsing scenario consists of applying (or removing) filter restrictions of object-valued properties or of changing the range of a number-valued properties. Using such operations aimed to select resources with desired properties, the user browses from state to state, where a state consists of the currently chosen facets and facet values and the current set of instances satisfying all chosen constraints. The task on faceted browsing checks existing solutions for their capabilities of enabling faceted browsing through large-scale RDF datasets, that is, it analyses their efficiency in navigating through large datasets, where the navigation is driven by intelligent iterative restrictions. We aim to measure the performance relative to dataset characteristics, such as overall size and graph characteristics.

Testing and Training Data

For this task, the transport dataset of linked connections will be used. The transport dataset is provided by a data generator and consists of train connections modelled using the transport ontology following GTFS (General Transit Feed Specification) standards – see here for more details. The datasets may be generated in different sizes, while the underlying ontology remains the same – see here  for a visualization of the ontology relevant to the task.

A participating system is required to answer a sequence of SPARQL queries, which simulate browsing scenarios through the underlying dataset. The browsing scenarios are motivated by the natural navigation behaviour of a user (such as a data scientist) through the data, as well as to check participating systems on certain choke points. The queries involve temporal (time slices), spatial (different map views) and structural (ontology related) aspects.

For training we provide  a dataset of triples in Turtle coming from our generator as well as as list of SPARQL queries for sample browsing scenarios. Two scenarios are similar to the ones used in the testing phase, while a third is meant to illustrate all the possible choke points that we aim to test on.

The training data are available here. Please first read the documentation in the readme.txt file.

Evaluation Methodology

At each state of the simulated browsing scenario through the dataset two types of queries are to be answered correctly:

  1. Facet counts (in form of SPARQL SELECT COUNT queries):
  2. For a specific facet, we ask for the number of instances that remain relevant after restriction over this facet. To increase efficiency, approximate counts (e.g. obtained by different indexing techniques) may be returned by a participating system.
  3. Instance retrieval (in form of SPARQL SELECT queries):
  4. After selecting a certain facet as a further filter on the solution space, the remaining instances are required to be returned.

One browsing scenario consists of between 8 to 11 changes of the solution space (instance retrievals), where each step may be the selection of a certain facet, a change in the range value for a literal property (which may be indirectly related through a complex property-path), or the action of undoing a previously chosen facet or range restriction.

The evaluation is based on the following performance KPIs.

  1. Time: The time required by the system is measured for the two tasks facet count and instance retrieval separately. The results are returned in a score function computing number of returned queries per second. For the instance retrieval queries, we additionally compute the query per second score for several choke points separately.
  2. Accuracy of counts: The facet counts are being checked for correctness. For each facet count, we record the distance of the returned count to the correct count in terms of absolute value and we record the error in relation to the size of solution space (relative error). We both sum and average over all steps of the browsing scenario, resulting in four overall error terms:
    1. overall absolute error (sum of all errors)
    2. average absolute error
    3. overall relative error (sum of all errors over sum of all counts)
    4. average relative error (average of relative over all count queries)
  3. Accuracy of instance retrievals: For each instance retrieval we collect the true positives, the false positives and false negatives to compute an overall precision, recall and F1-score. Additionally, we compute precision, recall and F1-score for each of several choke points separately.

Use Cases

Intelligent browsing by humans aims to find specific information under certain assumptions along temporal, spatial or other dimensions of statistical data. “Since plain web browsers support sessional browsing in a very primitive way (just back and forth), there is a need for more effective and flexible methods that allow users to progressively reach a state that satisfies them”, as Tzitzikas et al point out in their recent survey on faceted browsing (DOI: 10.1007/s10844-016-0413-8). The ability to efficiently perform such faceted browsing is therefore important for the exploration of most datasets, for example in human-controlled information retrieval from topic oriented datasets. We will include a use case in which a data analyst wants to explore the characteristics of a train network (e.g. delays in a particular region and certain times of a day) based on Linked Connections dataset (see here for details).

Implementation

Participants must:

  • provide his/her solution as a docker image (as Task 1)
  • provide a SystemAdapter recieving and executing SELECT and SELECT COUNT SPARQL queries and subsequently send the results to the EvaluationStorage in formats defined below. The platform sends out a new query to the system adapter only after a reply to the former query has been recorded.
  • the incoming SPARQL queries have to be read from incoming byte arrays as defined by the task queue, where the ‘data’ part of the byte array contains the SPARQL query in a String in UTF-8.
  • for instance retrieval (SELECT) queries, the result list should be returned as a byte array following the result queue standard with the ‘data’ part of the byte array containing a UTF-8 String with the results as a comma separated list of URIs. The byte array needs to be sent to the evaluation storage.
  • for facet count (SELECT COUNT) queries, the result should be returned as a byte array following the result queue standard with the ‘data’ part consisting of a String that contains the count (integer) value as UTF-8 encoded String.
  • A participating system needs to answer SPARQL `SELECT’ and `SELECT COUNT’ queries. To answer the queries, the system in particular needs to support
  • Systems have to correctly interpret the notion rdfs:subClassOf* denoting a path of zero or more occurrences of rdfs:subclassOf .

Evaluation

During the simulated browsing scenario through the dataset two types of queries are to be answered correctly

  • Facet counts (in form of SPARQL SELECT COUNT queries): For a specific facet, we ask for the number of instances that remain relevant after restriction over this facet. To increase efficiency, approximate counts (e.g. obtained by different indexing techniques) may be returned by a participating system.
  • Instance retrieval (in form of SPARQL SELECT queries): After selecting a certain facet as a further filter on the solution space, the actual remaining instances are required to be returned.

One browsing scenario consists of between 8 to 11 changes of the solution space (instance retrievals), where each step may be the selection of a certain facet, a change in the range value for a literal property (which may be indirectly related through a complex property-path), or the action of undoing a previously chosen facet or range restriction.

The evaluation is based on the following performance measures:

  • Time: The time required by the system is measured for the two tasks facet count and instance retrieval separately. The results are returned in a score function computing number of returned queries per second. For the instance retrieval queries, we additionally compute the query per second score for several choke points separately.
  • Accuracy of counts: The facet counts are being checked for correctness. For each facet count, we record the distance of the returned count to the correct count in terms of absolute value and we record the error in relation to the size of solution space (relative error). We both sum and average over all steps of the browsing scenario, resulting in four overall error terms:
    1. overall absolute error (sum of all errors)
    2. average absolute error
    3. overall relative error (sum of all errors over sum of all counts)
    4. average relative error (average of relative over all count queries)
  • Accuracy of instance retrievals: For each instance retrieval we collect the true positives, the false positives and false negatives to compute an overall precision, recall and F1-score. Additionally, we compute precision, recall and F1-score for several choke points separately.

Datasets

Twitter Dataset

Our Twitter dataset (https://github.com/renespeck/TWIG) is derived from 1 million real tweets that were generated in June 2009. To ensure that we do not divulge any personal information, we used (1) a Markov model to generate text that resembles tweets and abide by the density distribution of words in Tweets and (2) a tweet time distribution model that allows scaling up the number of agents generating tweets as well as the distribution of time for tweets. Therewith, we can ensure that the behavior of systems that ingest our tweets is similar to that of systems which ingest real tweets generated by the same number of users over the same period of time. The dataset abides by a simple ontology which describes tweets by the user who generated them, the time at which they were generated and their content.

Weidmüller Dataset

Molding machine dataset is provided by our partner Weidmüller. Basically, the dataset consists of readings taken from sensors deployed on a plastic injection molding machine. The sensors can measure various parameters of production process: distance, pressure, time, frequency, volume, temperature C, time S, speed, force. Each measurement is 120 dimensional vector consisting of values of different types, like text, fractional, decimal, but mostly fractional values. Each measurement is timestamped and described with IoT ontology. The dataset could be used in anomaly detection scenario.

TomTom Dataset

A text file containing a simple textual representation of the trace data (GPS fixes). Each line of the text file is representing a single GPS x. The lines are sorted by time stamp of the corresponding GPS fix (ascending). The format of each line is:
<UTC unix time stamp [ms]> <longitude [o]> <latitude [o]> <speed [m/s]>

LDBC Social Network Benchmark dataset

The Social Network Benchmark (SNB) provides a synthetic data generator (Datagen) which models an online social network (OSN), like Facebook. This Datagen will be modified in order to produce RDF datasets with a real-world structuredness as opposed to the large number of synthetic datasets used in benchmarking (they show a significant discrepancy in the level of structuredness compared to real-world RDF dataset). The dataset will be in TTL format. It is possible to generate datasets of different sizes. The benchmark defines a set of scale factors (SFs), targeting systems of different sizes and budgets. SFs are computed based on the ASCII size in Gigabytes of the generated output files using the CSV serializer. For example, SF 1 weights roughly 1GB in CSV format, SF 3 weights roughly 3GB and so on and so forth. The proposed SFs are the following: 1, 3, 10, 30, 100, 300, 1000. The size of the resulting dataset, is mainly affected by the following configuration parameters: the number of persons and the number of years simulated. Different SFs are computed by scaling the number of Persons in the network, while fixing the number of years simulated. For example, SF 30 consists of the activity of a social network of 182K users during a period of three years. The data contains different types of entities and relations, such as persons with friendship relations among them, posts, comments or likes. Additionally, it reproduces many of the structural characteristics observed in real OSNs: at-tribute correlations, degree distributions, structure-attribute correlations, and spiky activity volume.

Semantic Publishing Benchmark Data

The SPB data data generator uses ontologies and reference datasets provided by BBC, to produce sets of creative works. The data generator supports the creation of arbitrarily large RDF datasets in the order of billions of triples that mimic the characteristics of the reference BBC datasets. The generator produces creative works that are valid instances of BBC ontologies and define numerous concepts and properties employed to describe this content. SPB uses seven core and three domain RDF ontologies provided by BBC. The former define the main entities and their properties, required to describe essential concepts of the benchmark namely, creative works, persons, documents, BBC products (news, music, sport, education, blogs), annotations (tags), provenance of resources and content management system information. The latter are used to express concepts from a domain of interest such as football, politics, entertainment among others. Reference datasets are employed by the data generator to produce the data of interest. These datasets are snapshots of the real datasets provided by BBC; in addition, a GeoNames and DBPedia reference dataset has been included for further enriching the annotations with geo-locations to enable the formulation of geospatial queries, and person data. A creative work is described by a number of data value and object value properties; a creative work also has properties that link it to resources defined in reference datasets: those are the about and mentions properties, and their values can be any resource. The generator models three types of relations in the data: clustering of data, correlations of entities and random tagging of entities (the interested reader can find more details in [2]. The versioning benchmark will comprise versions of SPB datasets that follow the aforementioned principles of data generation.

The transport dataset of linked connections

A significant portion of people use public transport for their travels. The count-less public transport services worldwide, combined with their usage, lead to an enormous source of information. Many public transport companies worldwide provide this data using the GTFS ** standard, which can be converted to Linked Data using the Linked Connections  framework [1]. Such data is an ideal source for the benchmarking of systems because of its time and space dimensions. These datasets contain geospatial information about stops, temporal information about transit schedules and the interlinking between both. In many cases, benchmarking requires the ability to create synthetic datasets with specific properties of any given size. This is why we provide a public transport dataset generator that is able to create realistic public transport areas, networks and schedules. The generator can be configured to produce countless of synthetic datasets using a wide range of parameters.

References

  1. Colpaert, P., Llaves, A., Verborgh, R., Corcho, O., Mannens, E., Van de Walle, R.: Intermodal public transit routing using linked connections. In: Proceedings of the 14th International Semantic Web Conference: Posters and Demos (2015)
  2. Kotsev, N. Minadakis, V.P.O.E.I.F., Kiryakov, A.: Benchmarking rdf query en-gines: The case of semantic publishing benchmark. In: BLINK Proceedings (2016)

**https://developers.google.com/transit/gtfs/

Common API

A benchmark for evaluating a SPARQL-based system as well as the system adapter of this system should implement the API that is described in the following. The general API between a benchmark and a benchmark system is already described in our Github wiki (https://github.com/hobbit-project/platform/wiki/Develop-a-system-adapter). The wiki page describes the command queue as well as the three additional RabbitMQ queues a system is connected to, i.e., the data queue, task queue and result queue. (Please have a look at the article before reading on).

A SPARQL based system has four phases during the benchmarking

    1. Initialization phase
      As every system, it has to initialize itself, e.g., start needed components, load configurations etc. This phase ends as soon as it sends the SYSTEM_READY_SIGNAL on the command queue (as described in the wiki and implemented in the AbstractSystemAdapter in https://github.com/hobbit-project/core/blob/master/src/main/java/org/hobbit/core/components/AbstractSystemAdapter.java.
    2. Loading phase
      After the system is running and the benchmark started, it can receive data from the data queue which it should load into its triple store. This can be done as bulk load. The benchmark controller will send a BULK_LOAD_DATA_GEN_FINISHED signal on the command queue when it has finished the sending of all data.  The BULK_LOAD_DATA_GEN_FINISHED message comes with an Integer value (4 bytes) as additional data representing the number of data messages, the system adapter should have been received. This number should be used by the system adapter to wait for all messages to arrive. Note that the benchmark controller might have to collect the number of generated data messages from the data generators. In addition, the BULK_LOAD_DATA_GEN_FINISHED messages contains a flag that determines whether there are more data that have to be sent by the benchmark controller. Such flag is the one that lets the system to enter the querying phase, or let it wait for additional data to come. More specifically, the system will read the remaining data from the data queue, bulk load it into the store and send a BULK_LOADING_DATA_FINISHED signal on the command queue to the benchmark controller to indicate that it has finished the loading. If the flag of BULK_LOAD_DATA_GEN_FINISHED command was false, it waits for the next bunch of data to come, bulk load it into the store and send again BULK_LOADING_DATA_FINISHED signal on the command queue. If the flag is true it can proceed to the Querying phase. The values of the aforementioned  commands are:

      1. BULK_LOADING_DATA_FINISHED = (byte) 150;
      2. BULK_LOAD_DATA_GEN_FINISHED = (byte) 151;
    1. Received data in that time is structured in the following way:
      1. Integer value (4 bytes) containing the length of the graph URI
      2. Graph URI (UTF-8 encoded String)
      3. NTriples (UTF-8 encoded String; the rest of the package/data stream)
    1. Example Workflow:
      1. lastBulkLoad ← false
      2. while lastBulkLoad is false do
      3.      numberOfMessages ← X
      4.      benchmark sends data to system
      5.      if there are no more data for sending then
      6.           lastBulkLoad ← true
      7.      end if
      8.      benchmark sends BULK_LOAD_DATA_GEN_FINISHED { numberOfMessages,lastBulkLoad }
      9.      system loads data
      10.      system sends BULK_LOADING_DATA_FINISHED
      11. done
      12. system enters querying phase
    1. For the benchmarks that measure the time it takes a system to load the data, the times from step 8 to 10 are measured.
    1. Querying phase
      During that phase the system can get two types of input (queries are preceded by its length)

      1. Data from the data query that should be inserted into the store in the form of INSERT SPARQL queries.
      2. Tasks on the task queue, i.e., SPARQL queries (SELECT, INSERT,…), that it has to execute. The results for the single tasks (in JSON format) have to be send together with the id of the task to the result queue.
    2. Termination phase
      As described in the wiki, the third phase ends when the system receives the TASK_GENERATION_FINISHED command and has consumed all the remaining messages from the task queue. (The AbstractSystemAdapter already contains this step.)

Note that not every task has to make use of all of these possibilities. For example, Task 1 does not need the loading phase which means that it can directly send the BULK_LOAD_DATA_GEN_FINISHED signal to the system. Other tasks might not want to insert additional data during the benchmarking, i.e., does not have to use the data queue during the querying phase. Therefore, please refer to the description of each task, noted below.

Guidelines for MOCHA System Adapter Implementation

In this section, we present the guidelines for implementing the System Adapter following MOCHA Common API. The  storage system SystemAdapter class in Java must extend the abstract class AbstractSystemAdapter.java.
A prototype of a SystemAdapter for the open source version of Virtuoso can be found here.

A SystemAdapter must override the following methods:

    • public void init() throws Exception {}: this method is responsible for initializing the storage system, by executing a command that starts the system’s docker container. This function must call super.init().
    • public void receiveCommand(byte command, byte[] data) {}: This method is called if a command in the communication queue is received and must be handled by the System Adapter. In this case, the sender component is the Benchmark Controller and the receiver component is the SystemAdapter. This method is responsible for receiving the BULK_LOAD_DATA_GEN_FINISHED signals from the Benchmark Controller. Such signals are coming every time a new bulk load has to be done. Additionally, the Benchmark Controller will also send the information regarding the number of messages (dataset files) the System Adapter should have received and a flag that determines whether there are more data (more bulk loads) that have to be sent by the benchmark controller. All this information included in the byte[] data argument. The SystemAdapter can read this information using the following commands:
ByteBuffer buffer = ByteBuffer.wrap(data);
int numberOfMessages = buffer.getInt();
boolean lastBulkLoad = buffer.get() != 0;
  • After the receiving of BULK_LOAD_DATA_GEN_FINISHED signal the bulk phase (Phase 2) of the SystemAdapter for the received data is executed. During the bulk phase, the SystemAdapter should:
    1. Wait until all messages from each Data Generator are received.
    2. Once all messages are received, it should load them into the triplestore.
    3. Send the BULK_LOADING_DATA_FINISHED signal to the command queue for the Benchmark Controller to understand that the current bulk phase of the SystemAdapter is over.
  • public void receiveGeneratedData(byte[] data) {}:  this method has two different usage:
      1. In phase 2, it is responsible for receiving the sent data from the data generators and store them locally, in order to be available for loading. Recall from the MOCHA API that in byte[] data included: an integer value containing the length of the graph URI, the UTF-8 encoded String of the graph URI and the data as UTF-8 encoded String. It is suggested that a participant should use the following commands in the body of the receiveGeneratedData function to get all the above information:
    ByteBuffer dataBuffer = ByteBuffer.wrap(data);
    String graphUri = RabbitMQUtils.readString(dataBuffer);
    byte[] dataContentBytes = new byte[dataBuffer.remaining()];
    dataBuffer.get(dataContentBytes, 0, dataBuffer.remaining());
      1. The graph URI is used for determining the name of the file containing the dataset, or a version in which the received data have to be loaded. Finally every time that receiveGeneratedData function is called, the appropriate Atomic Integer variable that holds the number of received messages have to be updated.
      2. In phase 3, It is responsible for receiving the INSERT queries from the data generators. The text of the query can be get in the following way:
    ByteBuffer dataBuffer = ByteBuffer.wrap(data);
    String insertQuery = RabbitMQUtils.readString(dataBuffer);
    • public void receiveGeneratedTask(String tId, byte[] data) {}: this method is responsible for receiving and performing SELECT and INSERT sparql queries against the storage system. The parameter tId contains taskID, as String, for that particular SELECT/INSERT task and the parameter data the SELECT/INSERT query in the form of a byte array. The participant should use the following command to get the SPARQL query string:
ByteBuffer dataBuffer = ByteBuffer.wrap(data);
String queryString = RabbitMQUtils.readString(dataBuffer);
    • For the SELECT query, once it is executed and system response is received, the result set must be converted to JSON format, that abides to w3c standards. If the results of the SELECT query are an instance of the org.apache.jena.query.ResultSet class, then the conversion to JSON can be done using the commands:
ByteArrayOutputStream qBos = new ByteArrayOutputStream();
ResultSetFormatter.outputAsJSON(qBos, rs);
    • Once the results are serialised to JSON format, they must be send to the evaluation storage as byte[] along with the taskID using the following commands:
byte[] results = qBos.toByteArray(); 
sendResultToEvalStorage(tId, results);
    • For INSERT queries, the system adapter must inform the evaluation storage by sending a message to it (e.g. an empty string) along with the taskID using the following commands:
sendResultToEvalStorage(tId, RabbitMQUtils.writeString(""));
  • public void close() throws IOException{}: this method is responsible for shutting down the storage system so it will not receive any further queries. This function must call super.close().

In the following lines, the template for the System Adapter following MOCHA API can be found:

public class VirtuosoSysAda extends AbstractSystemAdapter {
    // a flag indicating if the data loading phase has been finished
    private boolean dataLoadingFinished = false;
    // number of messages received from the data generators
    private AtomicInteger totalReceived = new AtomicInteger(0);
    // number of messages sent by the data generators
    private AtomicInteger totalSent = new AtomicInteger(0);
    // mutex for waiting all the messages from the data generators
    // before loading phase
    private Semaphore allDataReceivedMutex = new Semaphore(0);
    // current loading phase
    private int loadingNumber = 0;

    public VirtuosoSysAda() {
    }

    @Override
    public void receiveGeneratedData(byte[] arg0) {
     if (dataLoadingFinished == false) {
       ByteBuffer dataBuffer = ByteBuffer.wrap(arg0);   
       String fileName = RabbitMQUtils.readString(dataBuffer);
       byte [] content = new byte[dataBuffer.remaining()];
       dataBuffer.get(content, 0, dataBuffer.remaining());
       // Store the file locally for later bulk loading
        …
       // if all the messages are there, release the mutex
       if(totalReceived.incrementAndGet() == totalSent.get()) {
         allDataReceivedMutex.release();
       }
     }
     else {   
       ByteBuffer buffer = ByteBuffer.wrap(arg0);
       String insertQuery = RabbitMQUtils.readString(buffer);
       // proccess the insert query
       … 
     }
    }

    @Override
    public void receiveGeneratedTask(String taskId, byte[] data) {
      ByteBuffer buffer = ByteBuffer.wrap(data);
      String queryString = RabbitMQUtils.readString(buffer);
      if (queryString.contains("INSERT DATA")) {
        // process the insert query and inform the evaluation storage
      … 
      }
      else {
        // process the select query and send the results
        // to the evaluation storage
        … 
      }
    }

    @Override
    public void init() throws Exception {
      super.init();
      // internal initialization
      … 
    }

    @Override
    public void receiveCommand(byte command, byte[] data) {
      if (VirtuosoSystemAdapterConstants.BULK_LOAD_DATA_GEN_FINISHED == command) {
         ByteBuffer buffer = ByteBuffer.wrap(data);
         int numberOfMessages = buffer.getInt();
         boolean lastBulkLoad = buffer.get() != 0;
         // if all data have been received before
         // BULK_LOAD_DATA_GEN_FINISHED command received
         // release before acquire, so it can immediately proceed to
         // bulk loading
         if(totalReceived.get()==totalSent.addAndGet(numberOfMessages)){
            allDataReceivedMutex.release();
       }
       // wait for receiving all data for bulk load
       try {
         allDataReceivedMutex.acquire();
       } catch (InterruptedException e) {
          … 
       }
       // all data for bulk load received.
       // proceed to the loading...
       … 
      loadingNumber++;
      if (lastBulkLoad)
      dataLoadingFinished = true;
     }
     super.receiveCommand(command, data);
    }

    @Override
    public void close() throws IOException {
      // internal close
      …
      super.close();
    }
}