In one of our previous blog posts, we presented the approaches regarding the data storage benchmark. We introduced important modifications to the synthetic data generator and the dataset of the Social Network Benchmark, developed in the context of the FP7 Linked Data Benchmarking Council (LDBC), and transformed its SQL queries to SPARQL, all while preserving its most important features. This way, we introduce a benchmark for assessing Big Linked Data storage solutions for interactive applications. This blog post outlines how the first version of the Data Storage benchmark (DSB) for HOBBIT was developed and tested (for details, consider reading the Deliverable 5.1.1) and focuses on the improvements that will be introduced by the second version of benchmark, currently under development.
The Data Storage benchmark has been fully integrated into the HOBBIT platform and all interested parties can evaluate their system against it by running the benchmark through the platform website. Interested parties should follow the instructions from the platform Wiki page – in brief, they need to provide their system in the form of a Docker image and provide a system adapter which implements the corresponding API. The DSB has the following parameters which need to be set in order to execute the benchmark:
- Number of operations: The user must provide the total number of SPARQL queries that should be executed against the tested system. This number includes both SELECT and INSERT queries. The ratio between them, i.e. the number of queries per query type, has been specified in such a way that each query type has the same impact on the overall score of the benchmark. This means that the simpler and faster queries occur much more frequently than the complex and slower ones.
- Scale factor: The DSB can be executed using different sizes of the dataset, i.e. with different scale factors. The total number of triples by scale factor is provided in the following table:
The Data Generator of DSB is the component that creates the dataset for the benchmark. The workflow of the Data Storage benchmark, in the context of the HOBBIT platform and its design, makes it unnecessary to generate the synthetic RDF dataset on each individual benchmark run. Due to the different available scale factors for the dataset, the dataset can reach a size of several billion triples – such a dataset can take several hours to be generated, even on very powerful hardware, using all available resources. Therefore, we changed the benchmark approach and pre-generated the datasets for every size (scale factor). In order to provide fast access, we stored them on an FTP server installed on the same cluster where the system will be benchmarked.
The Task Generator is the component which creates the tasks, i.e. the SPARQL, SELECT and INSERT queries, based on the incoming data. In the first version of the DSB we use sequential tasks, i.e. the Task Generator sends the tasks one by one, waits for the system under test to process a task in full before sending the next one to the System Adapter. With this, we examine and appraise the best performance of the tested system for a given query, by allowing the system to have all resources available for the query in question.
The Evaluation Module is the component which evaluates the results received from the benchmarked system. For each performed query, the Evaluation Storage saves the execution start-time and end-time, along with the expected result set. Based on that, this component calculates all specified key performance indicators (KPIs), such as the average execution time per query type in milliseconds, the throughput (number of executed queries per second), loading time, number of unexpected answers, etc., and sends them to the platform as an RDF model.
The biggest change between the first and the second version of the benchmark is in the Task Generator. The queries will not be executed sequentially, single threaded, one after the other, but we allow the benchmark to test the ability of the system to run concurrent queries, multiple of them in the same time. Much more relevant performance indicator than the elementary average execution times should be the capability of the system to serve a specified number of social network users, e.g. to answer most of their requests in the specified time (given as a parameter). So, unlike the first version of DSB, the idea of the second version is to have the real workload, consisting of users’ requests that corresponds to the reality. Most benchmarks, while examine the concurrency of the system under test, simply submit a specified number of tasks at the same time, forming the uniform distribution of the queries. Usually, this scenario doesn’t reflect the reality, where spiky activity volumes can appear. The amount of activity should be reactive to real-world events. If a natural disaster occurs, we will observe people talking about it mostly after the time of the disaster and the associated activity volume will decay as the hours pass. This translates to a spiky activity along the line of the social network, mixing periods with a high load with periods where the load level is lower. Also, this means that the amount of messages produced by users and their topics are correlated with given points in time. However, these differences in the activity of users are not only related to the updates (INSERT queries) but to the reads (SELECT queries) as well. There are some intervals in time where there are a lot of reads, while there are periods with almost none. The ratio of read queries should be highly correlated with the ratio of updates, for which the Data Generator is responsible. Thus, the Task Generator of DSB v2.0 will mimic the real workload that will be driven by updates.
The benchmark becomes more demanding as the scale factor (i.e. the dataset size and the number of social network users) increases. Our idea is to provide the ability for the benchmark to stress the system under test with more demanding tasks, while having a fixed scale factor. That concept is realized through the Time Compression Ratio (TCR) approach, as a target throughput to test. It is used for “squeezing” together or “stretching” apart the queries of the workload. If TCR is equal to 1, the workload will correspond to real frequencies of the queries, while if it is 2, the queries will be submitted 2 times faster. Also, if TCR is less than 1, the queries will be issued at a slower rate. The main goal is to dynamically adjust the TCR, by running the benchmark faster and faster, until the system provides the answers in the interactive time, and this will be the main KPI of our benchmark.