Data Storage: Approaches and Benchmark

The demand for efficient RDF storage technologies has grown steadily in recent years. This  is due to the increasing number of Linked Data datasets and applications exploiting Linked Data resources. In particular, a growing number of applications require RDF storage solutions capable of answering interactive SPARQL queries. Given the increasing number of solutions for RDF storage, there is a continuous need for objective means to compare technologies from different vendors. Consequently, there is a growing need for representative benchmarks that mimic the actual workloads present in real-world applications. In addition to helping developers, such benchmarks aim to stimulate technological progress among competing systems and thereby accelerate the maturing process of Big Linked Data software tools.

One aspect in such efforts is the development of benchmarks for data storage. A number of benchmarks and benchmark generation frameworks for querying Linked Data have been developed over the past decade [1][9][10][11]. The requirements that data storage benchmarks need to meet are: high insert rate with time-dependent and largely repetitive or cyclic data, possible exploitation of structure and physical organization adapted to the key dimensions of the data, bulk loading support, interactive complex read queries, as well as simple lookups, concurrency and high throughput [2][8].

Potential Data Storage Benchmarks

The Lehigh University Benchmark (LUBM) [9], the Berlin SPARQL Benchmark (BSBM) [3] and the SPARQL Performance Benchmark (SP2Bench) [12] use synthetic data and synthetic queries for different scenarios. LUBM provides data over the organizational structure of Universities. SP2Bench uses the DBLP bibliographic database and BSBM was developed for the e-commerce use-case and supports synthetic updates. However, the queries in all three benchmarks lack complexity and thus cannot, in our opinion, produce relevant results on how a triple store can handle complex queries under an interactive workload [10].
The DBpedia SPARQL Benchmark (DBPSB) uses real data and real queries. However, the results in [4], especially in comparison to the results in [10], show that the queries are not complex enough to show new insights.
FEASIBLE challenges this issue by providing real and complex queries, but requires a query log to generate them [11]. Hence, it cannot provide complex queries for synthetic data.
The Waterloo SPARQL Diversity Test Suite (WatDiv) [1] provides a QueryGenerator with 125 query templates. However, the generator is restricted to conjunctive SELECT queries only, which do not cover the requirements pointed out above.
The LDBC Social Network Benchmark (SNB) [7] represents a synthetic, but realistic dataset with complex queries. This means that it can show new insights of a triple store performance while using synthetic data. An additional difficulty for RDF storage systems is the real-world distribution in the dataset, i.e. low structuredness of the data [13], which results in much more challenging tasks for a query optimizer, larger number of potential optimal query plans, and introduces problems in cardinality estimations. E.g., the queries from this benchmark make it hard to estimate the number of Posts by friends of a Person due to two reasons: the number of friends of a Person can significantly vary, along with the number of Posts per Person. This is a desired scenario in benchmarks which represent real-world use-cases. Some of the well-known and widely accepted benchmarks, such as TPC-H, do not have this feature [5].

Therefore, instead of developing a new benchmark for data storage from scratch, we analyzed the existing ones, taking into account their relevance, their popularity and representation, their pros and cons regarding scalability, realness, the key performance indicators they measure, etc.

Data Storage Benchmark for the HOBBIT Project

Taking into account the existing benchmark briefly overviewed above, we decided to use the Social Network Benchmark (SNB), developed under the auspices of the Linked Data Benchmark Council (LDBC), as a starting point in constructing the Data Storage benchmark for the HOBBIT project. LDBC introduced a new choke-point driven methodology for developing benchmark workloads, which combines user input with input from expert systems architects [6]. Unlike other benchmarks which are specific, tied to a single technology, SNB is much more generic, and can be used for evaluation of pure graph database systems, systems intended to manage Semantic Web data conforming to the RDF data model, distributed graph processing systems and traditional relational database systems that support recursive SQL.

As part of the HOBBIT project, the OpenLink Software team is focused on modifying the synthetic dataset generator (DATAGEN) from SNB, as well as modifying the SPARQL queries to follow the modifications introduced to the synthetic dataset via DATAGEN. Our team’s output from the project will provide a new Data Storage benchmark based on SNB, which provides a synthetic dataset for an online social network with real-world data features (attribute correlation, degree distribution, structure-attribute correlations, spiky activity volume) and a real-world RDF dataset coherence (lower structuredness then relational databases and benchmarks), and a set of queries which test the choke-points of RDF data storage solutions.

Authors: Milos Jovanovik (OpenLink Software) and Mirko Spasić (OpenLink Software)

[1] Güneş Aluç, Olaf Hartig, M Tamer Özsu, and Khuzaima Daudjee. Diversified Stress Testing of RDF Data Management Systems. In International Semantic Web Conference (ISWC). 2014.
[2] Renzo Angles, Peter Boncz, Josep Larriba-Pey, Irini Fundulaki, Thomas Neumann, Orri Erling, Peter Neubauer, Norbert Martinez-Bazan, Venelin Kotsev, and Ioan Toma. The Linked Data Benchmark Council: A Graph and RDF Industry Benchmarking Effort. SIGMOD Rec., 43(1):27–31, May 2014.
[3] Christian Bizer and Andreas Schultz. The Berlin SPARQL Benchmark. Int. J. Semantic Web Inf. Syst., 5(2):1–24, 2009.
[4] Felix Conrads, Jens Lehmann, Muhammad Saleem, Mohamed Morsey, and Axel-Cyrille Ngonga Ngomo. IGUANA Feasible Benchmark 2016. 3 2017.
[5] Orri Erling. In Hoc Signo Vinces – Virtuoso meets TPC-H., 2013. Accessed on 21 February 2017.
[6] Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, Andrey Gubichev, Arnau Prat, Minh-Duc Pham, and Peter Boncz. The LDBC Social Network Benchmark: Interactive Workload. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 619–630. ACM, 2015.
[7] Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, Andrey Gubichev, Arnau Prat, Minh-Duc Pham, and Peter Boncz. The LDBC Social Network Benchmark: Interactive Workload. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 619–630, New York, NY, USA, 2015. ACM.
[8] Y. Guo, A. Qasem, Z. Pan, and J. Heflin. A Requirements Driven Framework for Benchmarking Semantic Web Knowledge Base Systems. IEEE Transactions on Knowledge and Data Engineering, 19(2):297–309, Feb 2007.
[9] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. LUBM: A Benchmark for OWL Knowledge Base Systems. J. Web Sem., 3(2-3):158–182, 2005.
[10] Mohamed Morsey, Jens Lehmann, Sören Auer, and Axel-Cyrille Ngonga Ngomo. Usage-Centric Benchmarking of RDF Triple Stores. In Proceedings of the 26th AAAI Conference on Artificial Intelligence (AAAI 2012), 2012.
[11] Muhammad Saleem, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo. FEASIBLE: A featurebased SPARQL benchmark generation framework. In The Semantic Web – ISWC 2015 – 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I, pages 52–69, 2015.
[12]  Michael Schmidt, Thomas Hornung, Georg Lausen, and Christoph Pinkel. SP2Bench: A SPARQL Performance Benchmark. In International Conference on Data Engineering (ICDE), pages 222–233. IEEE, 2009.
[13] Mirko Spasić, Milos Jovanovik, and Arnau Prat-Perez. An RDF Dataset Generator for the Social Network Benchmark with Real-World Coherence. In Proceedings of the Workshop on Benchmarking Linked Data (BLINK 2016), pages 18–25, 2016.

Spread the word. Share this post!

Leave A Reply

Your email address will not be published. Required fields are marked *