Versioning for Big Linked Data: approaches and benchmarks

As LOD datasets are constantly evolving, both at schema and instance level, there is a need for systems that support efficiently storing and querying such evolving data. Such archiving systems must support various types of queries on data, including queries that access multiple versions (cross-version queries), queries that access the evolution history (delta) itself, as well as combinations of the above. To support these functionalities, a variety of RDF archiving systems and frameworks have been proposed. Such systems, along with their characteristics (the archiving strategy that each system/framework implements, their ability to answer SPARQL queries, to identify equivalent blank nodes across versions and finally to support versioning concepts as committing, merging, branching etc. is shown in Table 1. In their simplest form, the systems store all the different snapshots (versions) of a dataset (full materialization); however, alternative proposals include delta-based approaches, the use of temporal annotations (Annotated Triples), as well as hybrid approaches that combine the above techniques.

System /
Framework
Archiving
Policy
SPARQL support Blank Nodes support Versioning Concepts
x-RDF-3X [8] Annotated Triples
SemVersion [11] Full Materialization
Cassidy et al. [1] Delta Based
R&Wbase [10] Annotated Triples
R43ples [4] Delta Based
TailR [7] Hybrid Approach
Im et al. [5] Delta Based
Memento [9] Full Materialization

An overview of RDF archiving systems and frameworks

Given the complexity of the problem and the multitude of parameters that need to be considered, being able to objectively evaluate the pros and cons of each system is a challenging task that requires appropriate benchmarks. Benchmarking is an important process that allows not only the evaluation of different systems across different dimensions, but also the identification of the weak and strong points of each one. Thus, benchmarks play the role of a driver for improvement, and also allow users to take informed decisions regarding the quality of different systems for different problem types and settings. The problem of benchmarking archiving systems for Linked Data has been considered only very recently, and, to the best of our knowledge, only two such benchmarks exist up to this day.

Fernandez et al. [2,3] have proposed a blueprint on benchmarking archiving systems for Semantic Web data. More specifically, the authors provide theoretical foundations on the design of data and queries to evaluate RDF archiving systems. To this end, they provide a formalization of archives in a way that allows them to effectively describe the data corpus and provide guidelines on the selection of relevant and comparable queries. To instantiate these foundations in a real-world scenario, they introduced the BEAR benchmark. BEAR serves a well-described data corpus and a basic, but extensible, query testbed along with an implementation and evaluation of the three archiving strategies Full Materialization, Delta-Based and Annotated Triples.

Meimaris and Papastefanatos [6] have proposed the EvoGen Benchmark Suite, a generator for evolving RDF data, used for benchmarking versioning and change detection approaches and tools. EvoGen is based on the LUBM generator, extended with 10 new classes and 19 new properties in order to support schema evolution. Their benchmarking methodology is based on a set of requirements and parameters that affect: (a) the data generation process (b) the context of the tested application and, (c) the query workload, as required by the nature of the evolving data. EvoGen is an extensible and highly configurable benchmark generator in terms of the number of generated versions, or the number of changes occurring from version to version. Similarly, the query workload is generated adaptively to such configurable data generation process. EvoGen takes into account the archiving strategy of the system under test, by providing adequate input data formats (full versions, deltas, etc.) as appropriate.

A more detailed presentation of the versioning approaches and benchmarks can be found in [12].

[1] Cassidy, S., Ballantine, J.: Version Control for RDF Triple Stores. ICSOFT  (ISDM/EHST/DC) 7, 5–12 (2007)
[2] Fernandez Garcia, J.D., Umbrich, J., Knuth, M., Polleres, A.: Evaluating Query and  Storage Strategies for RDF Archives. In: SEMANTiCS (2016, forthcoming)
[3] Fernandez Garcia, J.D., Umbrich, J., Polleres, A.: BEAR: Benchmarking the Efficiency of RDF Archiving. Tech. rep., Department für Informationsverarbeitung und Prozessmanagement, WU Vienna University of Economics and Business (2015)
[4] Graube, M., Hensel, S., Urbas, L.: R43ples: Revisions for triples. LDQ (2014)
[5] Im, D.H., Lee, S.W., Kim, H.J.: A version management framework for RDF triple stores. Int’l Journal of Software Engineering and Knowledge Engineering 22(01), 85–106 (2012)
[6] Meimaris, M., Papastefanatos, G.: The EvoGen Benchmark Suite for Evolving RDF Data. MeDAW (2016)
[7] Meinhardt, P., Knuth, M., Sack, H.: TailR: a platform for preserving history on the web of data. In: Int’l Conf. on Semantic Systems. pp. 57–64. ACM (2015)
[8] Neumann, T., Weikum, G.: x-RDF-3X: fast querying, high update rates, and consistency for RDF databases. VLDB Endowment 3(1-2), 256–263 (2010)
[9] Van de Sompel, H., Sanderson, R., Nelson, M.L., Balakireva, L.L., Shankar, H., Ainsworth, S.: An http-based versioning mechanism for linked data. arXiv preprint arXiv:1003.3661 (2010)
[10] Vander Sande, M., Colpaert, P., Verborgh, R., Coppens, S., Mannens, E., Van de Walle, R.: R&Wbase: git for triples. In: LDOW (2013)
[11] Völkel, M., Groza, T.: SemVersion: An RDF-based ontology versioning system. In: IADIS Int’l Conf. WWW/Internet. vol. 2006, p. 44 (2006)
[12] Vassilis Papakonstantinou, Giorgos Flouris, Irini Fundulaki, Kostas Stefanidis and Giannis Roussakis. Versioning for Linked Data: Archiving Systems and Benchmarks. In: BLINK (2016, forthcoming).

Vassilis Papakonstantinou, Irini Fundulaki, ICS-FORTH, Greece

Spread the word. Share this post!

Leave A Reply

Your email address will not be published. Required fields are marked *