Mighty Storage Challenge – Datasets

Description of Datasets

Twitter Dataset

Our Twitter dataset (https://github.com/renespeck/TWIG) is derived from 1 million real tweets that were generated in June 2009. To ensure that we do not divulge any personal information, we used (1) a Markov model to generate text that resembles tweets and abide by the density distribution of words in Tweets and (2) a tweet time distribution model that allows scaling up the number of agents generating tweets as well as the distribution of time for tweets. Therewith, we can ensure that the behavior of systems that ingest our tweets is similar to that of systems which ingest real tweets generated by the same number of users over the same period of time. The dataset abides by a simple ontology which describes tweets by the user who generated them, the time at which they were generated and their content.

Weidmüller Dataset

Molding machine dataset is provided by our partner Weidmüller. Basically, the dataset consists of readings taken from sensors deployed on a plastic injection molding machine. The sensors can measure various parameters of production process: distance, pressure, time, frequency, volume, temperature C, time S, speed, force. Each measurement is 120 dimensional vector consisting of values of different types, like text, fractional, decimal, but mostly fractional values. Each measurement is timestamped and described with IoT ontology. The dataset could be used in anomaly detection scenario.

TomTom Dataset

A text file containing a simple textual representation of the trace data (GPS fixes). Each line of the text file is representing a single GPS x. The lines are sorted by time stamp of the corresponding GPS fix (ascending). The format of each line is:
<UTC unix time stamp [ms]> <longitude [o]> <latitude [o]> <speed [m/s]>

LDBC Social Network Benchmark dataset

The Social Network Benchmark (SNB) provides a synthetic data generator (Datagen) which models an online social network (OSN), like Facebook. This Datagen will be modified in order to produce RDF datasets with a real-world structuredness as opposed to the large number of synthetic datasets used in benchmarking (they show a significant discrepancy in the level of structuredness compared to real-world RDF dataset). The dataset will be in TTL format. It is possible to generate datasets of different sizes. The benchmark defines a set of scale factors (SFs), targeting systems of different sizes and budgets. SFs are computed based on the ASCII size in Gigabytes of the generated output files using the CSV serializer. For example, SF 1 weights roughly 1GB in CSV format, SF 3 weights roughly 3GB and so on and so forth. The proposed SFs are the following: 1, 3, 10, 30, 100, 300, 1000. The size of the resulting dataset, is mainly affected by the following configuration parameters: the number of persons and the number of years simulated. Different SFs are computed by scaling the number of Persons in the network, while fixing the number of years simulated. For example, SF 30 consists of the activity of a social network of 182K users during a period of three years. The data contains different types of entities and relations, such as persons with friendship relations among them, posts, comments or likes. Additionally, it reproduces many of the structural characteristics observed in real OSNs: at-tribute correlations, degree distributions, structure-attribute correlations, and spiky activity volume.

Semantic Publishing Benchmark Data

The SPB data data generator uses ontologies and reference datasets provided by BBC, to produce sets of creative works. The data generator supports the creation of arbitrarily large RDF datasets in the order of billions of triples that mimic the characteristics of the reference BBC datasets. The generator produces creative works that are valid instances of BBC ontologies and define numerous concepts and properties employed to describe this content. SPB uses seven core and three domain RDF ontologies provided by BBC. The former define the main entities and their properties, required to describe essential concepts of the benchmark namely, creative works, persons, documents, BBC products (news, music, sport, education, blogs), annotations (tags), provenance of resources and content management system information. The latter are used to express concepts from a domain of interest such as football, politics, entertainment among others. Reference datasets are employed by the data generator to produce the data of interest. These datasets are snapshots of the real datasets provided by BBC; in addition, a GeoNames and DBPedia reference dataset has been included for further enriching the annotations with geo-locations to enable the formulation of geospatial queries, and person data. A creative work is described by a number of data value and object value properties; a creative work also has properties that link it to resources defined in reference datasets: those are the about and mentions properties, and their values can be any resource. The generator models three types of relations in the data: clustering of data, correlations of entities and random tagging of entities (the interested reader can find more details in [2]. The versioning benchmark will comprise versions of SPB datasets that follow the aforementioned principles of data generation.

The transport dataset of linked connections

A significant portion of people use public transport for their travels. The count-less public transport services worldwide, combined with their usage, lead to an enormous source of information. Many public transport companies worldwide provide this data using the GTFS ** standard, which can be converted to Linked Data using the Linked Connections  framework [1]. Such data is an ideal source for the benchmarking of systems because of its time and space dimensions. These datasets contain geospatial information about stops, temporal information about transit schedules and the interlinking between both. In many cases, benchmarking requires the ability to create synthetic datasets with specific properties of any given size. This is why we provide a public transport dataset generator that is able to create realistic public transport areas, networks and schedules. The generator can be configured to produce countless of synthetic datasets using a wide range of parameters.


  1. Colpaert, P., Llaves, A., Verborgh, R., Corcho, O., Mannens, E., Van de Walle, R.: Intermodal public transit routing using linked connections. In: Proceedings of the 14th International Semantic Web Conference: Posters and Demos (2015)
  2. Kotsev, N. Minadakis, V.P.O.E.I.F., Kiryakov, A.: Benchmarking rdf query en-gines: The case of semantic publishing benchmark. In: BLINK Proceedings (2016)