Objective and motivation
HOBBIT is designed to provide a generic methodology as well as generic tools for benchmarking Big Linked Data frameworks. The objectives of the projects were derived from (1) the requirements gathered from industrial partners, (2) previous works of the partners in Big Data projects such as BigDataEurope, GeoKnow, LDBC, DIACHRON, GrowSmarter, among others, as well as (3) works in national projects and challenges including but not limited to SAKE, SmartDataWeb and Peer Energy Cloud.
In HOBBIT we employ mimicking algorithms that allow simulating real, industry-relevant data sources in the following domains:
- Print Machine Data
- IT Data
- Sensor Data from Plastic Injection Molding Machines (Weidmüller dataset)
- Social Network Data from Twitter
- Transport Data
Data is a, if not the, key asset of modern data-driven companies. Hence, companies are reluctant to make relevant internal data available for competitions as this could reduce the competitiveness of these companies significantly. HOBBIT circumvents this problem by:
- employing mimicking algorithms that compute and reproduce variables that characterize the structure of company-data (e.g., for sensor data: number of events/second, type distribution for values – Gaussian, Poisson, etc. –, mean and standard deviation; graph data: distribution of branching factor, number of edges per node, growth in nodes/second and edges/second; for textual data: number of entities/document, average size of documents, number of documents/second, etc.) and
- feeding these characteristics into generators that were able to generate data similar to real company data without having to make the real company data available to the public.The mimicking algorithms are implemented in such a way that can be parameterized within companies and simply return parameters that can be used to feed the generators.
Print Machine Data
The printing machine, in particular an offset printing machine, is a specific machine type in the domain of production industry. It usually consists of different parts like a feeder, different printing units, optional coating units and a delivery system. The machine operation is divided in several printing jobs, which represent orders or parts of an order of customers. The data used for mimicking are the event data generated during machine operation. A printing job usually starts with a start-job event and ends with the finish-job event. In between these events several other events occur. Most of them are standard events within the expected operation of the machine, others indicate issues during the operation. To mimic these data USU data scientists analyzed the original machine data. Based on the time lag between different events, the probability distribution function has been determined. From this, the cumulative probability distribution function was calculated. These functions are used within the mimicking algorithm to generate the mimicking data. The findings have been validated by domain experts.
The mimicking algorithm was implemented as RESTful web-service. The web service allows configuring the start and end-date as well as the number of printing jobs per agent (printing machine). This allows to simulate arbitrary setups of several printing machines therefore increase and decrease the amount of data generated for the benchmark. The web service currently supports the linked data formats xml (rdf/xml), n3, turtle, n-triples as well as json-ld. The ontology scheme describes the general setup of the agent (printing machine) as well as the individual logged events. The principal algorithm for generating mimicked event data is generic. It can therefore be applied to other machine types or devices producing event data. However, the main effort still is the identification of the distribution functions of the events.
The IT Data use-case concentrates on log data from the Big-Data platform of USU. Compute nodes process compute jobs. The data for job computation come from an Apache Cassandra cluster and results are written into Cassandra. For efficient writing, Cassandra writes the data into partitioned tables called SSTable. To optimize reading, the tables are frequented compacted. Within the cluster several measurement values are monitored like the number of SSTable, the compacting as well measures like CPU usage and network traffic. The cluster at USU for one of its customer currently consists of 120 nodes and constantly rising. This currently results in about 15GB of monitoring data per day. For the mimicking, we analysed the patterns of individual measurement values as well as the correlation between different values. We validated the findings with experts and finally built a model for stochastic and deterministic simulation of the data. The model was integrated into a web service. The web service allows configuring the start and end-date as well as the sampling rate for the mimicking. This allows to simulate arbitrary setups of nodes within the cluster and therefore increases and decreases the amount of data generated for the benchmark. The web service currently supports the linked data formats xml (rdf/xml), n3, turtle, n-triples as well as json-ld. The ontology scheme mainly complies with the one provided with the mimicked printing machine data . The setup includes a Docker file for easy deployment.
Sensor Data from Plastic Injection Molding Machines (Weidmüller dataset)
The Weidmüller dataset consists of readings taken from sensors deployed on a plastic injection molding machine. The sensors can measure various parameters of the production process: distance, pressure, time, frequency, volume, temperature, time, speed, force, etc. Each measurement is a timestamped, 120 dimensional vector consisting of values of different types, like fractional, decimal, but mostly fractional values. The first step of our mimicking approach is to automatically classify all dimensions in three groups: constant, trending phases and stateful. A constant dimension has only one value for all data instances. A trending phase is the dimension that exhibits ascending or descending growth. All the other dimensions are considered stateful.
On the second step we take each individual dimension and apply mimicking techniques based on how the dimensions has been classified. For a constant dimension we take a random constant which is not far away from the original dimension’s value. For a trending phase we take the first value, an increment value and produce every next value as the previous value + increment. At random time moments we subtract some other value. For a stateful dimension we follow a more sophisticated scheme that allows us to mimic states and state transitions found in the original data. This scheme includes three steps: first, the clustering of the dimension with the k-means algorithm using an automatically computed k for this dimension. This allows us to assign a cluster to each data instance in the dimension, and second, for each cluster we compute the mean value and the standard deviation, and last, iterating throughout the dimension we compute one step cluster transition probabilities (Markov model).
The Twitter dataset is derived from 1 million real tweets that were generated in June 2009. To ensure that we do not divulge any personal information, we used (1) a Markov model to generate text that resembles tweets and abide by the density distribution of words in tweets and (2) a tweet time distribution model that allows scaling up the number of agents generating tweets as well as the distribution of time for tweets. Therewith, we can ensure that the behavior of systems that ingest our tweets is similar to that of systems which ingest real tweets generated by the same number of users over the same period of time. A simple ontology which describes tweets by the user who generated them, the time at which they were generated and their content is used in the dataset.
The number of users of navigation services continues to grow, either using the vehicle’s built-in unit, a dedicated device or a smartphone application. This enables the collection of extensive amounts of floating car data, that can be used to extract information relevant to a number of applications like road administration, traffic management and jam avoiding routing services, among many others. However, collecting a sufficiently large dataset comes with difficulties like costs or privacy concerns. At TomTom we take the user’s right to privacy very seriously, https://www.tomtom.com/privacy/, so we developed a Synthetic Trace Generator, which facilitates the creation of an arbitrary quantity of data from a few statistical descriptions of the traffic. More specifically, it generates some desired number of synthetic individual traces, with a trace being a list of positions recorded by one device (phone, car, etc.) throughout one day.
The generator uses probability distributions for variables like start and end locations of trips, their starting time or what is the device’s update frequency. Using parameters sampled from such distributions, a map is then used to find an appropriate route for the trip and successive points are generated at a regular time interval with typical speeds for each road.