Generating Big Linked Data for HOBBIT: Mimicking Production Machine and IT Data

With the digitalization in manufacturing and European initiatives like Factory of the Future or national programs like Industrie 4.0 supporting this development, the need for processing big data, coming from sensors, machine logs, Internet of Things (IoT) devices, and other sources is becoming more and more important especially when it comes to bringing added value to the customers.

The new USU department Katana, a spinoff of the research department, addresses this issue for its customers. Processing machine data however requires a stable IT platform. Therefore the IT platform has to be monitored in order to quickly identify issues and adopt resources based on platform usage. Within HOBBIT, USU therefore simulates machine data, which covers the Big-Data processing use-case. Furthermore IT-data are simulated covering the monitoring use-case.

Mimicking Production Machine Data

The simulated production data represent log-data of a printing machine (cf. Figure 1).

Figure 1: Ryobi offset press (By User Vohvelirauta on fi.wikipedia [Public domain], via Wikimedia Commons)

A printing machine processes different print jobs of different length. A print job starts with the initial start event and stops with the finish event. In between there are several other events like washing the plates, missing sheets, feeding issues, and other critical and non-critical events. Within HOBBIT USU analyses the dependencies between these events and their correlation and developed models for stochastic and deterministic simulations of this behaviour. USU evaluates the findings with experts and the original data sets. The simulation is done via a probability distribution functions f(t) representing the time lags between different events. Beginning with randomly generated start events intermediate events are created using the previously created distribution functions. Figure 2 shows the original and the simulated data. Δti,j represents the distribution function for the time lag between two events.

Figure 2: Comparison between simulated and real event-data from printing machine

Mimicking IT-Data

As mentioned above, the processing of machine data is done at USU within its big-data platform. The platform (cf. Figure 3) consists of a job server which distributes the compute-jobs amongst various compute-nodes. These nodes read data from and write results to the storage-nodes. The compute jobs vary in their life spans (short and long-lasting jobs).

Figure 3: Big-Data platform monitoring

Monitored measurement variables are among other things network traffic, disk space, and CPU utilization but also specific measures from the data storage Apache Cassandra. For fast writing Cassandra writes data into fragmented tables, the so called SSTables. At several stages these fragmented tables are compacted by Cassandra in order to have better reading performance. The individual measurement variables and their correlation to each other are analysed and findings are then validated. Figure 4 shows the compacted data in bytes and the number of SSLTables. As can be seen in the data the number of SSLTables decreases with the compacting in real as well as in sumulated data. The simulated data for this scenario therefore mimic measurement values of the big-data platform.

Figure 4: Original and simulated measures and their correlation to each other

Some Statistics and Wrap-Up

This postpresented the approach in HOBBIT of simulating production and IT data. A common amount of events produced by one printing machine is about 100MB per month. With an implementation of more than 3000 machines worldwide this produces more than 300GB of data per month that need to be processed. For IT data the amount of data produced can be up to 120MB per node and day. With more than 120 nodes installed for one customer of USU this currently produces about 15GB data per day and about 450GB per month. The mimicking algorithms however allow simulating more machines for the event data and more nodes for the IT-data if the use-case or benchmark requires this.

The mimicking algorithms are provided as asynchronous REST web-services. These services are deployed within Docker containers, which simplify the deployment process. Since the general approach is not dependent on the use-case, it can be used to simulate data of other use-cases as well. In case of the simulation of event data this requires examining the correlations of the events and their distribution. Unsurprisingly, this turns out to be the hard part of the mimicking.

Spread the word. Share this post!

Leave A Reply

Your email address will not be published. Required fields are marked *