During the last decade, Twitter has become one of the most important microblogging services for online news and social networking on the Web with around 310 million monthly active users in March 2016. The increasing popularity of Twitter as a data source for a multitude of applications ranging from entity extraction to sentiment analysis around products makes it an important dataset for benchmarking. For this reason, a large number of reference datasets based on Twitter (e.g., the Twitter7 dataset with 476 million tweets from 20 million users covering a – six month period from June 1 2009 to December 31 2009) was created. However, a request from Twitter, made the Twitter7 dataset and similar datasets no longer available for public use.
Within HOBBIT, we circumvent the problem of generating Twitter-like data by providing TWIG, the Twitter Benchmark Generator. TWIG has two main functions: it mimics the Twitter Network datastream while storing the mimicking results in an RDF serialisation. The following figure is a partial visualisation of the TWIG ontology.
TWIG is based on a crawl of Twitter that includes unique IDs for users, user tweet times and the tweets themselves. To generate synthetic tweets, TWIG first parses a Twitter crawl. Based on the result of the parses, it creates three probability distributions: One for the tweet daytime, which describes the probability to send a tweet at a specific minute during a day. Another for the number of tweets which describes the probability to send a specific number of tweets in a given time period. And the last one, the Word Predecessor Successor Distribution, for the generation of synthetic tweets, which describes the probability that a word is being followed by another word under the assumption of a Markov chain. The TWIG probability distributions are used in an automation to generate random users, random tweet times and synthetic tweets. The generated data is stored in an RDF serialisation to the hard drive. Every synthetic tweet that is generated by the TWIG system is generated deterministically. This is ensured by accepting a seed as parameter, which will be provided to the source of randomness.
With TWIG, we provide synthetic data that is very similar to real Twitter data and can be used to benchmark storage systems w.r.t. to their performance when faced with Twitter streams. Using TWIG however has the main advantage of leading to highly controllable, scalable and open data that also means clear and comparable benchmarking. We will thus use TWIG-generated data in the Mighty Storage Challenge at ESWC 2017. Check out the challenge and let us know whether you’d be interested in participating. More information on the other challenges organized by HOBBIT this year can be found here: project-hobbit.eu/challenges.
Authors: René Speck, Axel Ngonga Ngomo