Benchmarking Query Answering Systems for Linked Data using HOBBIT Benchmarks

“Alexa, how is the traffic?” This is one of the questions I ask most mornings in my kitchen, while trying to estimate whether I still have the time to complete my slow and painful brain-wake-up sequence or I should rush out, before the roads to work begin to fill up.
After the initial excitement, regular interactions with question answering systems have become more and more natural. As consumers’ expectations around the capabilities of systems able to answer questions formulated in natural language (QA systems) keep growing, so does the availability of such systems in various settings, devices and languages. Use cases are proliferating well beyond my trivial desire of saving a couple of minutes on my journey to work, ranging from disaster scenarios, to child care. But how does a QA system work?

Question answering over Linked Data
A system able to answer natural language questions, on the basis of structured datasets, can adopt a number of strategies. Most typically, however, such a system will attempt to analyse the question, translate it into an equivalent form in a query language, which is finally run on the dataset (knowledge base) to retrieve the answer. Figure 1 summarises the basic steps of the process.

Figure 1 – From question to query

In HOBBIT, the knowledge base we use for the QA benchmarks is DBpedia. Factual information retrieved from Wikipedia is organised in DBpedia in the form of a huge knowledge graph, a set of labelled nodes (entities) and directed arcs (relations) connecting them, similarly to the tiny fragment shown in Figure 2:

Figure 2 – A simple graph of facts[1]

Usually, relations between entities are represented as triples of subject-predicate-object, analogous to (Leonard_Nimoy, played, Spock). DBpedia is, more or less, the connected graph defined by a very large set of such triples, which describe “facts” stated in Wikipedia.
Such a graph can be queried by means of the SPARQL query language, an example of which is at the bottom of Figure 1.

How to benchmark
The crucial, original components of our QA benchmarks are datasets of natural language questions with associated SPARQL queries and answers (only provided at the training stage, please see our previous blog post). All QA systems using the HOBBIT platform for benchmarking will be challenged with the same set of questions, so that measures of precision (the proportion of answered questions which were correct) and recall (the proportion of correctly answered questions over the whole dataset) can be consistently compared.

In the future
In addition to a number of minor adjustments, we are working towards a better designed multilingual QA benchmark. This is a benchmark that can be used by QA systems answering questions in languages other than English: we expect a wide proliferation of such systems in the near future!


[1] From Gabrilovich, E., Murphy, K., Nickel, M., & Tresp, V. (2015). A Review of Relational Machine Learning for Knowledge Graphs: From Multi-Relational Link Prediction to Automated Knowledge Graph Construction. CoRR, abs/1503.00759.

Spread the word. Share this post!

Leave A Reply

Your email address will not be published. Required fields are marked *