Instance Matching Benchmark for Spatial Data Challenge
The number of datasets published in the Web of Data as part of the Linked Data Cloud is constantly increasing. The Linked Data paradigm is based on the unconstrained publication of information by diﬀerent publishers, and the inter-linking of Web resources across knowledge bases. In most cases, the cross-dataset links are not explicit in the dataset and must be automatically determined using Instance Matching (IM) tools amongst others. The large variety of techniques requires their comparative evaluation to determine which one is best suited for a given context. Performing such an assessment generally requires well-defined and widely accepted benchmarks to determine the weak and strong points of the proposed techniques and/or tools.
A number of real and synthetic benchmarks that address diﬀerent data linking challenges have been proposed for evaluating the performance of such systems. So far, only a limited number of link discovery benchmarks target the problem of linking geo-spatial entities.
However, some of the largest knowledge bases on the Linked Open Data Web are geo-spatial knowledge bases (e.g., LinkedGeoData with more than 30 billion triples). Linking spatial resources requires techniques that diﬀer from the classical mostly string-based approaches. In particular, considering the topology of the spatial resources and the topological relations between them is of central importance to systems driven by spatial data.
We believe that due to the large amount of available geo-spatial datasets employed in Linked Data and in several domains, it is critical that benchmarks for geo-spatial link discovery are developed.
The proposed challenge entitled “Instance Matching Benchmark for Spatial Data” is accepted at ISWC 2017, within OM workshop.
The aim of the instance matching benchmark for spatial data challenge is to test the performance of IM tools that implement string-based as well as topological approaches for identifying matching spatial entities. The IM frameworks will be evaluated for both accuracy (precision, recall and f-measure) and time performance. Systems that do not support the matching of spatial data, can still participate on Tasks 1 and 2.
Tasks and Training Data
We will use TomTom datasets in order to create the appropriate benchmarks. TomTom datasets contain representations of traces (GPS fixes). Each trace consists of a number of points. Each point has time stamp, longitude, latitude and speed (value and metric). The points are sorted by timestamp of the corresponding GPS fix (ascending).
This version of the challenge will comprise the following tasks:
- Task 1 (Linking) will measure how well the systems can match traces that have been altered using string-based approaches along with addition and deletion of intermediate points.As the TomTom dataset only contains coordinates and in order to apply string-based modifications based on LANCE we have replaced a number of those with labels retrieved from Google Maps Api, Foursquare Api and Nominatim Openstreetmap Api. This task also contains changes on date format and changes on coordinate formats.
- Task 2 (Spatial) measures how well the systems can identify DE-9IM (Dimensionally Extended nine-Intersection Model) topological relations. The supported spatial relations are the following: Disjoint, Touches, Contains/Within, Covers/CoveredBy, Intersects, Crosses, Overlaps and the traces are represented in Well-known text (WKT) format.For each relation, a different pair of source and target dataset will be given to the participants.
1. Saveta, E. Daskalaki, G. Flouris, I. Fundulaki, and A. Ngonga-Ngomo. LANCE: Piercing to the Heart of Instance Matching Tools. In ISWC, 2015.
Prerequisites for participation
Each participant must:
- Submit results related to one, more, or even all the expected tasks. Each task is articulated in two tests with different scales (i.e., number of instances to match):
- Sandbox (small scale): It contains two datasets called source and target as well as the set of expected mappings (i.e., reference alignment)
- Mainbox (medium/large scale): It contains two datasets called source and target. This test is blind, meaning that the reference alignment is not given to the participants.
- In both tests, the goal is to discover the matching pairs (i.e., mappings) among the instances in the source dataset and the instances in the target dataset.
Registration and Submission