A linkage pipeline for place records using multi-view encoders
Extracting information about Web entities has become commonplace in the academy and industry alike. In particular, data about places distinguish themselves as rich sources of geolocalized information and spatial context, serving as a foundation for a series of applications. These entities, however,...
Main Author: | COUSSEAU, Vinícius de Moraes Rêgo |
---|---|
Other Authors: | BARBOSA, Luciano de Andrade |
Format: | masterThesis |
Language: | por |
Published: |
Universidade Federal de Pernambuco
2020
|
Subjects: | |
Online Access: |
https://repositorio.ufpe.br/handle/123456789/38480 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: |
Extracting information about Web entities has become commonplace in the academy and industry alike. In particular, data about places distinguish themselves as rich sources of geolocalized information and spatial context, serving as a foundation for a series of applications. These entities, however, are inherently noisy and introduce several normalization problems, which need to be tackled in order to obtain a clean database. Record linkage, also known as entity resolution, refers to the detection of replicated data from potentially multiple sources, and is one of the most critical cleaning processes to be conducted in a data set. This work presents a novel record linkage solution for large scale Web-based places data, being composed of three steps: generation of potential duplicate place pairs, place pair deduplication, and clusterization of the classification results. The detection of duplicate places is the solution’s core, being a complex and seldom approached problem in this domain. Hence, the main contribution of this work is in the form of a model based on a deep neural network architecture, which utilizes encoders for different information levels of names, addresses, geographical coordinates, and categories. Each encoder uses distinct structures to generate representation vectors, which are concatenated, compared, and transported to a feature space that represents duplications and non-duplications. Additionally, this work proposes alternative classification models for real time usage by means of APIs. The complete solution is analyzed, with the classification model for place pairs being evaluated on top of two distinct data sets and compared against the stateof-the-art. As a result, the proposed solution is shown to handle large quantities of data in a production environment, and the classification model outperforms the baselines in both data sets, thus constituting a complete and efficient solution for the record linkage problem in the places data domain. |
---|