A linkage pipeline for place records using multi-view encoders
Extracting information about Web entities has become commonplace in the academy and industry alike. In particular, data about places distinguish themselves as rich sources of geolocalized information and spatial context, serving as a foundation for a series of applications. These entities, however,...
Main Author: | COUSSEAU, Vinícius de Moraes Rêgo |
---|---|
Other Authors: | BARBOSA, Luciano de Andrade |
Format: | masterThesis |
Language: | por |
Published: |
Universidade Federal de Pernambuco
2020
|
Subjects: | |
Online Access: |
https://repositorio.ufpe.br/handle/123456789/38480 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
ir-123456789-38480 |
---|---|
recordtype |
dspace |
spelling |
ir-123456789-384802020-11-04T05:17:07Z A linkage pipeline for place records using multi-view encoders COUSSEAU, Vinícius de Moraes Rêgo BARBOSA, Luciano de Andrade http://lattes.cnpq.br/2785182887468178 http://lattes.cnpq.br/7113249247656195 Banco de dados Resolução de entidades Extracting information about Web entities has become commonplace in the academy and industry alike. In particular, data about places distinguish themselves as rich sources of geolocalized information and spatial context, serving as a foundation for a series of applications. These entities, however, are inherently noisy and introduce several normalization problems, which need to be tackled in order to obtain a clean database. Record linkage, also known as entity resolution, refers to the detection of replicated data from potentially multiple sources, and is one of the most critical cleaning processes to be conducted in a data set. This work presents a novel record linkage solution for large scale Web-based places data, being composed of three steps: generation of potential duplicate place pairs, place pair deduplication, and clusterization of the classification results. The detection of duplicate places is the solution’s core, being a complex and seldom approached problem in this domain. Hence, the main contribution of this work is in the form of a model based on a deep neural network architecture, which utilizes encoders for different information levels of names, addresses, geographical coordinates, and categories. Each encoder uses distinct structures to generate representation vectors, which are concatenated, compared, and transported to a feature space that represents duplications and non-duplications. Additionally, this work proposes alternative classification models for real time usage by means of APIs. The complete solution is analyzed, with the classification model for place pairs being evaluated on top of two distinct data sets and compared against the stateof-the-art. As a result, the proposed solution is shown to handle large quantities of data in a production environment, and the classification model outperforms the baselines in both data sets, thus constituting a complete and efficient solution for the record linkage problem in the places data domain. A extração de informações sobre entidades da Web é uma prática comum tanto na academia quanto na indústria. Em particular, dados sobre pontos de interesse destacam-se como uma fonte rica de informação geolocalizada e contexto espacial, servindo como base para uma variedade de aplicações. Estas entidades, porém, são inerentemente ruidosas e introduzem diversos problemas de normalização, que precisam ser resolvidos para que possa se obter uma base de dados limpa. Resolução de entidades, que se refere à detecção de dados replicados vindos potencialmente de diversas fontes, é um dos processos de limpeza mais críticos a serem realizados. Este trabalho apresenta uma solução original de resolução de entidades para dados de pontos de interesse oriundos predominantemente da Web em grande escala, sendo composto por três etapas: geração de potenciais pares de pontos de interesse duplicados, classificação de pares em duplicações ou não-duplicações, e geração de clusters a partir dos resultados da classificação de pares duplicados. A detecção de pontos de interesse duplicados destaca-se como o cerne da solução, sendo um problema complexo e pouco abordado. Como principal contribuição do trabalho, portanto, é apresentado um modelo baseado em uma arquitetura de redes neurais profundas, que utiliza encoders para os diferentes níveis de informação de nomes, endereços, coordenadas geográficas, e categorias. Cada encoder utiliza estruturas distintas para gerar vetores de representação, que são concatenados, comparados, e transportados para um espaço de features que representa duplicações e não-duplicações. Adicionalmente, são propostas alternativas de modelos de classificação para uso em tempo real por meio de APIs. A solução completa é analisada, sendo o modelo para a classificação de pares de pontos de interesse avaliado em dois conjuntos de dados distintos e comparado com o estado da arte na área. Como resultado, a solução proposta mostra-se capaz de lidar com grandes quantidades de dados em um ambiente de produção, e o modelo de classificação obtém performance superior a dos modelos comparados em ambos os conjuntos de dados, constituindo uma solução completa e eficaz para o problema. 2020-11-03T21:26:52Z 2020-11-03T21:26:52Z 2020-08-14 masterThesis COUSSEAU, Vinícius de Moraes Rêgo. A linkage pipeline for place records using multi-view encoders. 2020. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2020. https://repositorio.ufpe.br/handle/123456789/38480 por openAccess Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ application/pdf Universidade Federal de Pernambuco UFPE Brasil Programa de Pos Graduacao em Ciencia da Computacao |
institution |
REPOSITORIO UFPE |
collection |
REPOSITORIO UFPE |
language |
por |
topic |
Banco de dados Resolução de entidades |
spellingShingle |
Banco de dados Resolução de entidades COUSSEAU, Vinícius de Moraes Rêgo A linkage pipeline for place records using multi-view encoders |
description |
Extracting information about Web entities has become commonplace in the academy and industry alike. In particular, data about places distinguish themselves as rich sources of geolocalized information and spatial context, serving as a foundation for a series of applications. These entities, however, are inherently noisy and introduce several normalization problems, which need to be tackled in order to obtain a clean database. Record linkage, also known as entity resolution, refers to the detection of replicated data from potentially multiple sources, and is one of the most critical cleaning processes to be conducted in a data set. This work presents a novel record linkage solution for large scale Web-based places data, being composed of three steps: generation of potential duplicate place pairs, place pair deduplication, and clusterization of the classification results. The detection of duplicate places is the solution’s core, being a complex and seldom approached problem in this domain. Hence, the main contribution of this work is in the form of a model based on a deep neural network architecture, which utilizes encoders for different information levels of names, addresses, geographical coordinates, and categories. Each encoder uses distinct structures to generate representation vectors, which are concatenated, compared, and transported to a feature space that represents duplications and non-duplications. Additionally, this work proposes alternative classification models for real time usage by means of APIs. The complete solution is analyzed, with the classification model for place pairs being evaluated on top of two distinct data sets and compared against the stateof-the-art. As a result, the proposed solution is shown to handle large quantities of data in a production environment, and the classification model outperforms the baselines in both data sets, thus constituting a complete and efficient solution for the record linkage problem in the places data domain. |
author2 |
BARBOSA, Luciano de Andrade |
format |
masterThesis |
author |
COUSSEAU, Vinícius de Moraes Rêgo |
author_sort |
COUSSEAU, Vinícius de Moraes Rêgo |
title |
A linkage pipeline for place records using multi-view encoders |
title_short |
A linkage pipeline for place records using multi-view encoders |
title_full |
A linkage pipeline for place records using multi-view encoders |
title_fullStr |
A linkage pipeline for place records using multi-view encoders |
title_full_unstemmed |
A linkage pipeline for place records using multi-view encoders |
title_sort |
linkage pipeline for place records using multi-view encoders |
publisher |
Universidade Federal de Pernambuco |
publishDate |
2020 |
url |
https://repositorio.ufpe.br/handle/123456789/38480 |
_version_ |
1686787994687635456 |
score |
13.657419 |