A solution to extractive summarization based on document type and a new measure for sentence similarity

The Internet is a enormous and fast growing digital repository encompassing billions of documents in a diversity of subjects, quality, reliability, etc. It is increasingly difficult to scavenge useful information from it. Thus, it is necessary to provide automatically techniques that allowing use...

Full description

Main Author: MELLO, Rafael Ferreira Leite de
Other Authors: FREITAS, Frederico Gonçalves de
Format: doctoralThesis
Language: por
Published: UNIVERSIDADE FEDERAL DE PERNAMBUCO 2016
Subjects:
Online Access: https://repositorio.ufpe.br/handle/123456789/15257
Tags: Add Tag
No Tags, Be the first to tag this record!
id ir-123456789-15257
recordtype dspace
spelling ir-123456789-152572019-10-26T00:46:38Z A solution to extractive summarization based on document type and a new measure for sentence similarity MELLO, Rafael Ferreira Leite de FREITAS, Frederico Gonçalves de LINS, Rafael Dueire http://lattes.cnpq.br/6190254569597745 http://lattes.cnpq.br/6195215666638965 Ciência da computação Inteligência artificial Mineração de texto Processamento de linguagem natural The Internet is a enormous and fast growing digital repository encompassing billions of documents in a diversity of subjects, quality, reliability, etc. It is increasingly difficult to scavenge useful information from it. Thus, it is necessary to provide automatically techniques that allowing users to save time and resources. Automatic text summarization techniques may offer a way out to this problem. Text summarization (TS) aims at automatically compress one or more documents to present their main ideas in less space. TS platforms receive one or more documents as input to generate a summary. In recent years, a variety of text summarization methods has been proposed. However, due to the different document types (such as news, blogs, and scientific articles) it became difficult to create a general TS application to create expressive summaries for each type. Another related relevant problem is measuring the degree of similarity between sentences, which is used in applications, such as: text summarization, information retrieval, image retrieval, text categorization, and machine translation. Recent works report several efforts to evaluate sentence similarity by representing sentences using vectors of bag of words or a tree of the syntactic information among words. However, most of these approaches do not take in consideration the sentence meaning and the words order. This thesis proposes: (i) a new text summarization solution which identifies the document type before perform the summarization, (ii) the creation of a new sentence similarity measure based on lexical, syntactic and semantic evaluation to deal with meaning and word order problems. The previous identification of the document types allows the summarization solution to select the methods that is more suitable to each type of text. This thesis also perform a detailed assessment with the most used text summarization methods to selects which create more informative summaries for news, blogs and scientific articles contexts.The sentence similarity measure proposed is completely unsupervised and reaches results similar to humans annotator using the dataset proposed by Li et al. The proposed measure was satisfactorily applied to evaluate the similarity between summaries and to eliminate redundancy in multi-document summarization. Atualmente a quantidade de documentos de texto aumentou consideravelmente principalmente com o grande crescimento da internet. Existem milhares de artigos de notícias, livros eletrônicos, artigos científicos, blog, etc. Com isso é necessário aplicar técnicas automáticas para extrair informações dessa grande massa de dados. Sumarização de texto pode ser usada para lidar com esse problema. Sumarização de texto (ST) cria versões comprimidas de um ou mais documentos de texto. Em outras palavras, palataformas de ST recebem um ou mais documentos como entrada e gera um sumário deles. Nos últimos anos, uma grande quantidade de técnicas de sumarização foram propostas. Contudo, dado a grande quantidade de tipos de documentos (por exemplo, notícias, blogs e artigos científicos) é difícil encontrar uma técnica seja genérica suficiente para criar sumários para todos os tipos de forma eficiente. Além disto, outro tópico bastante trabalhado na área de mineração de texto é a análise de similaridade entre sentenças. Essa similaridade pode ser usada em aplicações como: sumarização de texto, recuperação de infromação, recuperação de imagem, categorização de texto e tradução. Em geral, as técnicas propostas são baseados em vetores de palavras ou árvores sintáticas, com isso dois problemas não são abordados: o problema de significado e de ordem das palavras. Essa tese propõe: (i) Uma nova solução em sumarização de texto que identifica o tipo de documento antes de realizar a sumarização. (ii) A criação de uma nova medida de similaridade entre sentenças baseada nas análises léxica, sintática e semântica. A identificação de tipo de documento permite que a solução de sumarização selecione os melhores métodos para cada tipo de texto. Essa tese também realizar um estudo detalhado sobre os métodos de sumarização para selecinoar os que criam sumários mais informativos nos contextos de notícias blogs e artigos científicos. A medida de similaridade entre sentences é completamente não supervisionada e alcança resultados similarires dos anotadores humanos usando o dataset proposed por Li et al. A medida proposta também foi satisfatoriamente aplicada na avaliação de similaridade entre resumos e para eliminar redundância em sumarização multi-documento. 2016-02-19T18:25:04Z 2016-02-19T18:25:04Z 2015-03-20 doctoralThesis https://repositorio.ufpe.br/handle/123456789/15257 por Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ application/pdf UNIVERSIDADE FEDERAL DE PERNAMBUCO UFPE Brasil Programa de Pos Graduacao em Ciencia da Computacao
institution REPOSITORIO UFPE
collection REPOSITORIO UFPE
language por
topic Ciência da computação
Inteligência artificial
Mineração de texto
Processamento de linguagem natural
spellingShingle Ciência da computação
Inteligência artificial
Mineração de texto
Processamento de linguagem natural
MELLO, Rafael Ferreira Leite de
A solution to extractive summarization based on document type and a new measure for sentence similarity
description The Internet is a enormous and fast growing digital repository encompassing billions of documents in a diversity of subjects, quality, reliability, etc. It is increasingly difficult to scavenge useful information from it. Thus, it is necessary to provide automatically techniques that allowing users to save time and resources. Automatic text summarization techniques may offer a way out to this problem. Text summarization (TS) aims at automatically compress one or more documents to present their main ideas in less space. TS platforms receive one or more documents as input to generate a summary. In recent years, a variety of text summarization methods has been proposed. However, due to the different document types (such as news, blogs, and scientific articles) it became difficult to create a general TS application to create expressive summaries for each type. Another related relevant problem is measuring the degree of similarity between sentences, which is used in applications, such as: text summarization, information retrieval, image retrieval, text categorization, and machine translation. Recent works report several efforts to evaluate sentence similarity by representing sentences using vectors of bag of words or a tree of the syntactic information among words. However, most of these approaches do not take in consideration the sentence meaning and the words order. This thesis proposes: (i) a new text summarization solution which identifies the document type before perform the summarization, (ii) the creation of a new sentence similarity measure based on lexical, syntactic and semantic evaluation to deal with meaning and word order problems. The previous identification of the document types allows the summarization solution to select the methods that is more suitable to each type of text. This thesis also perform a detailed assessment with the most used text summarization methods to selects which create more informative summaries for news, blogs and scientific articles contexts.The sentence similarity measure proposed is completely unsupervised and reaches results similar to humans annotator using the dataset proposed by Li et al. The proposed measure was satisfactorily applied to evaluate the similarity between summaries and to eliminate redundancy in multi-document summarization.
author2 FREITAS, Frederico Gonçalves de
format doctoralThesis
author MELLO, Rafael Ferreira Leite de
author_sort MELLO, Rafael Ferreira Leite de
title A solution to extractive summarization based on document type and a new measure for sentence similarity
title_short A solution to extractive summarization based on document type and a new measure for sentence similarity
title_full A solution to extractive summarization based on document type and a new measure for sentence similarity
title_fullStr A solution to extractive summarization based on document type and a new measure for sentence similarity
title_full_unstemmed A solution to extractive summarization based on document type and a new measure for sentence similarity
title_sort solution to extractive summarization based on document type and a new measure for sentence similarity
publisher UNIVERSIDADE FEDERAL DE PERNAMBUCO
publishDate 2016
url https://repositorio.ufpe.br/handle/123456789/15257
_version_ 1648655320651661312
score 13.657419