A solution to extractive summarization based on document type and a new measure for sentence similarity
The Internet is a enormous and fast growing digital repository encompassing billions of documents in a diversity of subjects, quality, reliability, etc. It is increasingly difficult to scavenge useful information from it. Thus, it is necessary to provide automatically techniques that allowing use...
Main Author: | MELLO, Rafael Ferreira Leite de |
---|---|
Other Authors: | FREITAS, Frederico Gonçalves de |
Format: | doctoralThesis |
Language: | por |
Published: |
UNIVERSIDADE FEDERAL DE PERNAMBUCO
2016
|
Subjects: | |
Online Access: |
https://repositorio.ufpe.br/handle/123456789/15257 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
ir-123456789-15257 |
---|---|
recordtype |
dspace |
spelling |
ir-123456789-152572019-10-26T00:46:38Z A solution to extractive summarization based on document type and a new measure for sentence similarity MELLO, Rafael Ferreira Leite de FREITAS, Frederico Gonçalves de LINS, Rafael Dueire http://lattes.cnpq.br/6190254569597745 http://lattes.cnpq.br/6195215666638965 Ciência da computação Inteligência artificial Mineração de texto Processamento de linguagem natural The Internet is a enormous and fast growing digital repository encompassing billions of documents in a diversity of subjects, quality, reliability, etc. It is increasingly difficult to scavenge useful information from it. Thus, it is necessary to provide automatically techniques that allowing users to save time and resources. Automatic text summarization techniques may offer a way out to this problem. Text summarization (TS) aims at automatically compress one or more documents to present their main ideas in less space. TS platforms receive one or more documents as input to generate a summary. In recent years, a variety of text summarization methods has been proposed. However, due to the different document types (such as news, blogs, and scientific articles) it became difficult to create a general TS application to create expressive summaries for each type. Another related relevant problem is measuring the degree of similarity between sentences, which is used in applications, such as: text summarization, information retrieval, image retrieval, text categorization, and machine translation. Recent works report several efforts to evaluate sentence similarity by representing sentences using vectors of bag of words or a tree of the syntactic information among words. However, most of these approaches do not take in consideration the sentence meaning and the words order. This thesis proposes: (i) a new text summarization solution which identifies the document type before perform the summarization, (ii) the creation of a new sentence similarity measure based on lexical, syntactic and semantic evaluation to deal with meaning and word order problems. The previous identification of the document types allows the summarization solution to select the methods that is more suitable to each type of text. This thesis also perform a detailed assessment with the most used text summarization methods to selects which create more informative summaries for news, blogs and scientific articles contexts.The sentence similarity measure proposed is completely unsupervised and reaches results similar to humans annotator using the dataset proposed by Li et al. The proposed measure was satisfactorily applied to evaluate the similarity between summaries and to eliminate redundancy in multi-document summarization. Atualmente a quantidade de documentos de texto aumentou consideravelmente principalmente com o grande crescimento da internet. Existem milhares de artigos de notícias, livros eletrônicos, artigos científicos, blog, etc. Com isso é necessário aplicar técnicas automáticas para extrair informações dessa grande massa de dados. Sumarização de texto pode ser usada para lidar com esse problema. Sumarização de texto (ST) cria versões comprimidas de um ou mais documentos de texto. Em outras palavras, palataformas de ST recebem um ou mais documentos como entrada e gera um sumário deles. Nos últimos anos, uma grande quantidade de técnicas de sumarização foram propostas. Contudo, dado a grande quantidade de tipos de documentos (por exemplo, notícias, blogs e artigos científicos) é difícil encontrar uma técnica seja genérica suficiente para criar sumários para todos os tipos de forma eficiente. Além disto, outro tópico bastante trabalhado na área de mineração de texto é a análise de similaridade entre sentenças. Essa similaridade pode ser usada em aplicações como: sumarização de texto, recuperação de infromação, recuperação de imagem, categorização de texto e tradução. Em geral, as técnicas propostas são baseados em vetores de palavras ou árvores sintáticas, com isso dois problemas não são abordados: o problema de significado e de ordem das palavras. Essa tese propõe: (i) Uma nova solução em sumarização de texto que identifica o tipo de documento antes de realizar a sumarização. (ii) A criação de uma nova medida de similaridade entre sentenças baseada nas análises léxica, sintática e semântica. A identificação de tipo de documento permite que a solução de sumarização selecione os melhores métodos para cada tipo de texto. Essa tese também realizar um estudo detalhado sobre os métodos de sumarização para selecinoar os que criam sumários mais informativos nos contextos de notícias blogs e artigos científicos. A medida de similaridade entre sentences é completamente não supervisionada e alcança resultados similarires dos anotadores humanos usando o dataset proposed por Li et al. A medida proposta também foi satisfatoriamente aplicada na avaliação de similaridade entre resumos e para eliminar redundância em sumarização multi-documento. 2016-02-19T18:25:04Z 2016-02-19T18:25:04Z 2015-03-20 doctoralThesis https://repositorio.ufpe.br/handle/123456789/15257 por Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ application/pdf UNIVERSIDADE FEDERAL DE PERNAMBUCO UFPE Brasil Programa de Pos Graduacao em Ciencia da Computacao |
institution |
REPOSITORIO UFPE |
collection |
REPOSITORIO UFPE |
language |
por |
topic |
Ciência da computação Inteligência artificial Mineração de texto Processamento de linguagem natural |
spellingShingle |
Ciência da computação Inteligência artificial Mineração de texto Processamento de linguagem natural MELLO, Rafael Ferreira Leite de A solution to extractive summarization based on document type and a new measure for sentence similarity |
description |
The Internet is a enormous and fast growing digital repository encompassing billions of
documents in a diversity of subjects, quality, reliability, etc. It is increasingly difficult
to scavenge useful information from it. Thus, it is necessary to provide automatically
techniques that allowing users to save time and resources. Automatic text summarization
techniques may offer a way out to this problem. Text summarization (TS) aims at automatically
compress one or more documents to present their main ideas in less space. TS
platforms receive one or more documents as input to generate a summary. In recent years,
a variety of text summarization methods has been proposed. However, due to the different
document types (such as news, blogs, and scientific articles) it became difficult to create
a general TS application to create expressive summaries for each type. Another related
relevant problem is measuring the degree of similarity between sentences, which is used
in applications, such as: text summarization, information retrieval, image retrieval, text
categorization, and machine translation. Recent works report several efforts to evaluate
sentence similarity by representing sentences using vectors of bag of words or a tree of
the syntactic information among words. However, most of these approaches do not take
in consideration the sentence meaning and the words order. This thesis proposes: (i) a
new text summarization solution which identifies the document type before perform the
summarization, (ii) the creation of a new sentence similarity measure based on lexical,
syntactic and semantic evaluation to deal with meaning and word order problems. The
previous identification of the document types allows the summarization solution to select
the methods that is more suitable to each type of text. This thesis also perform a detailed
assessment with the most used text summarization methods to selects which create more
informative summaries for news, blogs and scientific articles contexts.The sentence similarity
measure proposed is completely unsupervised and reaches results similar to humans
annotator using the dataset proposed by Li et al. The proposed measure was satisfactorily
applied to evaluate the similarity between summaries and to eliminate redundancy in
multi-document summarization. |
author2 |
FREITAS, Frederico Gonçalves de |
format |
doctoralThesis |
author |
MELLO, Rafael Ferreira Leite de |
author_sort |
MELLO, Rafael Ferreira Leite de |
title |
A solution to extractive summarization based on document type and a new measure for sentence similarity |
title_short |
A solution to extractive summarization based on document type and a new measure for sentence similarity |
title_full |
A solution to extractive summarization based on document type and a new measure for sentence similarity |
title_fullStr |
A solution to extractive summarization based on document type and a new measure for sentence similarity |
title_full_unstemmed |
A solution to extractive summarization based on document type and a new measure for sentence similarity |
title_sort |
solution to extractive summarization based on document type and a new measure for sentence similarity |
publisher |
UNIVERSIDADE FEDERAL DE PERNAMBUCO |
publishDate |
2016 |
url |
https://repositorio.ufpe.br/handle/123456789/15257 |
_version_ |
1648655320651661312 |
score |
13.657419 |