143 research outputs found

    Semantic relations between sentences: from lexical to linguistically inspired semantic features and beyond

    Get PDF
    This thesis is concerned with the identification of semantic equivalence between pairs of natural language sentences, by studying and computing models to address Natural Language Processing tasks where some form of semantic equivalence is assessed. In such tasks, given two sentences, our models output either a class label, corresponding to the semantic relation between the sentences, based on a predefined set of semantic relations, or a continuous score, corresponding to their similarity on a predefined scale. The former setup corresponds to the tasks of Paraphrase Identification and Natural Language Inference, while the latter corresponds to the task of Semantic Textual Similarity. We present several models for English and Portuguese, where various types of features are considered, for instance based on distances between alternative representations of each sentence, following lexical and semantic frameworks, or embeddings from pre-trained Bidirectional Encoder Representations from Transformers models. For English, a new set of semantic features is proposed, from the formal semantic representation of Discourse Representation Structure. In Portuguese, suitable corpora are scarce and formal semantic representations are unavailable, hence an evaluation of currently available features and corpora is conducted, following the modelling setup employed for English. Competitive results are achieved on all tasks, for both English and Portuguese, particularly when considering that our models are based on generally available tools and technologies, and that all features and models are suitable for computation in most modern computers, except for those based on embeddings. In particular, for English, our semantic features from DRS are able to improve the performance of other models, when integrated in the feature set of such models, and state of the art results are achieved for Portuguese, with models based on fine tuning embeddings to a specific task; Sumário: Relações semânticas entre frases: de aspectos lexicais a aspectos semânticos inspirados em linguística e além destes Esta tese é dedicada à identificação de equivalência semântica entre frases em língua natural, através do estudo e computação de modelos destinados a tarefas de Processamento de Linguagem Natural relacionadas com alguma forma de equivalência semântica. Em tais tarefas, a partir de duas frases, os nossos modelos produzem uma etiqueta de classificação, que corresponde à relação semântica entre as frases, baseada num conjunto predefinido de possíveis relações semânticas, ou um valor contínuo, que corresponde à similaridade das frases numa escala predefinida. A primeira configuração mencionada corresponde às tarefas de Identificação de Paráfrases e de Inferência em Língua Natural, enquanto que a última configuração mencionada corresponde à tarefa de Similaridade Semântica em Texto. Apresentamos diversos modelos para Inglês e Português, onde vários tipos de aspectos são considerados, por exemplo baseados em distâncias entre representações alternativas para cada frase, seguindo formalismos semânticos e lexicais, ou vectores contextuais de modelos previamente treinados com Representações Codificadas Bidirecionalmente a partir de Transformadores. Para Inglês, propomos um novo conjunto de aspectos semânticos, a partir da representação formal de semântica em Estruturas de Representação de Discurso. Para Português, os conjuntos de dados apropriados são escassos e não estão disponíveis representações formais de semântica, então implementámos uma avaliação de aspectos actualmente disponíveis, seguindo a configuração de modelos aplicada para Inglês. Obtivemos resultados competitivos em todas as tarefas, em Inglês e Português, particularmente considerando que os nossos modelos são baseados em ferramentas e tecnologias disponíveis, e que todos os nossos aspectos e modelos são apropriados para computação na maioria dos computadores modernos, excepto os modelos baseados em vectores contextuais. Em particular, para Inglês, os nossos aspectos semânticos a partir de Estruturas de Representação de Discurso melhoram o desempenho de outros modelos, quando integrados no conjunto de aspectos de tais modelos, e obtivemos resultados estado da arte para Português, com modelos baseados em afinação de vectores contextuais para certa tarefa

    Detecção de Paráfrases na Lı́ngua Portuguesa usando Sentence Embeddings

    Get PDF
    A detecção (ou identificação) de paráfrases é a tarefa de determinar se duas ou mais sentenças de comprimento arbitrário possuem o mesmo significado. Os métodos para resolver esta tarefa com potenciais aplicações em sistemas de Processamento de Linguagem Natural. Este trabalho investiga a combinação de diferentes métodos de representação de sentenças em modelos de linguagem por espaços vetoriais e classificadores lineares para o problema de detecção de paráfrases para a língua portuguesa. Os resultados obtidos nesse trabalho estão aquém daqueles obtidos para a tarefa relacionada de detecção de implicação textual na avaliação ASSIN para a língua portuguesa, porém nesse trabalho investigamos a aplicação das representações vetoriais de sentenças para a detecção de paráfrases, outras características usualmente exploradas em sistemas desse tipo podem trivialmente ser incorporadas ao nosso método para melhorar a performance

    Event extraction and representation: A case study for the portuguese language

    Get PDF
    Text information extraction is an important natural language processing (NLP) task, which aims to automatically identify, extract, and represent information from text. In this context, event extraction plays a relevant role, allowing actions, agents, objects, places, and time periods to be identified and represented. The extracted information can be represented by specialized ontologies, supporting knowledge-based reasoning and inference processes. In this work, we will describe, in detail, our proposal for event extraction from Portuguese documents. The proposed approach is based on a pipeline of specialized natural language processing tools; namely, a part-of-speech tagger, a named entities recognizer, a dependency parser, semantic role labeling, and a knowledge extraction module. The architecture is language-independent, but its modules are language-dependent and can be built using adequate AI (i.e., rule-based or machine learning) methodologies. The developed system was evaluated with a corpus of Portuguese texts and the obtained results are presented and analysed. The current limitations and future work are discussed in detail

    A question-answering machine learning system for FAQs

    Get PDF
    With the increase in usage and dependence on the internet for gathering information, it’s now essential to efficiently retrieve information according to users’ needs. Question Answering (QA) systems aim to fulfill this need by trying to provide the most relevant answer for a user’s query expressed in natural language text or speech. Virtual assistants like Apple Siri and automated FAQ systems have become very popular and with this the constant rush of developing an efficient, advanced and expedient QA system is reaching new limits. In the field of QA systems, this thesis addresses the problem of finding the FAQ question that is most similar to a user’s query. Finding semantic similarities between database question banks and natural language text is its foremost step. The work aims at exploring unsupervised approaches for measuring semantic similarities for developing a closed domain QA system. To meet this objective modern sentence representation techniques, such as BERT and FLAIR GloVe, are coupled with various similarity measures (cosine, Euclidean and Manhattan) to identify the best model. The developed models were tested with three FAQs and SemEval 2015 datasets for English language; the best results were obtained from the coupling of BERT embedding with Euclidean distance similarity measure with a performance of 85.956% on a FAQ dataset. The model is also tested for Portuguese language with Portuguese Health support phone line SNS24 dataset; Sumário: Um sistema de pergunta-resposta de aprendizagem automatica para FAQs Com o aumento da utilização e da dependência da internet para a recolha de informação, tornou-se essencial recuperar a informação de forma eficiente de acordo com as necessidades dos utilizadores. Os Sistemas de Pergunta- Resposta (PR) visam responder a essa necessidade, tentando fornecer a resposta mais relevante para a consulta de um utilizador expressa em texto em linguagem natural escrita ou falada. Os assistentes virtuais como o Apple Siri e sistemas automatizados de perguntas frequentes tornaram-se muito populares aumentando a necessidade de desenvolver um sistema de controle de qualidade eficiente, avançado e conveniente. No campo dos sistemas de PR, esta dissertação aborda o problema de encontrar a pergunta que mais se assemelha à consulta de um utilizador. Encontrar semelhanças semânticas entre a base de dados de perguntas e o texto em linguagem natural é a sua etapa mais importante. Neste sentido, esta dissertação tem como objetivo explorar abordagens não supervisionadas para medir similaridades semânticas para o desenvolvimento de um sistema de pergunta-resposta de domínio fechado. Neste sentido, técnicas modernas de representação de frases como o BERT e FLAIR GloVe são utilizadas em conjunto com várias medidas de similaridade (cosseno, Euclidiana e Manhattan) para identificar os melhores modelos. Os modelos desenvolvidos foram testados com conjuntos de dados de três FAQ e o SemEval 2015; os melhores resultados foram obtidos da combinação entre modelos de embedding BERT e a distância euclidiana, tendo-se obtido um desempenho máximo de 85,956% num conjunto de dados FAQ. O modelo também é testado para a língua portuguesa com o conjunto de dados SNS24 da linha telefónica de suporte de saúde em português

    Hybrid fuzzy multi-objective particle swarm optimization for taxonomy extraction

    Get PDF
    Ontology learning refers to an automatic extraction of ontology to produce the ontology learning layer cake which consists of five kinds of output: terms, concepts, taxonomy relations, non-taxonomy relations and axioms. Term extraction is a prerequisite for all aspects of ontology learning. It is the automatic mining of complete terms from the input document. Another important part of ontology is taxonomy, or the hierarchy of concepts. It presents a tree view of the ontology and shows the inheritance between subconcepts and superconcepts. In this research, two methods were proposed for improving the performance of the extraction result. The first method uses particle swarm optimization in order to optimize the weights of features. The advantage of particle swarm optimization is that it can calculate and adjust the weight of each feature according to the appropriate value, and here it is used to improve the performance of term and taxonomy extraction. The second method uses a hybrid technique that uses multi-objective particle swarm optimization and fuzzy systems that ensures that the membership functions and fuzzy system rule sets are optimized. The advantage of using a fuzzy system is that the imprecise and uncertain values of feature weights can be tolerated during the extraction process. This method is used to improve the performance of taxonomy extraction. In the term extraction experiment, five extracted features were used for each term from the document. These features were represented by feature vectors consisting of domain relevance, domain consensus, term cohesion, first occurrence and length of noun phrase. For taxonomy extraction, matching Hearst lexico-syntactic patterns in documents and the web, and hypernym information form WordNet were used as the features that represent each pair of terms from the texts. These two proposed methods are evaluated using a dataset that contains documents about tourism. For term extraction, the proposed method is compared with benchmark algorithms such as Term Frequency Inverse Document Frequency, Weirdness, Glossary Extraction and Term Extractor, using the precision performance evaluation measurement. For taxonomy extraction, the proposed methods are compared with benchmark methods of Feature-based and weighting by Support Vector Machine using the f-measure, precision and recall performance evaluation measurements. For the first method, the experiment results concluded that implementing particle swarm optimization in order to optimize the feature weights in terms and taxonomy extraction leads to improved accuracy of extraction result compared to the benchmark algorithms. For the second method, the results concluded that the hybrid technique that uses multi-objective particle swarm optimization and fuzzy systems leads to improved performance of taxonomy extraction results when compared to the benchmark methods, while adjusting the fuzzy membership function and keeping the number of fuzzy rules to a minimum number with a high degree of accuracy

    Encyclopaedic question answering

    Get PDF
    Open-domain question answering (QA) is an established NLP task which enables users to search for speciVc pieces of information in large collections of texts. Instead of using keyword-based queries and a standard information retrieval engine, QA systems allow the use of natural language questions and return the exact answer (or a list of plausible answers) with supporting snippets of text. In the past decade, open-domain QA research has been dominated by evaluation fora such as TREC and CLEF, where shallow techniques relying on information redundancy have achieved very good performance. However, this performance is generally limited to simple factoid and deVnition questions because the answer is usually explicitly present in the document collection. Current approaches are much less successful in Vnding implicit answers and are diXcult to adapt to more complex question types which are likely to be posed by users. In order to advance the Veld of QA, this thesis proposes a shift in focus from simple factoid questions to encyclopaedic questions: list questions composed of several constraints. These questions have more than one correct answer which usually cannot be extracted from one small snippet of text. To correctly interpret the question, systems need to combine classic knowledge-based approaches with advanced NLP techniques. To Vnd and extract answers, systems need to aggregate atomic facts from heterogeneous sources as opposed to simply relying on keyword-based similarity. Encyclopaedic questions promote QA systems which use basic reasoning, making them more robust and easier to extend with new types of constraints and new types of questions. A novel semantic architecture is proposed which represents a paradigm shift in open-domain QA system design, using semantic concepts and knowledge representation instead of words and information retrieval. The architecture consists of two phases, analysis – responsible for interpreting questions and Vnding answers, and feedback – responsible for interacting with the user. This architecture provides the basis for EQUAL, a semantic QA system developed as part of the thesis, which uses Wikipedia as a source of world knowledge and iii employs simple forms of open-domain inference to answer encyclopaedic questions. EQUAL combines the output of a syntactic parser with semantic information from Wikipedia to analyse questions. To address natural language ambiguity, the system builds several formal interpretations containing the constraints speciVed by the user and addresses each interpretation in parallel. To Vnd answers, the system then tests these constraints individually for each candidate answer, considering information from diUerent documents and/or sources. The correctness of an answer is not proved using a logical formalism, instead a conVdence-based measure is employed. This measure reWects the validation of constraints from raw natural language, automatically extracted entities, relations and available structured and semi-structured knowledge from Wikipedia and the Semantic Web. When searching for and validating answers, EQUAL uses the Wikipedia link graph to Vnd relevant information. This method achieves good precision and allows only pages of a certain type to be considered, but is aUected by the incompleteness of the existing markup targeted towards human readers. In order to address this, a semantic analysis module which disambiguates entities is developed to enrich Wikipedia articles with additional links to other pages. The module increases recall, enabling the system to rely more on the link structure of Wikipedia than on word-based similarity between pages. It also allows authoritative information from diUerent sources to be linked to the encyclopaedia, further enhancing the coverage of the system. The viability of the proposed approach was evaluated in an independent setting by participating in two competitions at CLEF 2008 and 2009. In both competitions, EQUAL outperformed standard textual QA systems as well as semi-automatic approaches. Having established a feasible way forward for the design of open-domain QA systems, future work will attempt to further improve performance to take advantage of recent advances in information extraction and knowledge representation, as well as by experimenting with formal reasoning and inferencing capabilities.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Readability assessment and automatic text simplification, the analysis of basque complex structures

    Get PDF
    301 p.(eus); 217 (eng)Tesi-lan honetan, euskarazko testuen konplexutasuna eta sinplifikazioa automatikoki aztertzeko lehen urratsak egin ditugu. Testuen konplexutasuna aztertzeko, testuen sinplifikazio automatikoa helburu duten beste hizkuntzetako lanetan eta euskarazko corpusetan egindako azterketa linguistikoan oinarritu gara. Azterketa horietatik testuak automatikoki sinplifikatzeko oinarri linguistikoak ezarri ditugu. Konplexutasuna automatikoki analizatzeko, ezaugarri linguistikoetan eta ikasketa automatikoko tekniketan oinarrituta ErreXail sistema sortu eta inplementatu dugu.Horretaz gain, testuak automatikoki sinplifikatuko dituen Euskarazko Testuen Sinplifikatzailea (EuTS) sistemaren arkitektura diseinatu dugu, sistemaren modulu bakoitzean egingo diren eragiketak definituz eta, kasu-azterketa bezala,informazio biografikoa duten egitura parentetikoak sinplifikatuko dituen Biografix tresna eleaniztuna inplementatuz.Amaitzeko, Euskarazko Testu Sinplifikatuen Corpusa (ETSC) corpusa osatu dugu. Corpus hau baliatu dugu gure sinplifikaziorako azterketetatik ateratako hurbilpena beste batzuekin erkatzeko. Konparazio horiek egiteko, etiketatze-eskema bat ere definitu dugu

    Enhancing extractive summarization with automatic post-processing

    Get PDF
    Tese de doutoramento, Informática (Ciência da Computação), Universidade de Lisboa, Faculdade de Ciências, 2015Any solution or device that may help people to optimize their time in doing productive work is of a great help. The steadily increasing amount of information that must be handled by each person everyday, either in their professional tasks or in their personal life, is becoming harder to be processed. By reducing the texts to be handled, automatic text summarization is a very useful procedure that can help to reduce significantly the amount of time people spend in many of their reading tasks. In the context of handling several texts, dealing with redundancy and focusing on relevant information the major problems to be addressed in automatic multi-document summarization. The most common approach to this task is to build a summary with sentences retrieved from the input texts. This approach is named extractive summarization. The main focus of current research on extractive summarization has been algorithm optimization, striving to enhance the selection of content. However, gains related to the increasing of algorithms complexity have not yet been proved, as the summaries remain difficult to be processed by humans in a satisfactory way. A text built fromdifferent documents by extracting sentences fromthemtends to form a textually fragile sequence of sentences, whose elements tend to be weakly related. In the present work, tasks that modify and relate the summary sentences are combined in a post-processing procedure. These tasks include sentence reduction, paragraph creation and insertion of discourse connectives, seeking to improve the textual quality of the final summary to be delivered to human users. Thus, this dissertation addresses automatic text summarization in a different perspective, by exploring the impact of the postprocessing of extraction-based summaries in order to build fluent and cohesive texts and improved summaries for human usage.Qualquer solução ou dispositivo que possa ajudar as pessoas a optimizar o seu tempo, de forma a realizar tarefas produtivas, é uma grande ajuda. A quantidade de informação que cada pessoa temque manipular, todos os dias, seja no trabalho ou na sua vida pessoal, é difícil de ser processada. Ao comprimir os textos a serem processados, a sumarização automática é uma tarefa muito útil, que pode reduzir significativamente a quantidade de tempo que as pessoas despendem em tarefas de leitura. Lidar com a redundância e focar na informação relevante num conjunto de textos são os principais objectivos da sumarização automática de vários documentos. A abordagem mais comum para esta tarefa consiste em construirse o resumo com frases obtidas a partir dos textos originais. Esta abordagem é conhecida como sumarização extractiva. O principal foco da investigação mais recente sobre sumarização extrativa é a optimização de algoritmos que visam obter o conteúdo relevante expresso nos textos originais. Porém, os ganhos relacionados com o aumento da complexidade destes algoritmos não foram ainda comprovados, já que os sumários continuam a ser difíceis de ler. É expectável que um texto, cujas frases foram extraídas de diferentes fontes, forme uma sequência frágil, sobretudo pela falta de interligação dos seus elementos. No contexto deste trabalho, tarefas que modificam e relacionam frases são combinadas numprocedimento denominado pós-processamento. Estas tarefas incluem a simplificação de frases, a criação de parágrafos e a inserção de conectores de discurso, que juntas procurammelhorar a qualidade do sumário final. Assim, esta dissertação aborda a sumarização automática numa perspectiva diferente, estudando o impacto do pós-processamento de um sumário extractivo, a fim de produzir um texto final fluente e coeso e em vista de se obter uma melhor qualidade textual.Fundação para a Ciência e a Tecnologia (FCT), SFRH/BD/45133/200
    corecore