214 research outputs found

    Rapport : a fact-based question answering system for portuguese

    Get PDF
    Question answering is one of the longest-standing problems in natural language processing. Although natural language interfaces for computer systems can be considered more common these days, the same still does not happen regarding access to specific textual information. Any full text search engine can easily retrieve documents containing user specified or closely related terms, however it is typically unable to answer user questions with small passages or short answers. The problem with question answering is that text is hard to process, due to its syntactic structure and, to a higher degree, to its semantic contents. At the sentence level, although the syntactic aspects of natural language have well known rules, the size and complexity of a sentence may make it difficult to analyze its structure. Furthermore, semantic aspects are still arduous to address, with text ambiguity being one of the hardest tasks to handle. There is also the need to correctly process the question in order to define its target, and then select and process the answers found in a text. Additionally, the selected text that may yield the answer to a given question must be further processed in order to present just a passage instead of the full text. These issues take also longer to address in languages other than English, as is the case of Portuguese, that have a lot less people working on them. This work focuses on question answering for Portuguese. In other words, our field of interest is in the presentation of short answers, passages, and possibly full sentences, but not whole documents, to questions formulated using natural language. For that purpose, we have developed a system, RAPPORT, built upon the use of open information extraction techniques for extracting triples, so called facts, characterizing information on text files, and then storing and using them for answering user queries done in natural language. These facts, in the form of subject, predicate and object, alongside other metadata, constitute the basis of the answers presented by the system. Facts work both by storing short and direct information found in a text, typically entity related information, and by containing in themselves the answers to the questions already in the form of small passages. As for the results, although there is margin for improvement, they are a tangible proof of the adequacy of our approach and its different modules for storing information and retrieving answers in question answering systems. In the process, in addition to contributing with a new approach to question answering for Portuguese, and validating the application of open information extraction to question answering, we have developed a set of tools that has been used in other natural language processing related works, such as is the case of a lemmatizer, LEMPORT, which was built from scratch, and has a high accuracy. Many of these tools result from the improvement of those found in the Apache OpenNLP toolkit, by pre-processing their input, post-processing their output, or both, and by training models for use in those tools or other, such as MaltParser. Other tools include the creation of interfaces for other resources containing, for example, synonyms, hypernyms, hyponyms, or the creation of lists of, for instance, relations between verbs and agents, using rules

    Neural models for information retrieval: towards asymmetry sensitive approaches based on attention models

    Get PDF
    Ce travail se situe dans le contexte de la recherche d'information (RI) utilisant des techniques d'intelligence artificielle (IA) telles que l'apprentissage profond (DL). Il s'intéresse à des tâches nécessitant l'appariement de textes, telles que la recherche ad-hoc, le domaine du questions-réponses et l'identification des paraphrases. L'objectif de cette thèse est de proposer de nouveaux modèles, utilisant les méthodes de DL, pour construire des modèles d'appariement basés sur la sémantique de textes, et permettant de pallier les problèmes de l'inadéquation du vocabulaire relatifs aux représentations par sac de mots, ou bag of words (BoW), utilisées dans les modèles classiques de RI. En effet, les méthodes classiques de comparaison de textes sont basées sur la représentation BoW qui considère un texte donné comme un ensemble de mots indépendants. Le processus d'appariement de deux séquences de texte repose sur l'appariement exact entre les mots. La principale limite de cette approche est l'inadéquation du vocabulaire. Ce problème apparaît lorsque les séquences de texte à apparier n'utilisent pas le même vocabulaire, même si leurs sujets sont liés. Par exemple, la requête peut contenir plusieurs mots qui ne sont pas nécessairement utilisés dans les documents de la collection, notamment dans les documents pertinents. Les représentations BoW ignorent plusieurs aspects, tels que la structure du texte et le contexte des mots. Ces caractéristiques sont très importantes et permettent de différencier deux textes utilisant les mêmes mots et dont les informations exprimées sont différentes. Un autre problème dans l'appariement de texte est lié à la longueur des documents. Les parties pertinentes peuvent être réparties de manières différentes dans les documents d'une collection. Ceci est d'autant vrai dans les documents volumineux qui ont tendance à couvrir un grand nombre de sujets et à inclure un vocabulaire variable. Un document long pourrait ainsi comporter plusieurs passages pertinents qu'un modèle d'appariement doit capturer. Contrairement aux documents longs, les documents courts sont susceptibles de concerner un sujet spécifique et ont tendance à contenir un vocabulaire plus restreint. L'évaluation de leur pertinence est en principe plus simple que celle des documents plus longs. Dans cette thèse, nous avons proposé différentes contributions répondant chacune à l'un des problèmes susmentionnés. Tout d'abord, afin de résoudre le problème d'inadéquation du vocabulaire, nous avons utilisé des représentations distribuées des mots (plongement lexical) pour permettre un appariement basé sur la sémantique entre les différents mots. Ces représentations ont été utilisées dans des applications de RI où la similarité document-requête est calculée en comparant tous les vecteurs de termes de la requête avec tous les vecteurs de termes du document, indifféremment. Contrairement aux modèles proposés dans l'état-de-l'art, nous avons étudié l'impact des termes de la requête concernant leur présence/absence dans un document. Nous avons adopté différentes stratégies d'appariement document/requête. L'intuition est que l'absence des termes de la requête dans les documents pertinents est en soi un aspect utile à prendre en compte dans le processus de comparaison. En effet, ces termes n'apparaissent pas dans les documents de la collection pour deux raisons possibles : soit leurs synonymes ont été utilisés ; soit ils ne font pas partie du contexte des documents en questions. Les méthodes que nous avons proposé permettent, d'une part d'effectuer un appariement inexact entre le document et la requête, et d'une autre part évaluer l'impact des différents termes d'une requête dans le processus d'appariement. Bien que l'utilisation du plongement lexical des mots permet d'effectuer un appariement basé sur la sémantique entre différentes séquences de texte, ces représentations combinées avec les modèles classiques considèrent toujours le texte comme une liste d'éléments indépendants (sac de vecteurs au lieux de sac de mots). Or, la structure du texte aussi bien que l'ordre des mots est très importante. Tout changement dans la structure du texte et/ou l'ordre des mots altère l'information exprimée. Afin de résoudre ce problème, les modèles neuronaux ont été utilisés dans l'appariement de texte. Dans notre cas, nous avons d'abord étudié différents modèles neuronaux de l'état-de-l'art pour la comparaison de textes, ensuite nous avons proposé deux approches principales. Dans un premier temps, nous avons construit un modèle qui tient compte de la structure d'un texte et de l'importance de ses mots. Plus précisément, nous avons combiné un modèle basé sur la position avec un modèle basé sur l'attention pour construire une approche d'appariement de texte exploitant des représentations basées sur la position en combinaison avec une pondération basée sur l'attention des mots. Nous croyons que lorsque le modèle est conscient de la position et de l'importance des mots, les représentations apprises fourniront des caractéristiques plus pertinentes pour le processus de comparaison. Nous avons conclu que la position combinée, dans une configuration asymétrique, à l'attention portée à un mot d'une séquence, permet d'améliorer de façon significative les résultats. Dans un deuxième temps, nous avons analysé différentes applications d'appariement neuronal de texte et les avons regroupé en deux grandes catégories. (1) les problèmes d'appariement symétrique qui consiste à identifier si deux textes, de même nature, sont sémantiquement similaires ; (2) les problèmes d'appariement asymétrique qui consiste à évaluer si un texte d'entrée fournit les informations recherchées dans un autre texte de nature différente. En étudiant les différents modèles neuronaux existants, nous avons constaté que tous les modèles proposés se basent sur une architecture Siamoise globale où les différentes entrées du modèle subissent le même traitement quelque soit la nature de la tâche, (1) ou (2). Afin de prendre en considération la nature de la tâche, nous avons proposé une architecture sensible à l'asymétrie pour l'appariement neuronal de textes. Particulièrement, nous avons utilisé un modèle d'attention pour construire une architecture générale qui étend différents modèles neuronaux de l'état de l'art. Enfin, pour faire face aux problèmes liés à la taille des documents dans la recherche ad-hoc en utilisant les réseaux de neurones, nous avons proposé une approche pour extraire des signaux de pertinence à différents niveaux dans un document long. Notamment, au niveau des mots, des passages et du document complet. Plus précisément, nous avons proposé une architecture globale multi-couche permettant de mesurer la pertinence à différent niveaux, en utilisant les modèles d'attention. Cette architecture est ensuite utilisée pour étendre plusieurs modèles de l'état de l'art et d'examiner l'apport de la pertinence mesurée à différents niveaux. Par ailleurs, nous avons proposé un modèle basé sur l'architecture générale proposée. Il utilise un réseau récurrent afin d'effectuer une sorte d'interaction compétitive entre les passages susceptible d'être pertinent dans un document, et qui sont préalablement sélectionnés.This work is situated in the context of information retrieval (IR) using machine learning (ML) and deep learning (DL) techniques. It concerns different tasks requiring text matching, such as ad-hoc research, question answering and paraphrase identification. The objective of this thesis is to propose new approaches, using DL methods, to construct semantic-based models for text matching, and to overcome the problems of vocabulary mismatch related to the classical bag of word (BoW) representations used in traditional IR models. Indeed, traditional text matching methods are based on the BoW representation, which considers a given text as a set of independent words. The process of matching two sequences of text is based on the exact matching between words. The main limitation of this approach is related to the vocabulary mismatch. This problem occurs when the text sequences to be matched do not use the same vocabulary, even if their subjects are related. For example, the query may contain several words that are not necessarily used in the documents of the collection, including relevant documents. BoW representations ignore several aspects about a text sequence, such as the structure the context of words. These characteristics are important and make it possible to differentiate between two texts that use the same words but expressing different information. Another problem in text matching is related to the length of documents. The relevant parts can be distributed in different ways in the documents of a collection. This is especially true in large documents that tend to cover a large number of topics and include variable vocabulary. A long document could thus contain several relevant passages that a matching model must capture. Unlike long documents, short documents are likely to be relevant to a specific subject and tend to contain a more restricted vocabulary. Assessing their relevance is in principle simpler than assessing the one of longer documents. In this thesis, we have proposed different contributions, each addressing one of the above-mentioned issues. First, in order to solve the problem of vocabulary mismatch, we used distributed representations of words (word embedding) to allow a semantic matching between the different words. These representations have been used in IR applications where document/query similarity is computed by comparing all the term vectors of the query with all the term vectors of the document, regardless. Unlike the models proposed in the state-of-the-art, we studied the impact of query terms regarding their presence/absence in a document. We have adopted different document/query matching strategies. The intuition is that the absence of the query terms in the relevant documents is in itself a useful aspect to be taken into account in the matching process. Indeed, these terms do not appear in documents of the collection for two possible reasons: either their synonyms have been used or they are not part of the context of the considered documents. The methods we have proposed make it possible, on the one hand, to perform an inaccurate matching between the document and the query, and on the other hand, to evaluate the impact of the different terms of a query in the matching process. Although the use of word embedding allows semantic-based matching between different text sequences, these representations combined with classical matching models still consider the text as a list of independent elements (bag of vectors instead of bag of words). However, the structure of the text as well as the order of the words is important. Any change in the structure of the text and/or the order of words alters the information expressed. In order to solve this problem, neural models were used in text matching. In our case, we first studied different neural models from the state-of-the-art of text matching, then we proposed two main approaches. First, we built a model that takes into account the structure of a text and the importance of its words. Specifically, we combined a position-based model with an attention-based model to build a text matching approach using position-based representations combined with attention-based weights of words. We believe that when the model is aware of the position and importance of words, the representations learned will provide more relevant characteristics for the comparison process. We concluded that the combined position, in an asymmetric configuration, with the attention given to a word in a sequence, significantly improves the results. In a second step, we analyzed different neural text matching applications and grouped them into two main categories. (1) symmetric matching problems which consists in identifying if two texts, of the same nature, are semantically similar; (2) asymmetric matching problems which consists in evaluating if an input text provides the information sought in another text of a different nature. By studying the various existing neural models, we have found that all the models proposed are based on a global Siamese architecture where the different inputs of the model undergo the same processing, whatever the nature of the task (1) or (2). In order to take into consideration the nature of the matching task, we proposed an asymmetry sensitive architecture for neural text matching. In particular, we used an attention model to build a general architecture that extends different neural models of the state-of-the-art. Finally, to address problems related to document size in ad-hoc search using neural networks, we proposed an approach to extract relevance signals at different levels in a long document. In particular, at the level of words, passages and the complete document. More precisely, we proposed a global multi-layer architecture to measure relevance at different levels, using attention models. This architecture is then used to extend several state of the art models and to examine the contribution of relevance measured at different levels. Based on this general architecture, we proposed a model that uses a recurrent layer to perform a kind of competitive interactions between the passages that are likely to be relevant in a document, and which are previously selected

    Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval

    Get PDF
    Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents--or short passages--in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms--such as a person's name or a product model number--not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections--such as the document index of a commercial Web search engine--containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks.Comment: PhD thesis, Univ College London (2020

    Information Retrieval with Entity Linking

    Get PDF
    Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements in information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a lightweight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, I propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. A zero-shot end-to-end dense entity linking system is employed for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, I believe that the effectiveness gap between sparse and dense retrievers can be narrowed. Experiments are conducted on the MS MARCO passage dataset using the original qrel set, the re-ranked qrels favoured by MonoT5 and the latter set further re-ranked by DuoT5. Since I am concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, the results are evaluated using recall@1000. The suggested approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work. In addition, it is demonstrated that the non-expanded and the expanded runs with both explicit and hashed entities retrieve complementary results. Consequently, run combination methods such as run fusion and classifier selection are experimented to maximize the benefits of entity linking. Due to the success of entity methods for sparse retrieval, the proposed approach is also tested on dense retrievers. The corresponding results are reported in MRR@10

    Modelos densos e híbridos para recuperação de informação

    Get PDF
    As in the era of Big Data, there is the need of finding information in an easy and fast way, being imperative for a search system to understand more efficiently the user intent. Dense Retrieval focuses on this idea, by allowing the models to capture the underlying meaning of the queries and documents. Current models already surpass the classical BM-25 model in terms of accuracy. However, due to the use of a high number of dimensions to create representations of the queries and documents, the dense models are still not optimized in terms of their efficiency at a search level. This work focuses on evaluating the need for that high number of dimensions, by analyzing different dimensionality reduction methods, trained for different purposes, and comparing the trade-offs between efficiency and accuracy.Na era de Big Data em que nos encontramos, existe a necessidade de encontrar informação de uma forma mais fácil e mais rápida, sendo imperativo para um sistema de pesquisa entender eficientemente a intenção do utilizador. O campo de Dense Retrieval foca-se nesta ideia, permitindo que os modelos capturem os aspetos semânticos de queries e documentos. Modelos atuais já superam o modelo clássico BM-25 em termos de eficácia. No entanto, devido à aplicação de um número elevado de dimensões para criar as representações de queries e documentos, estes modelos densos ainda não estão otimizados em termos de desempenho ao nível da pesquisa. Este trabalho foca-se em avaliar a necessidade desse número elevado de dimensões, analisando diferentes métodos de redução de dimensionalidade, orientados para diferentes objectivos, e em comparar pontos de equilíbrio entre eficiência e precisão.Mestrado em Engenharia Informátic

    Question Answering using Syntactic Patterns in a Contextual Search Engine

    Get PDF
    Question Answering (QA) systems promise to enhance both usability and accuracy when searching for knowledge. This thesis presents a prototype QA system built to leverage the extraction capabilities of a modern, context-aware search platform; Fast ESP. Questions in plain English are transformed to queries which target specific entities in the text that correspond with the identified answer types. A small set of unified patterns is demonstrated as adequate to classify a wide variety of syntactic constructs. For the purpose of verifying the answers, a semantic lexicon is compiled using an automated procedure. The whole solution is based on pattern matching and presents this as a viable alternative to deeper linguistic methods

    Proceedings of the 9th Dutch-Belgian Information Retrieval Workshop

    Get PDF
    • …
    corecore