73 research outputs found

    Adaptation des systèmes de recherche d'information aux contextes : le cas des requêtes difficiles

    Get PDF
    Le domaine de la recherche d'information (RI) étudie la façon de trouver des informations pertinentes dans un ou plusieurs corpus, pour répondre à un besoin d'information. Dans un Système de Recherche d'Information (SRI) les informations cherchées sont des " documents " et un besoin d'information prend la forme d'une " requête " formulée par l'utilisateur. La performance d'un SRI est dépendante de la requête. Les requêtes pour lesquelles les SRI échouent (pas ou peu de documents pertinents retrouvés) sont appelées dans la littérature des " requêtes difficiles ". Cette difficulté peut être causée par l'ambiguïté des termes, la formulation peu claire de la requête, le manque de contexte du besoin d'information, la nature et la structure de la collection de documents, etc. Cette thèse vise à adapter les systèmes de recherche d'information à des contextes, en particulier dans le cadre de requêtes difficiles. Le manuscrit est structuré en cinq chapitres principaux, outre les remerciements, l'introduction générale et les conclusions et perspectives. Le premier chapitre représente une introduction à la RI. Nous développons le concept de pertinence, les modèles de recherche de la littérature, l'expansion de requêtes et le cadre d'évaluation utilisé dans les expérimentations qui ont servi à valider nos propositions. Chacun des chapitres suivants présente une de nos contributions. Les chapitres posent les problèmes, indiquent l'état de l'art, nos propositions théoriques et leur validation sur des collections de référence. Dans le chapitre deux, nous présentons nos recherche sur la prise en compte du caractère ambigu des requêtes. L'ambiguïté des termes des requêtes peut en effet conduire à une mauvaise sélection de documents par les moteurs. Dans l'état de l'art, les méthodes de désambiguïsation qui donnent des bonnes performances sont supervisées, mais ce type de méthodes n'est pas applicable dans un contexte réel de RI, car elles nécessitent de l'information normalement indisponible. De plus, dans la littérature, la désambiguïsation de termes pour la RI est déclarée comme sous optimale. Dans ce contexte, nous proposons une méthode de désambiguïsation de requêtes non-supervisée et montrons son efficacité. Notre approche est interdisciplinaire, entre les domaines du traitement automatique du langage et la RI. L'objectif de la méthode de désambiguïsation non-supervisée que nous avons mise au point est de donner plus d'importance aux documents retrouvés par le moteur de recherche qui contient les mots de la requête avec les sens identifiés par la désambigüisation. Ce changement d'ordre des documents permet d'offrir une nouvelle liste qui contient plus de documents potentiellement pertinents pour l'utilisateur. Nous avons testé cette méthode de ré-ordonnancement des documents après désambigüisation en utilisant deux techniques de classification différentes (Naïve Bayes [Chifu et Ionescu, 2012] et classification spectrale [Chifu et al., 2015]), sur trois collections de documents et des requêtes de la compétition TREC (TREC7, TREC8, WT10G). Nous avons montré que la méthode de désambigüisation donne de bons résultats dans le cas où peu de documents pertinents sont retrouvés par le moteur de recherche (7,9% d'amélioration par rapport aux méthodes de l'état de l'art). Dans le chapitre trois, nous présentons le travail focalisé sur la prédiction de la difficulté des requêtes. En effet, si l'ambigüité est un facteur de difficulté, il n'est pas le seul. Nous avons complété la palette des prédicteurs de difficulté en nous appuyant sur l'état de l'art. Les prédicteurs existants ne sont pas suffisamment efficaces et, en conséquence, nous introduisons des nouvelles mesures de prédiction de la difficulté qui combinent les prédicteurs. Nous proposons également une méthode robuste pour évaluer les prédicteurs de difficulté des requêtes. En utilisant les combinaisons des prédicteurs, sur les collections TREC7 et TREC8, nous obtenons une amélioration de la qualité de la prédiction de 7,1% par rapport à l'état de l'art [Chifu, 2013]. Dans le quatrième chapitre nous nous intéressons à l'application des mesures de prédiction. Plus précisément, nous avons proposé une approche sélective de RI, c'est-à-dire que les prédicteurs sont utilisés pour décider quel moteur de recherche, parmi plusieurs, répondrait mieux pour une requête. Le modèle de décision est appris par un SVM (Séparateur à Vaste Marge). Nous avons testé notre modèle sur des collections de référence de TREC (Robust, WT10G, GOV2). Les modèles appris ont classé les requêtes de test avec plus de 90% d'exactitude. Par ailleurs, les résultats de la recherche ont été améliorés de plus de 11% en termes de performance, comparé à des méthodes non sélectives [Chifu et Mothe, 2014]. Dans le dernier chapitre, nous avons traité une problématique importante dans le domaine de la RI : l'expansion des requêtes par l'ajout de termes. Il est très difficile de prédire les paramètres d'expansion ou d'anticiper si une requête a besoin d'expansion, ou pas. Nous présentons notre contribution pour optimiser le paramètre lambda dans le cas de RM3 (un modèle pseudo-pertinence d'expansion des requêtes), par requête. Nous avons testé plusieurs hypothèses, à la fois avec et sans information préalable. Nous recherchons la quantité minimale d'information nécessaire pour que l'optimisation du paramètre d'expansion soit possible. Les résultats obtenus ne sont pas satisfaisants, même si nous avons utilisé une vaste plage de méthodes, comme les SVM, la régression, la régression logistique et les mesures de similarité. Par conséquent, ces observations peuvent renforcer la conclusion sur la difficulté de ce problème d'optimisation. Les recherches ont été menées non seulement au cours d'une mobilité de la recherche de trois mois à l'institut Technion de Haïfa, en Israël, en 2013, mais aussi par la suite, en gardant le contact avec l'équipe de Technion. A Haïfa, nous avons travaillé avec le professeur Oren Kurland et la doctorante Anna Shtok. En conclusion, dans cette thèse nous avons proposé de nouvelles méthodes pour améliorer les performances des systèmes de RI, en s'appuyant sur la difficulté des requêtes. Les résultats des méthodes proposées dans les chapitres deux, trois et quatre montrent des améliorations importantes et ouvrent des perspectives pour de futures recherches. L'analyse présentée dans le chapitre cinq confirme la difficulté de la problématique d'optimisation du paramètre concerné et incite à creuser plus sur le paramétrage de l'expansion sélective des requêtesThe field of information retrieval (IR) studies the mechanisms to find relevant information in one or more document collections, in order to satisfy an information need. For an Information Retrieval System (IRS) the information to find is represented by "documents" and the information need takes the form of a "query" formulated by the user. IRS performance depends on queries. Queries for which the IRS fails (little or no relevant documents retrieved) are called in the literature "difficult queries". This difficulty may be caused by term ambiguity, unclear query formulation, the lack of context for the information need, the nature and structure of the document collection, etc. This thesis aims at adapting IRS to contexts, particularly in the case of difficult queries. The manuscript is organized into five main chapters, besides acknowledgements, general introduction, conclusions and perspectives. The first chapter is an introduction to RI. We develop the concept of relevance, the retrieval models from the literature, the query expansion models and the evaluation framework that was employed to validate our proposals. Each of the following chapters presents one of our contributions. Every chapter raises the research problem, indicates the related work, our theoretical proposals and their validation on benchmark collections. In chapter two, we present our research on treating the ambiguous queries. The query term ambiguity can indeed lead to poor document retrieval of documents by the search engine. In the related work, the disambiguation methods that yield good performance are supervised, however such methods are not applicable in a real IR context, as they require the information which is normally unavailable. Moreover, in the literature, term disambiguation for IR is declared under optimal. In this context, we propose an unsupervised query disambiguation method and show its effectiveness. Our approach is interdisciplinary between the fields of natural language processing and IR. The goal of our unsupervised disambiguation method is to give more importance to the documents retrieved by the search engine that contain the query terms with the specific meaning identified by disambiguation. The document re-ranking provides a new document list that contains potentially relevant documents to the user. We tested this document re-ranking method after disambiguation using two different classification techniques (Naïve Bayes [Chifu and Ionescu, 2012] and spectral clustering [Chifu et al., 2015]), over three document collections and queries from the TREC competition (TREC7, TREC8, WT10G). We have shown that the disambiguation method in IR works well in the case of poorly performing queries (7.9% improvement compared to the methods of the state of the art). In chapter three, we present the work focused on query difficulty prediction. Indeed, if the ambiguity is a difficulty factor, it is not the only one. We completed the range of predictors of difficulty by relying on the state of the art. Existing predictors are not sufficiently effective and therefore we introduce new difficulty prediction measures that combine predictors. We also propose a robust method to evaluate difficulty predictors. Using predictor combinations, on TREC7 and TREC8 collections, we obtain an improvement of 7.1% in terms of prediction quality, compared to the state of the art [Chifu, 2013]. In the fourth chapter we focus on the application of difficulty predictors. Specifically, we proposed a selective IR approach, that is to say, predictors are employed to decide which search engine, among many, would perform better for a query. The decision model is learned by SVM (Support Vector Machine). We tested our model on TREC benchmark collections (Robust, WT10G, GOV2). The learned model classified the test queries with over 90% accuracy. Furthermore, the research results were improved by more than 11% in terms of performance, compared to non-selective methods [Chifu and Mothe, 2014]. In the last chapter, we treated an important issue in the field of IR: the query expansion by adding terms. It is very difficult to predict the expansion parameters or to anticipate whether a query needs the expansion or not. We present our contribution to optimize the lambda parameter in the case of RM3 (a pseudo-relevance model for query expansion), per query. We tested several hypotheses, both with and without prior information. We are searching for the minimum amount of information necessary in order for the optimization of the expansion parameter to be possible. The results are not satisfactory, even though we used a wide range of methods such as SVM, regression, logistic regression and similarity measures. Therefore, these findings may reinforce the conclusion regarding the difficulty of this optimization problem. The research was conducted not only during a mobility research three months at the Technion Institute in Haifa, Israel, in 2013, but thereafter, keeping in touch with the team of Technion. In Haifa, we worked with Professor Oren Kurland and PhD student Anna Shtok. In conclusion, in this thesis we proposed new methods to improve the performance of IRS, based on the query difficulty. The results of the methods proposed in chapters two, three and four show significant improvements and open perspectives for future research. The analysis in chapter five confirms the difficulty of the optimization problem of the concerned parameter and encourages thorough investigation on selective query expansion setting

    Rapid Exploitation and Analysis of Documents

    Full text link

    A model for information retrieval driven by conceptual spaces

    Get PDF
    A retrieval model describes the transformation of a query into a set of documents. The question is: what drives this transformation? For semantic information retrieval type of models this transformation is driven by the content and structure of the semantic models. In this case, Knowledge Organization Systems (KOSs) are the semantic models that encode the meaning employed for monolingual and cross-language retrieval. The focus of this research is the relationship between these meanings’ representations and their role and potential in augmenting existing retrieval models effectiveness. The proposed approach is unique in explicitly interpreting a semantic reference as a pointer to a concept in the semantic model that activates all its linked neighboring concepts. It is in fact the formalization of the information retrieval model and the integration of knowledge resources from the Linguistic Linked Open Data cloud that is distinctive from other approaches. The preprocessing of the semantic model using Formal Concept Analysis enables the extraction of conceptual spaces (formal contexts)that are based on sub-graphs from the original structure of the semantic model. The types of conceptual spaces built in this case are limited by the KOSs structural relations relevant to retrieval: exact match, broader, narrower, and related. They capture the definitional and relational aspects of the concepts in the semantic model. Also, each formal context is assigned an operational role in the flow of processes of the retrieval system enabling a clear path towards the implementations of monolingual and cross-lingual systems. By following this model’s theoretical description in constructing a retrieval system, evaluation results have shown statistically significant results in both monolingual and bilingual settings when no methods for query expansion were used. The test suite was run on the Cross-Language Evaluation Forum Domain Specific 2004-2006 collection with additional extensions to match the specifics of this model

    Closing the gap in WSD: supervised results with unsupervised methods

    Get PDF
    Word-Sense Disambiguation (WSD), holds promise for many NLP applications requiring broad-coverage language understanding, such as summarization (Barzilay and Elhadad, 1997) and question answering (Ramakrishnan et al., 2003). Recent studies have also shown that WSD can benefit machine translation (Vickrey et al., 2005) and information retrieval (Stokoe, 2005). Much work has focused on the computational treatment of sense ambiguity, primarily using data-driven methods. The most accurate WSD systems to date are supervised and rely on the availability of sense-labeled training data. This restriction poses a significant barrier to widespread use of WSD in practice, since such data is extremely expensive to acquire for new languages and domains. Unsupervised WSD holds the key to enable such application, as it does not require sense-labeled data. However, unsupervised methods fall far behind supervised ones in terms of accuracy and ease of use. In this thesis we explore the reasons for this, and present solutions to remedy this situation. We hypothesize that one of the main problems with unsupervised WSD is its lack of a standard formulation and general purpose tools common to supervised methods. As a first step, we examine existing approaches to unsupervised WSD, with the aim of detecting independent principles that can be utilized in a general framework. We investigate ways of leveraging the diversity of existing methods, using ensembles, a common tool in the supervised learning framework. This approach allows us to achieve accuracy beyond that of the individual methods, without need for extensive modification of the underlying systems. Our examination of existing unsupervised approaches highlights the importance of using the predominant sense in case of uncertainty, and the effectiveness of statistical similarity methods as a tool for WSD. However, it also serves to emphasize the need for a way to merge and combine learning elements, and the potential of a supervised-style approach to the problem. Relying on existing methods does not take full advantage of the insights gained from the supervised framework. We therefore present an unsupervised WSD system which circumvents the question of actual disambiguation method, which is the main source of discrepancy in unsupervised WSD, and deals directly with the data. Our method uses statistical and semantic similarity measures to produce labeled training data in a completely unsupervised fashion. This allows the training and use of any standard supervised classifier for the actual disambiguation. Classifiers trained with our method significantly outperform those using other methods of data generation, and represent a big step in bridging the accuracy gap between supervised and unsupervised methods. Finally, we address a major drawback of classical unsupervised systems – their reliance on a fixed sense inventory and lexical resources. This dependence represents a substantial setback for unsupervised methods in cases where such resources are unavailable. Unfortunately, these are exactly the areas in which unsupervised methods are most needed. Unsupervised sense-discrimination, which does not share those restrictions, presents a promising solution to the problem. We therefore develop an unsupervised sense discrimination system. We base our system on a well-studied probabilistic generative model, Latent Dirichlet Allocation (Blei et al., 2003), which has many of the advantages of supervised frameworks. The model’s probabilistic nature lends itself to easy combination and extension, and its generative aspect is well suited to linguistic tasks. Our model achieves state-of-the-art performance on the unsupervised sense induction task, while remaining independent of any fixed sense inventory, and thus represents a fully unsupervised, general purpose, WSD tool

    Apprentissage des réseaux de neurones profonds et applications en traitement automatique de la langue naturelle

    Get PDF
    En apprentissage automatique, domaine qui consiste à utiliser des données pour apprendre une solution aux problèmes que nous voulons confier à la machine, le modèle des Réseaux de Neurones Artificiels (ANN) est un outil précieux. Il a été inventé voilà maintenant près de soixante ans, et pourtant, il est encore de nos jours le sujet d'une recherche active. Récemment, avec l'apprentissage profond, il a en effet permis d'améliorer l'état de l'art dans de nombreux champs d'applications comme la vision par ordinateur, le traitement de la parole et le traitement des langues naturelles. La quantité toujours grandissante de données disponibles et les améliorations du matériel informatique ont permis de faciliter l'apprentissage de modèles à haute capacité comme les ANNs profonds. Cependant, des difficultés inhérentes à l'entraînement de tels modèles, comme les minima locaux, ont encore un impact important. L'apprentissage profond vise donc à trouver des solutions, en régularisant ou en facilitant l'optimisation. Le pré-entraînnement non-supervisé, ou la technique du ``Dropout'', en sont des exemples. Les deux premiers travaux présentés dans cette thèse suivent cette ligne de recherche. Le premier étudie les problèmes de gradients diminuants/explosants dans les architectures profondes. Il montre que des choix simples, comme la fonction d'activation ou l'initialisation des poids du réseaux, ont une grande influence. Nous proposons l'initialisation normalisée pour faciliter l'apprentissage. Le second se focalise sur le choix de la fonction d'activation et présente le rectifieur, ou unité rectificatrice linéaire. Cette étude a été la première à mettre l'accent sur les fonctions d'activations linéaires par morceaux pour les réseaux de neurones profonds en apprentissage supervisé. Aujourd'hui, ce type de fonction d'activation est une composante essentielle des réseaux de neurones profonds. Les deux derniers travaux présentés se concentrent sur les applications des ANNs en traitement des langues naturelles. Le premier aborde le sujet de l'adaptation de domaine pour l'analyse de sentiment, en utilisant des Auto-Encodeurs Débruitants. Celui-ci est encore l'état de l'art de nos jours. Le second traite de l'apprentissage de données multi-relationnelles avec un modèle à base d'énergie, pouvant être utilisé pour la tâche de désambiguation de sens.Machine learning aims to leverage data in order for computers to solve problems of interest. Despite being invented close to sixty years ago, Artificial Neural Networks (ANN) remain an area of active research and a powerful tool. Their resurgence in the context of deep learning has led to dramatic improvements in various domains from computer vision and speech processing to natural language processing. The quantity of available data and the computing power are always increasing, which is desirable to train high capacity models such as deep ANNs. However, some intrinsic learning difficulties, such as local minima, remain problematic. Deep learning aims to find solutions to these problems, either by adding some regularisation or improving optimisation. Unsupervised pre-training or Dropout are examples of such solutions. The two first articles presented in this thesis follow this line of research. The first analyzes the problem of vanishing/exploding gradients in deep architectures. It shows that simple choices, like the activation function or the weights initialization, can have an important impact. We propose the normalized initialization scheme to improve learning. The second focuses on the activation function, where we propose the rectified linear unit. This work was the first to emphasise the use of linear by parts activation functions for deep supervised neural networks, which is now an essential component of such models. The last two papers show some applications of ANNs to Natural Language Processing. The first focuses on the specific subject of domain adaptation in the context of sentiment analysis, using Stacked Denoising Auto-encoders. It remains state of the art to this day. The second tackles learning with multi-relational data using an energy based model which can also be applied to the task of word-sense disambiguation

    Computational approaches to semantic change

    Get PDF
    Semantic change — how the meanings of words change over time — has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least  understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans. A major challenge presently is to integrate the hard-earned  knowledge and expertise of traditional historical linguistics with  cutting-edge methodology explored primarily in computational linguistics. The idea for the present volume came out of a concrete response to this challenge.  The 1st International Workshop on Computational Approaches to Historical Language Change (LChange'19), at ACL 2019, brought together scholars from both fields. This volume offers a survey of this exciting new direction in the study of semantic change, a discussion of the many remaining challenges that we face in pursuing it, and considerably updated and extended versions of a selection of the contributions to the LChange'19 workshop, addressing both more theoretical problems —  e.g., discovery of "laws of semantic change" — and practical applications, such as information retrieval in longitudinal text archives

    Computational approaches to semantic change

    Get PDF
    Semantic change — how the meanings of words change over time — has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least  understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans. A major challenge presently is to integrate the hard-earned  knowledge and expertise of traditional historical linguistics with  cutting-edge methodology explored primarily in computational linguistics. The idea for the present volume came out of a concrete response to this challenge.  The 1st International Workshop on Computational Approaches to Historical Language Change (LChange'19), at ACL 2019, brought together scholars from both fields. This volume offers a survey of this exciting new direction in the study of semantic change, a discussion of the many remaining challenges that we face in pursuing it, and considerably updated and extended versions of a selection of the contributions to the LChange'19 workshop, addressing both more theoretical problems —  e.g., discovery of "laws of semantic change" — and practical applications, such as information retrieval in longitudinal text archives

    Computational approaches to semantic change

    Get PDF
    Semantic change — how the meanings of words change over time — has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least  understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans. A major challenge presently is to integrate the hard-earned  knowledge and expertise of traditional historical linguistics with  cutting-edge methodology explored primarily in computational linguistics. The idea for the present volume came out of a concrete response to this challenge.  The 1st International Workshop on Computational Approaches to Historical Language Change (LChange'19), at ACL 2019, brought together scholars from both fields. This volume offers a survey of this exciting new direction in the study of semantic change, a discussion of the many remaining challenges that we face in pursuing it, and considerably updated and extended versions of a selection of the contributions to the LChange'19 workshop, addressing both more theoretical problems —  e.g., discovery of "laws of semantic change" — and practical applications, such as information retrieval in longitudinal text archives

    Computational approaches to semantic change

    Get PDF
    Semantic change — how the meanings of words change over time — has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least  understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans. A major challenge presently is to integrate the hard-earned  knowledge and expertise of traditional historical linguistics with  cutting-edge methodology explored primarily in computational linguistics. The idea for the present volume came out of a concrete response to this challenge.  The 1st International Workshop on Computational Approaches to Historical Language Change (LChange'19), at ACL 2019, brought together scholars from both fields. This volume offers a survey of this exciting new direction in the study of semantic change, a discussion of the many remaining challenges that we face in pursuing it, and considerably updated and extended versions of a selection of the contributions to the LChange'19 workshop, addressing both more theoretical problems —  e.g., discovery of "laws of semantic change" — and practical applications, such as information retrieval in longitudinal text archives
    corecore