22 research outputs found

    Метод машинного навчання для ідентифікації парафрази

    No full text
    У роботі описаний новий ефективний алгоритм ідентифікації парафрази, розроблений з використанням машинного навчання. Архітектура системи має форму багатошарового класифікатора, де класифікатори нижнього рівня приймають рішення про факт наявності або відсутності парафрази в парах речень, відповідно до їхніх індивідуальних стратегій, а супер-класифікатор верхнього рівня приймає остаточне рішення. Експерименти показали оцінки точності визначення парафрази, співставні з кращими існуючими в світі системами.A new effective algorithm for paraphrase identification has been developed with using machine learning approach. Architecture of the system has a form of multilayer classifier where sub-classifiers of the lower level make decisions about presence or absence of paraphrase in sentences according to their strategies and super-classifier of upper level finds the final solution. Experiments demonstrated precision of paraphrase detection comparable with the best ones state-of-the-art systems

    Syntax-Aware Multi-Sense Word Embeddings for Deep Compositional Models of Meaning

    Full text link
    Deep compositional models of meaning acting on distributional representations of words in order to produce vectors of larger text constituents are evolving to a popular area of NLP research. We detail a compositional distributional framework based on a rich form of word embeddings that aims at facilitating the interactions between words in the context of a sentence. Embeddings and composition layers are jointly learned against a generic objective that enhances the vectors with syntactic information from the surrounding context. Furthermore, each word is associated with a number of senses, the most plausible of which is selected dynamically during the composition process. We evaluate the produced vectors qualitatively and quantitatively with positive results. At the sentence level, the effectiveness of the framework is demonstrated on the MSRPar task, for which we report results within the state-of-the-art range.Comment: Accepted for presentation at EMNLP 201

    Automatic detection of parallel sentences from comparable biomedical texts

    Get PDF
    International audienceParallel sentences provide semantically similar information which can vary on a given dimension, such as language or register. Parallel sentences with register variation (like expert and non-expert documents) can be exploited for the automatic text simplification. The aim of automatic text simplification is to better access and understand a given information. In the biomedical field, simplification may permit patients to understand medical and health texts. Yet, there is currently no such available resources. We propose to exploit comparable corpora which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are in French and are related to the biomedical area. Our purpose is to state whether a given pair of specialized and simplified sentences is to be aligned or not. Manually created reference data show 0.76 inter-annotator agreement. We treat this task as binary classification (alignment/non-alignment). We perform experiments on balanced and imbalanced data. The results on balanced data reach up to 0.96 F-Measure. On imbalanced data, the results are lower but remain competitive when using classification models train on balanced data. Besides, among the three datasets exploited (se-mantic equivalence and inclusions), the detection of equivalence pairs is more efficient

    Parallel sentence retrieval from comparable corpora for biomedical text simplification

    Get PDF
    International audienceParallel sentences provide semantically similar information which can vary on a given dimension , such as language or register. Parallel sentences with register variation (like expert and non-expert documents) can be exploited for the automatic text simplification. The aim of automatic text simplification is to better access and understand a given information. In the biomedical field, simplification may permit patients to understand medical and health texts. Yet, there is currently no such available resources. We propose to exploit comparable corpora which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are in French and are related to the biomedical area. Manually created reference data show 0.76 inter-annotator agreement. Our purpose is to state whether a given pair of specialized and simplified sentences is parallel and can be aligned or not. We treat this task as binary classification (alignment/non-alignment). We perform experiments with a controlled ratio of imbalance and on the highly unbalanced real data. Our results show that the method we present here can be used to automatically generate a corpus of parallel sentences from our comparable corpus

    A Unified Kernel Approach For Learning Typed Sentence Rewritings

    Get PDF
    International audienceMany high level natural language processing problems can be framed as determining if two given sentences are a rewriting of each other. In this paper, we propose a class of kernel functions, referred to as type-enriched string rewriting kernels, which, used in kernel-based machine learning algorithms, allow to learn sentence rewritings. Unlike previous work, this method can be fed external lexical semantic relations to capture a wider class of rewriting rules. It also does not assume preliminary syntactic parsing but is still able to provide a unified framework to capture syntactic structure and alignments between the two sentences. We experiment on three different natural sentence rewriting tasks and obtain state-of-the-art results for all of them

    An Introduction to String Re-Writing Kernel

    Get PDF
    Abstract Learning for sentence re-writing is a fundamental task in natural language processing and information retrieval. In this paper, we propose a new class of kernel functions, referred to as string rewriting kernel, to address the problem. A string re-writing kernel measures the similarity between two pairs of strings. It can capture the lexical and structural similarity between sentence pairs without the need of constructing syntactic trees. We further propose an instance of string re-writing kernel which can be computed efficiently. Experimental results on benchmark datasets show that our method can achieve comparable results with state-of-the-art methods on two sentence re-writing learning tasks: paraphrase identification and recognizing textual entailment

    Détection automatique de phrases parallèles dans un corpus biomédical comparable technique/simplifié

    Get PDF
    International audienceAutomatic detection of parallel sentences in comparable biomedical corpora Parallel sentences provide identical or semantically similar information which gives important clues on language. When sentences vary by their register (like expert vs non-expert), they can be exploited for the automatic text simplification. The aim of text simplification is to improve the understanding of texts. For instance, in the biomedical field, simplification may permit patients to understand better medical texts in relation to their health. Yet, there is currently very few resources for the simplification of French texts. We propose to exploit comparable corpora, which are distinguished by their technicality, to detect parallel sentences and to align them. The reference data are created manually and show 0.76 inter-annotator agreement. We perform experiments on balanced and imbalanced data. The results on balanced data reach up to 0.94 F-measure. On imbalanced data, the results are lower (up to 0.92 F-measure) but remain competitive when using classification models trained on balanced data.Les phrases parallèles contiennent des informations identiques ou très proches sémantiquement et offrent des indications importantes sur le fonctionnement de la langue. Lorsque les phrases sont différenciées par leur registre (comme expert vs. non-expert), elles peuvent être exploitées pour la simplification automatique de textes. Le but de la simplification automatique est d'améliorer la compréhension de textes. Par exemple, dans le domaine biomédical, la simplification peut permettre aux patients de mieux comprendre les textes relatifs à leur santé. Il existe cependant très peu de ressources pour la simplification en français. Nous proposons donc d'exploiter des corpus com-parables, différenciés par leur technicité, pour y détecter des phrases parallèles et les aligner. Les données de référence sont créées manuellement et montrent un accord inter-annotateur de 0,76. Nous expérimentons sur des données équilibrées et déséquilibrées. La F-mesure sur les données équilibrées atteint jusqu'à 0,94. Sur les données déséquilibrées, les résultats sont plus faibles (jusqu'à 0,92 de F-mesure) mais restent compétitifs lorsque les modèles sont entraînés sur les données équilibrées

    Learning the Impact and Behavior of Syntactic Structure: A Case Study in Semantic Textual Similarity

    Get PDF
    We present a case study on the role of syn- tactic structures towards resolving the Se- mantic Textual Similarity (STS) task. Al- though various approaches have been pro- posed, the research of using syntactic in- formation to determine the semantic simi- larity is a relatively under-researched area. At the level of syntactic structure, it is not clear how significant the syntactic struc- ture contributes to the overall accuracy of the task. In this paper, we analyze the impact of syntactic structure towards the overall performance and its behavior in different score ranges of the STS seman- tic scale

    Noyaux de réécriture de phrases munis de types lexico-sémantiques

    Get PDF
    National audienceDe nombreux problèmes en traitement automatique des langues requièrent de déterminer si deux phrases sont des réécritures l’une de l’autre. Une solution efficace consiste à apprendre les réécritures en se fondant sur des méthodes à noyau qui mesurent la similarité entre deux réécritures de paires de phrases. Toutefois, ces méthodes ne permettent généralement pas de prendre en compte des variations sémantiques entre mots, qui permettraient de capturer un plus grand nombre de règles de réécriture. Dans cet article, nous proposons la définition et l’implémentation d’une nouvelle classe de fonction noyau, fondée sur la réécriture de phrases enrichie par un typage pour combler ce manque. Nous l’évaluons sur deux tâches, la reconnaissance de paraphrases et d’implications textuelles
    corecore