35 research outputs found
A new method for learning Phrase Based Machine Translation with Multivariate Mutual Information
International audienceCurrent statistical machine translation systems usually build an initial word-to-word alignments before learning phrase translation pairs. This operation needs so many matching between di erent single words of both considered languages. We propose a new approach for phrase-based machine translation which does not need any word alignments, it is based on inter-lingual triggers determined by Multivariate Mutual Information. This algorithm segments sentences into phrases and nds their alignments simultaneously. The main objective is to build directly valid alignments between source and target phrases. Inspite of the youth of this method, experiments showed that the results are competitive but needs some more e orts in order to overcome the one of state-of-the-art methods
Characterizing Health-Related Information Needs of Domain Experts (regular paper)
International audienceIn information retrieval literature, understanding the users' intents behind the queries is critically important to gain a better insight of how to select relevant results. While many studies investigated how users in general carry out exploratory health searches in digital environments, a few focused on how are the queries formulated, specifically by domain expert users. This study intends to fill this gap by studying 173 health expert queries issued from 3 medical information retrieval tasks within 2 different evaluation compaigns. A statistical analysis has been carried out to study both variation and correlation of health-query attributes such as length, clarity and specificity of either clinical or non clinical queries. The knowledge gained from the study has an immediate impact on the design of future health information seeking systems
Analyse exploratoire des requêtes d'experts médicaux : cas des campagnes d'évaluation TREC et CLEF (regular paper)
International audienceDans ce papier, nous nous intéressons à l'analyse des besoins en information exprimés par des experts médicaux dans l'objectif de les caractériser puis mesurer l'impact de leur structure sur les résultats de recherche. À cet, effet, nous menons une étude exploratoire basée sur des analyses statistiques multidimensionnelles sur des collections de requêtes issues de campagnes d'évaluation internationales standards en l'occurrence TREC et CLEF. Notre étude révèle des variabilités significatives à la fois dans la morphologie des requêtes, que des besoins et des performances, que nous interprétons sur la base des objectifs et spécificités des tâches médicales associées. Les résultats de cette étude ont un impact sur la conception de systèmes de recherche d'information médicaux
STATISTICAL MACHINE TRANSLATION IMPROVEMENT BASED ON PHRASE SELECTION
International audienceThis paper describes the importance of introducing a phrase-based language model in the process of machine translation. In fact, nowadays SMT are based on phrases for translation but their language models are based on classical ngrams. In this paper we introduce a phrase-based language model (PBLM) in the decoding process to try to match the phrases of a translation table with those predicted by a language model. Furthermore, we propose a new way to retrieve phrases and their corresponding translation by using the principle of conditional mutual information. The SMT developed will be compared to the baseline one in terms of BLEU, TER and METEOR. The experimental results show that the introduction of PBLM in the translation decoding improve the results
Training phrase-based SMT without explicit word aligment
International audienceThe machine translation systems usually build an initialword-to-word alignment, before training the phrase translation pairs.This approach requires a lot of matching between different single words ofboth considered languages. In this paper, we propose a new approach forphrase-based machine translation which does not require any word alignment.This method is based on inter-lingual triggers retrieved by MultivariateMutual Information. This algorithm segments sentences intophrases and fnds their alignments simultaneously. The main objectiveof this work is to build directly valid alignments between source andtarget phrases. The achieved results, in terms of performance are satisfactoryand the obtained translation table is smaller than the referenceone; this approach could be considered as an alternative to the classicalmethods
Training Statistical Machine Translation with Multivariate Mutual Information
International audienceIn this paper, we describe a new model for phrase-based statistical machine translation. Roughly speaking, statistical approach uses a language and a translation model. This latter could be viewed as a lexical and an alignment model. The approach we propose does not need any alignment, it is based on inter-lingual triggers determined by multivariate mutual information (MMI). This measure depends on conditional mutual information, this means that a source phrase is directly linked to a target one. The conditional mutual information is used in both directions (source-target and target-source languages). We present an experimental evaluation conducted on EUROPARL corpora (French and English) and using the decoder MOSES. We compare then our results to those of a previous work in which we used inter-lingual triggers determined by a simple mutual information (MI) as well as to those given by baseline model (Koehn et al., 2003)
Tweet Contextualization Based on Wikipedia and Dbpedia
National audienceBound to 140 characters, tweets are short and not written maintaining formal grammar and proper spelling. These spelling variations increase the likelihood of vocabulary mismatch and make them difficult to understand without context. This paper falls under the tweet contextualization task that aims at providing, automatically, a summary that explains a given tweet, allowing a reader to understand it. We propose different tweet expansion approaches based on Wikipeda and Dbpedia as external knowledge sources. These proposed approaches are divided into two steps. The first step consists in generating the candidate terms for a given tweet, while the second one consists in ranking and selecting these candidate terms using asimilarity measure. The effectiveness of our methods is proved through an experimental study conducted on the INEX 2014 collection
A new method for learning Phrase Based Machine Translation with Multivariate Mutual Information
International audienceCurrent statistical machine translation systems usually build an initial word-to-word alignments before learning phrase translation pairs. This operation needs so many matching between di erent single words of both considered languages. We propose a new approach for phrase-based machine translation which does not need any word alignments, it is based on inter-lingual triggers determined by Multivariate Mutual Information. This algorithm segments sentences into phrases and nds their alignments simultaneously. The main objective is to build directly valid alignments between source and target phrases. Inspite of the youth of this method, experiments showed that the results are competitive but needs some more e orts in order to overcome the one of state-of-the-art methods
Phrase-based Machine Translation based on Text Mining and Statistical Language Modeling Techniques
International audienceIn this paper, we introduce two new methods dedicated to phrase based machine translation. Both are based on mining a parallel corpus in order to nd out the couples of linguistic units which are translation of each other. The presented methods do not rely on any alignment in contrast to what is done usually by the statistical machine translation community. Each of them proposes a complete translation table containing translations of single words and phrases. The rst method is inspired from the well-known trigger language model while the second one is inspired from the association rules mining technique. All experiments ar e conducted on a large part of EUROPARL corpus and highlight the utility of both proposed approaches