21 research outputs found

    Answering List and Other questions

    Get PDF
    The importance of Question Answering is growing with the expansion of information and text documents on the web. Techniques in Question Answering have significantly improved during the last decade especially after the introduction of TREC Question Answering track. Most work in this field has been done on answering Factoid questions. In this thesis, however, we present and evaluate two approaches to answering List and Other types of questions which are as important but have not been investigated as much as Factoid questions. Although answering List questions is not a new research area, answering them automatically still remains a challenge. The median F-score of systems that participated at the TREC-2007 Question Answering track is still very low (0.085) while 74% of the questions had a median F-score of 0. In this thesis, we propose a novel approach to answering List questions. This approach is based on the hypothesis that the answer instances to a List question co-occur within sentences of the documents related to the question and the topic. We use a clustering method to group the candidate answers that co-occur more often. To pinpoint the right cluster, we use the target and the question keywords as spies . Using this approach, our system placed fourth among 21 teams in the TREC-2007 QA track with F-score 0.145. Other questions have been introduced in the TREC-QA track to retrieve other interesting facts about a topic. In our thesis, Other questions are answered using the notion of interest marking terms. To answer this type of questions, our system extracts, from Wikipedia articles, a list of interest marking terms related to the topic and uses them to extract and score sentences from the document collection where the answer should be found. Sentences are then re-ranked using universal interest-markers that are not specific to the topic. The top sentences are then returned as possible answers. To evaluate our approach, we participated in the TREC-2006 and TREC-2007 QA tracks. Using this approach, our system placed third in both years with F-score 0.199 and 0.281 respectively

    Domain Adaptation Techniques for Machine Translation and Their Evaluation in a Real-World Setting

    Get PDF
    Abstract. Statistical Machine Translation (SMT) is currently used in real-time and commercial settings to quickly produce initial translations for a document which can later be edited by a human. The SMT models specialized for one domain often perform poorly when applied to other domains. The typical assumption that both training and testing data are drawn from the same distribution no longer applies. This paper evaluates domain adaptation techniques for SMT systems in the context of end-user feedback in a real world application. We present our experiments using two adaptive techniques, one relying on log-linear models and the other using mixture models. We describe our experimental results on legal and government data, and present the human evaluation effort for post-editing in addition to traditional automated scoring techniques (BLEU scores). The human effort is based primarily on the amount of time and number of edits required by a professional post-editor to improve the quality of machine-generated translations to meet industry standards. The experimental results in this paper show that the domain adaptation techniques can yield a significant increase in BLEU score (up to four points) and a significant reduction in post-editing time of about one second per word

    Leveraging diverse sources in statistical machine translation

    Get PDF
    Statistical machine translation (SMT) is often faced with the problem of having insufficient training data for many language pairs. We propose several approaches to leveraging other available sources in SMT systems to enhance the quality of translation. Particularly, we propose approaches suitable in these four scenarios: 1. when an additional parallel corpus is available; 2. when parallel corpora between the source language and a third language and between that language and the target language are available; 3. when an abundant source-language monolingual corpus is available; 4. when no additional resource is available. In the heart of these solutions lie two novel approaches: ensemble decoding and a graph propagation approach for paraphrasing out-of-vocabulary words. Ensemble decoding combines a number of translation systems dynamically at the decoding step. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation. We extend ensemble decoding to do triangulation on-the-fly when there exist parallel corpora between the source language and one or multiple pivot languages and between those and the target language. These triangulated systems are dynamically combined together and possibly to a direct source-target system. Experiments in 12 different language pairs show significant improvements over the baselines in terms of BLEU scores. Ensemble decoding can also be used to apply stacking to statistical machine translation. Stacking is an ensemble learning approach that enhances the bias of the models. We show that stacking can consistently and significantly improve over the conventional SMT systems in two different language pairs and three different training sizes. In addition to ensemble decoding, we propose a novel approach to mining translations for OOV words using a monolingual corpus on the source-side language. We induce a lexicon by constructing a graph on the source language phrases and employ a graph propagation technique in order to find translations for those phrases. Experimental results in two different settings show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics

    Ensemble Triangulation for Statistical Machine Translation ∗

    No full text
    State-of-the-art statistical machine translation systems rely heavily on training data and insufficient training data usually results in poor translation quality. One solution to alleviate this problem is triangulation. Triangulation uses a third language as a pivot through which another sourcetarget translation system can be built. In this paper, we dynamically create multiple such triangulated systems and combine them using a novel approach called ensemble decoding. Experimental results of this approach show significant improvements in the BLEU score over the direct sourcetarget system. Our approach also outperforms a strong linear mixture baseline.
    corecore