21 research outputs found
Recommended from our members
Lost and Found in Translation: Cross-Lingual Question Answering with Result Translation
Using cross-lingual question answering (CLQA), users can find information in languages that they do not know. In this thesis, we consider the broader problem of CLQA with result translation, where answers retrieved by a CLQA system must be translated back to the user's language by a machine translation (MT) system. This task is challenging because answers must be both relevant to the question and adequately translated in order to be correct. In this work, we show that integrating the MT closely with cross-lingual retrieval can improve result relevance and we further demonstrate that automatically correcting errors in the MT output can improve the adequacy of translated results. To understand the task better, we undertake detailed error analyses examining the impact of MT errors on CLQA with result translation. We identify which MT errors are most detrimental to the task and how different cross-lingual information retrieval (CLIR) systems respond to different kinds of MT errors. We describe two main types of CLQA errors caused by MT errors: lost in retrieval errors, where relevant results are not returned, and lost in translation errors, where relevant results are perceived irrelevant due to inadequate MT. To address the lost in retrieval errors, we introduce two novel models for cross-lingual information retrieval that combine complementary source-language and target-language information from MT. We show empirically that these hybrid, bilingual models outperform both monolingual models and a prior hybrid model. Even once relevant results are retrieved, if they are not translated adequately, users will not understand that they are relevant. Rather than improving a specific MT system, we take a more general approach that can be applied to the output of any MT system. Our adequacy-oriented automatic post-editors (APEs) use resources from the CLQA context and information from the MT system to automatically detect and correct phrase-level errors in MT at query time, focusing on the errors that are most likely to impact CLQA: deleted or missing content words and mistranslated named entities. Human evaluations show that these adequacy-oriented APEs can successfully adapt task-agnostic MT systems to the needs of the CLQA task. Since there is no existing test data for translingual QA or IR tasks, we create a translingual information retrieval (TLIR) evaluation corpus. Furthermore, we develop an analysis framework for isolating the impact of MT errors on CLIR and on result understanding, as well as evaluating the whole TLIR task. We use the TLIR corpus to carry out a task-embedded MT evaluation, which shows that our CLIR models address lost in retrieval errors, resulting in higher TLIR recall; and that the APEs successfully correct many lost in translation errors, leading to more adequately translated results
Recommended from our members
Who, What, When, Where, Why? Comparing Multiple Approaches to the Cross-Lingual 5W Task
Cross-lingual tasks are especially difficult due to the compounding effect of errors in language processing and errors in machine translation (MT). In this paper, we present an error analysis of a new cross-lingual task: the 5W task, a sentence-level understanding task which seeks to return the English 5W's (Who, What, When, Where and Why) corresponding to a Chinese sentence. We analyze systems that we developed, identifying specific problems in language processing and MT that cause errors. The best cross-lingual 5W system was still 19% worse than the best monolingual 5W system, which shows that MT significantly degrades sentence-level understanding. Neither source-language nor target-language analysis was able to circumvent problems in MT, although each approach had advantages relative to the other. A detailed error analysis across multiple systems suggests directions for future research on the problem
Characterization of Trapped Lignin-Degrading Microbes in Tropical Forest Soil
Lignin is often the most difficult portion of plant biomass to degrade, with fungi generally thought to dominate during late stage decomposition. Lignin in feedstock plant material represents a barrier to more efficient plant biomass conversion and can also hinder enzymatic access to cellulose, which is critical for biofuels production. Tropical rain forest soils in Puerto Rico are characterized by frequent anoxic conditions and fluctuating redox, suggesting the presence of lignin-degrading organisms and mechanisms that are different from known fungal decomposers and oxygen-dependent enzyme activities. We explored microbial lignin-degraders by burying bio-traps containing lignin-amended and unamended biosep beads in the soil for 1, 4, 13 and 30 weeks. At each time point, phenol oxidase and peroxidase enzyme activity was found to be elevated in the lignin-amended versus the unamended beads, while cellulolytic enzyme activities were significantly depressed in lignin-amended beads. Quantitative PCR of bacterial communities showed more bacterial colonization in the lignin-amended compared to the unamended beads after one and four weeks, suggesting that the lignin supported increased bacterial abundance. The microbial community was analyzed by small subunit 16S ribosomal RNA genes using microarray (PhyloChip) and by high-throughput amplicon pyrosequencing based on universal primers targeting bacterial, archaeal, and eukaryotic communities. Community trends were significantly affected by time and the presence of lignin on the beads. Lignin-amended beads have higher relative abundances of representatives from the phyla Actinobacteria, Firmicutes, Acidobacteria and Proteobacteria compared to unamended beads. This study suggests that in low and fluctuating redox soils, bacteria could play a role in anaerobic lignin decomposition
Combining Signals for Cross-Lingual Relevance Feedback
Abstract. We present a new cross-lingual relevance feedback model that improves a machine-learned ranker for a language with few training resources, using feedback from a better ranker for a language that has more training resources. The model focuses on linguistically non-local queries, such as [world cup] and [copa mundial], that have similar user intent in different languages, thus allowing the low-resource ranker to get direct relevance feedback from the high-resource ranker. Our model extends prior work by combining both queryand document-level relevance signals using a machine-learned ranker. On an evaluation with web data sampled from a real-world search engine, the proposed cross-lingual feedback model outperforms two state-of-the-art models across two different low-resource languages.
E-rating Machine Translation
We describe our submissions to the WMT11 shared MT evaluation task: MTeRater and MTeRater-Plus. Both are machine-learned metrics that use features from e-rater R â—‹ , an automated essay scoring engine designed to assess writing proficiency. Despite using only features from e-rater and without comparing to translations, MTeRater achieves a sentencelevel correlation with human rankings equivalent to BLEU. Since MTeRater only assesses fluency, we build a meta-metric, MTeRater-Plus, that incorporates adequacy by combining MTeRater with other MT evaluation metrics and heuristics. This meta-metric has a higher correlation with human rankings than either MTeRater or individual MT metrics alone. However, we also find that e-rater features may not have significant impact on correlation in every case.
Lessons Learned from a PLTL-CS Program
The Peer-Led Team Learning (PLTL) approach has previously been shown to be effective in recruiting and retaining students, particularly under-represented students, in undergraduate introductory CS courses. In PLTL, small groups of students are led by an undergraduate peer and work together to solve problems related to CS. At Columbia University, the Columbia Emerging Scholars Program has used PLTL in an effort to increase enrollment in CS courses beyond the introductory level, and to increase the number of students who select Computer Science as their major, by demonstrating that CS is necessarily a collaborative activity that focuses more on problem solving and algorithmic thinking than on programming. Over the past five semesters, 68 students have completed the program, and preliminary results indicate that this program has had a positive effect on increasing participation in the major. This paper discusses our experiences of building and expanding the Columbia Emerging Scholars program, and addresses such topics as recruiting, training, scheduling, student behavior, and evaluation. We expect that this paper will provide a valuable set of lessons learned to other educators who seek to launch or grow a PLTL program at their institution as well
Simultaneous multilingual search for translingual information retrieval
We consider the problem of translingual information retrieval, where monolingual searchers issue queries in a different language than the document language(s) and the results must be returned in the language they know, the query language. We present a framework for translingual IR that integrates document translation and query translation into the retrieval model. The corpus is represented as an aligned, jointly indexed “pseudo-parallel” corpus, where each document contains the text of the document along with its translation into the query language. The queries are formulated as multilingual structured queries, where each query term and its translations into the document language(s) are treated as synonym sets. This model leverages simultaneous search in multiple languages against jointly indexed documents to improve the accuracy of results over search using document translation or query translation alone. For query translation, we compared a statistical machine translation (SMT) approach to a dictionarybased approach. We found that using a Wikipedia-derived dictionary for named entities combined with an SMT-based dictionary worked better than SMT alone. Simultaneous multilingual search also has other important features suited to translingual search, since it can provide an indication of poor document translation when a match with the source document is found. We show how close integration of CLIR and SMT allows us to improve result translation in addition to IR results