17 research outputs found

    Source Error-Projection for Sample Selection in Phrase-Based SMT for Resource-Poor Languages

    Get PDF
    Abstract The unavailability of parallel training corpora in resource-poor languages is a major bottleneck in cost-effective and rapid deployment of statistical machine translation (SMT) technology. This has spurred significant interest in active learning for SMT to select the most informative samples from a large candidate pool. This is especially challenging when irrelevant outliers dominate the pool. We propose two supervised sample selection methods, viz. greedy selection and integer linear programming (ILP), based on a novel measure of benefit derived from error analysis. These methods support the selection of diverse and high-impact, yet relevant batches of source sentences. Comparative experiments on multiple test sets across two resource-poor language pairs (English-Pashto and English-Dari) reveal that the proposed approaches achieve BLEU scores comparable to the full system using a very small fraction of all available training data (ca. 6% for E-P and 13% for E-D). We further demonstrate that the ILP method supports global constraints of significant practical value

    AUTOMATIC CLASSIFICATION OF QUESTION TURNS IN SPONTANEOUS SPEECH USING LEXICAL AND PROSODIC EVIDENCE

    No full text
    The ability to identify speech acts reliably is desirable in any spoken language system that interacts with humans. Minimally, such a system should be capable of distinguishing between question-bearing turns and other types of utterances. However, this is a non-trivial task, since spontaneous speech tends to have incomplete syntactic, and even ungrammatical, structure and is characterized by disfluencies, repairs and other non-linguistic vocalizations that make simple rule based pattern learning difficult. In this paper, we present a system for identifying question-bearing turns in spontaneous multi-party speech (ICSI Meeting Corpus) using lexical and prosodic evidence. On a balanced test set, our system achieves an accuracy of 71.9 % for the binary question vs. non-question classification task. Further, we investigate the robustness of our proposed technique to uncertainty in the lexical feature stream (e.g. caused by speech recognition errors). Our experiments indicate that classification accuracy of the proposed method is robust to errors in the text stream, dropping only about 0.8 % for every 10 % increase in word error rate (WER). Index Terms — question turn, speech act, dialog, prosody, spontaneous speech 1

    Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence

    No full text
    corecore