11,737 research outputs found

    A discriminative latent variable-based "DE" classifier for Chinese–English SMT

    Get PDF
    Syntactic reordering on the source-side is an effective way of handling word order differences. The (DE) construction is a flexible and ubiquitous syntactic structure in Chinese which is a major source of error in translation quality. In this paper, we propose a new classifier model — discriminative latent variable model (DPLVM) — to classify the DE construction to improve the accuracy of the classification and hence the translation quality. We also propose a new feature which can automatically learn the reordering rules to a certain extent. The experimental results show that the MT systems using the data reordered by our proposed model outperform the baseline systems by 6.42% and 3.08% relative points in terms of the BLEU score on PB-SMT and hierarchical phrase-based MT respectively. In addition, we analyse the impact of DE annotation on word alignment and on the SMT phrase table

    IMPROVING MOLECULAR FINGERPRINT SIMILARITY VIA ENHANCED FOLDING

    Get PDF
    Drug discovery depends on scientists finding similarity in molecular fingerprints to the drug target. A new way to improve the accuracy of molecular fingerprint folding is presented. The goal is to alleviate a growing challenge due to excessively long fingerprints. This improved method generates a new shorter fingerprint that is more accurate than the basic folded fingerprint. Information gathered during preprocessing is used to determine an optimal attribute order. The most commonly used blocks of bits can then be organized and used to generate a new improved fingerprint for more optimal folding. We thenapply the widely usedTanimoto similarity search algorithm to benchmark our results. We show an improvement in the final results using this method to generate an improved fingerprint when compared against other traditional folding methods

    A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena

    Get PDF
    Word reordering is one of the most difficult aspects of statistical machine translation (SMT), and an important factor of its quality and efficiency. Despite the vast amount of research published to date, the interest of the community in this problem has not decreased, and no single method appears to be strongly dominant across language pairs. Instead, the choice of the optimal approach for a new translation task still seems to be mostly driven by empirical trials. To orientate the reader in this vast and complex research area, we present a comprehensive survey of word reordering viewed as a statistical modeling challenge and as a natural language phenomenon. The survey describes in detail how word reordering is modeled within different string-based and tree-based SMT frameworks and as a stand-alone task, including systematic overviews of the literature in advanced reordering modeling. We then question why some approaches are more successful than others in different language pairs. We argue that, besides measuring the amount of reordering, it is important to understand which kinds of reordering occur in a given language pair. To this end, we conduct a qualitative analysis of word reordering phenomena in a diverse sample of language pairs, based on a large collection of linguistic knowledge. Empirical results in the SMT literature are shown to support the hypothesis that a few linguistic facts can be very useful to anticipate the reordering characteristics of a language pair and to select the SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic

    Improved phrase-based SMT with syntactic reordering patterns learned from lattice scoring

    Get PDF
    In this paper, we present a novel approach to incorporate source-side syntactic reordering patterns into phrase-based SMT. The main contribution of this work is to use the lattice scoring approach to exploit and utilize reordering information that is favoured by the baseline PBSMT system. By referring to the parse trees of the training corpus, we represent the observed reorderings with source-side syntactic patterns. The extracted patterns are then used to convert the parsed inputs into word lattices, which contain both the original source sentences and their potential reorderings. Weights of the word lattices are estimated from the observations of the syntactic reordering patterns in the training corpus. Finally, the PBSMT system is tuned and tested on the generated word lattices to show the benefits of adding potential sourceside reorderings in the inputs. We confirmed the effectiveness of our proposed method on a medium-sized corpus for Chinese-English machine translation task. Our method outperformed the baseline system by 1.67% relative on a randomly selected testset and 8.56% relative on the NIST 2008 testset in terms of BLEU score

    Linguistic Structure in Statistical Machine Translation

    Get PDF
    This thesis investigates the influence of linguistic structure in statistical machine translation. We develop a word reordering model based on syntactic parse trees and address the issues of pronouns and morphological agreement with a source discriminative word lexicon predicting the translation for individual words using structural features. When used in phrase-based machine translation, the models improve the translation for language pairs with different word order and morphological variation

    Description of the Chinese-to-Spanish rule-based machine translation system developed with a hybrid combination of human annotation and statistical techniques

    Get PDF
    Two of the most popular Machine Translation (MT) paradigms are rule based (RBMT) and corpus based, which include the statistical systems (SMT). When scarce parallel corpus is available, RBMT becomes particularly attractive. This is the case of the Chinese--Spanish language pair. This article presents the first RBMT system for Chinese to Spanish. We describe a hybrid method for constructing this system taking advantage of available resources such as parallel corpora that are used to extract dictionaries and lexical and structural transfer rules. The final system is freely available online and open source. Although performance lags behind standard SMT systems for an in-domain test set, the results show that the RBMT’s coverage is competitive and it outperforms the SMT system in an out-of-domain test set. This RBMT system is available to the general public, it can be further enhanced, and it opens up the possibility of creating future hybrid MT systems.Peer ReviewedPostprint (author's final draft

    Timesharing in relation to broad ability domains

    Get PDF
    [Abstract]: The concept of a timesharing ability has been the subject of considerable interest in recent times. The present study set out to determine whether a timesharing factor can be identi¬fied when a number of competing tasks are presented in the midst of a range of single tests designed to sample a broad range of psychological dimensions. Evidence for the existence of such a factor would form an important addition to our knowledge of human cognitive functioning. The framework for the study was provided by the theory of fluid and crystallized intelligence. A battery of single and competing tasks was presented to 126 subjects. The competing tasks represented a variety of within- and across-factor combinations from different levels of the (Gf/Gc) hierarchy. Modality of presentation was also varied in some combinations. On the basis of evidence presented in this study, it would be pre¬mature to seek to include a timesharing factor in the (Gf/Gc) model of intelligence
    corecore