3 research outputs found

    k-simplex volume optimizing projection algorithms for high-dimensional data sets

    Get PDF
    2021 Spring.Includes bibliographical references.Many applications produce data sets that contain hundreds or thousands of features, and consequently sit in very high dimensional space. It is desirable for purposes of analysis to reduce the dimension in a way that preserves certain important properties. Previous work has established conditions necessary for projecting data into lower dimensions while preserving pairwise distances up to some tolerance threshold, and algorithms have been developed to do so optimally. However, although similar criteria for projecting data into lower dimensions while preserving k-simplex volumes has been established, there are currently no algorithms seeking to optimally preserve such embedded volumes. In this work, two new algorithms are developed and tested: one which seeks to optimize the smallest projected k-simplex volume, and another which optimizes the average projected k-simplex volume

    Robust Neural Machine Translation

    Full text link
    This thesis aims for general robust Neural Machine Translation (NMT) that is agnostic to the test domain. NMT has achieved high quality on benchmarks with closed datasets such as WMT and NIST but can fail when the translation input contains noise due to, for example, mismatched domains or spelling errors. The standard solution is to apply domain adaptation or data augmentation to build a domain-dependent system. However, in real life, the input noise varies in a wide range of domains and types, which is unknown in the training phase. This thesis introduces five general approaches to improve NMT accuracy and robustness, where three of them are invariant to models, test domains, and noise types. First, we describe a novel unsupervised text normalization framework Lex-Var, to reduce the lexical variations for NMT. Then, we apply the phonetic encoding as auxiliary linguistic information and obtained very significant (5 BLEU point) improvement in translation quality and robustness. Furthermore, we introduce the random clustering encoding method based on our hypothesis of Semantic Diversity by Phonetics and generalizes to all languages. We also discussed two domain adaptation models for the known test domain. Finally, we provide a measurement of translation robustness based on the consistency of translation accuracy among samples and use it to evaluate our other methods. All these approaches are verified with extensive experiments across different languages and achieved significant and consistent improvements in translation quality and robustness over the state-of-the-art NMT
    corecore