3 research outputs found

    Robust Neural Machine Translation

    Full text link
    This thesis aims for general robust Neural Machine Translation (NMT) that is agnostic to the test domain. NMT has achieved high quality on benchmarks with closed datasets such as WMT and NIST but can fail when the translation input contains noise due to, for example, mismatched domains or spelling errors. The standard solution is to apply domain adaptation or data augmentation to build a domain-dependent system. However, in real life, the input noise varies in a wide range of domains and types, which is unknown in the training phase. This thesis introduces five general approaches to improve NMT accuracy and robustness, where three of them are invariant to models, test domains, and noise types. First, we describe a novel unsupervised text normalization framework Lex-Var, to reduce the lexical variations for NMT. Then, we apply the phonetic encoding as auxiliary linguistic information and obtained very significant (5 BLEU point) improvement in translation quality and robustness. Furthermore, we introduce the random clustering encoding method based on our hypothesis of Semantic Diversity by Phonetics and generalizes to all languages. We also discussed two domain adaptation models for the known test domain. Finally, we provide a measurement of translation robustness based on the consistency of translation accuracy among samples and use it to evaluate our other methods. All these approaches are verified with extensive experiments across different languages and achieved significant and consistent improvements in translation quality and robustness over the state-of-the-art NMT

    Short message service normalization for communication with a health information system

    Get PDF
    Philosophiae Doctor - PhDShort Message Service (SMS) is one of the most popularly used services for communication between mobile phone users. In recent times it has also been proposed as a means for information access. However, there are several challenges to be overcome in order to process an SMS, especially when it is used as a query in an information retrieval system.SMS users often tend deliberately to use compacted and grammatically incorrect writing that makes the message difficult to process with conventional information retrieval systems. To overcome this, a pre-processing step known as normalization is required. In this thesis an investigation of SMS normalization algorithms is carried out. To this end,studies have been conducted into the design of algorithms for translating and normalizing SMS text. Character-based, unsupervised and rule-based techniques are presented. An investigation was also undertaken into the design and development of a system for information access via SMS. A specific system was designed to access information related to a Frequently Asked Questions (FAQ) database in healthcare, using a case study. This study secures SMS communication, especially for healthcare information systems. The proposed technique is to encipher the messages using the secure shell (SSH) protocol

    Normalization of Homophonic Words in Chinese Microblogs

    No full text
    corecore