251 research outputs found

    Text Representation for Nonconcatenative Morphology

    Full text link
    The last six years have seen the immense improvement of the NMT in terms of translation quality. With the help of the neural networks, the NMT has been able to achieve the state-of-the-art results in transla- tion quality. However, the NMT is still not able to achieve translation quality near human levels. In this thesis, we propose new approaches to improve the language representation as input to the NMT. This can be achieved by exploiting language specific knowledge, such as phonetic alterations, the morphology, and the syntax. We propose a new approach to improve the language representation by exploiting mor- phological phenomena in Turkish and Hebrew and show that the proposed segmentation approaches can improve translation quality. We have used several different segmentation approaches and compared them with each other. All of the segmentation approaches are rooted in the language specific morphological analysis of Turkish and Hebrew. We have also looked at the effect of the specific segmentation approach on translation quality. We have trained six different models of the type transformer with different seg- mentation approaches and compared them with each other. For each of the segmentation approaches, we have evaluated the translation quality using two automatic metrics and the human evaluation. We have also observed that the segmentation approaches can improve the translation quality in the case of the human evaluation, but not in the case of the automatic metrics. We have emphasized the importance of the human evaluation for NMT, and have shown that the automatic metrics can often be misleading

    How Multilingual is Multilingual LLM?

    Full text link
    Large Language Models (LLMs), trained predominantly on extensive English data, often exhibit limitations when applied to other languages. Current research is primarily focused on enhancing the multilingual capabilities of these models by employing various tuning strategies. Despite their effectiveness in certain languages, the understanding of the multilingual abilities of LLMs remains incomplete. This study endeavors to evaluate the multilingual capacity of LLMs by conducting an exhaustive analysis across 101 languages, and classifies languages with similar characteristics into four distinct quadrants. By delving into each quadrant, we shed light on the rationale behind their categorization and offer actionable guidelines for tuning these languages. Extensive experiments reveal that existing LLMs possess multilingual capabilities that surpass our expectations, and we can significantly improve the multilingual performance of LLMs by focusing on these distinct attributes present in each quadrant

    SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

    Full text link
    What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communicatio

    Proceedings of the IATS 2022 Panel on Tibetan Digital Humanities and Natural Language Processing

    Get PDF

    PaLM: Scaling Language Modeling with Pathways

    Full text link
    Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies

    Meaning refinement to improve cross-lingual information retrieval

    Get PDF
    Magdeburg, Univ., Fak. für Informatik, Diss., 2012von Farag Ahme

    Investigation into the genetic basis of bovine horn development

    Get PDF
    The presence of horns in ruminants has financial and welfare implications for the farming of cattle, sheep and goats worldwide. The genetic interactions that lead to horn development are not known. Hornless, or polled, cattle occur naturally, but the known causative DNA variants (Celtic, Friesian, Mongolian and Guarani) are in intergenic regions on bovine chromosome 1, and therefore, their functions are not known. The leading hypothesis is that horns are derived from cranial neural crest cells and the POLLED variants disrupt the migration or proliferation of these stem cells. The bovine POLLED region was explored through bioinformatics analyses as horned animals may have genomic differences from hornless individuals or species near the POLLED DNA variants. The aim was to identify differences in genes synteny, lincRNA, and topologically associating domain (TAD) structure between horned and hornless individuals or species. Horned (n = 1) and polled (Celtic; n = 1) Hi-C sequences produced the same TAD structures. The POLLED genomic region was refined to a 520-kb region encompassing all four POLLED variants. LOC526226 was unique to the bovine POLLED region and not conserved in the species analysed (water buffalo, sheep, goat, pig, horse, dog and human), and therefore, may be involved in horn development. Histological analyses of cranial tissues from homozygous horned and polled fetuses at day 58 of development were conducted. The aims were to 1) determine the differences in the structure of horn bud region, and 2) compare immunohistochemistry staining of neural crest markers (SOX10 and NGFR) and RXFP2 between horned and polled tissues. Condensed cells were only observed in the horn bud mesenchyme of horned fetuses and may be progenitor cells. SOX10 and NGFR was not detected in these condensed cells, and therefore, these cells are not derived from the neural crest or have differentiated and no longer express neural crest markers. SOX10 and NGFR were detected in the peripheral nerves. RXFP2 was detected in peripheral nerves and in the horn bud epidermis. Transcriptomic analyses of cranial tissues from the horned and polled fetuses at day 58 of development was also conducted. The aims were to 1) identify genes that may directly be affected by the polled variants, and 2) identify genes and pathways important for horn development. Near the POLLED region, three genes (C1H21orf62, SON and EVA1C) and one lincRNA (LOC112447120) were differentially expressed between horned and polled fetuses. Previously identified candidate genes, RXFP2, TWIST2 and ZEB2, were also differentially expressed. New candidates for the horn development pathway were proposed based on the analyses (MEIS2, PBX3, FZD8, CTNNB1 and LEF1). LOC526226 was not differentially expressed in the horn bud. Differentially expressed genes had functions in axon guidance, cytoskeletal structure and the extracellular region, and therefore, these pathways may be vital for horn development. Based on this research, it is now hypothesised that 1) horn stem cells are located in the mesenchyme and interact with the epidermis to initiate horn development, 2) the Celtic POLLED variant directly affects expression of C1H21orf62, SON, EVA1C and LOC112447120, and 3) the migration of horn stem cells is reduced by the effect of the POLLED variants upon C1H21orf62, SON, EVA1C and/or LOC112447120 expression.Thesis (Ph.D.) -- University of Adelaide, School of Animal and Veterinary Sciences, 202
    corecore