68 research outputs found

    Paradigm Completion for Derivational Morphology

    Full text link
    The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task. We overview the theoretical motivation for a paradigmatic treatment of derivational morphology, and introduce the task of derivational paradigm completion as a parallel to inflectional paradigm completion. State-of-the-art neural models, adapted from the inflection task, are able to learn a range of derivation patterns, and outperform a non-neural baseline by 16.4%. However, due to semantic, historical, and lexical considerations involved in derivational morphology, future work will be needed to achieve performance parity with inflection-generating systems.Comment: EMNLP 201

    SMRT Chatbots: Improving Non-Task-Oriented Dialog with Simulated Multiple Reference Training

    Full text link
    Non-task-oriented dialog models suffer from poor quality and non-diverse responses. To overcome limited conversational data, we apply Simulated Multiple Reference Training (SMRT; Khayrallah et al., 2020), and use a paraphraser to simulate multiple responses per training prompt. We find SMRT improves over a strong Transformer baseline as measured by human and automatic quality scores and lexical diversity. We also find SMRT is comparable to pretraining in human evaluation quality, and outperforms pretraining on automatic quality and lexical diversity, without requiring related-domain dialog data.Comment: EMNLP 2020 Camera Read

    On-the-Fly Fusion of Large Language Models and Machine Translation

    Full text link
    We propose the on-the-fly ensembling of a machine translation model with an LLM, prompted on the same task and input. We perform experiments on 4 language pairs (both directions) with varying data amounts. We find that a slightly weaker-at-translation LLM can improve translations of a NMT model, and ensembling with an LLM can produce better translations than ensembling two stronger MT models. We combine our method with various techniques from LLM prompting, such as in context learning and translation context

    Overcoming Data Challenges in Machine Translation

    Get PDF
    Data-driven machine translation paradigms—which use machine learning to create translation models that can automatically translate from one language to another—have the potential to enable seamless communication across language barriers, and improve global information access. For this to become a reality, machine translation must be available for all languages and styles of text. However, the translation quality of these models is sensitive to the quality and quantity of the data the models are trained on. In this dissertation we address and analyze challenges arising from this sensitivity; we present methods that improve translation quality in difficult data settings, and analyze the effect of data quality on machine translation quality. Machine translation models are typically trained on parallel corpora, but limited quantities of such data are available for most language pairs, leading to a low resource problem. We present a method for transfer learning from a paraphraser to overcome data sparsity in low resource settings. Even when training data is available in the desired language pair, it is frequently of a different style or genre than we would like to translate—leading to a domain mismatch. We present a method for improving domain adaptation translation quality. A seemingly obvious approach when faced with a lack of data is to acquire more data. However, it is not always feasible to produce additional human translations. In such a case, an option may be to crawl the web for additional training data. However, as we demonstrate, such data can be very noisy and harm machine translation quality. Our analysis motivated subsequent work on data filtering and cleaning by the broader community. The contributions in this dissertation not only improve translation quality in difficult data settings, but also serve as a reminder to carefully consider the impact of the data when training machine learning models

    School Bullying among Intermediate School Students in Babylon Governorate

    Get PDF
     هدفت الدراسة إلى اكشف عن (التنمر المدرسي لدى طلبة المرحلة المتوسطة في محافظة بابل) في مدارس مركز محافظة بابل وأقضيتها عن طريق بناء مقياس للتنمر المدرسي لدى طلبة المرحلة المتوسطة ولمعرفة في ما إذا كانت هناك فروق ذات دلالة إحصائية في نسبة هذا السلوك حسب متغير الجنس ومتغير السكن.  نفذت الدراسة على عينة مختارة بشكل عشوائي مؤلفة من (300) مطالب وطالبة للمرحلة الدراسية المتوسطة (الثاني) في محافظة بابل للعام الدراسي 2022 _ 2023 مقسمة إلى (150) إناثا و(150) ذكورا ينتمون إلى أربع مدارس؛ مدرستين للذكور، ومدرستين للإناث مقسمة هذه المدارس إلى؛ مدرسة ريف ومدينة وبلغت النسبة المئوية للعينة (21.27%) ونسبة العينة للمجتمع الأصلي (0.024%).  على أساس الاطلاع على الأدبيات والدراسات السابقة وضع مقياس والتأكد من صدقه بعرض فقراته على مجموعة من المحكمين من ذوي الاختصاص لإبداء الرأي في صلاحيتها وملاءمتها من حيث المضمون والصياغة للفقرات، مع التأكد من ثبات الأداة بتوزيع المقياس بصيغته النهائية على عينة مؤلفة من (40) طالبا وطالبة اختيروا بشكل عشوائي وبعد ذلك وزع نفس المقياس على الأشخاص أنفسهم بعد (15) يوما لتحديد درجة الثبات.  حللت نتائج الاستبيان إحصائيا للكشف عن نسبة التنمر المدرسي وقد أشارت النتائج إلى أن  نسبة انتشار التنمر المدرسي (2.27%) بالنسبة لمجموع العينة العام و(21.27%) بالنسبة مجتمع البحث إذ كانت نسبة التنمر المدرسي لدى الذكور (1.33%) والإناث (9.33%) مما يشير إلى أن نسبة التنمر المدرسي (اللفظي) لدى الإناث أكبر من الذكور. وأشارت النتائج بعد تحليلها إحصائيا إلى وجود فروق ذات دلالة إحصائية في نسبة التنمر المدرسي بين طلبة الريف والمدينة، إذ إن النسبة الأكبر لطلبة الريف مقارنة بالمدينة. بينما ثبت من الدراسة عدم وجود فروق ذات دلالة إحصائية في انتشار نسبة التنمر المدرسي لدى الإناث في الريف أو المدينة. من هذه الدراسة يمكن التوصية وضع برامج تدريبية للطلبة عن أساليب وكيفية التعامل مع التنمر المدرسي لديهم لغرض فهم مطالبهم واحتياجاتهم النفسية لخفض معدلات التنمر المدرسي وتحقق التوافق النفسي والتربوي والاجتماعي لهم. وكذلك تفعيل عمل المراكز والوحدات الإرشادية في وزارة التربية ووزارة العمل والشؤون الاجتماعية ووزارة حقوق الانسان.The current study aimed at (school bullying among middle school students in the province of Babil) in the schools of the center and districts of the province of Babylon by building a scale for school bullying among middle school students and to find out whether there are statistically significant differences in the percentage of this behavior according to the gender variable and the variable Living The study was carried out on a randomly selected sample consisting of (300) male and female students for the middle school stage (second) in the province of Babylon for the academic year 2022-2023 divided into (150) females and (150) males belonging to four schools, two schools for males and two schools for females divided into these schools To rural and city schools, the sample percentage was (21.27%) and the sample percentage for the original community was (0.024%) On the basis of reviewing the literature and previous studies, a scale was developed and its validity was confirmed by presenting its paragraphs to a group of specialized arbitrators in order to express an opinion on its validity and suitability in terms of content and wording of the paragraphs, while ensuring the stability of the tool by distributing the scale in its final form to a sample It consisted of (40) male and female students who were chosen randomly, and then the same scale was distributed to the same persons after a period of (15) days to determine the degree of stability. The results of the questionnaire were statistically analyzed to reveal the percentage of school bullying among middle school students. The obtained results indicated that the prevalence of school bullying was (2.27%) for the general sample and (21.27%) for the research community, as the percentage of school bullying among males was (1.33%).) and females (9.33%), which indicates that the percentage of school bullying (verbal) for females is greater than for males. The results, after statistical analysis, also indicated that there were statistically significant differences in the percentage of school bullying between rural and city students, as the largest percentage of rural students compared to the city. While the study proved that there are no statistically significant differences in the prevalence of school bullying among females in the countryside or the city. From this study, it can be recommended to develop training programs for students in the middle school on methods and how to deal with their school bullying in order to understand their demands and psychological needs to reduce school bullying rates and achieve psychological, educational and social compatibility for them

    On the Impact of Various Types of Noise on Neural Machine Translation

    Get PDF
    We examine how various types of noise in the parallel training data impact the quality of neural machine translation systems. We create five types of artificial noise and analyze how they degrade performance in neural and statistical machine translation. We find that neural models are generally more harmed by noise than statistical models. For one especially egregious type of noise they learn to just copy the input sentence.Comment: Please cite as: @InProceedings{khayrallah-koehn:2018:WNMT, author = {Khayrallah, Huda and Koehn, Philipp}, title = {On the Impact of Various Types of Noise on Neural Machine Translation}, booktitle = {Proceedings of the Second Workshop on Neural Machine Translation and Generation}, year = {2018}, address = {Melbourne}, publisher = {Association for Computational Linguistics}

    SOTASTREAM: A Streaming Approach to Machine Translation Training

    Full text link
    Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream can be manipulated by a set of user-definable operators that provide on-the-fly modifications, such as data normalization, augmentation or filtering. We release an open-source toolkit, SOTASTREAM, that implements this approach: https://github.com/marian-nmt/sotastream. We show that it cuts training time, adds flexibility, reduces experiment management complexity, and reduces disk space, all without affecting the accuracy of the trained models
    corecore