134 research outputs found

    La traduzione specializzata all’opera per una piccola impresa in espansione: la mia esperienza di internazionalizzazione in cinese di Bioretics© S.r.l.

    Get PDF
    Global markets are currently immersed in two all-encompassing and unstoppable processes: internationalization and globalization. While the former pushes companies to look beyond the borders of their country of origin to forge relationships with foreign trading partners, the latter fosters the standardization in all countries, by reducing spatiotemporal distances and breaking down geographical, political, economic and socio-cultural barriers. In recent decades, another domain has appeared to propel these unifying drives: Artificial Intelligence, together with its high technologies aiming to implement human cognitive abilities in machinery. The “Language Toolkit – Le lingue straniere al servizio dell’internazionalizzazione dell’impresa” project, promoted by the Department of Interpreting and Translation (Forlì Campus) in collaboration with the Romagna Chamber of Commerce (Forlì-Cesena and Rimini), seeks to help Italian SMEs make their way into the global market. It is precisely within this project that this dissertation has been conceived. Indeed, its purpose is to present the translation and localization project from English into Chinese of a series of texts produced by Bioretics© S.r.l.: an investor deck, the company website and part of the installation and use manual of the Aliquis© framework software, its flagship product. This dissertation is structured as follows: Chapter 1 presents the project and the company in detail; Chapter 2 outlines the internationalization and globalization processes and the Artificial Intelligence market both in Italy and in China; Chapter 3 provides the theoretical foundations for every aspect related to Specialized Translation, including website localization; Chapter 4 describes the resources and tools used to perform the translations; Chapter 5 proposes an analysis of the source texts; Chapter 6 is a commentary on translation strategies and choices

    Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora

    Full text link
    Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text. Approaching the problem as a sequence-to-sequence task, we compare the use of a common subword unit vocabulary and byte-level encoding. Initial synthetic training data is created using an error-generating pipeline, and used for finetuning two subword-level models and one byte-level model. Models are then finetuned further on hand-corrected error corpora, including texts written by children, university students, dyslexic and second-language writers, and evaluated over different error types and origins. We show that a byte-level model enables higher correction quality than a subword approach, not only for simple spelling errors, but also for more complex semantic, stylistic and grammatical issues. In particular, initial training on synthetic corpora followed by finetuning on a relatively small parallel corpus of real-world errors helps the byte-level model correct a wide range of commonly occurring errors. Our experiments are run for the Icelandic language but should hold for other similar languages, particularly morphologically rich ones

    The automatic processing of multiword expressions in Irish

    Get PDF
    It is well-documented that Multiword Expressions (MWEs) pose a unique challenge to a variety of NLP tasks such as machine translation, parsing, information retrieval, and more. For low-resource languages such as Irish, these challenges can be exacerbated by the scarcity of data, and a lack of research in this topic. In order to improve handling of MWEs in various NLP tasks for Irish, this thesis will address both the lack of resources specifically targeting MWEs in Irish, and examine how these resources can be applied to said NLP tasks. We report on the creation and analysis of a number of lexical resources as part of this PhD research. Ilfhocail, a lexicon of Irish MWEs, is created through extract- ing MWEs from other lexical resources such as dictionaries. A corpus annotated with verbal MWEs in Irish is created for the inclusion of Irish in the PARSEME Shared Task 1.2. Additionally, MWEs were tagged in a bilingual EN-GA corpus for inclusion in experiments in machine translation. For the purposes of annotation, a categorisation scheme for nine categories of MWEs in Irish is created, based on combining linguistic analysis on these types of constructions and cross-lingual frameworks for defining MWEs. A case study in applying MWEs to NLP tasks is undertaken, with the exploration of incorporating MWE information while training Neural Machine Translation systems. Finally, the topic of automatic identification of Irish MWEs is explored, documenting the training of a system capable of automatically identifying Irish MWEs from a variety of categories, and the challenges associated with developing such a system. This research contributes towards a greater understanding of Irish MWEs and their applications in NLP, and provides a foundation for future work in exploring other methods for the automatic discovery and identification of Irish MWEs, and further developing the MWE resources described above

    24th Nordic Conference on Computational Linguistics (NoDaLiDa)

    Get PDF

    Translationese indicators for human translation quality estimation (based on English-to-Russian translation of mass-media texts)

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Human translation quality estimation is a relatively new and challenging area of research, because human translation quality is notoriously more subtle and subjective than machine translation, which attracts much more attention and effort of the research community. At the same time, human translation is routinely assessed by education and certification institutions, as well as at translation competitions. Do the quality labels and scores generated from real-life quality judgments align well with objective properties of translations? This thesis puts this question to a test using machine learning methods. Conceptually, this research is built around a hypothesis that linguistic properties characteristic of translations, as a specific form of communication, can correlate with translation quality. This assumption is often made in translation studies but has never been put to a rigorous empirical test. Exploring translationese features in a quality estimation task can help identify quality-related trends in translational behaviour and provide data-driven insights into professionalism to improve training. Using translationese for quality estimation fits well with the concept of quality in translation studies, because it is essentially a document-level property. Linguistically-motivated translationese features are also more interpretable than popular distributed representations and can explain linguistic differences between quality categories in human translation. We investigated (i) an extended set of Universal Dependencies-based morphosyntactic features as well as two lexical feature sets capturing (ii) collocational properties of translations, and (iii) ratios of vocabulary items in various frequency bands along with entropy scores from n-gram models. To compare the performance of our feature sets in translationese classifications and in quality estimation tasks against other representations, the experiments were also run on tf-idf features, QuEst++ features and on contextualised embeddings from a range of pre-trained language models, including the state-of-the-art multilingual solution for machine translation quality estimation. Our major focus was on document-level prediction, however, where the labels and features allowed, the experiments were extended to the sentence level. The corpus used in this research includes English-to-Russian parallel subcorpora of student and professional translations of mass-media texts, and a register-comparable corpus of non-translations in the target language. Quality labels for various subsets of student translations come from a number of real-life settings: translation competitions, graded student translations, error annotations and direct assessment. We overview approaches to benchmarking quality in translation and provide a detailed description of our own annotation experiments. Of the three proposed translationese feature sets, morphosyntactic features, returned the best results on all tasks. In many settings they were secondary only to contextualised embeddings. At the same time, performance on various representations was contingent on the type of quality captured by quality labels/scores. Using the outcomes of machine learning experiments and feature analysis, we established that translationese properties of translations were not equality reflected by various labels and scores. For example, professionalism was much less related to translationese than expected. Labels from documentlevel holistic assessment demonstrated maximum support for our hypothesis: lower-ranking translations clearly exhibited more translationese. They bore more traces of mechanical translational behaviours associated with following source language patterns whenever possible, which led to the inflated frequencies of analytical passives, modal predicates, verbal forms, especially copula verbs and verbs in the finite form. As expected, lower-ranking translations were more repetitive and had longer, more complex sentences. Higher-ranking translations were indicative of greater skill in recognising and counteracting translationese tendencies. For document-level holistic labels as an approach to capture quality, translationese indicators might provide a valuable contribution to an effective quality estimation pipeline. However, error-based scores, and especially scores from sentence-level direct assessment, proved to be much less correlated by translationese and fluency issues, in general. This was confirmed by relatively low regression results across all representations that had access only to the target language side of the dataset, by feature analysis and by correlation between error-based scores and scores from direct assessment

    adaptNMT: an open-source, language-agnostic development environment for neural machine translation

    Get PDF
    adaptNMT streamlines all processes involved in the development and deployment of RNN and Transformer neural translation models. As an open-source application, it is designed for both technical and non-technical users who work in the field of machine translation. Built upon the widely-adopted OpenNMT ecosystem, the application is particularly useful for new entrants to the field since the setup of the development environment and creation of train, validation and test splits is greatly simplified. Graphing, embedded within the application, illustrates the progress of model training, and SentencePiece is used for creating subword segmentation models. Hyperparameter customization is facilitated through an intuitive user interface, and a single-click model development approach has been implemented. Models developed by adaptNMT can be evaluated using a range of metrics, and deployed as a translation service within the application. To support eco-friendly research in the NLP space, a green report also flags the power consumption and kgCO2 emissions generated during model development. The application is freely available (http://github.com/adaptNMT)

    Towards Reliable and Inclusive Natural Language Generation

    Get PDF
    Natural language generation (NLG) is an important subfield of natural language processing (NLP) that produces natural language output. Despite notable advancements made by large-scale pre-trained language models in NLG, there remain several unresolved challenges. This thesis aims to enhance NLG from two significant aspects: reliability and inclusiveness. For reliability, on the one hand, we introduce novel training objectives that improve the alignment of language generation models with desired model behaviors. To improve the answerability of model-generated questions, we use a question answering model to provide additional rewards to a question generation model, encouraging the production of more answerable questions. In addition, we propose to train language models with a mixture of forward and reverse cross-entropies, demonstrating that the resulting models yield better generated text without complex decoding strategies. On the other hand, we propose novel evaluation methods to assess the performance of NLG models accurately and comprehensively. By combining human and automatic evaluations, we strike a balance between reliability and reproducibility. We delve into the unexplored issue of unfaithfulness in extractive summaries and conclude that extractive summarization does not guarantee faithfulness. For inclusiveness, we extend the coverage of NLG techniques to low-resource or endangered languages. We develop the first machine translation system for supporting translation between Cherokee, an endangered Native American language, and English, and we propose a roadmap for utilizing NLP to support language revitalization efforts. Additionally, we investigate the underrepresentation of low-resource languages during multilingual tokenization, a crucial data preprocessing step in training multilingual NLG models, and we present best practices for training multilingual tokenizers. Overall, this thesis works towards enhancing the trustworthiness of NLG models in practice and facilitating support for a more diverse range of languages worldwide.Doctor of Philosoph

    MorphPiece : Moving away from Statistical Language Representation

    Full text link
    Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. We propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows superior convergence compared to the same architecture trained on a standard BPE tokenizer. Specifically we get Language Modeling performance comparable to a 6 times larger model. Additionally, we evaluate MorphGPT on a variety of NLP tasks in supervised and unsupervised settings and find superior performance across the board, compared to GPT-2 model.Comment: 9 pages excluding references and appendices. 5 figure

    Systematic Approaches for Telemedicine and Data Coordination for COVID-19 in Baja California, Mexico

    Get PDF
    Conference proceedings info: ICICT 2023: 2023 The 6th International Conference on Information and Computer Technologies Raleigh, HI, United States, March 24-26, 2023 Pages 529-542We provide a model for systematic implementation of telemedicine within a large evaluation center for COVID-19 in the area of Baja California, Mexico. Our model is based on human-centric design factors and cross disciplinary collaborations for scalable data-driven enablement of smartphone, cellular, and video Teleconsul-tation technologies to link hospitals, clinics, and emergency medical services for point-of-care assessments of COVID testing, and for subsequent treatment and quar-antine decisions. A multidisciplinary team was rapidly created, in cooperation with different institutions, including: the Autonomous University of Baja California, the Ministry of Health, the Command, Communication and Computer Control Center of the Ministry of the State of Baja California (C4), Colleges of Medicine, and the College of Psychologists. Our objective is to provide information to the public and to evaluate COVID-19 in real time and to track, regional, municipal, and state-wide data in real time that informs supply chains and resource allocation with the anticipation of a surge in COVID-19 cases. RESUMEN Proporcionamos un modelo para la implementación sistemática de la telemedicina dentro de un gran centro de evaluación de COVID-19 en el área de Baja California, México. Nuestro modelo se basa en factores de diseño centrados en el ser humano y colaboraciones interdisciplinarias para la habilitación escalable basada en datos de tecnologías de teleconsulta de teléfonos inteligentes, celulares y video para vincular hospitales, clínicas y servicios médicos de emergencia para evaluaciones de COVID en el punto de atención. pruebas, y para el tratamiento posterior y decisiones de cuarentena. Rápidamente se creó un equipo multidisciplinario, en cooperación con diferentes instituciones, entre ellas: la Universidad Autónoma de Baja California, la Secretaría de Salud, el Centro de Comando, Comunicaciones y Control Informático. de la Secretaría del Estado de Baja California (C4), Facultades de Medicina y Colegio de Psicólogos. Nuestro objetivo es proporcionar información al público y evaluar COVID-19 en tiempo real y rastrear datos regionales, municipales y estatales en tiempo real que informan las cadenas de suministro y la asignación de recursos con la anticipación de un aumento de COVID-19. 19 casos.ICICT 2023: 2023 The 6th International Conference on Information and Computer Technologieshttps://doi.org/10.1007/978-981-99-3236-
    corecore