202 research outputs found

    Cross-Lingual Neural Network Speech Synthesis Based on Multiple Embeddings

    Get PDF
    The paper presents a novel architecture and method for speech synthesis in multiple languages, in voices of multiple speakers and in multiple speaking styles, even in cases when speech from a particular speaker in the target language was not present in the training data. The method is based on the application of neural network embedding to combinations of speaker and style IDs, but also to phones in particular phonetic contexts, without any prior linguistic knowledge on their phonetic properties. This enables the network not only to efficiently capture similarities and differences between speakers and speaking styles, but to establish appropriate relationships between phones belonging to different languages, and ultimately to produce synthetic speech in the voice of a certain speaker in a language that he/she has never spoken. The validity of the proposed approach has been confirmed through experiments with models trained on speech corpora of American English and Mexican Spanish. It has also been shown that the proposed approach supports the use of neural vocoders, i.e. that they are able to produce synthesized speech of good quality even in languages that they were not trained on

    SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

    Full text link
    What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communicatio

    Multilingual journalism, news translation and discourse: converging methods, converging theories

    Get PDF
    This study explores a methodological and theoretical framework suitable for the investigation of multilingual journalism, news translation and the discourses that these two meaning-making activities promote and transmit to global audiences. Departing from the consideration that journalism as well as translation are multi-layered objects and may conceal power dynamics and struggles within society, I suggest that such complexities can only be addressed and analysed if deconstructed and successively reconstructed through a combination of methods and theoretical perspectives. This study merges different methodological approaches, including Critical Discourse Analysis, Corpus Linguistics and comparative analysis in order to grasp the complexities and ramifications of different forms of journalism (i.e. broadcasting, online, written and a mix of the three) in different national, supranational and international contexts, (i.e. Italy, the UK, Europe, Australia and the Internet). This combination of methods and theories is termed "convergence" to indicate the convergence of approaches from a variety of fields of studies functional to the investigation of both multilingual news discourse and news translation, and ultimately echoing the increasingly pervasive phenomenon of media convergence. The "convergence" framework allows us to move past the traditional Source Text – Target Text opposition, thus favouring a more flexible and wider concept of translation, one that seems to be more fitting to the reality of language transfer processes in the news. In order to demonstrate the validity of the framework of convergence, this thesis presents a corpus of multilingual audio-visual news transcripts (AVNews Corpus), and four case studies, two employing the AVNews Corpus, one envisaging a more traditional comparative analysis, and finally one making use of a small parallel corpus. The four case studies presented in this thesis aim to showcase the validity of this framework, eventually calling for larger and more systematic studies about language transfer activities in the news

    Self-supervised learning in natural language processing

    Get PDF
    Most natural language processing (NLP) learning algorithms require labeled data. While this is given for a select number of (mostly English) tasks, the availability of labeled data is sparse or non-existent for the vast majority of use-cases. To alleviate this, unsupervised learning and a wide array of data augmentation techniques have been developed (Hedderich et al., 2021a). However, unsupervised learning often requires massive amounts of unlabeled data and also fails to perform in difficult (low-resource) data settings, i.e., if there is an increased distance between the source and target data distributions (Kim et al., 2020). This distributional distance can be the case if there is a domain drift or large linguistic distance between the source and target data. Unsupervised learning in itself does not exploit the highly informative (labeled) supervisory signals hidden in unlabeled data. In this dissertation, we show that by combining the right unsupervised auxiliary task (e.g., sentence pair extraction) with an appropriate primary task (e.g., machine translation), self-supervised learning can exploit these hidden supervisory signals more efficiently than purely unsupervised approaches, while functioning on less labeled data than supervised approaches. Our self-supervised learning approach can be used to learn NLP tasks in an efficient manner, even when the amount of training data is sparse or the data comes with strong differences in its underlying distribution, e.g., stemming from unrelated languages. For our general approach, we applied unsupervised learning as an auxiliary task to learn a supervised primary task. Concretely, we have focused on the auxiliary task of sentence pair extraction for sequence-to-sequence primary tasks (i.e., machine translation and style transfer) as well as language modeling, clustering, subspace learning and knowledge integration for primary classification tasks (i.e., hate speech detection and sentiment analysis). For sequence-to-sequence tasks, we show that self-supervised neural machine translation (NMT) achieves competitive results on high-resource language pairs in comparison to unsupervised NMT while requiring less data. Further combining self-supervised NMT with unsupervised NMT-inspired augmentation techniques makes the learning of low-resource (similar, distant and unrelated) language pairs possible. Further, using our self-supervised approach, we show how style transfer can be learned without the need for parallel data, generating stylistic rephrasings of highest overall performance on all tested tasks. For sequence-to-label tasks, we underline the benefit of auxiliary task-based augmentation over primary task augmentation. An auxiliary task that showed to be especially beneficial to the primary task performance was subspace learning, which led to impressive gains in (cross-lingual) zero-shot classification performance on similar or distant target tasks, also on similar, distant and unrelated languages.Die meisten Lernalgorithmen der Computerlingistik (CL) benötigen gelabelte Daten. Diese sind zwar für eine Auswahl an (hautpsächlich Englischen) Aufgaben verfügbar, für den Großteil aller Anwendungsfälle sind gelabelte Daten jedoch nur spärrlich bis gar nicht vorhanden. Um dem gegenzusteuern, wurde eine große Auswahl an Techniken entwickelt, welche sich das unüberwachte Lernen oder Datenaugmentierung zu eigen machen (Hedderich et al., 2021a). Unüberwachtes Lernen benötigt jedoch massive Mengen an ungelabelten Daten und versagt, wenn es mit schwierigen (resourcenarmen) Datensituationen konfrontiert wird, d.h. wenn eine größere Distanz zwischen der Quellen- und Zieldatendistributionen vorhanden ist (Kim et al., 2020). Eine distributionelle Distanz kann zum Beispiel der Fall sein, wenn ein Domänenunterschied oder eine größere sprachliche Distanz zwischen der Quellenund Zieldaten besteht. Unüberwachtes Lernen selbst nutzt die hochinformativen (gelabelten) Überwachungssignale, welche sich in ungelabelte Daten verstecken, nicht aus. In dieser Dissertation zeigen wir, dass selbstüberwachtes Lernen, durch die Kombination der richtigen unüberwachten Hilfsaufgabe (z.B. Satzpaarextraktion) mit einer passenden Hauptaufgabe (z.B. maschinelle Übersetzung), diese versteckten Überwachsungssignale effizienter ausnutzen kann als pure unüberwachte Lernalgorithmen, und dabei auch noch weniger gelabelte Daten benötigen als überwachte Lernalgorithmen. Unser selbstüberwachter Lernansatz erlaubt es uns, CL Aufgaben effizient zu lernen, selbst wenn die Trainingsdatenmenge spärrlich ist oder die Daten mit starken distributionellen Differenzen einher gehen, z.B. weil die Daten von zwei nicht verwandten Sprachen stammen. Im Generellen haben wir unüberwachtes Lernen als Hilfsaufgabe angewandt um eine überwachte Hauptaufgabe zu erlernen. Konkret haben wir uns auf Satzpaarextraktion als Hilfsaufgabe für Sequenz-zu-Sequenz Hauptaufgaben (z.B. maschinelle Übersetzung und Stilübertragung) konzentriert sowohl als auch Sprachmodelierung, Clustern, Teilraumlernen und Wissensintegration zum erlernen von Klassifikationsaufgaben (z.B. Hassredenidentifikation und Sentimentanalyse). Für Sequenz-zu-Sequenz Aufgaben zeigen wir, dass selbstüberwachte maschinelle Übersetzung (MÜ) im Vergleich zur unüberwachten MÜ wettbewerbsfähige Ergebnisse auf resourcenreichen Sprachpaaren erreicht und währenddessen weniger Daten zum Lernen benötigt. Wenn selbstüberwachte MÜ mit Augmentationstechniken, inspiriert durch unüberwachte MÜ, kombiniert wird, wird auch das Lernen von resourcenarmen (ähnlichen, entfernt verwandten und nicht verwandten) Sprachpaaren möglich. Außerdem zeigen wir, wie unser selbsüberwachter Lernansatz es ermöglicht Stilübertragung ohne parallele Daten zu erlernen und dabei stylistische Umformulierungen von höchster Qualität auf allen geprüften Aufgaben zu erlangen. Für Sequenz-zu-Label Aufgaben unterstreichen wir den Vorteil, welchen hilfsaufgabenseitige Augmentierung über hauptaufgabenseitige Augmentierung hat. Eine Hilfsaufgabe welche sich als besonders hilfreich für die Qualität der Hauptaufgabe herausstellte ist das Teilraumlernen, welches zu beeindruckenden Leistungssteigerungen für (sprachübergreifende) zero-shot Klassifikation ähnlicher und entfernter Zielaufgaben (auch für ähnliche, entfernt verwandte und nicht verwandte Sprachen) führt

    Theory and practice of self-translation in the Basque Country

    Get PDF
    305p. (euskera); 216 p. (inglés)Ikerlan honek helburu du euskal idazleek beren testuak itzultzean zer joera dituzten deskribatzea. Horretarako, atal teorikoen ondotik, corpusean oinarrituriko metodologia erabili dut autoitzulpena modu sistematikoan aztertzeko; itzulpen ikasketa deskriptiboetan oraindik ere urria da corpusen erabilera, orain arte bereziki hizkuntzalaritzak erabili baitu tresna hori. Tesiak zazpi atal ditu: lehenengoan, itzulpen ikasketa deskribatzaileen barruan kokatzen dut ikerlana, eta marko teoriko-metodologikoa ezartzen; bigarren atalean autoitzulpenari buruzko hausnarketa teorikoa egiten dut, jarduna bere zabalean ardatz hartuta; hirugarren atala kultur testuinguruari eskaini diot, itzulpena gertakari sozial eta kulturala den heinean ezinbestekoa baita euskal literatura sistemaz aritzea; laugarren atalean, Euskaratik Autoitzulitako Literaturaren (EUSAL) katalogoa aurkezten dut, eta bosgarrenean, berriz, katalogotik corpus eleanitz, digital eta paralelorako bidea azaltzen da; corpus horretan oinarriturik, makro eta mikro mailako azterketa xehea egiten dut seigarren atalean, hainbat adibideren bitartez; azkenik, zazpigarren atalean, ondorio nagusiak aletzen ditut.Horrela, bada, azterlan honek ekarpena egiten dio bai autoitzulpenaren azterketari, oro har, bai euskal literaturan geroz eta zabalduagoa den praktika baten gogoetari, zehazki. Euskaratik egiten den autoitzulpen literarioa deskribatzeaz gainera, literatura sistemaren berri ematen digu azterketa honek

    Benchmarking Arabic AI with Large Language Models

    Full text link
    With large Foundation Models (FMs), language technologies (AI in general) are entering a new paradigm: eliminating the need for developing large-scale task-specific datasets and supporting a variety of tasks through set-ups ranging from zero-shot to few-shot learning. However, understanding FMs capabilities requires a systematic benchmarking effort by comparing FMs performance with the state-of-the-art (SOTA) task-specific models. With that goal, past work focused on the English language and included a few efforts with multiple languages. Our study contributes to ongoing research by evaluating FMs performance for standard Arabic NLP and Speech processing, including a range of tasks from sequence tagging to content classification across diverse domains. We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM, addressing 33 unique tasks using 59 publicly available datasets resulting in 96 test setups. For a few tasks, FMs performs on par or exceeds the performance of the SOTA models but for the majority it under-performs. Given the importance of prompt for the FMs performance, we discuss our prompt strategies in detail and elaborate on our findings. Our future work on Arabic AI will explore few-shot prompting, expand the range of tasks, and investigate additional open-source models.Comment: Foundation Models, Large Language Models, Arabic NLP, Arabic Speech, Arabic AI, , CHatGPT Evaluation, USM Evaluation, Whisper Evaluatio

    Negative vaccine voices in Swedish social media

    Get PDF
    Vaccinations are one of the most significant interventions to public health, but vaccine hesitancy creates concerns for a portion of the population in many countries, including Sweden. Since discussions on vaccine hesitancy are often taken on social networking sites, data from Swedish social media are used to study and quantify the sentiment among the discussants on the vaccination-or-not topic during phases of the COVID-19 pandemic. Out of all the posts analyzed a majority showed a stronger negative sentiment, prevailing throughout the whole of the examined period, with some spikes or jumps due to the occurrence of certain vaccine-related events distinguishable in the results. Sentiment analysis can be a valuable tool to track public opinions regarding the use, efficacy, safety, and importance of vaccination

    Word-Final /s/ in English

    Get PDF
    Synopsis: The complexities of speech production, perception, and comprehension are enormous. Theoretical approaches of these complexities most recently face the challenge of accounting for findings on subphonemic differences. The aim of the present dissertation is to establish a robust foundation of findings on such subphonemic differences. One rather popular case for differences in subphonemic detail is word-final /s/ and /z/ in English (henceforth S) as it constitutes a number of morphological functions. Using word-final S, three general issues are investigated. First, are there subphonemic durational differences between different types of word-final S? If there are such differences, how can they be accounted for? Second, can such subphonemic durational differences be perceived? Third, do such subphonemic durational differences influence the comprehension of S? These questions are investigated by five highly controlled studies: a production task, an implementation of Linear Discriminative Learning, a same-different task, and two number-decision tasks. Using not only real words but also pseudowords as target items, potentially confounding effects of lexical storage are controlled for. Concerning the first issue, the results show that there are indeed durational differences between different types of word-final S. Non-morphemic S is longest in duration, clitic S is shortest in duration, and plural S duration is in-between non-morphemic S and clitic S durations. It appears that the durational differences are connected to a word’s semantic activation diversity and its phonological certainty. Regarding the second issue, subphonemic durational differences in word-final S can be perceived, with higher levels of perceptibility for differences of 35 ms and higher. In regard to the third issue, subphonemic durational differences are found not to influence the speed of comprehension, but show a significant effect on the process of comprehension. The overall results give raise to a revision of various extant models of speech production, perception, and comprehension
    corecore