338 research outputs found

    Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion

    Full text link
    Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios is getting increasingly popular. Although many of the works in the field of voice conversion share a common global pipeline, there is a considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods in the voice conversion pipeline can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 621 publications from more than 38 different venues between the years 2017 and 2023, followed by an in-depth review of a final database consisting of 123 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls within the community. Lastly, we condense the knowledge gathered, identify main challenges and provide recommendations for future research directions

    Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings

    Get PDF
    Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following: * We release ‘language packs’ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations. * We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora. * We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains. * We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams. This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations. We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available

    SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

    Full text link
    What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communicatio

    How cultural identities are constructed in China's national holiday blockbuster : a perspective from political discourse analysis

    Get PDF
    The recent Chinese national blockbuster My People, My Country (MPMC), a movie consisting of 7 stories recounting 7 memorial moments and events since the founding of the People’s Republic of China, has evoked strong emotions among domestic Chinese citizens as well as Chinese diasporas overseas (Hou, 2019). Based on data by Maoyan’s website (2019), MPMC is ranked in the top ten of highest-grossing films in mainland China. As a propaganda film, the huge success of MPMC encourages us to wonder: why is it so successful and why did it receive such strong emotional responses? This question merits investigation as the answer will shed light on how cultural production is to create a shared national identity and further to serve political purpose in uniting populace in today’s new era (Gramsci, 1985; O’shannassy, 2008). Echoing the claim that MPMC was “aiming to awaken the shared memories of Chinese people around the world” (“China Focus”, 2019), I will take the approach of political discourse analysis (PDA) to probe into two specific questions: what strategies are used in constructing cultural identities? And how is MPMC different from past propaganda films, which, according to Teo (2019) and Veg (2012), directly extoll the virtues of the State and belong to high culture? In order to assess the effectiveness of the strategies employed in the movie in constructing national identities, I conducted a small-scale (25 samples) questionnaire survey among Chinese diasporas overseas to understand their feelings towards and comments on the movie (Hall, 2014). The questionnaire consists of 5 open questions investigating the participants’ feelings of this movie as well as which stories they liked/disliked the most. It was administered among 25 Chinese students studying at Ghent University. Feedback suggests that the audience is particularly impressed by elements they share affinity and familiarity with. For instance, the national anthem and theme song of the film (also entitled My People My Country) represent a shared memory: most, if not all, Chinese people, especially those born in 1980s and 1990s, were taught this song repeatedly in their school and university years. Interestingly, apart from these two general shared memories, smaller-scale but more targeted cultural content is employed too, such as the different dialects spoken by different characters throughout the narratives in the movie. These dialects represent the most spoken dialects in China. By employing cultural elements that are familiar to audience, MPMC manages to create proximity and further evoke a highly affective reaction with the participants. Moving to the second research question, I will particularly focus on examining the topics and structures of the 7 seemingly independent stories in the movie, both of which are considered important in PDA (Dunmire, 2012; van Dijk, 1997). The topics featuring the 7 stories vary but share several commonalities—all related to political events and ideologies, and are unfolded in a highly similar structure: all 7 stories end with success and happiness, at the expense of personal sacrifice. Based on such findings, I will further compare MPMC with previous nationalist films, such as Wolf Warrior 2 and Operation Red Sea, both of which are among top ten highest-grossing films and typical patriotic styles, as well as The Founding of a Republic, a tribute to the 60th national anniversary of People’s Republic of China. The comparison suggests an apparent shift from a focus on high-level or remote figures, such as soldiers from a special force or navy, to an emphasis on the popular culture in fostering patriotism. For example, inviting popular celebrities to act and seeing historical events from citizens’ perspectives are among the used strategies. The findings will enable us to better understand how cultural contents are used as tools for political purposes, such as creating unified national identity and maintain cultural hegemony (Gramsci, 1985)

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Full text link
    Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License
    • 

    corecore