338 research outputs found
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings
Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following:
* We release âlanguage packsâ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations.
* We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora.
* We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains.
* We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams.
This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations.
We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available
SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
What does it take to create the Babel Fish, a tool that can help individuals
translate speech between any two languages? While recent breakthroughs in
text-based models have pushed machine translation coverage beyond 200
languages, unified speech-to-speech translation models have yet to achieve
similar strides. More specifically, conventional speech-to-speech translation
systems rely on cascaded systems that perform translation progressively,
putting high-performing unified systems out of reach. To address these gaps, we
introduce SeamlessM4T, a single model that supports speech-to-speech
translation, speech-to-text translation, text-to-speech translation,
text-to-text translation, and automatic speech recognition for up to 100
languages. To build this, we used 1 million hours of open speech audio data to
learn self-supervised speech representations with w2v-BERT 2.0. Subsequently,
we created a multimodal corpus of automatically aligned speech translations.
Filtered and combined with human-labeled and pseudo-labeled data, we developed
the first multilingual system capable of translating from and into English for
both speech and text. On FLEURS, SeamlessM4T sets a new standard for
translations into multiple target languages, achieving an improvement of 20%
BLEU over the previous SOTA in direct speech-to-text translation. Compared to
strong cascaded models, SeamlessM4T improves the quality of into-English
translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in
speech-to-speech. Tested for robustness, our system performs better against
background noises and speaker variations in speech-to-text tasks compared to
the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and
added toxicity to assess translation safety. Finally, all contributions in this
work are open-sourced and accessible at
https://github.com/facebookresearch/seamless_communicatio
How cultural identities are constructed in China's national holiday blockbuster : a perspective from political discourse analysis
The recent Chinese national blockbuster My People, My Country (MPMC), a movie consisting of 7 stories recounting 7 memorial moments and events since the founding of the Peopleâs Republic of China, has evoked strong emotions among domestic Chinese citizens as well as Chinese diasporas overseas (Hou, 2019). Based on data by Maoyanâs website (2019), MPMC is ranked in the top ten of highest-grossing films in mainland China. As a propaganda film, the huge success of MPMC encourages us to wonder: why is it so successful and why did it receive such strong emotional responses? This question merits investigation as the answer will shed light on how cultural production is to create a shared national identity and further to serve political purpose in uniting populace in todayâs new era (Gramsci, 1985; Oâshannassy, 2008). Echoing the claim that MPMC was âaiming to awaken the shared memories of Chinese people around the worldâ (âChina Focusâ, 2019), I will take the approach of political discourse analysis (PDA) to probe into two specific questions: what strategies are used in constructing cultural identities? And how is MPMC different from past propaganda films, which, according to Teo (2019) and Veg (2012), directly extoll the virtues of the State and belong to high culture?
In order to assess the effectiveness of the strategies employed in the movie in constructing national identities, I conducted a small-scale (25 samples) questionnaire survey among Chinese diasporas overseas to understand their feelings towards and comments on the movie (Hall, 2014). The questionnaire consists of 5 open questions investigating the participantsâ feelings of this movie as well as which stories they liked/disliked the most. It was administered among 25 Chinese students studying at Ghent University. Feedback suggests that the audience is particularly impressed by elements they share affinity and familiarity with. For instance, the national anthem and theme song of the film (also entitled My People My Country) represent a shared memory: most, if not all, Chinese people, especially those born in 1980s and 1990s, were taught this song repeatedly in their school and university years. Interestingly, apart from these two general shared memories, smaller-scale but more targeted cultural content is employed too, such as the different dialects spoken by different characters throughout the narratives in the movie. These dialects represent the most spoken dialects in China. By employing cultural elements that are familiar to audience, MPMC manages to create proximity and further evoke a highly affective reaction with the participants. Moving to the second research question, I will particularly focus on examining the topics and structures of the 7 seemingly independent stories in the movie, both of which are considered important in PDA (Dunmire, 2012; van Dijk, 1997). The topics featuring the 7 stories vary but share several commonalitiesâall related to political events and ideologies, and are unfolded in a highly similar structure: all 7 stories end with success and happiness, at the expense of personal sacrifice. Based on such findings, I will further compare MPMC with previous nationalist films, such as Wolf Warrior 2 and Operation Red Sea, both of which are among top ten highest-grossing films and typical patriotic styles, as well as The Founding of a Republic, a tribute to the 60th national anniversary of Peopleâs Republic of China. The comparison suggests an apparent shift from a focus on high-level or remote figures, such as soldiers from a special force or navy, to an emphasis on the popular culture in fostering patriotism. For example, inviting popular celebrities to act and seeing historical events from citizensâ perspectives are among the used strategies. The findings will enable us to better understand how cultural contents are used as tools for political purposes, such as creating unified national identity and maintain cultural hegemony (Gramsci, 1985)
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
- âŠ