16,600 research outputs found
Evaluation of Tacotron Based Synthesizers for Spanish and Basque
In this paper, we describe the implementation and evaluation of Text to Speech synthesizers based on neural networks for Spanish and Basque. Several voices were built, all of them using a limited number of data. The system applies Tacotron 2 to compute mel-spectrograms from the input sequence, followed by WaveGlow as neural vocoder to obtain the audio signals from the spectrograms. The limited number of data used for training the models leads to synthesis errors in some sentences. To automatically detect those errors, we developed a new method that is able to find the sentences that have lost the alignment during the inference process. To mitigate the problem, we implemented a guided attention providing the system with the explicit duration of the phonemes. The resulting system was evaluated to assess its robustness, quality and naturalness both with objective and subjective measures. The results reveal the capacity of the system to produce good quality and natural audios.This work was funded by the Basque Government (Project refs. PIBA 2018-035, IT-1355-19). This work is part of the project Grant PID 2019-108040RB-C21 funded by MCIN/AEI/10.13039/ 501100011033
AI-generated Content for Various Data Modalities: A Survey
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D
assets, and other media using AI algorithms. Due to its wide range of
applications and the demonstrated potential of recent works, AIGC developments
have been attracting lots of attention recently, and AIGC methods have been
developed for various data modalities, such as image, video, text, 3D shape (as
voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human
avatar (body and head), 3D motion, and audio -- each presenting different
characteristics and challenges. Furthermore, there have also been many
significant developments in cross-modality AIGC methods, where generative
methods can receive conditioning input in one modality and produce outputs in
another. Examples include going from various modalities to image, video, 3D
shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar),
and audio modalities. In this paper, we provide a comprehensive review of AIGC
methods across different data modalities, including both single-modality and
cross-modality methods, highlighting the various challenges, representative
works, and recent technical directions in each setting. We also survey the
representative datasets throughout the modalities, and present comparative
results for various modalities. Moreover, we also discuss the challenges and
potential future research directions
Modality-Agnostic Self-Supervised Learning with Meta-Learned Masked Auto-Encoder
Despite its practical importance across a wide range of modalities, recent
advances in self-supervised learning (SSL) have been primarily focused on a few
well-curated domains, e.g., vision and language, often relying on their
domain-specific knowledge. For example, Masked Auto-Encoder (MAE) has become
one of the popular architectures in these domains, but less has explored its
potential in other modalities. In this paper, we develop MAE as a unified,
modality-agnostic SSL framework. In turn, we argue meta-learning as a key to
interpreting MAE as a modality-agnostic learner, and propose enhancements to
MAE from the motivation to jointly improve its SSL across diverse modalities,
coined MetaMAE as a result. Our key idea is to view the mask reconstruction of
MAE as a meta-learning task: masked tokens are predicted by adapting the
Transformer meta-learner through the amortization of unmasked tokens. Based on
this novel interpretation, we propose to integrate two advanced meta-learning
techniques. First, we adapt the amortized latent of the Transformer encoder
using gradient-based meta-learning to enhance the reconstruction. Then, we
maximize the alignment between amortized and adapted latents through task
contrastive learning which guides the Transformer encoder to better encode the
task-specific knowledge. Our experiment demonstrates the superiority of MetaMAE
in the modality-agnostic SSL benchmark (called DABS), significantly
outperforming prior baselines. Code is available at
https://github.com/alinlab/MetaMAE.Comment: Accepted to NeurIPS 2023. The first two authors contributed equall
Image Synthesis under Limited Data: A Survey and Taxonomy
Deep generative models, which target reproducing the given data distribution
to produce novel samples, have made unprecedented advancements in recent years.
Their technical breakthroughs have enabled unparalleled quality in the
synthesis of visual content. However, one critical prerequisite for their
tremendous success is the availability of a sufficient number of training
samples, which requires massive computation resources. When trained on limited
data, generative models tend to suffer from severe performance deterioration
due to overfitting and memorization. Accordingly, researchers have devoted
considerable attention to develop novel models that are capable of generating
plausible and diverse images from limited training data recently. Despite
numerous efforts to enhance training stability and synthesis quality in the
limited data scenarios, there is a lack of a systematic survey that provides 1)
a clear problem definition, critical challenges, and taxonomy of various tasks;
2) an in-depth analysis on the pros, cons, and remain limitations of existing
literature; as well as 3) a thorough discussion on the potential applications
and future directions in the field of image synthesis under limited data. In
order to fill this gap and provide a informative introduction to researchers
who are new to this topic, this survey offers a comprehensive review and a
novel taxonomy on the development of image synthesis under limited data. In
particular, it covers the problem definition, requirements, main solutions,
popular benchmarks, and remain challenges in a comprehensive and all-around
manner.Comment: 230 references, 25 pages. GitHub:
https://github.com/kobeshegu/awesome-few-shot-generatio
Sparks of Large Audio Models: A Survey and Outlook
This survey paper provides a comprehensive overview of the recent
advancements and challenges in applying large language models to the field of
audio signal processing. Audio processing, with its diverse signal
representations and a wide range of sources--from human voices to musical
instruments and environmental sounds--poses challenges distinct from those
found in traditional Natural Language Processing scenarios. Nevertheless,
\textit{Large Audio Models}, epitomized by transformer-based architectures,
have shown marked efficacy in this sphere. By leveraging massive amount of
data, these models have demonstrated prowess in a variety of audio tasks,
spanning from Automatic Speech Recognition and Text-To-Speech to Music
Generation, among others. Notably, recently these Foundational Audio Models,
like SeamlessM4T, have started showing abilities to act as universal
translators, supporting multiple speech tasks for up to 100 languages without
any reliance on separate task-specific systems. This paper presents an in-depth
analysis of state-of-the-art methodologies regarding \textit{Foundational Large
Audio Models}, their performance benchmarks, and their applicability to
real-world scenarios. We also highlight current limitations and provide
insights into potential future research directions in the realm of
\textit{Large Audio Models} with the intent to spark further discussion,
thereby fostering innovation in the next generation of audio-processing
systems. Furthermore, to cope with the rapid development in this area, we will
consistently update the relevant repository with relevant recent articles and
their open-source implementations at
https://github.com/EmulationAI/awesome-large-audio-models.Comment: work in progress, Repo URL:
https://github.com/EmulationAI/awesome-large-audio-model
- …