159 research outputs found
Audio-Visual Speech Inpainting with Deep Learning
In this paper, we present a deep-learning-based framework for audio-visual
speech inpainting, i.e., the task of restoring the missing parts of an acoustic
speech signal from reliable audio context and uncorrupted visual information.
Recent work focuses solely on audio-only methods and generally aims at
inpainting music signals, which show highly different structure than speech.
Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to
investigate the contribution that vision can provide for gaps of different
duration. We also experiment with a multi-task learning approach where a phone
recognition task is learned together with speech inpainting. Results show that
the performance of audio-only speech inpainting approaches degrades rapidly
when gaps get large, while the proposed audio-visual approach is able to
plausibly restore missing information. In addition, we show that multi-task
learning is effective, although the largest contribution to performance comes
from vision
Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing
Creating music is iterative, requiring varied methods at each stage. However,
existing AI music systems fall short in orchestrating multiple subsystems for
diverse needs. To address this gap, we introduce Loop Copilot, a novel system
that enables users to generate and iteratively refine music through an
interactive, multi-round dialogue interface. The system uses a large language
model to interpret user intentions and select appropriate AI models for task
execution. Each backend model is specialized for a specific task, and their
outputs are aggregated to meet the user's requirements. To ensure musical
coherence, essential attributes are maintained in a centralized table. We
evaluate the effectiveness of the proposed system through semi-structured
interviews and questionnaires, highlighting its utility not only in
facilitating music creation but also its potential for broader applications.Comment: Source code and demo video are available at
\url{https://sites.google.com/view/loop-copilot
LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for
text-to-speech (TTS) use. It is derived by applying speech restoration to the
LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling
rate from 2,456 speakers and the corresponding texts. The constituent samples
of LibriTTS-R are identical to those of LibriTTS, with only the sound quality
improved. Experimental results show that the LibriTTS-R ground-truth samples
showed significantly improved sound quality compared to those in LibriTTS. In
addition, neural end-to-end TTS trained with LibriTTS-R achieved speech
naturalness on par with that of the ground-truth samples. The corpus is freely
available for download from \url{http://www.openslr.org/141/}.Comment: Accepted to Interspeech 202
Mustango: Toward Controllable Text-to-Music Generation
With recent advancements in text-to-audio and text-to-music based on latent
diffusion models, the quality of generated content has been reaching new
heights. The controllability of musical aspects, however, has not been
explicitly explored in text-to-music systems yet. In this paper, we present
Mustango, a music-domain-knowledge-inspired text-to-music system based on
diffusion, that expands the Tango text-to-audio model. Mustango aims to control
the generated music, not only with general text captions, but from more rich
captions that could include specific instructions related to chords, beats,
tempo, and key. As part of Mustango, we propose MuNet, a
Music-Domain-Knowledge-Informed UNet sub-module to integrate these
music-specific features, which we predict from the text prompt, as well as the
general text embedding, into the diffusion denoising process. To overcome the
limited availability of open datasets of music with text captions, we propose a
novel data augmentation method that includes altering the harmonic, rhythmic,
and dynamic aspects of music audio and using state-of-the-art Music Information
Retrieval methods to extract the music features which will then be appended to
the existing descriptions in text format. We release the resulting MusicBench
dataset which contains over 52K instances and includes music-theory-based
descriptions in the caption text. Through extensive experiments, we show that
the quality of the music generated by Mustango is state-of-the-art, and the
controllability through music-specific text prompts greatly outperforms other
models in terms of desired chords, beat, key, and tempo, on multiple datasets
Sparks of Large Audio Models: A Survey and Outlook
This survey paper provides a comprehensive overview of the recent
advancements and challenges in applying large language models to the field of
audio signal processing. Audio processing, with its diverse signal
representations and a wide range of sources--from human voices to musical
instruments and environmental sounds--poses challenges distinct from those
found in traditional Natural Language Processing scenarios. Nevertheless,
\textit{Large Audio Models}, epitomized by transformer-based architectures,
have shown marked efficacy in this sphere. By leveraging massive amount of
data, these models have demonstrated prowess in a variety of audio tasks,
spanning from Automatic Speech Recognition and Text-To-Speech to Music
Generation, among others. Notably, recently these Foundational Audio Models,
like SeamlessM4T, have started showing abilities to act as universal
translators, supporting multiple speech tasks for up to 100 languages without
any reliance on separate task-specific systems. This paper presents an in-depth
analysis of state-of-the-art methodologies regarding \textit{Foundational Large
Audio Models}, their performance benchmarks, and their applicability to
real-world scenarios. We also highlight current limitations and provide
insights into potential future research directions in the realm of
\textit{Large Audio Models} with the intent to spark further discussion,
thereby fostering innovation in the next generation of audio-processing
systems. Furthermore, to cope with the rapid development in this area, we will
consistently update the relevant repository with relevant recent articles and
their open-source implementations at
https://github.com/EmulationAI/awesome-large-audio-models.Comment: work in progress, Repo URL:
https://github.com/EmulationAI/awesome-large-audio-model
DMRN+18: Digital Music Research Network One-day Workshop 2023
DMRN+18: Digital Music Research Network One-day Workshop 2023 Queen Mary University of London Tuesday 19th December 2023 • Keynote speaker: Stefan Bilbao The Digital Music Research Network (DMRN) aims to promote research in the area of digital music, by bringing together researchers from UK and overseas universities, as well as industry, for its annual workshop. The workshop will include invited and contributed talks and posters. The workshop will be an ideal opportunity for networking with other people working in the area. Keynote speakers: Stefan Bilbao Tittle: Physics-based Audio: Sound Synthesis and Virtual Acoustics. Abstract: Any acoustically-produced sound produced must be the result of physical laws that describe the dynamics of a given system---always at least partly mechanical, and sometimes with an electronic element as well. One approach to the synthesis of natural acoustic timbres, thus, is through simulation, often referred to in this context as physical modelling, or physics-based audio. In this talk, the principles of physics-based audio, and the various different approaches to simulation are described, followed by a set of examples covering: various musical instrument types; the important related problem of the emulation of room acoustics or “virtual acoustics”; the embedding of instruments in a 3D virtual space; electromechanical effects; and also new modular instrument designs based on physical laws, but without a counterpart in the real world. Some more technical details follow, including the strengths, weaknesses and limitations of such methods, and pointers to some links to data-centred black-box approaches to sound generation and effects processing. The talk concludes with some musical examples and recent work on moving such algorithms to a real-time setting.. Bio: Stefan is a Professor (full) at Reid School of Music, University of Edinburgh, he is the Personal Chair of Acoustics and Audio Signal Processing, Music. He currently works on computational acoustics, for applications in sound synthesis and virtual acoustics. Special topics of interest include: Finite difference time domain methods, distributed nonlinear systems such as strings and plates, architectural acoustics, spatial audio in simulation, multichannel sound synthesis, and hardware and software realizations. More information on: https://www.acoustics.ed.ac.uk/group-members/dr-stefan-bilbao/ DMRN+18 is sponsored by The UKRI Centre for Doctoral Training in Artificial Intelligence and Music (AIM); a leading PhD research programme aimed at the Music/Audio Technology and Creative Industries, based at Queen Mary University of London
Face Image and Video Analysis in Biometrics and Health Applications
Computer Vision (CV) enables computers and systems to derive meaningful information from acquired visual inputs, such as images and videos, and make decisions based on the extracted information. Its goal is to acquire, process, analyze, and understand the information by developing a theoretical and algorithmic model. Biometrics are distinctive and measurable human characteristics used to label or describe individuals by combining computer vision with knowledge of human physiology (e.g., face, iris, fingerprint) and behavior (e.g., gait, gaze, voice). Face is one of the most informative biometric traits. Many studies have investigated the human face from the perspectives of various different disciplines, ranging from computer vision, deep learning, to neuroscience and biometrics. In this work, we analyze the face characteristics from digital images and videos in the areas of morphing attack and defense, and autism diagnosis. For face morphing attacks generation, we proposed a transformer based generative adversarial network to generate more visually realistic morphing attacks by combining different losses, such as face matching distance, facial landmark based loss, perceptual loss and pixel-wise mean square error. In face morphing attack detection study, we designed a fusion-based few-shot learning (FSL) method to learn discriminative features from face images for few-shot morphing attack detection (FS-MAD), and extend the current binary detection into multiclass classification, namely, few-shot morphing attack fingerprinting (FS-MAF). In the autism diagnosis study, we developed a discriminative few shot learning method to analyze hour-long video data and explored the fusion of facial dynamics for facial trait classification of autism spectrum disorder (ASD) in three severity levels. The results show outstanding performance of the proposed fusion-based few-shot framework on the dataset. Besides, we further explored the possibility of performing face micro- expression spotting and feature analysis on autism video data to classify ASD and control groups. The results indicate the effectiveness of subtle facial expression changes on autism diagnosis
- …