361 research outputs found
Play as You Like: Timbre-enhanced Multi-modal Music Style Transfer
Style transfer of polyphonic music recordings is a challenging task when
considering the modeling of diverse, imaginative, and reasonable music pieces
in the style different from their original one. To achieve this, learning
stable multi-modal representations for both domain-variant (i.e., style) and
domain-invariant (i.e., content) information of music in an unsupervised manner
is critical. In this paper, we propose an unsupervised music style transfer
method without the need for parallel data. Besides, to characterize the
multi-modal distribution of music pieces, we employ the Multi-modal
Unsupervised Image-to-Image Translation (MUNIT) framework in the proposed
system. This allows one to generate diverse outputs from the learned latent
distributions representing contents and styles. Moreover, to better capture the
granularity of sound, such as the perceptual dimensions of timbre and the
nuance in instrument-specific performance, cognitively plausible features
including mel-frequency cepstral coefficients (MFCC), spectral difference, and
spectral envelope, are combined with the widely-used mel-spectrogram into a
timber-enhanced multi-channel input representation. The Relativistic average
Generative Adversarial Networks (RaGAN) is also utilized to achieve fast
convergence and high stability. We conduct experiments on bilateral style
transfer tasks among three different genres, namely piano solo, guitar solo,
and string quartet. Results demonstrate the advantages of the proposed method
in music style transfer with improved sound quality and in allowing users to
manipulate the output
Crossing You in Style: Cross-modal Style Transfer from Music to Visual Arts
Music-to-visual style transfer is a challenging yet important cross-modal
learning problem in the practice of creativity. Its major difference from the
traditional image style transfer problem is that the style information is
provided by music rather than images. Assuming that musical features can be
properly mapped to visual contents through semantic links between the two
domains, we solve the music-to-visual style transfer problem in two steps:
music visualization and style transfer. The music visualization network
utilizes an encoder-generator architecture with a conditional generative
adversarial network to generate image-based music representations from music
data. This network is integrated with an image style transfer method to
accomplish the style transfer process. Experiments are conducted on
WikiArt-IMSLP, a newly compiled dataset including Western music recordings and
paintings listed by decades. By utilizing such a label to learn the semantic
connection between paintings and music, we demonstrate that the proposed
framework can generate diverse image style representations from a music piece,
and these representations can unveil certain art forms of the same era.
Subjective testing results also emphasize the role of the era label in
improving the perceptual quality on the compatibility between music and visual
content
AI-generated Content for Various Data Modalities: A Survey
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D
assets, and other media using AI algorithms. Due to its wide range of
applications and the demonstrated potential of recent works, AIGC developments
have been attracting lots of attention recently, and AIGC methods have been
developed for various data modalities, such as image, video, text, 3D shape (as
voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human
avatar (body and head), 3D motion, and audio -- each presenting different
characteristics and challenges. Furthermore, there have also been many
significant developments in cross-modality AIGC methods, where generative
methods can receive conditioning input in one modality and produce outputs in
another. Examples include going from various modalities to image, video, 3D
shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar),
and audio modalities. In this paper, we provide a comprehensive review of AIGC
methods across different data modalities, including both single-modality and
cross-modality methods, highlighting the various challenges, representative
works, and recent technical directions in each setting. We also survey the
representative datasets throughout the modalities, and present comparative
results for various modalities. Moreover, we also discuss the challenges and
potential future research directions
Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis
Deep person generation has attracted extensive research attention due to its
wide applications in virtual agents, video conferencing, online shopping and
art/movie production. With the advancement of deep learning, visual appearances
(face, pose, cloth) of a person image can be easily generated or manipulated on
demand. In this survey, we first summarize the scope of person generation, and
then systematically review recent progress and technical trends in deep person
generation, covering three major tasks: talking-head generation (face),
pose-guided person generation (pose) and garment-oriented person generation
(cloth). More than two hundred papers are covered for a thorough overview, and
the milestone works are highlighted to witness the major technical
breakthrough. Based on these fundamental tasks, a number of applications are
investigated, e.g., virtual fitting, digital human, generative data
augmentation. We hope this survey could shed some light on the future prospects
of deep person generation, and provide a helpful foundation for full
applications towards digital human
SkinCAN AI: A deep learning-based skin cancer classification and segmentation pipeline designed along with a generative model
The rarity of Melanoma skin cancer accounts for the dataset collected to be limited and highly skewed, as benign moles can easily mimic the impression of the melanoma-affected area. Such an imbalanced dataset makes training any deep learning classifier network harder by affecting the training stability. We have an intuition that synthesizing such skin lesion medical images could help solve the issue of overfitting in training networks and assist in enforcing the anonymization of actual patients. Despite multiple previous attempts, none of the models were practical for the fast-paced clinical environment. In this thesis, we propose a novel pipeline named SkinCAN AI, inspired by StyleGAN but designed explicitly considering the limitations of the skin lesion dataset and emphasizing the requirement of a faster optimized diagnostic tool that can be easily inferred and integrated into the clinical environment. Our SkinCAN AI model is equipped with its module of adaptive discriminator augmentation that enables limited target data distribution to be learned and artificial data points to be sampled, which further assist the classifier network in learning semantic features. We elucidate the novelty of our SkinCAN AI pipeline by integrating the soft attention module in the classifier network. This module yields an attention mask analyzed by DenseNet201 to focus on learning relevant semantic features from skin lesion images without using any heavy computational burden of artifact removal software. The SkinGAN model achieves an FID score of 0.622 while allowing its synthetic samples to train the DenseNet201 model with an accuracy of 0.9494, AUC of 0.938, specificity of 0.969, and sensitivity of 0.695. We provide evidence in our thesis that our proposed pipelines outperform other state-of-the-art existing networks developed for this task of early diagnosis
Deep Learning for Music Information Retrieval in Limited Data Scenarios.
PhD ThesisWhile deep learning (DL) models have achieved impressive results in settings
where large amounts of annotated training data are available, over tting often
degrades performance when data is more limited. To improve the generalisation
of DL models, we investigate \data-driven priors" that exploit additional unlabelled
data or labelled data from related tasks. Unlike techniques such as data
augmentation, these priors are applicable across a range of machine listening
tasks, since their design does not rely on problem-speci c knowledge.
We rst consider scenarios in which parts of samples can be missing, aiming to
make more datasets available for model training. In an initial study focusing on
audio source separation (ASS), we exploit additionally available unlabelled music
and solo source recordings by using generative adversarial networks (GANs),
resulting in higher separation quality. We then present a fully adversarial
framework for learning generative models with missing data. Our discriminator
consists of separately trainable components that can be combined to train the
generator with the same objective as in the original GAN framework. We apply
our framework to image generation, image segmentation and ASS, demonstrating
superior performance compared to the original GAN.
To improve performance on any given MIR task, we also aim to leverage
datasets which are annotated for similar tasks. We use multi-task learning (MTL)
to perform singing voice detection and singing voice separation with one model,
improving performance on both tasks. Furthermore, we employ meta-learning
on a diverse collection of ten MIR tasks to nd a weight initialisation for a
\universal MIR model" so that training the model on any MIR task with this
initialisation quickly leads to good performance.
Since our data-driven priors encode knowledge shared across tasks and
datasets, they are suited for high-dimensional, end-to-end models, instead of small
models relying on task-speci c feature engineering, such as xed spectrogram
representations of audio commonly used in machine listening. To this end, we
propose \Wave-U-Net", an adaptation of the U-Net, which can perform ASS
directly on the raw waveform while performing favourably to its spectrogrambased
counterpart. Finally, we derive \Seq-U-Net" as a causal variant of Wave-
U-Net, which performs comparably to Wavenet and Temporal Convolutional
Network (TCN) on a variety of sequence modelling tasks, while being more
computationally e cient.
On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator
Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
- …