361 research outputs found

    Play as You Like: Timbre-enhanced Multi-modal Music Style Transfer

    Full text link
    Style transfer of polyphonic music recordings is a challenging task when considering the modeling of diverse, imaginative, and reasonable music pieces in the style different from their original one. To achieve this, learning stable multi-modal representations for both domain-variant (i.e., style) and domain-invariant (i.e., content) information of music in an unsupervised manner is critical. In this paper, we propose an unsupervised music style transfer method without the need for parallel data. Besides, to characterize the multi-modal distribution of music pieces, we employ the Multi-modal Unsupervised Image-to-Image Translation (MUNIT) framework in the proposed system. This allows one to generate diverse outputs from the learned latent distributions representing contents and styles. Moreover, to better capture the granularity of sound, such as the perceptual dimensions of timbre and the nuance in instrument-specific performance, cognitively plausible features including mel-frequency cepstral coefficients (MFCC), spectral difference, and spectral envelope, are combined with the widely-used mel-spectrogram into a timber-enhanced multi-channel input representation. The Relativistic average Generative Adversarial Networks (RaGAN) is also utilized to achieve fast convergence and high stability. We conduct experiments on bilateral style transfer tasks among three different genres, namely piano solo, guitar solo, and string quartet. Results demonstrate the advantages of the proposed method in music style transfer with improved sound quality and in allowing users to manipulate the output

    Crossing You in Style: Cross-modal Style Transfer from Music to Visual Arts

    Full text link
    Music-to-visual style transfer is a challenging yet important cross-modal learning problem in the practice of creativity. Its major difference from the traditional image style transfer problem is that the style information is provided by music rather than images. Assuming that musical features can be properly mapped to visual contents through semantic links between the two domains, we solve the music-to-visual style transfer problem in two steps: music visualization and style transfer. The music visualization network utilizes an encoder-generator architecture with a conditional generative adversarial network to generate image-based music representations from music data. This network is integrated with an image style transfer method to accomplish the style transfer process. Experiments are conducted on WikiArt-IMSLP, a newly compiled dataset including Western music recordings and paintings listed by decades. By utilizing such a label to learn the semantic connection between paintings and music, we demonstrate that the proposed framework can generate diverse image style representations from a music piece, and these representations can unveil certain art forms of the same era. Subjective testing results also emphasize the role of the era label in improving the perceptual quality on the compatibility between music and visual content

    AI-generated Content for Various Data Modalities: A Survey

    Full text link
    AI-generated content (AIGC) methods aim to produce text, images, videos, 3D assets, and other media using AI algorithms. Due to its wide range of applications and the demonstrated potential of recent works, AIGC developments have been attracting lots of attention recently, and AIGC methods have been developed for various data modalities, such as image, video, text, 3D shape (as voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human avatar (body and head), 3D motion, and audio -- each presenting different characteristics and challenges. Furthermore, there have also been many significant developments in cross-modality AIGC methods, where generative methods can receive conditioning input in one modality and produce outputs in another. Examples include going from various modalities to image, video, 3D shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar), and audio modalities. In this paper, we provide a comprehensive review of AIGC methods across different data modalities, including both single-modality and cross-modality methods, highlighting the various challenges, representative works, and recent technical directions in each setting. We also survey the representative datasets throughout the modalities, and present comparative results for various modalities. Moreover, we also discuss the challenges and potential future research directions

    Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis

    Full text link
    Deep person generation has attracted extensive research attention due to its wide applications in virtual agents, video conferencing, online shopping and art/movie production. With the advancement of deep learning, visual appearances (face, pose, cloth) of a person image can be easily generated or manipulated on demand. In this survey, we first summarize the scope of person generation, and then systematically review recent progress and technical trends in deep person generation, covering three major tasks: talking-head generation (face), pose-guided person generation (pose) and garment-oriented person generation (cloth). More than two hundred papers are covered for a thorough overview, and the milestone works are highlighted to witness the major technical breakthrough. Based on these fundamental tasks, a number of applications are investigated, e.g., virtual fitting, digital human, generative data augmentation. We hope this survey could shed some light on the future prospects of deep person generation, and provide a helpful foundation for full applications towards digital human

    SkinCAN AI: A deep learning-based skin cancer classification and segmentation pipeline designed along with a generative model

    Get PDF
    The rarity of Melanoma skin cancer accounts for the dataset collected to be limited and highly skewed, as benign moles can easily mimic the impression of the melanoma-affected area. Such an imbalanced dataset makes training any deep learning classifier network harder by affecting the training stability. We have an intuition that synthesizing such skin lesion medical images could help solve the issue of overfitting in training networks and assist in enforcing the anonymization of actual patients. Despite multiple previous attempts, none of the models were practical for the fast-paced clinical environment. In this thesis, we propose a novel pipeline named SkinCAN AI, inspired by StyleGAN but designed explicitly considering the limitations of the skin lesion dataset and emphasizing the requirement of a faster optimized diagnostic tool that can be easily inferred and integrated into the clinical environment. Our SkinCAN AI model is equipped with its module of adaptive discriminator augmentation that enables limited target data distribution to be learned and artificial data points to be sampled, which further assist the classifier network in learning semantic features. We elucidate the novelty of our SkinCAN AI pipeline by integrating the soft attention module in the classifier network. This module yields an attention mask analyzed by DenseNet201 to focus on learning relevant semantic features from skin lesion images without using any heavy computational burden of artifact removal software. The SkinGAN model achieves an FID score of 0.622 while allowing its synthetic samples to train the DenseNet201 model with an accuracy of 0.9494, AUC of 0.938, specificity of 0.969, and sensitivity of 0.695. We provide evidence in our thesis that our proposed pipelines outperform other state-of-the-art existing networks developed for this task of early diagnosis

    Deep Learning for Music Information Retrieval in Limited Data Scenarios.

    Get PDF
    PhD ThesisWhile deep learning (DL) models have achieved impressive results in settings where large amounts of annotated training data are available, over tting often degrades performance when data is more limited. To improve the generalisation of DL models, we investigate \data-driven priors" that exploit additional unlabelled data or labelled data from related tasks. Unlike techniques such as data augmentation, these priors are applicable across a range of machine listening tasks, since their design does not rely on problem-speci c knowledge. We rst consider scenarios in which parts of samples can be missing, aiming to make more datasets available for model training. In an initial study focusing on audio source separation (ASS), we exploit additionally available unlabelled music and solo source recordings by using generative adversarial networks (GANs), resulting in higher separation quality. We then present a fully adversarial framework for learning generative models with missing data. Our discriminator consists of separately trainable components that can be combined to train the generator with the same objective as in the original GAN framework. We apply our framework to image generation, image segmentation and ASS, demonstrating superior performance compared to the original GAN. To improve performance on any given MIR task, we also aim to leverage datasets which are annotated for similar tasks. We use multi-task learning (MTL) to perform singing voice detection and singing voice separation with one model, improving performance on both tasks. Furthermore, we employ meta-learning on a diverse collection of ten MIR tasks to nd a weight initialisation for a \universal MIR model" so that training the model on any MIR task with this initialisation quickly leads to good performance. Since our data-driven priors encode knowledge shared across tasks and datasets, they are suited for high-dimensional, end-to-end models, instead of small models relying on task-speci c feature engineering, such as xed spectrogram representations of audio commonly used in machine listening. To this end, we propose \Wave-U-Net", an adaptation of the U-Net, which can perform ASS directly on the raw waveform while performing favourably to its spectrogrambased counterpart. Finally, we derive \Seq-U-Net" as a causal variant of Wave- U-Net, which performs comparably to Wavenet and Temporal Convolutional Network (TCN) on a variety of sequence modelling tasks, while being more computationally e cient.

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field
    corecore