17 research outputs found
SCNet: Sparse Compression Network for Music Source Separation
Deep learning-based methods have made significant achievements in music
source separation. However, obtaining good results while maintaining a low
model complexity remains challenging in super wide-band music source
separation. Previous works either overlook the differences in subbands or
inadequately address the problem of information loss when generating subband
features. In this paper, we propose SCNet, a novel frequency-domain network to
explicitly split the spectrogram of the mixture into several subbands and
introduce a sparsity-based encoder to model different frequency bands. We use a
higher compression ratio on subbands with less information to improve the
information density and focus on modeling subbands with more information. In
this way, the separation performance can be significantly improved using lower
computational consumption. Experiment results show that the proposed model
achieves a signal to distortion ratio (SDR) of 9.0 dB on the MUSDB18-HQ dataset
without using extra data, which outperforms state-of-the-art methods.
Specifically, SCNet's CPU inference time is only 48% of HT Demucs, one of the
previous state-of-the-art models.Comment: Accepted by ICASSP 202
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation Methods
Talking Face Generation (TFG) aims to reconstruct facial movements to achieve
high natural lip movements from audio and facial features that are under
potential connections. Existing TFG methods have made significant advancements
to produce natural and realistic images. However, most work rarely takes visual
quality into consideration. It is challenging to ensure lip synchronization
while avoiding visual quality degradation in cross-modal generation methods. To
address this issue, we propose a universal High-Definition Teeth Restoration
Network, dubbed HDTR-Net, for arbitrary TFG methods. HDTR-Net can enhance teeth
regions at an extremely fast speed while maintaining synchronization, and
temporal consistency. In particular, we propose a Fine-Grained Feature Fusion
(FGFF) module to effectively capture fine texture feature information around
teeth and surrounding regions, and use these features to fine-grain the feature
map to enhance the clarity of teeth. Extensive experiments show that our method
can be adapted to arbitrary TFG methods without suffering from lip
synchronization and frame coherence. Another advantage of HDTR-Net is its
real-time generation ability. Also under the condition of high-definition
restoration of talking face video synthesis, its inference speed is
faster than the current state-of-the-art face restoration based on
super-resolution.Comment: 15pages, 6 figures, PRCV202
Quantum process tomography with unknown single-preparation input states
Quantum Process Tomography (QPT) methods aim at identifying, i.e. estimating,
a given quantum process. QPT is a major quantum information processing tool,
since it especially allows one to characterize the actual behavior of quantum
gates, which are the building blocks of quantum computers. However, usual QPT
procedures are complicated, since they set several constraints on the quantum
states used as inputs of the process to be characterized. In this paper, we
extend QPT so as to avoid two such constraints. On the one hand, usual QPT
methods requires one to know, hence to precisely control (i.e. prepare), the
specific quantum states used as inputs of the considered quantum process, which
is cumbersome. We therefore propose a Blind, or unsupervised, extension of QPT
(i.e. BQPT), which means that this approach uses input quantum states whose
values are unknown and arbitrary, except that they are requested to meet some
general known properties (and this approach exploits the output states of the
considered quantum process). On the other hand, usual QPT methods require one
to be able to prepare many copies of the same (known) input state, which is
constraining. On the contrary, we propose "single-preparation methods", i.e.
methods which can operate with only one instance of each considered input
state. These two new concepts are here illustrated with practical BQPT methods
which are numerically validated, in the case when: i) random pure states are
used as inputs and their required properties are especially related to the
statistical independence of the random variables that define them, ii) the
considered quantum process is based on cylindrical-symmetry Heisenberg spin
coupling. These concepts may be extended to a much wider class of processes and
to BQPT methods based on other input quantum state properties
A new generalized projection and its application to acceleration of audio declipping
In convex optimization, it is often inevitable to work with projectors onto convex sets composed with a linear operator. Such a need arises from both the theory and applications, with signal processing being a prominent and broad field where convex optimization has been used recently. In this article, a novel projector is presented, which generalizes previous results in that it admits to work with a broader family of linear transforms when compared with the state of the art but, on the other hand, it is limited to box-type convex sets in the transformed domain. The new projector is described by an explicit formula, which makes it simple to implement and requires a low computational cost. The projector is interpreted within the framework of the so-called proximal splitting theory. The convenience of the new projector is demonstrated on an example from signal processing, where it was possible to speed up the convergence of a signal declipping algorithm by a factor of more than two
The Sound Demixing Challenge 2023 \unicode{x2013} Cinematic Demixing Track
This paper summarizes the cinematic demixing (CDX) track of the Sound
Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the
challenge setup, detailing the structure of the competition and the datasets
used. Especially, we detail CDXDB23, a new hidden dataset constructed from real
movies that was used to rank the submissions. The paper also offers insights
into the most successful approaches employed by participants. Compared to the
cocktail-fork baseline, the best-performing system trained exclusively on the
simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8dB in
SDR whereas the top performing system on the open leaderboard, where any data
could be used for training, saw a significant improvement of 5.7dB.Comment: under revie
Recommended from our members
Musical source separation with deep learning and large-scale datasets
Throughout this thesis we will explore automatic music source separation by utilizing modern (at the time of writing) techniques and tools from machine learning and big data processing. The bulk of this work was carried out between 2016 and 2019.
In Chapter 2 we conduct a review of source separation literature. We start by outlining a subset of applications of source separation in some depth. We describe some of the early, pioneering work in automatic source separation: Auditory Scene Analysis, and its digital counterpart, Computational Auditory Scene Analysis.
We then introduce matrix decomposition-based methods such as Independent Component Analysis and Non-Negative Matrix factorization, and pitch informed methods where the separation algorithm is guided by pitch information that is known a priori. We brie y discuss user-guided methods, before conducting a thorough review of Deep Learning based source separation, including recurrent, convolutional, deep clustering-based, and Generative Adversarial Networks.
We then proceed to describe common evaluation metrics
and training datasets. Finally, we list a number of current challenges and drawbacks of current systems.
Chapter 3 focuses on datasets for musical source separation. First we show the growth of dataset sizes for both machine learning in general and music information retrieval specifically. We give several examples of the complexities and idiosyncrasies that are intrinsic to music datasets. We then proceed to present a method for extracting ground truth data for source separation from large unstructured musical catalogs.
In Chapter 4 we design a novel deep learning-based source separation algorithm. Motivation is provided by means of a musicological study1 that showed the high importance of vocals relative to other musical factors, in the minds of listeners. At the core of the vocal separation algorithm is the U-Net, a deep learning architecture that uses skip connections to preserve fine-grained detail. It was originally developed in the biomedical imaging domain, and later adapted to image-to-image translation. We adapt it to the source separation domain by treating spectrograms as images, and we use the dataset mining methods from Chapter 3 to generate sufficiently large training data. We evaluate our model objectively using standard evaluation metrics, subjectively using \crowdsourced" human subjects. To the best of our knowledge, this is the first use of U-Nets for source separation.
In the introduction above we proposed joint learning to optimize source separation and other objectives. In Chapter 5 we investigate one such instance: multi-task learning of vocal removal and vocal pitch tracking. We combine the vocal separation model from Chapter 4 with a state of the art pitch salience estimation model2, exploring several ways of combining the two models. We find that vocal pitch estimation benefits from joint learning when the two tasks are trained in sequence, with the source separation model preceding the pitch estimation model. We also report benefits from fine-tuning by iteratively applying the model.
Chapter 6 extends the U-Net model to multiple instruments. In order to minimize the phase artifacts that were a common issue in Chapter 4, we modify the model to operate in the complex domain. We run experiments with several loss functions: Time-domain loss, magnitude-only frequency domain loss, and joint time and frequency-domain loss. Our experiments are evaluated both objectively and subjectively, and we carry out extensive qualitative analysis to investigate the effects of complex masking.
Finally, we conclude the thesis in Chapter 7 by summarizing this work and highlighting several future directions of research
The Sound Demixing Challenge 2023 \unicode{x2013} Music Demixing Track
This paper summarizes the music demixing (MDX) track of the Sound Demixing
Challenge (SDX'23). We provide a summary of the challenge setup and introduce
the task of robust music source separation (MSS), i.e., training MSS models in
the presence of errors in the training data. We propose a formalization of the
errors that can occur in the design of a training dataset for MSS systems and
introduce two new datasets that simulate such errors: SDXDB23_LabelNoise and
SDXDB23_Bleeding1. We describe the methods that achieved the highest scores in
the competition. Moreover, we present a direct comparison with the previous
edition of the challenge (the Music Demixing Challenge 2021): the best
performing system under the standard MSS formulation achieved an improvement of
over 1.6dB in signal-to-distortion ratio over the winner of the previous
competition, when evaluated on MDXDB21. Besides relying on the
signal-to-distortion ratio as objective metric, we also performed a listening
test with renowned producers/musicians to study the perceptual quality of the
systems and report here the results. Finally, we provide our insights into the
organization of the competition and our prospects for future editions.Comment: under revie
Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation
Monaural Singing Voice Separation (MSVS) is a challenging task and has been
studied for decades. Deep neural networks (DNNs) are the current
state-of-the-art methods for MSVS. However, the existing DNNs are often
designed manually, which is time-consuming and error-prone. In addition, the
network architectures are usually pre-defined, and not adapted to the training
data. To address these issues, we introduce a Neural Architecture Search (NAS)
method to the structure design of DNNs for MSVS. Specifically, we propose a new
multi-resolution Convolutional Neural Network (CNN) framework for MSVS namely
Multi-Resolution Pooling CNN (MRP-CNN), which uses various-size pooling
operators to extract multi-resolution features. Based on the NAS, we then
develop an evolving framework namely Evolving MRP-CNN (E-MRP-CNN), by
automatically searching the effective MRP-CNN structures using genetic
algorithms, optimized in terms of a single-objective considering only
separation performance, or multi-objective considering both the separation
performance and the model complexity. The multi-objective E-MRP-CNN gives a set
of Pareto-optimal solutions, each providing a trade-off between separation
performance and model complexity. Quantitative and qualitative evaluations on
the MIR-1K and DSD100 datasets are used to demonstrate the advantages of the
proposed framework over several recent baselines