Search CORE

17 research outputs found

SCNet: Sparse Compression Network for Music Source Separation

Author: Chen Jun
Jiang Tao
Kang Shiyin
Li Yang
Meng Helen
Tong Weinan
Wu Zhiyong
Zhu Jiaxu
Publication venue
Publication date: 24/01/2024
Field of study

Deep learning-based methods have made significant achievements in music source separation. However, obtaining good results while maintaining a low model complexity remains challenging in super wide-band music source separation. Previous works either overlook the differences in subbands or inadequately address the problem of information loss when generating subband features. In this paper, we propose SCNet, a novel frequency-domain network to explicitly split the spectrogram of the mixture into several subbands and introduce a sparsity-based encoder to model different frequency bands. We use a higher compression ratio on subbands with less information to improve the information density and focus on modeling subbands with more information. In this way, the separation performance can be significantly improved using lower computational consumption. Experiment results show that the proposed model achieves a signal to distortion ratio (SDR) of 9.0 dB on the MUSDB18-HQ dataset without using extra data, which outperforms state-of-the-art methods. Specifically, SCNet's CPU inference time is only 48% of HT Demucs, one of the previous state-of-the-art models.Comment: Accepted by ICASSP 202

arXiv.org e-Print Archive

Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss

Author: Chen Lele
Duan Zhiyao
Maddox Ross K.
Xu Chenliang
Publication venue
Publication date: 09/05/2019
Field of study

We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons

arXiv.org e-Print Archive

Crossref

HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation Methods

Author: Li Yongyuan
Liang Chao
Qin Xiuyuan
Wei Mingqiang
Publication venue
Publication date: 14/09/2023
Field of study

Talking Face Generation (TFG) aims to reconstruct facial movements to achieve high natural lip movements from audio and facial features that are under potential connections. Existing TFG methods have made significant advancements to produce natural and realistic images. However, most work rarely takes visual quality into consideration. It is challenging to ensure lip synchronization while avoiding visual quality degradation in cross-modal generation methods. To address this issue, we propose a universal High-Definition Teeth Restoration Network, dubbed HDTR-Net, for arbitrary TFG methods. HDTR-Net can enhance teeth regions at an extremely fast speed while maintaining synchronization, and temporal consistency. In particular, we propose a Fine-Grained Feature Fusion (FGFF) module to effectively capture fine texture feature information around teeth and surrounding regions, and use these features to fine-grain the feature map to enhance the clarity of teeth. Extensive experiments show that our method can be adapted to arbitrary TFG methods without suffering from lip synchronization and frame coherence. Another advantage of HDTR-Net is its real-time generation ability. Also under the condition of high-definition restoration of talking face video synthesis, its inference speed is

300\%

faster than the current state-of-the-art face restoration based on super-resolution.Comment: 15pages, 6 figures, PRCV202

arXiv.org e-Print Archive

Quantum process tomography with unknown single-preparation input states

Author: Deville Alain
Deville Yannick
Publication venue: 'American Physical Society (APS)'
Publication date: 18/09/2019
Field of study

Quantum Process Tomography (QPT) methods aim at identifying, i.e. estimating, a given quantum process. QPT is a major quantum information processing tool, since it especially allows one to characterize the actual behavior of quantum gates, which are the building blocks of quantum computers. However, usual QPT procedures are complicated, since they set several constraints on the quantum states used as inputs of the process to be characterized. In this paper, we extend QPT so as to avoid two such constraints. On the one hand, usual QPT methods requires one to know, hence to precisely control (i.e. prepare), the specific quantum states used as inputs of the considered quantum process, which is cumbersome. We therefore propose a Blind, or unsupervised, extension of QPT (i.e. BQPT), which means that this approach uses input quantum states whose values are unknown and arbitrary, except that they are requested to meet some general known properties (and this approach exploits the output states of the considered quantum process). On the other hand, usual QPT methods require one to be able to prepare many copies of the same (known) input state, which is constraining. On the contrary, we propose "single-preparation methods", i.e. methods which can operate with only one instance of each considered input state. These two new concepts are here illustrated with practical BQPT methods which are numerically validated, in the case when: i) random pure states are used as inputs and their required properties are especially related to the statistical independence of the random variables that define them, ii) the considered quantum process is based on cylindrical-symmetry Heisenberg spin coupling. These concepts may be extended to a much wider class of processes and to BQPT methods based on other input quantum state properties

arXiv.org e-Print Archive

HAL AMU

HAL-INSU

HAL-IRD

A new generalized projection and its application to acceleration of audio declipping

Author: Mokrý Ondřej
Rajmic Pavel
Veselý Vítězslav
Záviška Pavel
Publication venue: 'MDPI AG'
Publication date: 19/09/2019
Field of study

In convex optimization, it is often inevitable to work with projectors onto convex sets composed with a linear operator. Such a need arises from both the theory and applications, with signal processing being a prominent and broad field where convex optimization has been used recently. In this article, a novel projector is presented, which generalizes previous results in that it admits to work with a broader family of linear transforms when compared with the state of the art but, on the other hand, it is limited to box-type convex sets in the transformed domain. The new projector is described by an explicit formula, which makes it simple to implement and requires a low computational cost. The projector is interpreted within the framework of the so-called proximal splitting theory. The convenience of the new projector is demonstrated on an example from signal processing, where it was possible to speed up the convergence of a signal declipping algorithm by a factor of more than two

Multidisciplinary Digital Publishing Institute

Digital library of Brno University of Technology

The Sound Demixing Challenge 2023 \unicode{x2013} Cinematic Demixing Track

Author: Chakraborty Dipam
Fabbro Giorgio
Gu Rongzhi
Habruseva Tatiana
Hirano Masato
Li Kai
Luo Yi
Mitsufuji Yuki
Mohanty Sharada
Roux Jonathan Le
Solovyev Roman
Stempkovskiy Alexander
Sukhovei Mikhail
Takahashi Shusuke
Uhlich Stefan
Wichern Gordon
Yu Jianwei
Publication venue
Publication date: 14/08/2023
Field of study

This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8dB in SDR whereas the top performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7dB.Comment: under revie

arXiv.org e-Print Archive

Recommended from our members

Musical source separation with deep learning and large-scale datasets

Author: Jansson A.
Publication venue
Publication date
Field of study

Throughout this thesis we will explore automatic music source separation by utilizing modern (at the time of writing) techniques and tools from machine learning and big data processing. The bulk of this work was carried out between 2016 and 2019. In Chapter 2 we conduct a review of source separation literature. We start by outlining a subset of applications of source separation in some depth. We describe some of the early, pioneering work in automatic source separation: Auditory Scene Analysis, and its digital counterpart, Computational Auditory Scene Analysis. We then introduce matrix decomposition-based methods such as Independent Component Analysis and Non-Negative Matrix factorization, and pitch informed methods where the separation algorithm is guided by pitch information that is known a priori. We brie y discuss user-guided methods, before conducting a thorough review of Deep Learning based source separation, including recurrent, convolutional, deep clustering-based, and Generative Adversarial Networks. We then proceed to describe common evaluation metrics and training datasets. Finally, we list a number of current challenges and drawbacks of current systems. Chapter 3 focuses on datasets for musical source separation. First we show the growth of dataset sizes for both machine learning in general and music information retrieval specifically. We give several examples of the complexities and idiosyncrasies that are intrinsic to music datasets. We then proceed to present a method for extracting ground truth data for source separation from large unstructured musical catalogs. In Chapter 4 we design a novel deep learning-based source separation algorithm. Motivation is provided by means of a musicological study1 that showed the high importance of vocals relative to other musical factors, in the minds of listeners. At the core of the vocal separation algorithm is the U-Net, a deep learning architecture that uses skip connections to preserve fine-grained detail. It was originally developed in the biomedical imaging domain, and later adapted to image-to-image translation. We adapt it to the source separation domain by treating spectrograms as images, and we use the dataset mining methods from Chapter 3 to generate sufficiently large training data. We evaluate our model objectively using standard evaluation metrics, subjectively using \crowdsourced" human subjects. To the best of our knowledge, this is the first use of U-Nets for source separation. In the introduction above we proposed joint learning to optimize source separation and other objectives. In Chapter 5 we investigate one such instance: multi-task learning of vocal removal and vocal pitch tracking. We combine the vocal separation model from Chapter 4 with a state of the art pitch salience estimation model2, exploring several ways of combining the two models. We find that vocal pitch estimation benefits from joint learning when the two tasks are trained in sequence, with the source separation model preceding the pitch estimation model. We also report benefits from fine-tuning by iteratively applying the model. Chapter 6 extends the U-Net model to multiple instruments. In order to minimize the phase artifacts that were a common issue in Chapter 4, we modify the model to operate in the complex domain. We run experiments with several loss functions: Time-domain loss, magnitude-only frequency domain loss, and joint time and frequency-domain loss. Our experiments are evaluated both objectively and subjectively, and we carry out extensive qualitative analysis to investigate the effects of complex masking. Finally, we conclude the thesis in Chapter 7 by summarizing this work and highlighting several future directions of research

City Research Online

The Sound Demixing Challenge 2023 \unicode{x2013} Music Demixing Track

This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce two new datasets that simulate such errors: SDXDB23_LabelNoise and SDXDB23_Bleeding1. We describe the methods that achieved the highest scores in the competition. Moreover, we present a direct comparison with the previous edition of the challenge (the Music Demixing Challenge 2021): the best performing system under the standard MSS formulation achieved an improvement of over 1.6dB in signal-to-distortion ratio over the winner of the previous competition, when evaluated on MDXDB21. Besides relying on the signal-to-distortion ratio as objective metric, we also performed a listening test with renowned producers/musicians to study the perceptual quality of the systems and report here the results. Finally, we provide our insights into the organization of the competition and our prospects for future editions.Comment: under revie

arXiv.org e-Print Archive

Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation

Author: Dong Bofei
Unoki Masashi
Wang Shengbei
Wang Wenwu
Yuan Weitao
Publication venue
Publication date: 03/08/2020
Field of study

Monaural Singing Voice Separation (MSVS) is a challenging task and has been studied for decades. Deep neural networks (DNNs) are the current state-of-the-art methods for MSVS. However, the existing DNNs are often designed manually, which is time-consuming and error-prone. In addition, the network architectures are usually pre-defined, and not adapted to the training data. To address these issues, we introduce a Neural Architecture Search (NAS) method to the structure design of DNNs for MSVS. Specifically, we propose a new multi-resolution Convolutional Neural Network (CNN) framework for MSVS namely Multi-Resolution Pooling CNN (MRP-CNN), which uses various-size pooling operators to extract multi-resolution features. Based on the NAS, we then develop an evolving framework namely Evolving MRP-CNN (E-MRP-CNN), by automatically searching the effective MRP-CNN structures using genetic algorithms, optimized in terms of a single-objective considering only separation performance, or multi-objective considering both the separation performance and the model complexity. The multi-objective E-MRP-CNN gives a set of Pareto-optimal solutions, each providing a trade-off between separation performance and model complexity. Quantitative and qualitative evaluations on the MIR-1K and DSD100 datasets are used to demonstrate the advantages of the proposed framework over several recent baselines

arXiv.org e-Print Archive

University of Surrey