17 research outputs found

    SCNet: Sparse Compression Network for Music Source Separation

    Full text link
    Deep learning-based methods have made significant achievements in music source separation. However, obtaining good results while maintaining a low model complexity remains challenging in super wide-band music source separation. Previous works either overlook the differences in subbands or inadequately address the problem of information loss when generating subband features. In this paper, we propose SCNet, a novel frequency-domain network to explicitly split the spectrogram of the mixture into several subbands and introduce a sparsity-based encoder to model different frequency bands. We use a higher compression ratio on subbands with less information to improve the information density and focus on modeling subbands with more information. In this way, the separation performance can be significantly improved using lower computational consumption. Experiment results show that the proposed model achieves a signal to distortion ratio (SDR) of 9.0 dB on the MUSDB18-HQ dataset without using extra data, which outperforms state-of-the-art methods. Specifically, SCNet's CPU inference time is only 48% of HT Demucs, one of the previous state-of-the-art models.Comment: Accepted by ICASSP 202

    Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss

    Full text link
    We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons

    HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation Methods

    Full text link
    Talking Face Generation (TFG) aims to reconstruct facial movements to achieve high natural lip movements from audio and facial features that are under potential connections. Existing TFG methods have made significant advancements to produce natural and realistic images. However, most work rarely takes visual quality into consideration. It is challenging to ensure lip synchronization while avoiding visual quality degradation in cross-modal generation methods. To address this issue, we propose a universal High-Definition Teeth Restoration Network, dubbed HDTR-Net, for arbitrary TFG methods. HDTR-Net can enhance teeth regions at an extremely fast speed while maintaining synchronization, and temporal consistency. In particular, we propose a Fine-Grained Feature Fusion (FGFF) module to effectively capture fine texture feature information around teeth and surrounding regions, and use these features to fine-grain the feature map to enhance the clarity of teeth. Extensive experiments show that our method can be adapted to arbitrary TFG methods without suffering from lip synchronization and frame coherence. Another advantage of HDTR-Net is its real-time generation ability. Also under the condition of high-definition restoration of talking face video synthesis, its inference speed is 300%300\% faster than the current state-of-the-art face restoration based on super-resolution.Comment: 15pages, 6 figures, PRCV202

    Quantum process tomography with unknown single-preparation input states

    Full text link
    Quantum Process Tomography (QPT) methods aim at identifying, i.e. estimating, a given quantum process. QPT is a major quantum information processing tool, since it especially allows one to characterize the actual behavior of quantum gates, which are the building blocks of quantum computers. However, usual QPT procedures are complicated, since they set several constraints on the quantum states used as inputs of the process to be characterized. In this paper, we extend QPT so as to avoid two such constraints. On the one hand, usual QPT methods requires one to know, hence to precisely control (i.e. prepare), the specific quantum states used as inputs of the considered quantum process, which is cumbersome. We therefore propose a Blind, or unsupervised, extension of QPT (i.e. BQPT), which means that this approach uses input quantum states whose values are unknown and arbitrary, except that they are requested to meet some general known properties (and this approach exploits the output states of the considered quantum process). On the other hand, usual QPT methods require one to be able to prepare many copies of the same (known) input state, which is constraining. On the contrary, we propose "single-preparation methods", i.e. methods which can operate with only one instance of each considered input state. These two new concepts are here illustrated with practical BQPT methods which are numerically validated, in the case when: i) random pure states are used as inputs and their required properties are especially related to the statistical independence of the random variables that define them, ii) the considered quantum process is based on cylindrical-symmetry Heisenberg spin coupling. These concepts may be extended to a much wider class of processes and to BQPT methods based on other input quantum state properties

    A new generalized projection and its application to acceleration of audio declipping

    Get PDF
    In convex optimization, it is often inevitable to work with projectors onto convex sets composed with a linear operator. Such a need arises from both the theory and applications, with signal processing being a prominent and broad field where convex optimization has been used recently. In this article, a novel projector is presented, which generalizes previous results in that it admits to work with a broader family of linear transforms when compared with the state of the art but, on the other hand, it is limited to box-type convex sets in the transformed domain. The new projector is described by an explicit formula, which makes it simple to implement and requires a low computational cost. The projector is interpreted within the framework of the so-called proximal splitting theory. The convenience of the new projector is demonstrated on an example from signal processing, where it was possible to speed up the convergence of a signal declipping algorithm by a factor of more than two

    The Sound Demixing Challenge 2023 \unicode{x2013} Cinematic Demixing Track

    Full text link
    This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8dB in SDR whereas the top performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7dB.Comment: under revie

    The Sound Demixing Challenge 2023 \unicode{x2013} Music Demixing Track

    Full text link
    This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce two new datasets that simulate such errors: SDXDB23_LabelNoise and SDXDB23_Bleeding1. We describe the methods that achieved the highest scores in the competition. Moreover, we present a direct comparison with the previous edition of the challenge (the Music Demixing Challenge 2021): the best performing system under the standard MSS formulation achieved an improvement of over 1.6dB in signal-to-distortion ratio over the winner of the previous competition, when evaluated on MDXDB21. Besides relying on the signal-to-distortion ratio as objective metric, we also performed a listening test with renowned producers/musicians to study the perceptual quality of the systems and report here the results. Finally, we provide our insights into the organization of the competition and our prospects for future editions.Comment: under revie

    Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation

    Full text link
    Monaural Singing Voice Separation (MSVS) is a challenging task and has been studied for decades. Deep neural networks (DNNs) are the current state-of-the-art methods for MSVS. However, the existing DNNs are often designed manually, which is time-consuming and error-prone. In addition, the network architectures are usually pre-defined, and not adapted to the training data. To address these issues, we introduce a Neural Architecture Search (NAS) method to the structure design of DNNs for MSVS. Specifically, we propose a new multi-resolution Convolutional Neural Network (CNN) framework for MSVS namely Multi-Resolution Pooling CNN (MRP-CNN), which uses various-size pooling operators to extract multi-resolution features. Based on the NAS, we then develop an evolving framework namely Evolving MRP-CNN (E-MRP-CNN), by automatically searching the effective MRP-CNN structures using genetic algorithms, optimized in terms of a single-objective considering only separation performance, or multi-objective considering both the separation performance and the model complexity. The multi-objective E-MRP-CNN gives a set of Pareto-optimal solutions, each providing a trade-off between separation performance and model complexity. Quantitative and qualitative evaluations on the MIR-1K and DSD100 datasets are used to demonstrate the advantages of the proposed framework over several recent baselines