77 research outputs found
Deep Clustering and Conventional Networks for Music Separation: Stronger Together
Deep clustering is the first method to handle general audio separation
scenarios with multiple sources of the same type and an arbitrary number of
sources, performing impressively in speaker-independent speech separation
tasks. However, little is known about its effectiveness in other challenging
situations such as music source separation. Contrary to conventional networks
that directly estimate the source signals, deep clustering generates an
embedding for each time-frequency bin, and separates sources by clustering the
bins in the embedding space. We show that deep clustering outperforms
conventional networks on a singing voice separation task, in both matched and
mismatched conditions, even though conventional networks have the advantage of
end-to-end training for best signal approximation, presumably because its more
flexible objective engenders better regularization. Since the strengths of deep
clustering and conventional network architectures appear complementary, we
explore combining them in a single hybrid network trained via an approach akin
to multi-task learning. Remarkably, the combination significantly outperforms
either of its components.Comment: Published in ICASSP 201
Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation
Monaural Singing Voice Separation (MSVS) is a challenging task and has been
studied for decades. Deep neural networks (DNNs) are the current
state-of-the-art methods for MSVS. However, the existing DNNs are often
designed manually, which is time-consuming and error-prone. In addition, the
network architectures are usually pre-defined, and not adapted to the training
data. To address these issues, we introduce a Neural Architecture Search (NAS)
method to the structure design of DNNs for MSVS. Specifically, we propose a new
multi-resolution Convolutional Neural Network (CNN) framework for MSVS namely
Multi-Resolution Pooling CNN (MRP-CNN), which uses various-size pooling
operators to extract multi-resolution features. Based on the NAS, we then
develop an evolving framework namely Evolving MRP-CNN (E-MRP-CNN), by
automatically searching the effective MRP-CNN structures using genetic
algorithms, optimized in terms of a single-objective considering only
separation performance, or multi-objective considering both the separation
performance and the model complexity. The multi-objective E-MRP-CNN gives a set
of Pareto-optimal solutions, each providing a trade-off between separation
performance and model complexity. Quantitative and qualitative evaluations on
the MIR-1K and DSD100 datasets are used to demonstrate the advantages of the
proposed framework over several recent baselines
Singing Voice Separation from Monaural Recordings using Archetypal Analysis
Ο διαχωρισμός τραγουδιστικής φωνής στοχεύει στο να διαχωρίσει το σήμα της τραγουδιστικής φωνής από το σήμα της μουσικής υπόκρουσης έχοντας ως είσοδο μουσικές ηχογραφήσεις. Η εργασία αυτή είναι ένας ακρογωνιαίος λίθος για πλήθος εργασιών που ανήκουν στην κατηγορία ”ανάκτηση μουσικής πληροφορίας” όπως για παράδειγμα αυτόματη
αναγνώριση στίχων, αναγνώριση τραγουδιστή, εξόρυξη μελωδίας και ρεμίξ ήχου. Στη παρούσα διατριβή, διερευνούμε τον Διαχωρισμό τραγουδιστικής φωνής από μονοφωνικές ηχογραφήσεις εκμεταλλευόμενοι μεθόδους μη επιτηρούμενης μηχανικής μάθησης. Το κίνητρο πίσω από τις μεθόδους που χρησιμοποιήθηκαν είναι το γεγονός ότι η μουσική υπόκρουση τοποθετείται σε έναν χαμηλής-τάξης υπόχωρο λόγω του επαναλαμβανόμενου
μοτίβου της, ενώ το πρότυπο της φωνής παρατηρείται ως αραιό μέσα σε ένα μουσικό κομμάτι. Συνεπώς, ανασυνθέτουμε ηχητικά φασματογραφήματα ως υπέρθεση χαμηλής-τάξης και αραιών συνιστωσών, αποτυπώνοντας τα φασματογραφήματα της μουσικής υπόκρουσης και τραγουδιστικής φωνής αντίστοιχα χρησιμοποιώντας τον αλγόριθμο Robust Principal Component Analysis. Επιπλέον, λαμβάνοντας υπόψη τη μη αρνητική φύση του μέτρου του ηχητικού φασματογραφήματος, αναπτύξαμε μία παραλλαγή της Αρχετυπικής Ανάλυσης με περιορισμούς αραιότητας στοχεύοντας να βελτιώσουμε τον διαχωρισμό. Αμφότερες οι μέθοδοι αξιολογήθηκαν στο σύνολο δεδομένων MIR-1K, το οποίο είναι κατασκευασμένο ειδικά για τον διαχωρισμό τραγουδιστικής φωνής. Τα πειραματικά αποτελέσματα δείχνουν πως και οι δύο μέθοδοι εκτελούν τον διαχωρισμό τραγουδιστικής φωνής επιτυχημένα και πετυχαίνουν στην μετρική GNSDR τιμή μεγαλύτερη των 3.0dB.Singing voice separation aims at separating the singing voice signal from the background music signal from music recordings. This task is a cornerstone for numerous MIR (Music Information Retrieval) tasks including automatic lyric recognition, singer identification, melody extraction and audio remixing. In this thesis, we investigate Singing voice separation from monaural recordings by exploiting unsupervised machine learning methods. The motivation behind the employed methods is the fact that music accompaniment lies in a low rank subspace due to its repeating motive and singing voice has a sparse pattern within the song. To this end, we decompose audio spectrograms as a superposition of low-rank components and sparse ones, capturing the spectrograms of background music and singing voice respectively using the Robust Principal Component Analysis algorithm. Furthermore, by considering the non-negative nature of the magnitude of audio spectrograms, we develop a variant of Archetypal Analysis with sparsity constraints aiming to improve the separation. Both methods are evaluated on MIR-1K dataset, which is designed especially for singing voice separation. Experimental evaluation confirms that both methods perform singing voice separation successfully and achieve a value above 3.0dB in GNSDR metric
Singing voice correction using canonical time warping
Expressive singing voice correction is an appealing but challenging problem.
A robust time-warping algorithm which synchronizes two singing recordings can
provide a promising solution. We thereby propose to address the problem by
canonical time warping (CTW) which aligns amateur singing recordings to
professional ones. A new pitch contour is generated given the alignment
information, and a pitch-corrected singing is synthesized back through the
vocoder. The objective evaluation shows that CTW is robust against
pitch-shifting and time-stretching effects, and the subjective test
demonstrates that CTW prevails the other methods including DTW and the
commercial auto-tuning software. Finally, we demonstrate the applicability of
the proposed method in a practical, real-world scenario
- …