551 research outputs found

    A computational framework for sound segregation in music signals

    Get PDF
    Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200

    Real-time Sound Source Separation For Music Applications

    Get PDF
    Sound source separation refers to the task of extracting individual sound sources from some number of mixtures of those sound sources. In this thesis, a novel sound source separation algorithm for musical applications is presented. It leverages the fact that the vast majority of commercially recorded music since the 1950s has been mixed down for two channel reproduction, more commonly known as stereo. The algorithm presented in Chapter 3 in this thesis requires no prior knowledge or learning and performs the task of separation based purely on azimuth discrimination within the stereo field. The algorithm exploits the use of the pan pot as a means to achieve image localisation within stereophonic recordings. As such, only an interaural intensity difference exists between left and right channels for a single source. We use gain scaling and phase cancellation techniques to expose frequency dependent nulls across the azimuth domain, from which source separation and resynthesis is carried out. The algorithm is demonstrated to be state of the art in the field of sound source separation but also to be a useful pre-process to other tasks such as music segmentation and surround sound upmixing

    Score-Informed Source Separation for Music Signals

    Get PDF
    In recent years, the processing of audio recordings by exploiting additional musical knowledge has turned out to be a promising research direction. In particular, additional note information as specified by a musical score or a MIDI file has been employed to support various audio processing tasks such as source separation, audio parameterization, performance analysis, or instrument equalization. In this contribution, we provide an overview of approaches for score-informed source separation and illustrate their potential by discussing innovative applications and interfaces. Additionally, to illustrate some basic principles behind these approaches, we demonstrate how score information can be integrated into the well-known non-negative matrix factorization (NMF) framework. Finally, we compare this approach to advanced methods based on parametric models

    Pitch-Informed Solo and Accompaniment Separation

    Get PDF
    ï»żDas Thema dieser Dissertation ist die Entwicklung eines Systems zur Tonhöhen-informierten Quellentrennung von Musiksignalen in Soloinstrument und Begleitung. Dieses ist geeignet, die dominanten Instrumente aus einem MusikstĂŒck zu isolieren, unabhĂ€ngig von der Art des Instruments, der Begleitung und Stilrichtung. Dabei werden nur einstimmige Melodieinstrumente in Betracht gezogen. Die Musikaufnahmen liegen monaural vor, es kann also keine zusĂ€tzliche Information aus der Verteilung der Instrumente im Stereo-Panorama gewonnen werden. Die entwickelte Methode nutzt Tonhöhen-Information als Basis fĂŒr eine sinusoidale Modellierung der spektralen Eigenschaften des Soloinstruments aus dem Musikmischsignal. Anstatt die spektralen Informationen pro Frame zu bestimmen, werden in der vorgeschlagenen Methode Tonobjekte fĂŒr die Separation genutzt. Tonobjekt-basierte Verarbeitung ermöglicht es, zusĂ€tzlich die NotenanfĂ€nge zu verfeinern, transiente Artefakte zu reduzieren, gemeinsame Amplitudenmodulation (Common Amplitude Modulation CAM) einzubeziehen und besser nichtharmonische Elemente der Töne abzuschĂ€tzen. Der vorgestellte Algorithmus zur Quellentrennung von Soloinstrument und Begleitung ermöglicht eine Echtzeitverarbeitung und ist somit relevant fĂŒr den praktischen Einsatz. Ein Experiment zur besseren Modellierung der ZusammenhĂ€nge zwischen Magnitude, Phase und Feinfrequenz von isolierten Instrumententönen wurde durchgefĂŒhrt. Als Ergebnis konnte die KontinuitĂ€t der zeitlichen EinhĂŒllenden, die InharmonizitĂ€t bestimmter Musikinstrumente und die Auswertung des Phasenfortschritts fĂŒr die vorgestellte Methode ausgenutzt werden. ZusĂ€tzlich wurde ein Algorithmus fĂŒr die Quellentrennung in perkussive und harmonische Signalanteile auf Basis des Phasenfortschritts entwickelt. Dieser erreicht ein verbesserte perzeptuelle QualitĂ€t der harmonischen und perkussiven Signale gegenĂŒber vergleichbaren Methoden nach dem Stand der Technik. Die vorgestellte Methode zur Klangquellentrennung in Soloinstrument und Begleitung wurde zu den Evaluationskampagnen SiSEC 2011 und SiSEC 2013 eingereicht. Dort konnten vergleichbare Ergebnisse im Hinblick auf perzeptuelle Bewertungsmaße erzielt werden. Die QualitĂ€t eines Referenzalgorithmus im Hinblick auf den in dieser Dissertation beschriebenen Instrumentaldatensatz ĂŒbertroffen werden. Als ein Anwendungsszenario fĂŒr die Klangquellentrennung in Solo und Begleitung wurde ein Hörtest durchgefĂŒhrt, der die QualitĂ€tsanforderungen an Quellentrennung im Kontext von Musiklernsoftware bewerten sollte. Die Ergebnisse dieses Hörtests zeigen, dass die Solo- und Begleitspur gemĂ€ĂŸ unterschiedlicher QualitĂ€tskriterien getrennt werden sollten. Die Musiklernsoftware Songs2See integriert die vorgestellte Klangquellentrennung bereits in einer kommerziell erhĂ€ltlichen Anwendung.This thesis addresses the development of a system for pitch-informed solo and accompaniment separation capable of separating main instruments from music accompaniment regardless of the musical genre of the track, or type of music accompaniment. For the solo instrument, only pitched monophonic instruments were considered in a single-channel scenario where no panning or spatial location information is available. In the proposed method, pitch information is used as an initial stage of a sinusoidal modeling approach that attempts to estimate the spectral information of the solo instrument from a given audio mixture. Instead of estimating the solo instrument on a frame by frame basis, the proposed method gathers information of tone objects to perform separation. Tone-based processing allowed the inclusion of novel processing stages for attack refinement, transient interference reduction, common amplitude modulation (CAM) of tone objects, and for better estimation of non-harmonic elements that can occur in musical instrument tones. The proposed solo and accompaniment algorithm is an efficient method suitable for real-world applications. A study was conducted to better model magnitude, frequency, and phase of isolated musical instrument tones. As a result of this study, temporal envelope smoothness, inharmonicty of musical instruments, and phase expectation were exploited in the proposed separation method. Additionally, an algorithm for harmonic/percussive separation based on phase expectation was proposed. The algorithm shows improved perceptual quality with respect to state-of-the-art methods for harmonic/percussive separation. The proposed solo and accompaniment method obtained perceptual quality scores comparable to other state-of-the-art algorithms under the SiSEC 2011 and SiSEC 2013 campaigns, and outperformed the comparison algorithm on the instrumental dataset described in this thesis.As a use-case of solo and accompaniment separation, a listening test procedure was conducted to assess separation quality requirements in the context of music education. Results from the listening test showed that solo and accompaniment tracks should be optimized differently to suit quality requirements of music education. The Songs2See application was presented as commercial music learning software which includes the proposed solo and accompaniment separation method

    Trennung und SchĂ€tzung der Anzahl von Audiosignalquellen mit Zeit- und FrequenzĂŒberlappung

    Get PDF
    Everyday audio recordings involve mixture signals: music contains a mixture of instruments; in a meeting or conference, there is a mixture of human voices. For these mixtures, automatically separating or estimating the number of sources is a challenging task. A common assumption when processing mixtures in the time-frequency domain is that sources are not fully overlapped. However, in this work we consider some cases where the overlap is severe — for instance, when instruments play the same note (unison) or when many people speak concurrently ("cocktail party") — highlighting the need for new representations and more powerful models. To address the problems of source separation and count estimation, we use conventional signal processing techniques as well as deep neural networks (DNN). We ïŹrst address the source separation problem for unison instrument mixtures, studying the distinct spectro-temporal modulations caused by vibrato. To exploit these modulations, we developed a method based on time warping, informed by an estimate of the fundamental frequency. For cases where such estimates are not available, we present an unsupervised model, inspired by the way humans group time-varying sources (common fate). This contribution comes with a novel representation that improves separation for overlapped and modulated sources on unison mixtures but also improves vocal and accompaniment separation when used as an input for a DNN model. Then, we focus on estimating the number of sources in a mixture, which is important for real-world scenarios. Our work on count estimation was motivated by a study on how humans can address this task, which lead us to conduct listening experiments, conïŹrming that humans are only able to estimate the number of up to four sources correctly. To answer the question of whether machines can perform similarly, we present a DNN architecture, trained to estimate the number of concurrent speakers. Our results show improvements compared to other methods, and the model even outperformed humans on the same task. In both the source separation and source count estimation tasks, the key contribution of this thesis is the concept of “modulation”, which is important to computationally mimic human performance. Our proposed Common Fate Transform is an adequate representation to disentangle overlapping signals for separation, and an inspection of our DNN count estimation model revealed that it proceeds to ïŹnd modulation-like intermediate features.Im Alltag sind wir von gemischten Signalen umgeben: Musik besteht aus einer Mischung von Instrumenten; in einem Meeting oder auf einer Konferenz sind wir einer Mischung menschlicher Stimmen ausgesetzt. FĂŒr diese Mischungen ist die automatische Quellentrennung oder die Bestimmung der Anzahl an Quellen eine anspruchsvolle Aufgabe. Eine hĂ€uïŹge Annahme bei der Verarbeitung von gemischten Signalen im Zeit-Frequenzbereich ist, dass die Quellen sich nicht vollstĂ€ndig ĂŒberlappen. In dieser Arbeit betrachten wir jedoch einige FĂ€lle, in denen die Überlappung immens ist zum Beispiel, wenn Instrumente den gleichen Ton spielen (unisono) oder wenn viele Menschen gleichzeitig sprechen (Cocktailparty) —, so dass neue Signal-ReprĂ€sentationen und leistungsfĂ€higere Modelle notwendig sind. Um die zwei genannten Probleme zu bewĂ€ltigen, verwenden wir sowohl konventionelle Signalverbeitungsmethoden als auch tiefgehende neuronale Netze (DNN). Wir gehen zunĂ€chst auf das Problem der Quellentrennung fĂŒr Unisono-Instrumentenmischungen ein und untersuchen die speziellen, durch Vibrato ausgelösten, zeitlich-spektralen Modulationen. Um diese Modulationen auszunutzen entwickelten wir eine Methode, die auf Zeitverzerrung basiert und eine SchĂ€tzung der Grundfrequenz als zusĂ€tzliche Information nutzt. FĂŒr FĂ€lle, in denen diese SchĂ€tzungen nicht verfĂŒgbar sind, stellen wir ein unĂŒberwachtes Modell vor, das inspiriert ist von der Art und Weise, wie Menschen zeitverĂ€nderliche Quellen gruppieren (Common Fate). Dieser Beitrag enthĂ€lt eine neuartige ReprĂ€sentation, die die Separierbarkeit fĂŒr ĂŒberlappte und modulierte Quellen in Unisono-Mischungen erhöht, aber auch die Trennung in Gesang und Begleitung verbessert, wenn sie in einem DNN-Modell verwendet wird. Im Weiteren beschĂ€ftigen wir uns mit der SchĂ€tzung der Anzahl von Quellen in einer Mischung, was fĂŒr reale Szenarien wichtig ist. Unsere Arbeit an der SchĂ€tzung der Anzahl war motiviert durch eine Studie, die zeigt, wie wir Menschen diese Aufgabe angehen. Dies hat uns dazu veranlasst, eigene Hörexperimente durchzufĂŒhren, die bestĂ€tigten, dass Menschen nur in der Lage sind, die Anzahl von bis zu vier Quellen korrekt abzuschĂ€tzen. Um nun die Frage zu beantworten, ob Maschinen dies Ă€hnlich gut können, stellen wir eine DNN-Architektur vor, die erlernt hat, die Anzahl der gleichzeitig sprechenden Sprecher zu ermitteln. Die Ergebnisse zeigen Verbesserungen im Vergleich zu anderen Methoden, aber vor allem auch im Vergleich zu menschlichen Hörern. Sowohl bei der Quellentrennung als auch bei der SchĂ€tzung der Anzahl an Quellen ist ein Kernbeitrag dieser Arbeit das Konzept der “Modulation”, welches wichtig ist, um die Strategien von Menschen mittels Computern nachzuahmen. Unsere vorgeschlagene Common Fate Transformation ist eine adĂ€quate Darstellung, um die Überlappung von Signalen fĂŒr die Trennung zugĂ€nglich zu machen und eine Inspektion unseres DNN-ZĂ€hlmodells ergab schließlich, dass sich auch hier modulationsĂ€hnliche Merkmale ïŹnden lassen

    Musical source separation using time-frequency source priors

    Full text link

    High-resolution sinusoidal analysis for resolving harmonic collisions in music audio signal processing

    Get PDF
    Many music signals can largely be considered an additive combination of multiple sources, such as musical instruments or voice. If the musical sources are pitched instruments, the spectra they produce are predominantly harmonic, and are thus well suited to an additive sinusoidal model. However, due to resolution limits inherent in time-frequency analyses, when the harmonics of multiple sources occupy equivalent time-frequency regions, their individual properties are additively combined in the time-frequency representation of the mixed signal. Any such time-frequency point in a mixture where multiple harmonics overlap produces a single observation from which the contributions owed to each of the individual harmonics cannot be trivially deduced. These overlaps are referred to as overlapping partials or harmonic collisions. If one wishes to infer some information about individual sources in music mixtures, the information carried in regions where collided harmonics exist becomes unreliable due to interference from other sources. This interference has ramifications in a variety of music signal processing applications such as multiple fundamental frequency estimation, source separation, and instrumentation identification. This thesis addresses harmonic collisions in music signal processing applications. As a solution to the harmonic collision problem, a class of signal subspace-based high-resolution sinusoidal parameter estimators is explored. Specifically, the direct matrix pencil method, or equivalently, the Estimation of Signal Parameters via Rotational Invariance Techniques (ESPRIT) method, is used with the goal of producing estimates of the salient parameters of individual harmonics that occupy equivalent time-frequency regions. This estimation method is adapted here to be applicable to time-varying signals such as musical audio. While high-resolution methods have been previously explored in the context of music signal processing, previous work has not addressed whether or not such methods truly produce high-resolution sinusoidal parameter estimates in real-world music audio signals. Therefore, this thesis answers the question of whether high-resolution sinusoidal parameter estimators are really high-resolution for real music signals. This work directly explores the capabilities of this form of sinusoidal parameter estimation to resolve collided harmonics. The capabilities of this analysis method are also explored in the context of music signal processing applications. Potential benefits of high-resolution sinusoidal analysis are examined in experiments involving multiple fundamental frequency estimation and audio source separation. This work shows that there are indeed benefits to high-resolution sinusoidal analysis in music signal processing applications, especially when compared to methods that produce sinusoidal parameter estimates based on more traditional time-frequency representations. The benefits of this form of sinusoidal analysis are made most evident in multiple fundamental frequency estimation applications, where substantial performance gains are seen. High-resolution analysis in the context of computational auditory scene analysis-based source separation shows similar performance to existing comparable methods

    Source Separation for Hearing Aid Applications

    Get PDF

    Automatic Drum Transcription and Source Separation

    Get PDF
    While research has been carried out on automated polyphonic music transcription, to-date the problem of automated polyphonic percussion transcription has not received the same degree of attention. A related problem is that of sound source separation, which attempts to separate a mixture signal into its constituent sources. This thesis focuses on the task of polyphonic percussion transcription and sound source separation of a limited set of drum instruments, namely the drums found in the standard rock/pop drum kit. As there was little previous research on polyphonic percussion transcription a broad review of music information retrieval methods, including previous polyphonic percussion systems, was also carried out to determine if there were any methods which were of potential use in the area of polyphonic drum transcription. Following on from this a review was conducted of general source separation and redundancy reduction techniques, such as Independent Component Analysis and Independent Subspace Analysis, as these techniques have shown potential in separating mixtures of sources. Upon completion of the review it was decided that a combination of the blind separation approach, Independent Subspace Analysis (ISA), with the use of prior knowledge as used in music information retrieval methods, was the best approach to tackling the problem of polyphonic percussion transcription as well as that of sound source separation. A number of new algorithms which combine the use of prior knowledge with the source separation abilities of techniques such as ISA are presented. These include sub-band ISA, Prior Subspace Analysis (PSA), and an automatic modelling and grouping technique which is used in conjunction with PSA to perform polyphonic percussion transcription. These approaches are demonstrated to be effective in the task of polyphonic percussion transcription, and PSA is also demonstrated to be capable of transcribing drums in the presence of pitched instruments
    • 

    corecore