276 research outputs found

    Trennung und Schätzung der Anzahl von Audiosignalquellen mit Zeit- und Frequenzüberlappung

    Get PDF
    Everyday audio recordings involve mixture signals: music contains a mixture of instruments; in a meeting or conference, there is a mixture of human voices. For these mixtures, automatically separating or estimating the number of sources is a challenging task. A common assumption when processing mixtures in the time-frequency domain is that sources are not fully overlapped. However, in this work we consider some cases where the overlap is severe — for instance, when instruments play the same note (unison) or when many people speak concurrently ("cocktail party") — highlighting the need for new representations and more powerful models. To address the problems of source separation and count estimation, we use conventional signal processing techniques as well as deep neural networks (DNN). We first address the source separation problem for unison instrument mixtures, studying the distinct spectro-temporal modulations caused by vibrato. To exploit these modulations, we developed a method based on time warping, informed by an estimate of the fundamental frequency. For cases where such estimates are not available, we present an unsupervised model, inspired by the way humans group time-varying sources (common fate). This contribution comes with a novel representation that improves separation for overlapped and modulated sources on unison mixtures but also improves vocal and accompaniment separation when used as an input for a DNN model. Then, we focus on estimating the number of sources in a mixture, which is important for real-world scenarios. Our work on count estimation was motivated by a study on how humans can address this task, which lead us to conduct listening experiments, confirming that humans are only able to estimate the number of up to four sources correctly. To answer the question of whether machines can perform similarly, we present a DNN architecture, trained to estimate the number of concurrent speakers. Our results show improvements compared to other methods, and the model even outperformed humans on the same task. In both the source separation and source count estimation tasks, the key contribution of this thesis is the concept of “modulation”, which is important to computationally mimic human performance. Our proposed Common Fate Transform is an adequate representation to disentangle overlapping signals for separation, and an inspection of our DNN count estimation model revealed that it proceeds to find modulation-like intermediate features.Im Alltag sind wir von gemischten Signalen umgeben: Musik besteht aus einer Mischung von Instrumenten; in einem Meeting oder auf einer Konferenz sind wir einer Mischung menschlicher Stimmen ausgesetzt. Für diese Mischungen ist die automatische Quellentrennung oder die Bestimmung der Anzahl an Quellen eine anspruchsvolle Aufgabe. Eine häufige Annahme bei der Verarbeitung von gemischten Signalen im Zeit-Frequenzbereich ist, dass die Quellen sich nicht vollständig überlappen. In dieser Arbeit betrachten wir jedoch einige Fälle, in denen die Überlappung immens ist zum Beispiel, wenn Instrumente den gleichen Ton spielen (unisono) oder wenn viele Menschen gleichzeitig sprechen (Cocktailparty) —, so dass neue Signal-Repräsentationen und leistungsfähigere Modelle notwendig sind. Um die zwei genannten Probleme zu bewältigen, verwenden wir sowohl konventionelle Signalverbeitungsmethoden als auch tiefgehende neuronale Netze (DNN). Wir gehen zunächst auf das Problem der Quellentrennung für Unisono-Instrumentenmischungen ein und untersuchen die speziellen, durch Vibrato ausgelösten, zeitlich-spektralen Modulationen. Um diese Modulationen auszunutzen entwickelten wir eine Methode, die auf Zeitverzerrung basiert und eine Schätzung der Grundfrequenz als zusätzliche Information nutzt. Für Fälle, in denen diese Schätzungen nicht verfügbar sind, stellen wir ein unüberwachtes Modell vor, das inspiriert ist von der Art und Weise, wie Menschen zeitveränderliche Quellen gruppieren (Common Fate). Dieser Beitrag enthält eine neuartige Repräsentation, die die Separierbarkeit für überlappte und modulierte Quellen in Unisono-Mischungen erhöht, aber auch die Trennung in Gesang und Begleitung verbessert, wenn sie in einem DNN-Modell verwendet wird. Im Weiteren beschäftigen wir uns mit der Schätzung der Anzahl von Quellen in einer Mischung, was für reale Szenarien wichtig ist. Unsere Arbeit an der Schätzung der Anzahl war motiviert durch eine Studie, die zeigt, wie wir Menschen diese Aufgabe angehen. Dies hat uns dazu veranlasst, eigene Hörexperimente durchzuführen, die bestätigten, dass Menschen nur in der Lage sind, die Anzahl von bis zu vier Quellen korrekt abzuschätzen. Um nun die Frage zu beantworten, ob Maschinen dies ähnlich gut können, stellen wir eine DNN-Architektur vor, die erlernt hat, die Anzahl der gleichzeitig sprechenden Sprecher zu ermitteln. Die Ergebnisse zeigen Verbesserungen im Vergleich zu anderen Methoden, aber vor allem auch im Vergleich zu menschlichen Hörern. Sowohl bei der Quellentrennung als auch bei der Schätzung der Anzahl an Quellen ist ein Kernbeitrag dieser Arbeit das Konzept der “Modulation”, welches wichtig ist, um die Strategien von Menschen mittels Computern nachzuahmen. Unsere vorgeschlagene Common Fate Transformation ist eine adäquate Darstellung, um die Überlappung von Signalen für die Trennung zugänglich zu machen und eine Inspektion unseres DNN-Zählmodells ergab schließlich, dass sich auch hier modulationsähnliche Merkmale finden lassen

    Audio source separation for music in low-latency and high-latency scenarios

    Get PDF
    Aquesta tesi proposa mètodes per tractar les limitacions de les tècniques existents de separació de fonts musicals en condicions de baixa i alta latència. En primer lloc, ens centrem en els mètodes amb un baix cost computacional i baixa latència. Proposem l'ús de la regularització de Tikhonov com a mètode de descomposició de l'espectre en el context de baixa latència. El comparem amb les tècniques existents en tasques d'estimació i seguiment dels tons, que són passos crucials en molts mètodes de separació. A continuació utilitzem i avaluem el mètode de descomposició de l'espectre en tasques de separació de veu cantada, baix i percussió. En segon lloc, proposem diversos mètodes d'alta latència que milloren la separació de la veu cantada, gràcies al modelatge de components específics, com la respiració i les consonants. Finalment, explorem l'ús de correlacions temporals i anotacions manuals per millorar la separació dels instruments de percussió i dels senyals musicals polifònics complexes.Esta tesis propone métodos para tratar las limitaciones de las técnicas existentes de separación de fuentes musicales en condiciones de baja y alta latencia. En primer lugar, nos centramos en los métodos con un bajo coste computacional y baja latencia. Proponemos el uso de la regularización de Tikhonov como método de descomposición del espectro en el contexto de baja latencia. Lo comparamos con las técnicas existentes en tareas de estimación y seguimiento de los tonos, que son pasos cruciales en muchos métodos de separación. A continuación utilizamos y evaluamos el método de descomposición del espectro en tareas de separación de voz cantada, bajo y percusión. En segundo lugar, proponemos varios métodos de alta latencia que mejoran la separación de la voz cantada, gracias al modelado de componentes que a menudo no se toman en cuenta, como la respiración y las consonantes. Finalmente, exploramos el uso de correlaciones temporales y anotaciones manuales para mejorar la separación de los instrumentos de percusión y señales musicales polifónicas complejas.This thesis proposes specific methods to address the limitations of current music source separation methods in low-latency and high-latency scenarios. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch estimation and tracking tasks, crucial steps in many separation methods. We then use the proposed spectrum decomposition method in low-latency separation tasks targeting singing voice, bass and drums. Second, we propose several high-latency methods that improve the separation of singing voice by modeling components that are often not accounted for, such as breathiness and consonants. Finally, we explore using temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals

    An investigation into glottal waveform based speech coding

    Get PDF
    Coding of voiced speech by extraction of the glottal waveform has shown promise in improving the efficiency of speech coding systems. This thesis describes an investigation into the performance of such a system. The effect of reverberation on the radiation impedance at the lips is shown to be negligible under normal conditions. Also, the accuracy of the Image Method for adding artificial reverberation to anechoic speech recordings is established. A new algorithm, Pre-emphasised Maximum Likelihood Epoch Detection (PMLED), for Glottal Closure Instant detection is proposed. The algorithm is tested on natural speech and is shown to be both accurate and robust. Two techniques for giottai waveform estimation, Closed Phase Inverse Filtering (CPIF) and Iterative Adaptive Inverse Filtering (IAIF), are compared. In tandem with an LF model fitting procedure, both techniques display a high degree of accuracy However, IAIF is found to be slightly more robust. Based on these results, a Glottal Excited Linear Predictive (GELP) coding system for voiced speech is proposed and tested. Using a differential LF parameter quantisation scheme, the system achieves speech quality similar to that of U S Federal Standard 1016 CELP at a lower mean bit rate while incurring no extra delay

    Audio source separation techniques including novel time-frequency representation tools

    Get PDF
    The thesis explores the development of tools for audio representation with applications in Audio Source Separation and in the Music Information Retrieval (MIR) field. A novel constant Q transform was introduced, called IIR-CQT. The transform allows a flexible design and achieves low computational cost. Also, an independent development of the Fan Chirp Transform (FChT) with the focus on the representation of simultaneous sources is studied, which has several applications in the analysis of polyphonic music signals. Dierent applications are explored in the MIR field, some of them directly related with the low-level representation tools that were analyzed. One of these applications is the development of a visualization tool based in the FChT that proved to be useful for musicological analysis . The tool has been made available as an open source, freely available software. The proposed Transform has also been used to detect and track fundamental frequencies of harmonic sources in polyphonic music. Also, the information of the slope of the pitch was used to define a similarity measure between two harmonic components that are close in time. This measure helps to use clustering algorithms to track multiple sources in polyphonic music. Additionally, the FChT was used in the context of the Query by Humming application. One of the main limitations of such application is the construction of a search database. In this work, we propose an algorithm to automatically populate the database of an existing Query by Humming, with promising results. Finally, two audio source separation techniques are studied. The first one is the separation of harmonic signals based on the FChT. The second one is an application for which the fundamental frequency of the sources is assumed to be known (Score Informed Source Separation problem)

    Analysis and resynthesis of polyphonic music

    Get PDF
    This thesis examines applications of Digital Signal Processing to the analysis, transformation, and resynthesis of musical audio. First I give an overview of the human perception of music. I then examine in detail the requirements for a system that can analyse, transcribe, process, and resynthesise monaural polyphonic music. I then describe and compare the possible hardware and software platforms. After this I describe a prototype hybrid system that attempts to carry out these tasks using a method based on additive synthesis. Next I present results from its application to a variety of musical examples, and critically assess its performance and limitations. I then address these issues in the design of a second system based on Gabor wavelets. I conclude by summarising the research and outlining suggestions for future developments

    Gendering the Virtual Space: Sonic Femininities and Masculinities in Contemporary Top 40 Music

    Full text link
    This dissertation analyzes vocal placement—the apparent location of a voice in the virtual space created by a recording—and its relationship to gender. When listening to a piece of recorded music through headphones or stereo speakers, one hears various sound sources as though they were located in a virtual space (Clarke 2013). For instance, a specific vocal performance—once manipulated by various technologies in a recording studio—might evoke a concert hall, an intimate setting, or an otherworldly space. The placement of the voice within this space is one of the central musical parameters through which listeners ascribe cultural meanings to popular music. I develop an original methodology for analyzing vocal placement in recorded popular music. Combining close listening with music information retrieval tools, I precisely locate a voice’s placement in virtual space according to five parameters: (1) Width, (2) Pitch Height, (3) Prominence, (4) Environment, and (5) Layering. I use the methodology to conduct close and distant readings of vocal placement in twenty-first-century Anglo-American popular music. First, an analysis of “Love the Way You Lie” (2010), by Eminem feat. Rihanna, showcases how the methodology can be used to support close readings of individual songs. Through my analysis, I suggest that Rihanna’s wide vocal placement evokes a nexus of conflicting emotions in the wake of domestic violence. Eminem’s narrow placement, conversely, expresses anger, frustration, and violence. Second, I use the analytical methodology to conduct a larger-scale study of vocal placement in a corpus of 113 post-2008 Billboard chart-topping collaborations between two or more artists. By stepping away from close readings of individual songs, I show how gender stereotypes are engineered en masse in the popular music industry. I show that women artists are generally assigned vocal placements that are wider, more layered, and more reverberated than those of men. This vocal placement configuration—exemplified in “Love the Way You Lie”—creates a sonic contrast that presents women’s voices as ornamental and diffuse, and men’s voices as direct and relatable. I argue that these contrasting vocal placements sonically construct a gender binary, exemplifying one of the ways in which dichotomous conceptions of gender are reinforced through the sound of popular music

    Lauluyhtyeen intonaation automaattinen määritys

    Get PDF
    The objective of this study is a specific music signal processing task, primarily intended to help vocal ensemble singers practice their intonation. In this case intonation is defined as deviations of pitch in relation to the note written in the score which are small, less than a semitone. These can be either intentional or unintentional. Practicing intonation is typically challenging without an external ear. The algorithm developed in this thesis combined with the presented application concept can act as the external ear, providing real-time information on intonation to support practicing. The method can be applied to the analysis of recorded material as well. The music signal generated by a vocal ensemble is polyphonic. It contains multiple simultaneous tones with partly or completely overlapping harmonic partials. We need to be able to estimate the fundamental frequency of each tone, which then indicates the pitch of each singer. Our experiments show, that the fundamental frequency estimation method based on the Fourier analysis developed in this thesis can be applied to the automatic analysis of vocal ensembles. A sufficient frequency resolution can be achieved without compromising the time resolution too much by using an adequately sized window. The accuracy and robustness can be further increased by taking advantage of solitary partials. The greatest challenge turned out to be the estimation of tones in octave and unison relationships. These intervals are fairly common in tonal music. This question requires further investigation or another type of approach.Tässä työssä tutkitaan erityistä musiikkisignaalin analysointitehtävää, jonka tarkoi- tuksena on auttaa lauluyhtyelaulajia intonaation harjoittelussa. Intonaatiolla tar- koitetaan tässä yhteydessä pieniä, alle puolen sävelaskeleen säveltasoeroja nuottiin kirjoitettuun sävelkorkeuteen nähden, jotka voivat olla joko tarkoituksenmukaisia tai tahattomia. Intonaation harjoittelu on tyypillisesti haastavaa ilman ulkopuolista korvaa. Työssä kehitetty algoritmi yhdessä esitellyn sovelluskonseptin kanssa voi toimia harjoittelutilanteessa ulkopuolisena korvana tarjoten reaaliaikaista tietoa intonaatiosta harjoittelun tueksi. Vaihtoehtoisesti menetelmää voidaan hyödyntää harjoitusäänitteiden analysointiin jälkikäteen. Lauluyhtyeen tuottama musiikki- signaali on polyfoninen. Se sisältää useita päällekkäisiä säveliä, joiden osasävelet menevät toistensa kanssa osittain tai kokonaan päällekkäin. Tästä signaalista on pystyttävä tunnistamaan kunkin sävelen perustaajuus, joka puolestaan kertoo lau- lajan laulaman sävelkorkeuden. Kokeellisten tulosten perusteella työssä kehitettyä Fourier-muunnokseen perustuvaa taajuusanalyysiä voidaan soveltaa lauluyhtyeen intonaation automaattiseen määritykseen, kun nuottiin kirjoitettua sointua hyödyn- netään analyysin lähtötietona. Sopivankokoista näyteikkunaa käyttämällä päästiin riittävään taajuusresoluutioon aikaresoluution säilyessä kohtuullisena. Yksinäisiä osasäveliä hyödyntämällä voidaan edelleen parantaa tarkkuutta ja toimintavar- muutta. Suurimmaksi haasteeksi osoittautui oktaavi- ja priimisuhteissa olevien intervallien luotettava määritys. Näitä intervallisuhteita esiintyy tonaalisessa musii- kissa erityisen paljon. Tämä kysymys vaatii vielä lisätutkimusta tai uudenlaista lähestymistapaa
    corecore