Search CORE

381 research outputs found

Trennung und Schätzung der Anzahl von Audiosignalquellen mit Zeit- und Frequenzüberlappung

Author: Stöter Fabian-Robert
Publication venue
Publication date: 01/01/2020
Field of study

Everyday audio recordings involve mixture signals: music contains a mixture of instruments; in a meeting or conference, there is a mixture of human voices. For these mixtures, automatically separating or estimating the number of sources is a challenging task. A common assumption when processing mixtures in the time-frequency domain is that sources are not fully overlapped. However, in this work we consider some cases where the overlap is severe — for instance, when instruments play the same note (unison) or when many people speak concurrently ("cocktail party") — highlighting the need for new representations and more powerful models. To address the problems of source separation and count estimation, we use conventional signal processing techniques as well as deep neural networks (DNN). We ﬁrst address the source separation problem for unison instrument mixtures, studying the distinct spectro-temporal modulations caused by vibrato. To exploit these modulations, we developed a method based on time warping, informed by an estimate of the fundamental frequency. For cases where such estimates are not available, we present an unsupervised model, inspired by the way humans group time-varying sources (common fate). This contribution comes with a novel representation that improves separation for overlapped and modulated sources on unison mixtures but also improves vocal and accompaniment separation when used as an input for a DNN model. Then, we focus on estimating the number of sources in a mixture, which is important for real-world scenarios. Our work on count estimation was motivated by a study on how humans can address this task, which lead us to conduct listening experiments, conﬁrming that humans are only able to estimate the number of up to four sources correctly. To answer the question of whether machines can perform similarly, we present a DNN architecture, trained to estimate the number of concurrent speakers. Our results show improvements compared to other methods, and the model even outperformed humans on the same task. In both the source separation and source count estimation tasks, the key contribution of this thesis is the concept of “modulation”, which is important to computationally mimic human performance. Our proposed Common Fate Transform is an adequate representation to disentangle overlapping signals for separation, and an inspection of our DNN count estimation model revealed that it proceeds to ﬁnd modulation-like intermediate features.Im Alltag sind wir von gemischten Signalen umgeben: Musik besteht aus einer Mischung von Instrumenten; in einem Meeting oder auf einer Konferenz sind wir einer Mischung menschlicher Stimmen ausgesetzt. Für diese Mischungen ist die automatische Quellentrennung oder die Bestimmung der Anzahl an Quellen eine anspruchsvolle Aufgabe. Eine häuﬁge Annahme bei der Verarbeitung von gemischten Signalen im Zeit-Frequenzbereich ist, dass die Quellen sich nicht vollständig überlappen. In dieser Arbeit betrachten wir jedoch einige Fälle, in denen die Überlappung immens ist zum Beispiel, wenn Instrumente den gleichen Ton spielen (unisono) oder wenn viele Menschen gleichzeitig sprechen (Cocktailparty) —, so dass neue Signal-Repräsentationen und leistungsfähigere Modelle notwendig sind. Um die zwei genannten Probleme zu bewältigen, verwenden wir sowohl konventionelle Signalverbeitungsmethoden als auch tiefgehende neuronale Netze (DNN). Wir gehen zunächst auf das Problem der Quellentrennung für Unisono-Instrumentenmischungen ein und untersuchen die speziellen, durch Vibrato ausgelösten, zeitlich-spektralen Modulationen. Um diese Modulationen auszunutzen entwickelten wir eine Methode, die auf Zeitverzerrung basiert und eine Schätzung der Grundfrequenz als zusätzliche Information nutzt. Für Fälle, in denen diese Schätzungen nicht verfügbar sind, stellen wir ein unüberwachtes Modell vor, das inspiriert ist von der Art und Weise, wie Menschen zeitveränderliche Quellen gruppieren (Common Fate). Dieser Beitrag enthält eine neuartige Repräsentation, die die Separierbarkeit für überlappte und modulierte Quellen in Unisono-Mischungen erhöht, aber auch die Trennung in Gesang und Begleitung verbessert, wenn sie in einem DNN-Modell verwendet wird. Im Weiteren beschäftigen wir uns mit der Schätzung der Anzahl von Quellen in einer Mischung, was für reale Szenarien wichtig ist. Unsere Arbeit an der Schätzung der Anzahl war motiviert durch eine Studie, die zeigt, wie wir Menschen diese Aufgabe angehen. Dies hat uns dazu veranlasst, eigene Hörexperimente durchzuführen, die bestätigten, dass Menschen nur in der Lage sind, die Anzahl von bis zu vier Quellen korrekt abzuschätzen. Um nun die Frage zu beantworten, ob Maschinen dies ähnlich gut können, stellen wir eine DNN-Architektur vor, die erlernt hat, die Anzahl der gleichzeitig sprechenden Sprecher zu ermitteln. Die Ergebnisse zeigen Verbesserungen im Vergleich zu anderen Methoden, aber vor allem auch im Vergleich zu menschlichen Hörern. Sowohl bei der Quellentrennung als auch bei der Schätzung der Anzahl an Quellen ist ein Kernbeitrag dieser Arbeit das Konzept der “Modulation”, welches wichtig ist, um die Strategien von Menschen mittels Computern nachzuahmen. Unsere vorgeschlagene Common Fate Transformation ist eine adäquate Darstellung, um die Überlappung von Signalen für die Trennung zugänglich zu machen und eine Inspektion unseres DNN-Zählmodells ergab schließlich, dass sich auch hier modulationsähnliche Merkmale ﬁnden lassen

Exploiting primitive grouping constraints for noise robust automatic speech recognition : studies with simultaneous speech.

Author: Coy André
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 01/01/2008
Field of study

Significant strides have been made in the field of automatic speech recognition over the past three decades. However, the systems are not robust; their performance degrades in the presence of even moderate amounts of noise. This thesis presents an approach to developing a speech recognition system that takes inspiration firom the approach of human speech recognition

Physiology, Psychoacoustics and Cognition in Normal and Impaired Hearing

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Neural processing of natural sounds

Author: A Calabrese
A Hsu
A Srivastava
AC Arcadi
AE Hoerl
AJ Doupe
AJ Doupe
AM Aertsen
AS Feng
B Hedwig
BN Pasley
C Huetz
C Koppl
C Perrodin
C Vignal
CA Atencio
CD Meliza
CF Moss
CI Petkov
CT Miller
D Al-Mana
D Margoliash
D Margoliash
D Suta
DA Depireux
DA Depireux
DH Sanes
DL Cheney
DL Ruderman
DM Schneider
DM Schneider
E Schneidman
EC Smith
EP Simoncelli
ES Fortune
F Libersat
F Menardy
F Rieke
FA Rodriguez
FE Theunissen
FE Theunissen
FE Theunissen
Frédéric E. Theunissen
G Kim
GB Christianson
GB Keller
GJL Beckers
GW Gifford
H Asari
H Attias
H Kover
HC Gerhardt
I George
I Nelken
J Fritz
J Newman
JA Garcia-Lazaro
JA Grace
JA Miranda
JC Middlebrooks
JD Chen
JDW Greenlee
JE Elie
JF Prather
JH McDermott
JH McDermott
JJ Eggermont
JJ Eggermont
JJ Eggermont
JM Grimsley
JM McFarland
JP Rauschecker
JS McCasland
Julie E. Elie
KI Nagel
KMM Walker
L Cohen
L Robles
LM Miller
LM Miller
M Kawasaki
M Penna
M Regev
M Sahani
M Schafer
M Ter-Mikaelian
MA Bee
MA Escabi
MA Escabi
MB Ahrens
ME Hauber
MM Solis
MN Wallace
MS Lewicki
MS Lewicki
MS Lewicki
N Amin
N Amin
N Ding
N Mesgarani
N Mesgarani
N Mesgarani
N Suga
N Suga
N Ulanovsky
NC Rabinowitz
NC Rabinowitz
NC Singh
NL Carlson
O Darrigol
P Gill
P Gill
P Marler
P Marler
PR Marler
PZ Marmarelis
RC Moore
RF Voss
RM Seyfarth
S Andoni
SF Volman
SJ Eliades
SM Woolley
SM Woolley
SM Woolley
SM Woolley
SMN Woolley
SV David
SV David
SV David
T Aubin
T Chi
T Sharpee
TM Elliott
TQ Gentner
V Bruns
W Bialek
W Teramoto
YE Cohen
YE Cohen
YI Fishman
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Natural sounds include animal vocalizations, environmental sounds such as wind, water and fire noises and non-vocal sounds made by animals and humans for communication. These natural sounds have characteristic statistical properties that make them perceptually salient and that drive auditory neurons in optimal regimes for information transmission.Recent advances in statistics and computer sciences have allowed neuro-physiologists to extract the stimulus-response function of complex auditory neurons from responses to natural sounds. These studies have shown a hierarchical processing that leads to the neural detection of progressively more complex natural sound features and have demonstrated the importance of the acoustical and behavioral contexts for the neural responses.High-level auditory neurons have shown to be exquisitely selective for conspecific calls. This fine selectivity could play an important role for species recognition, for vocal learning in songbirds and, in the case of the bats, for the processing of the sounds used in echolocation. Research that investigates how communication sounds are categorized into behaviorally meaningful groups (e.g. call types in animals, words in human speech) remains in its infancy.Animals and humans also excel at separating communication sounds from each other and from background noise. Neurons that detect communication calls in noise have been found but the neural computations involved in sound source separation and natural auditory scene analysis remain overall poorly understood. Thus, future auditory research will have to focus not only on how natural sounds are processed by the auditory system but also on the computations that allow for this processing to occur in natural listening situations.The complexity of the computations needed in the natural hearing task might require a high-dimensional representation provided by ensemble of neurons and the use of natural sounds might be the best solution for understanding the ensemble neural code

eScholarship - University of California

Sound Object Recognition

Author: Patil Kailash
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 03/10/2018
Field of study

Humans are constantly exposed to a variety of acoustic stimuli ranging from music and speech to more complex acoustic scenes like a noisy marketplace. The human auditory perception mechanism is able to analyze these different kinds of sounds and extract meaningful information suggesting that the same processing mechanism is capable of representing different sound classes. In this thesis, we test this hypothesis by proposing a high dimensional sound object representation framework, that captures the various modulations of sound by performing a multi-resolution mapping. We then show that this model is able to capture a wide variety of sound classes (speech, music, soundscapes) by applying it to the tasks of speech recognition, speaker verification, musical instrument recognition and acoustic soundscape recognition. We propose a multi-resolution analysis approach that captures the detailed variations in the spectral characterists as a basis for recognizing sound objects. We then show how such a system can be fine tuned to capture both the message information (speech content) and the messenger information (speaker identity). This system is shown to outperform state-of-art system for noise robustness at both automatic speech recognition and speaker verification tasks. The proposed analysis scheme with the included ability to analyze temporal modulations was used to capture musical sound objects. We showed that using a model of cortical processing, we were able to accurately replicate the human perceptual similarity judgments and also were able to get a good classification performance on a large set of musical instruments. We also show that neither just the spectral feature or the marginals of the proposed model are sufficient to capture human perception. Moreover, we were able to extend this model to continuous musical recordings by proposing a new method to extract notes from the recordings. Complex acoustic scenes like a sports stadium have multiple sources producing sounds at the same time. We show that the proposed representation scheme can not only capture these complex acoustic scenes, but provides a flexible mechanism to adapt to target sources of interest. The human auditory perception system is known to be a complex system where there are both bottom-up analysis pathways and top-down feedback mechanisms. The top-down feedback enhances the output of the bottom-up system to better realize the target sounds. In this thesis we propose an implementation of top-down attention module which is complimentary to the high dimensional acoustic feature extraction mechanism. This attention module is a distributed system operating at multiple stages of representation, effectively acting as a retuning mechanism, that adapts the same system to different tasks. We showed that such an adaptation mechanism is able to tremendously improve the performance of the system at detecting the target source in the presence of various distracting background sources

JScholarship

Physiology, Psychoacoustics and Cognition in Normal and Impaired Hearing

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Dissertations of the University of Groningen