Search CORE

223 research outputs found

Large scale evaluation of importance maps in automatic speech recognition

Author: Mandel Michael I
Trinh Viet Anh
Publication venue: 'International Speech Communication Association'
Publication date: 21/05/2020
Field of study

In this paper, we propose a metric that we call the structured saliency benchmark (SSBM) to evaluate importance maps computed for automatic speech recognizers on individual utterances. These maps indicate time-frequency points of the utterance that are most important for correct recognition of a target word. Our evaluation technique is not only suitable for standard classification tasks, but is also appropriate for structured prediction tasks like sequence-to-sequence models. Additionally, we use this approach to perform a large scale comparison of the importance maps created by our previously introduced technique using "bubble noise" to identify important points through correlation with a baseline approach based on smoothed speech energy and forced alignment. Our results show that the bubble analysis approach is better at identifying important speech regions than this baseline on 100 sentences from the AMI corpus.Comment: submitted to INTERSPEECH 202

arXiv.org e-Print Archive

Crossref

EM localization and separation using interaural level and phase cues

Author: Ellis Daniel P. W.
Mandel Michael I.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2007
Field of study

We describe a system for localizing and separating multiple sound sources from a reverberant two-channel recording. It consists of a probabilistic model of interaural level and phase differences and an EM algorithm for finding the maximum likelihood parameters of this model. By assigning points in the interaural spectrogram probabilistically to sources with the best-fitting parameters and then estimating the parameters of the sources from the points assigned to them, the system is able to separate and localize more sound sources than there are available channels. It is also able to estimate frequency-dependent level differences of sources in a mixture that correspond well to those measured in isolation. In experiments in simulated anechoic and reverberant environments, the proposed system improved the signal-to-noise ratio of target sources by 2.7 and 3.4dB more than two comparable algorithms on average

CiteSeerX

Crossref

Columbia University Academic Commons

Recommended from our members

An EM Algorithm for Localizing Multiple Sound: Sources in Reverberant Environments

Author: Ellis Daniel P. W.
Jebara Tony
Mandel Michael I.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2007
Field of study

We present a method for localizing and separating sound sources in stereo recordings that is robust to reverberation and does not make any assumptions about the source statistics. The method consists of a probabilistic model of binaural multisource recordings and an expectation maximization algorithm for finding the maximum likelihood parameters of that model. These parameters include distributions over delays and assignments of time-frequency regions to sources. We evaluate this method against two comparable algorithms on simulations of simultaneous speech from two or three sources. Our method outperforms the others in anechoic conditions and performs as well as the better of the two in the presence of reverberation

Columbia University Academic Commons

Improving MIDI-audio alignment with acoustic features

Author: Devaney Johanna
Ellis Daniel P. W.
Mandel Michael I.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2009
Field of study

This paper describes a technique to improve the accuracy of dynamic time warping-based MIDI-audio alignment. The technique implements a hidden Markov model that uses aperiodicity and power estimates from the signal as observations and the results of a dynamic time warping alignment as a prior. In addition to improving the overall alignment, this technique also identifies the transient and steady state sections of the note. This information is important for describing various aspects of a musical performance, including both pitch and rhythm

Crossref

Columbia University Academic Commons

Recommended from our members

Combining Localization Cues and Source Model Constraints for Binaural Source Separation

Author: Ellis Daniel P. W.
Mandel Michael I.
Weiss Ron J.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2011
Field of study

We describe a system for separating multiple sources from a two-channel recording based on interaural cues and prior knowledge of the statistics of the underlying source signals. The proposed algorithm effectively combines information derived from low level perceptual cues, similar to those used by the human auditory system, with higher level information related to speaker identity. We combine a probabilistic model of the observed interaural level and phase differences with a prior model of the source statistics and derive an EM algorithm for finding the maximum likelihood parameters of the joint model. The system is able to separate more sound sources than there are observed channels in the presence of reverberation. In simulated mixtures of speech from two and three speakers the proposed algorithm gives a signal-to-noise ratio improvement of 1.7 dB over a baseline algorithm which uses only interaural cues. Further improvement is obtained by incorporating eigenvoice speaker adaptation to enable the source model to better match the sources present in the signal. This improves performance over the baseline by 2.7 dB when the speakers used for training and testing are matched. However, the improvement is minimal when the test data is very different from that used in training

Columbia University Academic Commons