223 research outputs found
Large scale evaluation of importance maps in automatic speech recognition
In this paper, we propose a metric that we call the structured saliency
benchmark (SSBM) to evaluate importance maps computed for automatic speech
recognizers on individual utterances. These maps indicate time-frequency points
of the utterance that are most important for correct recognition of a target
word. Our evaluation technique is not only suitable for standard classification
tasks, but is also appropriate for structured prediction tasks like
sequence-to-sequence models. Additionally, we use this approach to perform a
large scale comparison of the importance maps created by our previously
introduced technique using "bubble noise" to identify important points through
correlation with a baseline approach based on smoothed speech energy and forced
alignment. Our results show that the bubble analysis approach is better at
identifying important speech regions than this baseline on 100 sentences from
the AMI corpus.Comment: submitted to INTERSPEECH 202
EM localization and separation using interaural level and phase cues
We describe a system for localizing and separating multiple sound sources from a reverberant two-channel recording. It consists of a probabilistic model of interaural level and phase differences and an EM algorithm for finding the maximum likelihood parameters of this model. By assigning points in the interaural spectrogram probabilistically to sources with the best-fitting parameters and then estimating the parameters of the sources from the points assigned to them, the system is able to separate and localize more sound sources than there are available channels. It is also able to estimate frequency-dependent level differences of sources in a mixture that correspond well to those measured in isolation. In experiments in simulated anechoic and reverberant environments, the proposed system improved the signal-to-noise ratio of target sources by 2.7 and 3.4dB more than two comparable algorithms on average
Recommended from our members
An EM Algorithm for Localizing Multiple Sound: Sources in Reverberant Environments
We present a method for localizing and separating sound sources in stereo recordings that is robust to reverberation and does not make any assumptions about the source statistics. The method consists of a probabilistic model of binaural multisource recordings and an expectation maximization algorithm for finding the maximum likelihood parameters of that model. These parameters include distributions over delays and assignments of time-frequency regions to sources. We evaluate this method against two comparable algorithms on simulations of simultaneous speech from two or three sources. Our method outperforms the others in anechoic conditions and performs as well as the better of the two in the presence of reverberation
Improving MIDI-audio alignment with acoustic features
This paper describes a technique to improve the accuracy of dynamic time warping-based MIDI-audio alignment. The technique implements a hidden Markov model that uses aperiodicity and power estimates from the signal as observations and the results of a dynamic time warping alignment as a prior. In addition to improving the overall alignment, this technique also identifies the transient and steady state sections of the note. This information is important for describing various aspects of a musical performance, including both pitch and rhythm
Recommended from our members
Combining Localization Cues and Source Model Constraints for Binaural Source Separation
We describe a system for separating multiple sources from a two-channel recording based on interaural cues and prior knowledge of the statistics of the underlying source signals. The proposed algorithm effectively combines information derived from low level perceptual cues, similar to those used by the human auditory system, with higher level information related to speaker identity. We combine a probabilistic model of the observed interaural level and phase differences with a prior model of the source statistics and derive an EM algorithm for finding the maximum likelihood parameters of the joint model. The system is able to separate more sound sources than there are observed channels in the presence of reverberation. In simulated mixtures of speech from two and three speakers the proposed algorithm gives a signal-to-noise ratio improvement of 1.7 dB over a baseline algorithm which uses only interaural cues. Further improvement is obtained by incorporating eigenvoice speaker adaptation to enable the source model to better match the sources present in the signal. This improves performance over the baseline by 2.7 dB when the speakers used for training and testing are matched. However, the improvement is minimal when the test data is very different from that used in training
- …