14 research outputs found
Individual identity in songbirds: signal representations and metric learning for locating the information in complex corvid calls
Bird calls range from simple tones to rich dynamic multi-harmonic structures.
The more complex calls are very poorly understood at present, such as those of
the scientifically important corvid family (jackdaws, crows, ravens, etc.).
Individual birds can recognise familiar individuals from calls, but where in
the signal is this identity encoded? We studied the question by applying a
combination of feature representations to a dataset of jackdaw calls, including
linear predictive coding (LPC) and adaptive discrete Fourier transform (aDFT).
We demonstrate through a classification paradigm that we can strongly
outperform a standard spectrogram representation for identifying individuals,
and we apply metric learning to determine which time-frequency regions
contribute most strongly to robust individual identification. Computational
methods can help to direct our search for understanding of these complex
biological signals
Gaussian Process Experts for Voice Conversion
Conventional approaches to voice conversion typically use a GMM to represent the joint probability density of source and target features. This model is then used to perform spectral conversion between speakers. This approach is reasonably effective but can be prone to overfitting and oversmoothing of the target spectra. This paper proposes an alternative scheme that uses a collection of Gaussian process experts to perform the spectral conversion. Gaussian processes are robust to overfitting and oversmoothing and can predict the target spectra more accurately. Experimental results indicate that the objective performance of voice conversion can be improved using the proposed approach. Copyright © 2011 ISCA
Attention, sobriety checkpoint! Can humans determine by means of voice, if someone is drunk... and can automatic classifiers compete?
This paper analyzes the human performance of recognizing drunk speakers merely by voice and compares the results with the performance of an automatic statistical classifier. The study is carried out within the Interspeech 2011 Speaker State Challenge [1] employing the Alcohol Language Corpus (ALC) [2]. The 79 subjects yielded an average performance of 55.8% unweighted accuracy on a balanced intoxicated/non-intoxicated sample set. The statistical classifier developed in this study reaches a performance of 66.6% unweighted accuracy on the test set. In comparison, the subject with the highest performance yielded 70.0%. Our classifier is based on 4368 acoustic and prosodic features. Incorporating linguistic features along with feature selection using Information Gain Ratio (IGR) ranking added 0.7% absolute improvement with resulting in a 29% smaller feature space size. Copyright © 2011 ISCA
Graphone model interpolation and Arabic pronunciation generation
This paper extends n-gram graphone model pronunciation generation to use a mixture of such models. This technique is useful when pronunciation data is for a specific variant (or set of variants) of a language, such as for a dialect, and only a small amount of pronunciation dictionary training data for that specific variant is available. The performance of the interpolated n-gram graphone model is evaluated on Arabic phonetic pronunciation generation for words that can't be handled by the Buckwalter Morphological Analyser. The pronunciations produced are also used to train an Arabic broadcast audio speech recognition system. In both cases the interpolated graphone model leads to improved performance. Copyright © 2011 ISCA
Compression Techniques Applied to Multiple Speech Recognition Systems
Speech recognition systems typically contain many Gaussian distributions, and hence a large number of parameters. This makes them both slow to decode speech, and large to store. Techniques have been proposed to decrease the number of parameters. One approach is to share parameters between multiple Gaussians, thus reducing the total number of parameters and allowing for shared likelihood calculation. Gaussian tying and subspace clustering are two related techniques which take this approach to system compression. These techniques can decrease the number of parameters with no noticeable drop in performance for single systems. However, multiple acoustic models are often used in real speech recognition systems. This paper considers the application of Gaussian tying and subspace compression to multiple systems. Results show that two speech recognition systems can be modelled using the same number of Gaussians as just one system, with little effect on individual system performance. Copyright © 2009 ISCA
