Search CORE

6,951 research outputs found

Phoneme and sentence-level ensembles for speech recognition

Author: Bengio Samy
Dimitrakakis Christos
Publication venue
Publication date: 01/01/2011
Field of study

We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phoneme-level bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Chalmers Research

Hochschulschriftenserver - Universität Frankfurt am Main

Recommended from our members

Learning Distributed Representations for Multiple-Viewpoint Melodic Prediction

Author: Cherla S.
Garcez A.
Pearce M.
Weyde T.
Publication venue
Publication date: 01/01/2013
Field of study

The analysis of sequences is important for extracting in- formation from music owing to its fundamentally temporal nature. In this paper, we present a distributed model based on the Restricted Boltzmann Machine (RBM) for learning melodic sequences. The model is similar to a previous suc- cessful neural network model for natural language [2]. It is first trained to predict the next pitch in a given pitch se- quence, and then extended to also make use of information in sequences of note-durations in monophonic melodies on the same task. In doing so, we also propose an efficient way of representing this additional information that takes advantage of the RBM’s structure. Results show that this RBM-based prediction model performs better than previ- ously evaluated n-gram models and also outperforms them in certain cases. It is able to make use of information present in longer sequences more effectively than n-gram models, while scaling linearly in the number of free pa- rameters required

City Research Online

Audio-based event detection for sports video

Author: A.K. Jain
D. Keislar
D. Pye
J.P. Cambell Jr.
K. Kobla
L. Rabiner
Y. Wang
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2003
Field of study

In this paper, we present an audio-based event detection approach shown to be effective when applied to the Sports broadcast data. The main benefit of this approach is the ability to recognise patterns that indicate high levels of crowd response which can be correlated to key events. By applying Hidden Markov Model-based classifiers, where the predefined content classes are parameterised using Mel-Frequency Cepstral Coefficients, we were able to eliminate the need for defining a heuristic set of rules to determine event detection, thus avoiding a two-class approach shown not to be suitable for this problem. Experimentation indicated that this is an effective method for classifying crowd response in Soccer matches, thus providing a basis for automatic indexing and summarisation

CiteSeerX

Crossref

University of Strathclyde Institutional Repository

Enlighten

Audio-visual speech recognition with background music using single-channel source separation

Author: Erdogan Hakan
Erdoğan Hakan
Grais Emad Mounir
Topkaya İbrahim Saygın
Topkaya Ibrahim Saygin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

In this paper, we consider audio-visual speech recognition with background music. The proposed algorithm is an integration of audio-visual speech recognition and single channel source separation (SCSS). We apply the proposed algorithm to recognize spoken speech that is mixed with music signals. First, the SCSS algorithm based on nonnegative matrix factorization (NMF) and spectral masks is used to separate the audio speech signal from the background music in magnitude spectral domain. After speech audio is separated from music, regular audio-visual speech recognition (AVSR) is employed using multi-stream hidden Markov models. Employing two approaches together, we try to improve recognition accuracy by both processing the audio signal with SCSS and supporting the recognition task with visual information. Experimental results show that combining audio-visual speech recognition with source separation gives remarkable improvements in the accuracy of the speech recognition system

CiteSeerX

Crossref

University of Surrey

Sabanci University Research Database

Surrey Research Insight