Search CORE

31,318 research outputs found

Sparse coding for speech recognition

Author: Smit Willem Jacobus
Publication venue: 'University of Pretoria - Department of Philosophy'
Publication date: 11/11/2008
Field of study

The brain is a complex organ that is computationally strong. Recent research in the field of neurobiology help scientists to better understand the working of the brain, especially how the brain represents or codes external signals. The research shows that the neural code is sparse. A sparse code is a code in which few neurons participate in the representation of a signal. Neurons communicate with each other by sending pulses or spikes at certain times. The spikes send between several neurons over time is called a spike train. A spike train contains all the important information about the signal that it codes. This thesis shows how sparse coding can be used to do speech recognition. The recognition process consists of three parts. First the speech signal is transformed into a spectrogram. Thereafter a sparse code to represent the spectrogram is found. The spectrogram serves as the input to a linear generative model. The output of themodel is a sparse code that can be interpreted as a spike train. Lastly a spike train model recognises the words that are encoded in the spike train. The algorithms that search for sparse codes to represent signals require many computations. We therefore propose an algorithm that is more efficient than current algorithms. The algorithm makes it possible to find sparse codes in reasonable time if the spectrogram is fairly coarse. The system achieves a word error rate of 19% with a coarse spectrogram, while a system based on Hidden Markov Models achieves a word error rate of 15% on the same spectrograms.Thesis (PhD)--University of Pretoria, 2008.Electrical, Electronic and Computer Engineeringunrestricte

UPSpace at the University of Pretoria

Sparse Coding of Neural Word Embeddings for Multilingual Sequence Labeling

Author: Berend Gábor
Publication venue
Publication date: 21/12/2016
Field of study

In this paper we propose and carefully evaluate a sequence labeling framework which solely utilizes sparse indicator features derived from dense distributed word representations. The proposed model obtains (near) state-of-the art performance for both part-of-speech tagging and named entity recognition for a variety of languages. Our model relies only on a few thousand sparse coding-derived features, without applying any modification of the word representations employed for the different tasks. The proposed model has favorable generalization properties as it retains over 89.8% of its average POS tagging accuracy when trained at 1.2% of the total available training data, i.e.~150 sentences per language

arXiv.org e-Print Archive

SZTE Publicatio Repozitórium - SZTE - Repository of Publications

Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models

Author: Asaei Afsaneh
Bourlard Herve
Dighe Pranay
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/10/2016
Field of study

Conventional deep neural networks (DNN) for speech acoustic modeling rely on Gaussian mixture models (GMM) and hidden Markov model (HMM) to obtain binary class labels as the targets for DNN training. Subword classes in speech recognition systems correspond to context-dependent tied states or senones. The present work addresses some limitations of GMM-HMM senone alignments for DNN training. We hypothesize that the senone probabilities obtained from a DNN trained with binary labels can provide more accurate targets to learn better acoustic models. However, DNN outputs bear inaccuracies which are exhibited as high dimensional unstructured noise, whereas the informative components are structured and low-dimensional. We exploit principle component analysis (PCA) and sparse coding to characterize the senone subspaces. Enhanced probabilities obtained from low-rank and sparse reconstructions are used as soft-targets for DNN acoustic modeling, that also enables training with untranscribed data. Experiments conducted on AMI corpus shows 4.6% relative reduction in word error rate

arXiv.org e-Print Archive

Crossref

Exploiting Low-dimensional Structures to Enhance DNN Based Acoustic Modeling in Speech Recognition

Author: Asaei Afsaneh
Bourlard Herve
Dighe Pranay
Luyet Gil
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/01/2016
Field of study

We propose to model the acoustic space of deep neural network (DNN) class-conditional posterior probabilities as a union of low-dimensional subspaces. To that end, the training posteriors are used for dictionary learning and sparse coding. Sparse representation of the test posteriors using this dictionary enables projection to the space of training data. Relying on the fact that the intrinsic dimensions of the posterior subspaces are indeed very small and the matrix of all posteriors belonging to a class has a very low rank, we demonstrate how low-dimensional structures enable further enhancement of the posteriors and rectify the spurious errors due to mismatch conditions. The enhanced acoustic modeling method leads to improvements in continuous speech recognition task using hybrid DNN-HMM (hidden Markov model) framework in both clean and noisy conditions, where upto 15.4% relative reduction in word error rate (WER) is achieved

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Bio-Inspired Multi-Layer Spiking Neural Network Extracts Discriminative Features from Speech Signals

Author: A Tavanaei
G Hinton
GR Doddington
JJ Wade
LR Rabiner
M Beyeler
N Kasabov
O Abdel-Hamid
S Ghosh-Dastidar
SG Wysoski
SR Kheradpisheh
T Masquelier
W Maass
Y Bengio
Y Bengio
Y Cao
Y Dan
Y LeCun
Y LeCun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/06/2017
Field of study

Spiking neural networks (SNNs) enable power-efficient implementations due to their sparse, spike-based coding scheme. This paper develops a bio-inspired SNN that uses unsupervised learning to extract discriminative features from speech signals, which can subsequently be used in a classifier. The architecture consists of a spiking convolutional/pooling layer followed by a fully connected spiking layer for feature discovery. The convolutional layer of leaky, integrate-and-fire (LIF) neurons represents primary acoustic features. The fully connected layer is equipped with a probabilistic spike-timing-dependent plasticity learning rule. This layer represents the discriminative features through probabilistic, LIF neurons. To assess the discriminative power of the learned features, they are used in a hidden Markov model (HMM) for spoken digit recognition. The experimental results show performance above 96% that compares favorably with popular statistical feature extraction methods. Our results provide a novel demonstration of unsupervised feature acquisition in an SNN

arXiv.org e-Print Archive

Crossref

LCANets++: Robust Audio Classification using Multi-layer Neural Networks with Lateral Competition

Author: Dibbo Sayanton V.
Kenyon Garrett T.
Moore Juston S.
Teti Michael A.
Publication venue
Publication date: 23/08/2023
Field of study

Audio classification aims at recognizing audio signals, including speech commands or sound events. However, current audio classifiers are susceptible to perturbations and adversarial attacks. In addition, real-world audio classification tasks often suffer from limited labeled data. To help bridge these gaps, previous work developed neuro-inspired convolutional neural networks (CNNs) with sparse coding via the Locally Competitive Algorithm (LCA) in the first layer (i.e., LCANets) for computer vision. LCANets learn in a combination of supervised and unsupervised learning, reducing dependency on labeled samples. Motivated by the fact that auditory cortex is also sparse, we extend LCANets to audio recognition tasks and introduce LCANets++, which are CNNs that perform sparse coding in multiple layers via LCA. We demonstrate that LCANets++ are more robust than standard CNNs and LCANets against perturbations, e.g., background noise, as well as black-box and white-box attacks, e.g., evasion and fast gradient sign (FGSM) attacks.Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl

arXiv.org e-Print Archive

Speech Based Machine Learning Models for Emotional State Recognition and PTSD Detection

Author: Banerjee Debrup
Publication venue: ODU Digital Commons
Publication date: 01/07/2017
Field of study

Recognition of emotional state and diagnosis of trauma related illnesses such as posttraumatic stress disorder (PTSD) using speech signals have been active research topics over the past decade. A typical emotion recognition system consists of three components: speech segmentation, feature extraction and emotion identification. Various speech features have been developed for emotional state recognition which can be divided into three categories, namely, excitation, vocal tract and prosodic. However, the capabilities of different feature categories and advanced machine learning techniques have not been fully explored for emotion recognition and PTSD diagnosis. For PTSD assessment, clinical diagnosis through structured interviews is a widely accepted means of diagnosis, but patients are often embarrassed to get diagnosed at clinics. The speech signal based system is a recently developed alternative. Unfortunately,PTSD speech corpora are limited in size which presents difficulties in training complex diagnostic models. This dissertation proposed sparse coding methods and deep belief network models for emotional state identification and PTSD diagnosis. It also includes an additional transfer learning strategy for PTSD diagnosis. Deep belief networks are complex models that cannot work with small data like the PTSD speech database. Thus, a transfer learning strategy was adopted to mitigate the small data problem. Transfer learning aims to extract knowledge from one or more source tasks and apply the knowledge to a target task with the intention of improving the learning. It has proved to be useful when the target task has limited high quality training data. We evaluated the proposed methods on the speech under simulated and actual stress database (SUSAS) for emotional state recognition and on two PTSD speech databases for PTSD diagnosis. Experimental results and statistical tests showed that the proposed models outperformed most state-of-the-art methods in the literature and are potentially efficient models for emotional state recognition and PTSD diagnosis

Old Dominion University