Search CORE

305,383 research outputs found

Structured Sparsity Models for Multiparty Speech Recovery from Reverberant Recordings

Author: Asaei Afsaneh
Bourlard Hervé
Cevher Volkan
Golbabaee Mohammad
Publication venue
Publication date: 01/01/2012
Field of study

We tackle the multi-party speech recovery problem through modeling the acoustic of the reverberant chambers. Our approach exploits structured sparsity models to perform room modeling and speech recovery. We propose a scheme for characterizing the room acoustic from the unknown competing speech sources relying on localization of the early images of the speakers by sparse approximation of the spatial spectra of the virtual sources in a free-space model. The images are then clustered exploiting the low-rank structure of the spectro-temporal components belonging to each source. This enables us to identify the early support of the room impulse response function and its unique map to the room geometry. To further tackle the ambiguity of the reflection ratios, we propose a novel formulation of the reverberation model and estimate the absorption coefficients through a convex optimization exploiting joint sparsity model formulated upon spatio-spectral sparsity of concurrent speech representation. The acoustic parameters are then incorporated for separating individual speech signals through either structured sparse recovery or inverse filtering the acoustic channels. The experiments conducted on real data recordings demonstrate the effectiveness of the proposed approach for multi-party speech recovery and recognition.Comment: 31 page

arXiv.org e-Print Archive

Edinburgh Research Explorer

Speech recognition in noise with active and passive hearing protectors: a comparative study

Author: Annelies Bockstael
ANSI S3.5–1997
Bert De Coensel
Birgit Philips
Bronkhorst A.
Damman W.
Damman W.
Dancer A.
Dick Botteldooren
Freya Swinnen
Hannah Keppler
Hiselius P.
ISO 4869–1
ISO 4869–2
ISO 532–1975
ISVR
Kutner M. H.
Leen Maes
Vinck Bart
Wendy D’Haenens
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2011
Field of study

Crossref

Ghent University Academic Bibliography

FPGA-Based Low-Power Speech Recognition with Recurrent Neural Networks

Author: Choi Sungwook
Hwang Kyuyeon
Lee Minjae
Park Jinhwan
Shin Sungho
Sung Wonyong
Publication venue
Publication date: 30/09/2016
Field of study

In this paper, a neural network based real-time speech recognition (SR) system is developed using an FPGA for very low-power operation. The implemented system employs two recurrent neural networks (RNNs); one is a speech-to-character RNN for acoustic modeling (AM) and the other is for character-level language modeling (LM). The system also employs a statistical word-level LM to improve the recognition accuracy. The results of the AM, the character-level LM, and the word-level LM are combined using a fairly simple N-best search algorithm instead of the hidden Markov model (HMM) based network. The RNNs are implemented using massively parallel processing elements (PEs) for low latency and high throughput. The weights are quantized to 6 bits to store all of them in the on-chip memory of an FPGA. The proposed algorithm is implemented on a Xilinx XC7Z045, and the system can operate much faster than real-time.Comment: Accepted to SiPS 201

arXiv.org e-Print Archive

Crossref

Robust Sound Event Classification using Deep Neural Networks

Author: McLoughlin Ian Vince
Song Yan
Xiao Wei
Xie Zhi-Peng
Zhang Hao-min
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2015
Field of study

The automatic recognition of sound events by computers is an important aspect of emerging applications such as automated surveillance, machine hearing and auditory scene understanding. Recent advances in machine learning, as well as in computational models of the human auditory system, have contributed to advances in this increasingly popular research field. Robust sound event classification, the ability to recognise sounds under real-world noisy conditions, is an especially challenging task. Classification methods translated from the speech recognition domain, using features such as mel-frequency cepstral coefficients, have been shown to perform reasonably well for the sound event classification task, although spectrogram-based or auditory image analysis techniques reportedly achieve superior performance in noise. This paper outlines a sound event classification framework that compares auditory image front end features with spectrogram image-based front end features, using support vector machine and deep neural network classifiers. Performance is evaluated on a standard robust classification task in different levels of corrupting noise, and with several system enhancements, and shown to compare very well with current state-of-the-art classification techniques

Crossref

Kent Academic Repository

A Review of Chinese Academy of Sciences (CASIA) Gait Database As a Human Gait Recognition Dataset

Author: Andrie Rosa
Arai Kohei
Basuki Achmad
Publication venue
Publication date: 26/10/2011
Field of study

Human Gait as the recognition object is the famous biometrics system recently. Many researchers had focused this subject to consider for a new recognition system. One of the important advantage in this recognition compare to other is it does not require observed subject’s attention and cooperation. There are many human gait datasets created within the last 10 years. Some databases that are widely used are University Of South Florida (USF) Gait Dataset, Chinese Academy of Sciences (CASIA) Gait Dataset, and Southampton University (SOTON) Gait Dataset. This paper will analyze the CASIA Gait Dataset in order to see their characteristics. There are 2 pre-processing subsystems; model based and model free approach. We will use 2D Discrete Wavelet Transform (DWT). We select Haar wavelets to reduce and extract the feature

EEPIS Repository

The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition

Author: Chowdury Mohammad Salah Uddin
Khandaker Mayeen Uddin *
Sulieman Abdelmoneim
Tamam Nissren
Uddin Mohammad Amaz
Publication venue: Tech Science Press
Publication date: 22/09/2022
Field of study

Human speech indirectly represents the mental state or emotion of others. The use of Artificial Intelligence (AI)-based techniques may bring revolution in this modern era by recognizing emotion from speech. In this study, we introduced a robust method for emotion recognition from human speech using a well-performed preprocessing technique together with the deep learning-based mixed model consisting of Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN). About 2800 audio files were extracted from the Toronto emotional speech set (TESS) database for this study. A high pass and Savitzky Golay Filter have been used to obtain noise-free as well as smooth audio data. A total of seven types of emotions; Angry, Disgust, Fear, Happy, Neutral, Pleasant-surprise, and Sad were used in this study. Energy, Fundamental frequency, and Mel Frequency Cepstral Coefficient (MFCC) have been used to extract the emotion features, and these features resulted in 97.5% accuracy in the mixed LSTM+CNN model. This mixed model is found to be performed better than the usual state-of-the-art models in emotion recognition from speech. It also indicates that this mixed model could be effectively utilized in advanced research dealing with sound processing

Sunway Institutional Repository