Search CORE

724 research outputs found

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Author: Geiger Jürgen
Jin Wenyu
Mousa Amr El-Desoky
Pohjalainen Jouni
Schuller Björn
Zhang Zixing
Publication venue
Publication date: 01/01/2018
Field of study

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

arXiv.org e-Print Archive

OPUS Augsburg

Crossref

Knock-Knock: Acoustic Object Recognition using Stacked Denoising Autoencoders

Author: Althoefer K
Liu H
Luo S
Zhu L
Publication venue: 'Elsevier BV'
Publication date: 11/03/2017
Field of study

This paper presents a successful application of deep learning for object recognition based on acoustic data. The shortcomings of previously employed approaches where handcrafted features describing the acoustic data are being used, include limiting the capability of the found representation to be widely applicable and facing the risk of capturing only insignificant characteristics for a task. In contrast, there is no need to define the feature representation format when using multilayer/deep learning architecture methods: features can be learned from raw sensor data without defining discriminative characteristics a-priori. In this paper, stacked denoising autoencoders are applied to train a deep learning model. Knocking each object in our test set 120 times with a marker pen to obtain the auditory data, thirty different objects were successfully classified in our experiment and each object was knocked 120 times by a marker pen to obtain the auditory data. By employing the proposed deep learning framework, a high accuracy of 91.50% was achieved. A traditional method using handcrafted features with a shallow classifier was taken as a benchmark and the attained recognition rate was only 58.22%. Interestingly, a recognition rate of 82.00% was achieved when using a shallow classifier with raw acoustic data as input. In addition, we could show that the time taken to classify one object using deep learning was far less (by a factor of more than 6) than utilizing the traditional method. It was also explored how different model parameters in our deep architecture affect the recognition performance.Comment: 6 pages, 10 figures, Neurocomputin

arXiv.org e-Print Archive

University of Liverpool Repository

Crossref

Queen Mary Research Online

King's Research Portal

RawNet: Fast End-to-End Neural Vocoder

Author: He Yunchao
Wang Yujun
Zhang Haitong
Publication venue
Publication date: 10/04/2019
Field of study

Neural networks based vocoders have recently demonstrated the powerful ability to synthesize high quality speech. These models usually generate samples by conditioning on some spectrum features, such as Mel-spectrum. However, these features are extracted by using speech analysis module including some processing based on the human knowledge. In this work, we proposed RawNet, a truly end-to-end neural vocoder, which use a coder network to learn the higher representation of signal, and an autoregressive voder network to generate speech sample by sample. The coder and voder together act like an auto-encoder network, and could be jointly trained directly on raw waveform without any human-designed features. The experiments on the Copy-Synthesis tasks show that RawNet can achieve the comparative synthesized speech quality with LPCNet, with a smaller model architecture and faster speech generation at the inference step.Comment: Submitted to Interspeech 2019, Graz, Austri

arXiv.org e-Print Archive

Auto-encoder based deep learning for surface electromyography signal processing

Author: Al-Jumaily AA
Ibrahim MFI
Publication venue: 'ASTES Journal'
Publication date: 01/01/2018
Field of study

© 2018 Advances in Science, Technology and Engineering Systems. All Rights Reserved. Feature extraction is taking a very vital and essential part of bio-signal processing. We need to choose one of two paths to identify and select features in any system. The most popular track is engineering handcrafted, which mainly depends on the user experience and the field of application. While the other path is feature learning, which depends on training the system on recognising and picking the best features that match the application. The main concept of feature learning is to create a model that is expected to be able to learn the best features without any human intervention instead of recourse the traditional methods for feature extraction or reduction and avoid dealing with feature extraction that depends on researcher experience. In this paper, Auto-Encoder will be utilised as a feature learning algorithm to practice the recommended model to excerpt the useful features from the surface electromyography signal. Deep learning method will be suggested by using Auto-Encoder to learn features. Wavelet Packet, Spectrogram, and Wavelet will be employed to represent the surface electromyography signal in our recommended model. Then, the newly represented bio-signal will be fed to stacked autoencoder (2 stages) to learn features and finally, the behaviour of the proposed algorithm will be estimated by hiring different classifiers such as Extreme Learning Machine, Support Vector Machine, and SoftMax Layer. The Rectified Linear Unit (ReLU) will be created as an activation function for extreme learning machine classifier besides existing functions such as sigmoid and radial basis function. ReLU will show a better classification ability than sigmoid and Radial basis function (RBF) for wavelet, Wavelet scale 5 and wavelet packet signal representations implemented techniques. ReLU will illustrate better classification ability, as an activation function, than sigmoid and poorer than RBF for spectrogram signal representation. Both confidence interval and Analysis of Variance will be estimated for different classifiers. Classifier fusion layer will be implemented to glean the classifier which will progress the best accuracies' values for both testing and training to develop the results. Classifier fusion layer brought an encouraging value for both accuracies either training or testing ones

OPUS - University of Technology Sydney