13 research outputs found
Very Deep Convolutional Neural Networks for Robust Speech Recognition
This paper describes the extension and optimization of our previous work on
very deep convolutional neural networks (CNNs) for effective recognition of
noisy speech in the Aurora 4 task. The appropriate number of convolutional
layers, the sizes of the filters, pooling operations and input feature maps are
all modified: the filter and pooling sizes are reduced and dimensions of input
feature maps are extended to allow adding more convolutional layers.
Furthermore appropriate input padding and input feature map selection
strategies are developed. In addition, an adaptation framework using joint
training of very deep CNN with auxiliary features i-vector and fMLLR features
is developed. These modifications give substantial word error rate reductions
over the standard CNN used as baseline. Finally the very deep CNN is combined
with an LSTM-RNN acoustic model and it is shown that state-level weighted log
likelihood score combination in a joint acoustic model decoding scheme is very
effective. On the Aurora 4 task, the very deep CNN achieves a WER of 8.81%,
further 7.99% with auxiliary feature joint training, and 7.09% with LSTM-RNN
joint decoding.Comment: accepted by SLT 201
A Convolutional Neural Network model based on Neutrosophy for Noisy Speech Recognition
Convolutional neural networks are sensitive to unknown noisy condition in the
test phase and so their performance degrades for the noisy data classification
task including noisy speech recognition. In this research, a new convolutional
neural network (CNN) model with data uncertainty handling; referred as NCNN
(Neutrosophic Convolutional Neural Network); is proposed for classification
task. Here, speech signals are used as input data and their noise is modeled as
uncertainty. In this task, using speech spectrogram, a definition of
uncertainty is proposed in neutrosophic (NS) domain. Uncertainty is computed
for each Time-frequency point of speech spectrogram as like a pixel. Therefore,
uncertainty matrix with the same size of spectrogram is created in NS domain.
In the next step, a two parallel paths CNN classification model is proposed.
Speech spectrogram is used as input of the first path and uncertainty matrix
for the second path. The outputs of two paths are combined to compute the final
output of the classifier. To show the effectiveness of the proposed method, it
has been compared with conventional CNN on the isolated words of Aurora2
dataset. The proposed method achieves the average accuracy of 85.96 in noisy
train data. It is more robust against Car, Airport and Subway noises with
accuracies 90, 88 and 81 in test sets A, B and C, respectively. Results show
that the proposed method outperforms conventional CNN with the improvement of
6, 5 and 2 percentage in test set A, test set B and test sets C, respectively.
It means that the proposed method is more robust against noisy data and handle
these data effectively.Comment: International conference on Pattern Recognition and Image Analysis
(IPRIA 2019
The role of articulatory feature representation quality in a computational model of human spoken-word recognition
Fine-Tracker is a speech-based model of human speech recognition. While previous work has shown that Fine-Tracker is successful at modelling aspects of human spoken-word recognition, its speech recognition performance is not comparable to that of human performance, possibly due to suboptimal intermediate articulatory feature (AF) representations. This study investigates the effect of improved AF representations, obtained using a state-of-the-art deep convolutional network, on Fine-Tracker鈥檚 simulation and recognition performance: Although the improved AF quality resulted in improved speech recognition; it, surprisingly, did not lead to an improvement in Fine-Tracker鈥檚 simulation power
Assistant robot through deep learning
This article presents a work oriented to assistive robotics, where a scenario is established for a robot to reach a tool in the hand of a user, when they have verbally requested it by his name. For this, three convolutional neural networks are trained, one for recognition of a group of tools, which obtained an accuracy of 98% identifying the tools established for the application, that are scalpel, screwdriver and scissors; one for speech recognition, trained with the names of the tools in Spanish language, where its validation accuracy reach a 97.5% in the recognition of the words; and another for recognition of the user's hand, taking in consideration the classification of 2 gestures: Open and Closed hand, where a 96.25% accuracy was achieved. With those networks, tests in real time are performed, presenting results in the delivery of each tool with a 100% of accuracy, i.e. the robot was able to identify correctly what the user requested, recognize correctly each tool and deliver the one need when the user opened their hand, taking an average time of 45 seconds in the execution of the application
Speech Quality Classifier Model based on DBN that Considers Atmospheric Phenomena
Current implementations of 5G networks consider higher frequency range of operation than previous telecommunication networks, and it is possible to offer higher data rates for different applications. On the other hand, atmospheric phenomena could have a more negative impact on the transmission quality. Thus, the study of the transmitted signal quality at high frequencies is relevant to guaranty the user 虂s quality of experience. In this research, the recommendations ITU-R P.838-3 and ITU-R P.676-11 are implemented in a network scenario, which are methodologies to estimate the signal degradations originated by rainfall and atmospheric gases, respectively. Thus, speech signals are encoded by the AMR-WB codec, transmitted and the perceptual speech quality is evaluated using the algorithm described in ITU-T Rec. P.863, mostly known as POLQA. The novelty of this work is to propose a non-intrusive speech quality classifier that considers atmospheric phenomena. This classifier is based on Deep Belief Networks (DBN) that uses Support Vector Machine (SVM) with radial basis function kernel (RBF-SVM) as classifier, to identify five predefined speech quality classes. Experimental Results show that the proposed speech quality classifier reached an accuracy between 92% and 95% for each quality class overcoming the results obtained by the sole non-intrusive standard described in ITU-T Recommendation P.563. Furthermore, subjective tests are carried out to validate the proposed classifier performance, and it reached an accuracy of 94.8%