673 research outputs found
Syllable classification using static matrices and prosodic features
In this paper we explore the usefulness of prosodic features for
syllable classification. In order to do this, we represent the
syllable as a static analysis unit such that its acoustic-temporal
dynamics could be merged into a set of features that the SVM
classifier will consider as a whole. In the first part of our
experiment we used MFCC as features for classification,
obtaining a maximum accuracy of 86.66%. The second part of
our study tests whether the prosodic information is
complementary to the cepstral information for syllable
classification. The results obtained show that combining the
two types of information does improve the classification, but
further analysis is necessary for a more successful
combination of the two types of features
Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data
It is well known that recognizers personalized to each user are much more
effective than user-independent recognizers. With the popularity of smartphones
today, although it is not difficult to collect a large set of audio data for
each user, it is difficult to transcribe it. However, it is now possible to
automatically discover acoustic tokens from unlabeled personal data in an
unsupervised way. We therefore propose a multi-task deep learning framework
called a phoneme-token deep neural network (PTDNN), jointly trained from
unsupervised acoustic tokens discovered from unlabeled data and very limited
transcribed data for personalized acoustic modeling. We term this scenario
"weakly supervised". The underlying intuition is that the high degree of
similarity between the HMM states of acoustic token models and phoneme models
may help them learn from each other in this multi-task learning framework.
Initial experiments performed over a personalized audio data set recorded from
Facebook posts demonstrated that very good improvements can be achieved in both
frame accuracy and word accuracy over popularly-considered baselines such as
fDLR, speaker code and lightly supervised adaptation. This approach complements
existing speaker adaptation approaches and can be used jointly with such
techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201
Speech recognition experiments with audiobooks
Under real-life conditions several factors may be present that make the automatic recognition of speech difficult. The most obvious examples are background noise, peculiarities of the speaker's voice, sloppy articulation and strong emotional load. These all pose difficult problems for robust speech recognition, but it is not exactly clear how much each contributes to the difficulty of the task. In this paper we examine the abilities of our best recognition technologies under near-ideal conditions. The optimal conditions will be simulated by working with the sound material of an audiobook, in which most of the disturbing factors mentioned above are absent. Firstly pure phone recognition experiments will be performed, where neural net-based technologies will also be tried as well as the conventional Hidden Markov Models. Then we move on to large vocabulary recognition, where morphbased language models are applied to improve the performance of the standard word-based technology. The tests clearly justify our assertion that audiobooks pose a much easier recognition task than real-life databases. In both types of tasks we report the lowest error rates we have achieved so far in Hungarian continuous speech recognition
Machine learning for Arabic phonemes recognition using electrolarynx speech
Automatic speech recognition system is one of the essential ways of interaction with machines. Interests in speech based intelligent systems have grown in the past few decades. Therefore, there is a need to develop more efficient methods for human speech recognition to ensure the reliability of communication between individuals and machines. This paper is concerned with Arabic phoneme recognition of electrolarynx device. Electrolarynx is a device used by cancer patients having vocal laryngeal cords removed. Speech recognition here is considered to find the preferred machine learning model that can classify phonemes produced by electrolarynx device. The phonemes recognition employs different machine learning schemes, including convolutional neural network, recurrent neural network, artificial neural network (ANN), random forest, extreme gradient boosting (XGBoost), and long short-term memory. Modern standard Arabic is utilized for testing and training phases of the recognition system. The dataset covers both an ordinary speech and electrolarynx device speech recorded by the same person. Mel frequency cepstral coefficients are considered as speech features. The results show that the ANN machine learning method outperformed other methods with an accuracy rate of 75%, a precision value of 77%, and a phoneme error rate (PER) of 21.85%
Arabic digits speech recognition and speaker identification in noisy environment using a hybrid model of VQ and GMM
This paper presents an automatic speaker identification and speech recognition for Arabic digits in noisy environment. In this work, the proposed system is able to identify the speaker after saving his voice in the database and adding noise. The mel frequency cepstral coefficients (MFCC) is the best approach used in building a program in the Matlab platform; also, the quantization is used for generating the codebooks. The Gaussian mixture modelling (GMM) algorithms are used to generate template, feature-matching purpose. In this paper, we have proposed a system based on MFCC-GMM and MFCC-VQ Approaches on the one hand and by using the Hybrid Approach MFCC-VQ-GMM on the other hand for speaker modeling. The White Gaussian noise is added to the clean speech at several signal-to-noise ratio (SNR) levels to test the system in a noisy environment. The proposed system gives good results in recognition rate
- …