318 research outputs found
Representation Learning With Hidden Unit Clustering For Low Resource Speech Applications
The representation learning of speech, without textual resources, is an area
of significant interest for many low resource speech applications. In this
paper, we describe an approach to self-supervised representation learning from
raw audio using a hidden unit clustering (HUC) framework. The input to the
model consists of audio samples that are windowed and processed with 1-D
convolutional layers. The learned "time-frequency" representations from the
convolutional neural network (CNN) module are further processed with long short
term memory (LSTM) layers which generate a contextual vector representation for
every windowed segment. The HUC framework, allowing the categorization of the
representations into a small number of phoneme-like units, is used to train the
model for learning semantically rich speech representations. The targets
consist of phoneme-like pseudo labels for each audio segment and these are
generated with an iterative k-means algorithm. We explore techniques that
improve the speaker invariance of the learned representations and illustrate
the effectiveness of the proposed approach on two settings, i) completely
unsupervised speech applications on the sub-tasks described as part of the
ZeroSpeech 2021 challenge and ii) semi-supervised automatic speech recognition
(ASR) applications on the TIMIT dataset and on the GramVaani challenge Hindi
dataset. In these experiments, we achieve state-of-art results for various
ZeroSpeech tasks. Further, on the ASR experiments, the HUC representations are
shown to improve significantly over other established benchmarks based on
Wav2vec, HuBERT and Best-RQ
Multinomial logistic regression probability ratio-based feature vectors for Malay vowel recognition
Vowel Recognition is a part of automatic speech recognition (ASR) systems that classifies speech signals into groups of vowels. The performance of Malay vowel recognition (MVR) like any multiclass classification problem depends largely on Feature Vectors (FVs). FVs such as Mel-frequency Cepstral Coefficients (MFCC) have produced high error rates due to poor phoneme information. Classifier transformed probabilistic features have proved a better alternative in conveying phoneme information. However, the high dimensionality of the probabilistic features introduces additional complexity that deteriorates ASR performance. This study aims
to improve MVR performance by proposing an algorithm that transforms MFCC FVs into a new set of features using Multinomial Logistic Regression (MLR) to reduce the dimensionality of the probabilistic features. This study was carried out in four phases
which are pre-processing and feature extraction, best regression coefficients generation, feature transformation, and performance evaluation. The speech corpus consists of 1953 samples of five Malay vowels of /a/, /e/, /i/, /o/ and /u/ recorded from students of two public universities in Malaysia. Two sets of algorithms were developed which are DBRCs and FELT. DBRCs algorithm determines the best regression coefficients (DBRCs) to obtain the best set of regression coefficients (RCs) from the extracted 39-MFCC FVs through resampling and data swapping approach. FELT
algorithm transforms 39-MFCC FVs using logistic transformation method into FELT FVs. Vowel recognition rates of FELT and 39-MFCC FVs were compared using four different classification techniques of Artificial Neural Network, MLR, Linear Discriminant Analysis, and k-Nearest Neighbour. Classification results showed that FELT FVs surpass the performance of 39-MFCC FVs in MVR. Depending on the classifiers used, the improved performance of 1.48% - 11.70% was attained by FELT over MFCC. Furthermore, FELT significantly improved the recognition accuracy of
vowels /o/ and /u/ by 5.13% and 8.04% respectively. This study contributes two algorithms for determining the best set of RCs and generating FELT FVs from MFCC. The FELT FVs eliminate the need for dimensionality reduction with comparable performances. Furthermore, FELT FVs improved MVR for all the five vowels
especially /o/ and /u/. The improved MVR performance will spur the development of Malay speech-based systems, especially for the Malaysian community
Computer classification of stop consonants in a speaker independent continuous speech environment
In the English language there are six stop consonants, /b,d,g,p,t,k/. They account for over 17% of all phonemic occurrences. In continuous speech, phonetic recognition of stop consonants requires the ability to explicitly characterize the acoustic signal. Prior work has shown that high classification accuracy of discrete syllables and words can be achieved by characterizing the shape of the spectrally transformed acoustic signal. This thesis extends this concept to include a multispeaker continuous speech database and statistical moments of a distribution to characterize shape. A multivariate maximum likelihood classifier was used to discriminate classes. To reduce the number of features used by the discriminant model a dynamic programming scheme was employed to optimize subset combinations. The top six moments were the mean, variance, and skewness in both frequency and energy. Results showed 85% classification on the full database of 952 utterances. Performance improved to 97% when the discriminant model was trained separately for male and female talkers
SPEECH EMOTION DETECTION USING MACHINE LEARNING TECHNIQUES
Communication is the key to express one’s thoughts and ideas clearly. Amongst all forms of communication, speech is the most preferred and powerful form of communications in human. The era of the Internet of Things (IoT) is rapidly advancing in bringing more intelligent systems available for everyday use. These applications range from simple wearables and widgets to complex self-driving vehicles and automated systems employed in various fields. Intelligent applications are interactive and require minimum user effort to function, and mostly function on voice-based input. This creates the necessity for these computer applications to completely comprehend human speech. A speech percept can reveal information about the speaker including gender, age, language, and emotion. Several existing speech recognition systems used in IoT applications are integrated with an emotion detection system in order to analyze the emotional state of the speaker. The performance of the emotion detection system can greatly influence the overall performance of the IoT application in many ways and can provide many advantages over the functionalities of these applications. This research presents a speech emotion detection system with improvements over an existing system in terms of data, feature selection, and methodology that aims at classifying speech percepts based on emotions, more accurately
Neural approaches to spoken content embedding
Comparing spoken segments is a central operation to speech processing.
Traditional approaches in this area have favored frame-level dynamic
programming algorithms, such as dynamic time warping, because they require no
supervision, but they are limited in performance and efficiency. As an
alternative, acoustic word embeddings -- fixed-dimensional vector
representations of variable-length spoken word segments -- have begun to be
considered for such tasks as well. However, the current space of such
discriminative embedding models, training approaches, and their application to
real-world downstream tasks is limited. We start by considering ``single-view"
training losses where the goal is to learn an acoustic word embedding model
that separates same-word and different-word spoken segment pairs. Then, we
consider ``multi-view" contrastive losses. In this setting, acoustic word
embeddings are learned jointly with embeddings of character sequences to
generate acoustically grounded embeddings of written words, or acoustically
grounded word embeddings.
In this thesis, we contribute new discriminative acoustic word embedding
(AWE) and acoustically grounded word embedding (AGWE) approaches based on
recurrent neural networks (RNNs). We improve model training in terms of both
efficiency and performance. We take these developments beyond English to
several low-resource languages and show that multilingual training improves
performance when labeled data is limited. We apply our embedding models, both
monolingual and multilingual, to the downstream tasks of query-by-example
speech search and automatic speech recognition. Finally, we show how our
embedding approaches compare with and complement more recent self-supervised
speech models.Comment: PhD thesi
- …