1,132 research outputs found
Adaptation of Whisper models to child speech recognition
Automatic Speech Recognition (ASR) systems often struggle with transcribing
child speech due to the lack of large child speech datasets required to
accurately train child-friendly ASR models. However, there are huge amounts of
annotated adult speech datasets which were used to create multilingual ASR
models, such as Whisper. Our work aims to explore whether such models can be
adapted to child speech to improve ASR for children. In addition, we compare
Whisper child-adaptations with finetuned self-supervised models, such as
wav2vec2. We demonstrate that finetuning Whisper on child speech yields
significant improvements in ASR performance on child speech, compared to non
finetuned Whisper models. Additionally, utilizing self-supervised Wav2vec2
models that have been finetuned on child speech outperforms Whisper finetuning.Comment: Accepted in Interspeech 202
Multidisciplinary perspectives on automatic analysis of children's language samples : where do we go from here?
BACKGROUND : Language sample analysis (LSA) is invaluable to describe and understand child language use and development for clinical purposes and research. Digital tools supporting LSA are available, but many of the LSA steps have not been automated. Nevertheless, programs that include automatic speech recognition (ASR), the first step of LSA, have already reached mainstream applicability. SUMMARY : To better understand the complexity, challenges, and future needs of automatic LSA from a technological perspective, including the tasks of transcribing, annotating, and analysing natural child language samples, this article takes on a multidisciplinary view. Requirements of a fully automated LSA process are characterized, features of existing LSA software tools compared, and prior work from the disciplines of information science and computational linguistics reviewed. KEY MESSAGES : Existing tools vary in their extent of automation provided across the process of LSA. Advances in machine learning for speech recognition and processing have potential to facilitate LSA, but the specifics of child speech and language as well as the lack of child data complicate software design. A transdisciplinary approach is recommended as feasible to support future software development for LSA.https://karger.com/fplhj2023Centre for Augmentative and Alternative Communication (CAAC)Speech-Language Pathology and Audiolog
Parallelizing Legendre Memory Unit Training
Recently, a new recurrent neural network (RNN) named the Legendre Memory Unit (LMU) was proposed and shown to achieve state-of-the-art performance on several benchmark datasets. Here we leverage the linear time-invariant (LTI) memory component of the LMU to construct a simplified variant that can be parallelized during training (and yet executed as an RNN during inference), resulting in up to 200 times faster training. We note that our efficient parallelizing scheme is general and is applicable to any deep network whose recurrent components are LTI systems. We demonstrate the improved accuracy and decreased parameter count of our new architecture compared to the original LMU and a variety of published LSTM and transformer networks across seven benchmarks. For instance, our LMU sets a new state-of-the-art result on psMNIST, and uses half the parameters while outperforming DistilBERT and LSTM models on IMDB sentiment analysis
Multinomial logistic regression probability ratio-based feature vectors for Malay vowel recognition
Vowel Recognition is a part of automatic speech recognition (ASR) systems that classifies speech signals into groups of vowels. The performance of Malay vowel recognition (MVR) like any multiclass classification problem depends largely on Feature Vectors (FVs). FVs such as Mel-frequency Cepstral Coefficients (MFCC) have produced high error rates due to poor phoneme information. Classifier transformed probabilistic features have proved a better alternative in conveying phoneme information. However, the high dimensionality of the probabilistic features introduces additional complexity that deteriorates ASR performance. This study aims
to improve MVR performance by proposing an algorithm that transforms MFCC FVs into a new set of features using Multinomial Logistic Regression (MLR) to reduce the dimensionality of the probabilistic features. This study was carried out in four phases
which are pre-processing and feature extraction, best regression coefficients generation, feature transformation, and performance evaluation. The speech corpus consists of 1953 samples of five Malay vowels of /a/, /e/, /i/, /o/ and /u/ recorded from students of two public universities in Malaysia. Two sets of algorithms were developed which are DBRCs and FELT. DBRCs algorithm determines the best regression coefficients (DBRCs) to obtain the best set of regression coefficients (RCs) from the extracted 39-MFCC FVs through resampling and data swapping approach. FELT
algorithm transforms 39-MFCC FVs using logistic transformation method into FELT FVs. Vowel recognition rates of FELT and 39-MFCC FVs were compared using four different classification techniques of Artificial Neural Network, MLR, Linear Discriminant Analysis, and k-Nearest Neighbour. Classification results showed that FELT FVs surpass the performance of 39-MFCC FVs in MVR. Depending on the classifiers used, the improved performance of 1.48% - 11.70% was attained by FELT over MFCC. Furthermore, FELT significantly improved the recognition accuracy of
vowels /o/ and /u/ by 5.13% and 8.04% respectively. This study contributes two algorithms for determining the best set of RCs and generating FELT FVs from MFCC. The FELT FVs eliminate the need for dimensionality reduction with comparable performances. Furthermore, FELT FVs improved MVR for all the five vowels
especially /o/ and /u/. The improved MVR performance will spur the development of Malay speech-based systems, especially for the Malaysian community
Robust learning of acoustic representations from diverse speech data
Automatic speech recognition is increasingly applied to new domains. A key challenge is
to robustly learn, update and maintain representations to cope with transient acoustic
conditions. A typical example is broadcast media, for which speakers and environments
may change rapidly, and available supervision may be poor. The concern of this
thesis is to build and investigate methods for acoustic modelling that are robust to the
characteristics and transient conditions as embodied by such media.
The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio
with approximate labels, but training methods can be sensitive to label errors, and their
use is therefore not trivial. State-of-the-art semi-supervised training makes effective
use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid
overfitting to poor supervision, but does not make use of the transcriptions. Existing
approaches that do aim to make use of the transcriptions typically employ an algorithm
to filter or combine the transcriptions with the recognition output from a seed model,
but the final result does not encode uncertainty. We propose a method to combine the
lattice output from a biased recognition pass with the transcripts, crucially preserving
uncertainty in the lattice where appropriate. This substantially reduces the word error
rate on a broadcast task.
The second contribution is a method to factorise representations for speakers and
environments so that they may be combined in novel combinations. In realistic scenarios,
the speaker or environment transform at test time might be unknown, or there may be
insufficient data to learn a joint transform. We show that in such cases, factorised, or
independent, representations are required to avoid deteriorating performance. Using
i-vectors, we factorise speaker or environment information using multi-condition training
with neural networks. Specifically, we extract bottleneck features from networks trained
to classify either speakers or environments. The resulting factorised representations
prove beneficial when one factor is missing at test time, or when all factors are seen,
but not in the desired combination.
The third contribution is an investigation of model adaptation in a longitudinal
setting. In this scenario, we repeatedly adapt a model to new data, with the constraint
that previous data becomes unavailable. We first demonstrate the effect of such a
constraint, and show that using a cyclical learning rate may help. We then observe
that these successive models lend themselves well to ensembling. Finally, we show
that the impact of this constraint in an active learning setting may be detrimental to
performance, and suggest to combine active learning with semi-supervised training to
avoid biasing the model.
The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature
extractor, known as SincNet. In contrast to traditional techniques that warp the
filterbank frequencies in standard feature extraction, adapting SincNet parameters is
more flexible and more readily optimised, whilst maintaining interpretability. On a task
adapting from adult to child speech, we show that this layer is well suited for adaptation
and is very effective with respect to the small number of adapted parameters
Four Mode Based Dialogue Management with Modified POMDP Model
This thesis proposes a method to manage the interaction between the user and the system dynamically, through speech or text input which updates the user goals, select system actions and calculate rewards for each system response at each time-stamp. The main focus is made on the dialog manager, which decides how to continue the dialogue. We have used POMDP technique, as it maintains a belief distribution on the dialogue states based on the observations over the dialogue even in a noisy environment. Four contextual control modes are introduced in dialogue management for decision-making mechanism, and to keep track of machine behaviour for each dialogue state. The result obtained proves that our proposed framework has overcome the limitations of prior POMDP methods, and exactly understands the actual intention of the users within the available time, providing very interactive conversation between the user and the computer
- …