515 research outputs found
Modeling DNN as human learner
In previous experiments, human listeners demonstrated that they had the ability to adapt to
unheard, ambiguous phonemes after some initial, relatively short exposures. At the same time,
previous work in the speech community has shown that pre-trained deep neural network-based
(DNN) ASR systems, like humans, also have the ability to adapt to unseen, ambiguous phonemes
after retuning their parameters on a relatively small set. In the first part of this thesis, the time-course
of phoneme category adaptation in a DNN is investigated in more detail. By retuning the
DNNs with more and more tokens with ambiguous sounds and comparing classification accuracy
of the ambiguous phonemes in a held-out test across the time-course, we found out that DNNs, like
human listeners, also demonstrated fast adaptation: the accuracy curves were step-like in almost
all cases, showing very little adaptation after seeing only one (out of ten) training bins. However,
unlike our experimental setup mentioned above, in a typical
lexically guided perceptual learning
experiment, listeners are trained with individual words instead of individual phones, and thus to truly
model such a scenario, we would require a model that could take the context of a whole utterance
into account. Traditional speech recognition systems accomplish this through the use of hidden
Markov models (HMM) and WFST decoding. In recent years, bidirectional long short-term memory (Bi-LSTM) trained under connectionist temporal classification (CTC) criterion has also attracted
much attention. In the second part of this thesis, previous experiments on ambiguous phoneme
recognition were carried out again on a new Bi-LSTM model, and phonetic transcriptions of words
ending with ambiguous phonemes were used as training targets, instead of individual sounds that
consisted of a single phoneme. We found out that despite the vastly different architecture, the
new model showed highly similar behavior in terms of classification rate over the time course of
incremental retuning. This indicated that ambiguous phonemes in a continuous context could also
be quickly adapted by neural network-based models. In the last part of this thesis, our pre-trained
Dutch Bi-LSTM from the previous part was treated as a Dutch second language learner and was
asked to transcribe English utterances in a self-adaptation scheme. In other words, we used the
Dutch model to generate phonetic transcriptions directly and retune the model on the transcriptions
it generated, although ground truth transcriptions were used to choose a subset of all self-labeled
transcriptions. Self-adaptation is of interest as a model of human second language learning, but also
has great practical engineering value, e.g., it could be used to adapt speech recognition to a lowr-resource
language. We investigated two ways to improve the adaptation scheme, with the first being
multi-task learning with articulatory feature detection during training the model on Dutch and self-labeled
adaptation, and the second being first letting the model adapt to isolated short words before
feeding it with longer utterances.Ope
Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning
One of the challenges in Speech Emotion Recognition (SER) "in the wild" is
the large mismatch between training and test data (e.g. speakers and tasks). In
order to improve the generalisation capabilities of the emotion models, we
propose to use Multi-Task Learning (MTL) and use gender and naturalness as
auxiliary tasks in deep neural networks. This method was evaluated in
within-corpus and various cross-corpus classification experiments that simulate
conditions "in the wild". In comparison to Single-Task Learning (STL) based
state of the art methods, we found that our MTL method proposed improved
performance significantly. Particularly, models using both gender and
naturalness achieved more gains than those using either gender or naturalness
separately. This benefit was also found in the high-level representations of
the feature space, obtained from our method proposed, where discriminative
emotional clusters could be observed.Comment: Published in the proceedings of INTERSPEECH, Stockholm, September,
201
Auditory perceptual learning in autistic adults
The automatic retuning of phoneme categories to better adapt to the speech of a novel talker has been extensively documented across various (neurotypical) populations, including both adults and children. However, no studies have examined auditory perceptual learning effects in populations atypical in perceptual, social, and language processing for communication, such as populations with autism. Employing a classic lexically-guided perceptual learning paradigm, the present study investigated perceptual learning effects in Australian English autistic and non-autistic adults. The findings revealed that automatic attunement to existing phoneme categories was not activated in the autistic group in the same manner as for non-autistic control subjects. Specifically, autistic adults were able to both successfully discern lexical items and to categorize speech sounds; however, they did not show effects of perceptual retuning to talkers. These findings may have implications for the application of current sensory theories (e.g., Bayesian decision theory) to speech and language processing by autistic individuals. Lay Summary Lexically guided perceptual learning assists in the disambiguation of speech from a novel talker. The present study established that while Australian English autistic adult listeners were able to successfully discern lexical items and categorize speech sounds in their native language, perceptual flexibility in updating speaker-specific phonemic knowledge when exposed to a novel talker was not available. Implications for speech and language processing by autistic individuals as well as current sensory theories are discussed
- …