Search CORE

100 research outputs found

An integrated approach to speech recognition using phrase-based units

Author: Watkins Christopher James
Publication venue: University of East Anglia
Publication date: 01/01/2010
Field of study

University of East Anglia digital repository

Recommended from our members

Deep Learning for Automatic Assessment and Feedback of Spoken English

Author: Kyriakopoulos Konstantinos
Publication venue: University of Cambridge
Publication date: 12/03/2022
Field of study

Growing global demand for learning a second language (L2), particularly English, has led to considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications. This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One of the challenges in automatic spoken language assessment is giving candidates feedback on particular aspects, or views, of their spoken language proficiency, in addition to the overall holistic score normally provided. Another is detecting pronunciation and other types of errors at the word or utterance level and feeding them back to the learner in a useful way. It is usually difficult to obtain accurate training data with separate scores for different views and, as examiners are often trained to give holistic grades, single-view scores can suffer issues of consistency. Conversely, holistic scores are available for various standard assessment tasks such as Linguaskill. An investigation is thus conducted into whether assessment scores linked to particular views of the speaker’s ability can be obtained from systems trained using only holistic scores. End-to-end neural systems are designed with structures and forms of input tuned to single views, specifically each of pronunciation, rhythm, intonation and text. By training each system on large quantities of candidate data, individual-view information should be possible to extract. The relationships between the predictions of each system are evaluated to examine whether they are, in fact, extracting different information about the speaker. Three methods of combining the systems to predict holistic score are investigated, namely averaging their predictions and concatenating and attending over their intermediate representations. The combined graders are compared to each other and to baseline approaches. The tasks of error detection and error tendency diagnosis become particularly challenging when the speech in question is spontaneous and particularly given the challenges posed by the inconsistency of human annotation of pronunciation errors. An approach to these tasks is presented by distinguishing between lexical errors, wherein the speaker does not know how a particular word is pronounced, and accent errors, wherein the candidate’s speech exhibits consistent patterns of phone substitution, deletion and insertion. Three annotated corpora x of non-native English speech by speakers of multiple L1s are analysed, the consistency of human annotation investigated and a method presented for detecting individual accent and lexical errors and diagnosing accent error tendencies at the speaker level

Apollo (Cambridge)

Statistical models for noise-robust speech recognition

Author: van Dalen Rogier Christiaan
Publication venue: University of Cambridge
Publication date: 01/01/2011
Field of study

A standard way of improving the robustness of speech recognition systems to noise is model compensation. This replaces a speech recogniser's distributions over clean speech by ones over noise-corrupted speech. For each clean speech component, model compensation techniques usually approximate the corrupted speech distribution with a diagonal-covariance Gaussian distribution. This thesis looks into improving on this approximation in two ways: firstly, by estimating full-covariance Gaussian distributions; secondly, by approximating corrupted-speech likelihoods without any parameterised distribution. The first part of this work is about compensating for within-component feature correlations under noise. For this, the covariance matrices of the computed Gaussians should be full instead of diagonal. The estimation of off-diagonal covariance elements turns out to be sensitive to approximations. A popular approximation is the one that state-of-the-art compensation schemes, like VTS compensation, use for dynamic coefficients: the continuous-time approximation. Standard speech recognisers contain both per-time slice, static, coefficients, and dynamic coefficients, which represent signal changes over time, and are normally computed from a window of static coefficients. To remove the need for the continuous-time approximation, this thesis introduces a new technique. It first compensates a distribution over the window of statics, and then applies the same linear projection that extracts dynamic coefficients. It introduces a number of methods that address the correlation changes that occur in noise within this framework. The next problem is decoding speed with full covariances. This thesis re-analyses the previously-introduced predictive linear transformations, and shows how they can model feature correlations at low and tunable computational cost. The second part of this work removes the Gaussian assumption completely. It introduces a sampling method that, given speech and noise distributions and a mismatch function, in the limit calculates the corrupted speech likelihood exactly. For this, it transforms the integral in the likelihood expression, and then applies sequential importance resampling. Though it is too slow to use for recognition, it enables a more fine-grained assessment of compensation techniques, based on the KL divergence to the ideal compensation for one component. The KL divergence proves to predict the word error rate well. This technique also makes it possible to evaluate the impact of approximations that standard compensation schemes make.This work was supported by Toshiba Research Europe Ltd., Cambridge Research Laboratory

CiteSeerX

Apollo (Cambridge)

Are super-face-recognisers also super-voice-recognisers? Evidence from cross-modal identification tasks

Author: Assal G.
Assal G.
Bate S.
Bate S.
Bobak A. K.
Burton M. A.
Cohen J.
Davis J. P.
Green D. M.
Kreiman J.
Latinus M.
Lavan N.
Macmillan N.
Mühl C.
Patel A. D.
Spearman C.
Publication venue: 'Wiley'
Publication date: 01/05/2021
Field of study

Individual differences in face identification ability range from prosopagnosia to super-recognition. The current study examined whether face identification ability predicts voice identification ability (participants: N = 529). Superior-face-identifiers (exceptional at face memory and matching), superior-face-recognisers (exceptional at face memory only), superior-face-matchers (exceptional face matchers only), and controls completed the Bangor Voice Matching Test, Glasgow Voice Memory Test, and a Famous Voice Recognition Test. Meeting predictions, those possessing exceptional face memory and matching skills outperformed typical-range face groups at voice memory and voice matching respectively. Proportionally more super-face-identifiers also achieved our super-voice-recogniser criteria on two or more tests. Underlying cross-modality (voices vs. faces) and cross-task (memory vs. perception) mechanisms may therefore drive superior performances. Dissociations between Glasgow Voice Memory Test voice and bell recognition also suggest voice-specific effects to match those found with faces. These findings have applied implications for policing, particularly in cases when only suspect voice clips are available

Southampton (e-Prints Soton)

Crossref

Greenwich Academic Literature Archive

University of Strathclyde Institutional Repository

Automatic recognition of multiparty human interactions using dynamic Bayesian networks

Author: Dielmann Alfred
Publication venue: The University of Edinburgh
Publication date: 01/01/2009
Field of study

Relating statistical machine learning approaches to the automatic analysis of multiparty communicative events, such as meetings, is an ambitious research area. We have investigated automatic meeting segmentation both in terms of “Meeting Actions” and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine grained level highlighting individual speaker intentions. Group meeting actions describe the same process at a coarse level, highlighting interactions between different meeting participants and showing overall group intentions. A framework based on probabilistic graphical models such as dynamic Bayesian networks (DBNs) has been investigated for both tasks. Our first set of experiments is concerned with the segmentation and structuring of meetings (recorded using multiple cameras and microphones) into sequences of group meeting actions such as monologue, discussion and presentation. We outline four families of multimodal features based on speaker turns, lexical transcription, prosody, and visual motion that are extracted from the raw audio and video recordings. We relate these lowlevel multimodal features to complex group behaviours proposing a multistreammodelling framework based on dynamic Bayesian networks. Later experiments are concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty conversational speech. We present a joint generative approach based on a switching DBN for DA recognition in which segmentation and classification of DAs are carried out in parallel. This approach models a set of features, related to lexical content and prosody, and incorporates a weighted interpolated factored language model. In conjunction with this joint generative model, we have also investigated the use of a discriminative approach, based on conditional random fields, to perform a reclassification of the segmented DAs. The DBN based approach yielded significant improvements when applied both to the meeting action and the dialogue act recognition task. On both tasks, the DBN framework provided an effective factorisation of the state-space and a flexible infrastructure able to integrate a heterogeneous set of resources such as continuous and discrete multimodal features, and statistical language models. Although our experiments have been principally targeted on multiparty meetings; features, models, and methodologies developed in this thesis can be employed for a wide range of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features for several related research areas such as speaker addressing and focus of attention modelling, automatic speech recognition and understanding, topic and decision detection

Edinburgh Research Archive

Spoken language processing: piecing together the puzzle

Author: Abler
Aboitiz
Alexandrov
Altmann
Anderson
Anderson
Arnold
Baddeley
Bailly
Bara
Baron-Cohen
Baron-Cohen
Barto
Becchio
Becker
Bourlard
Brainard
Bregman
Bridle
Chartrand
Chella
Cherry
Clarke
Cooke
Cooke
Cowley
Cox
Darwin
Dawkins
de Graaf-Peters
de Zubicaray
Denes
Deutsch
Dijksterhuis
Donald
Dutoit
Ekman
Engelhardt
Erlhagen
Fadiga
Fairbanks
Feldman
Fenn
Figueredo
Fitch
Fitch
Fodor
Fowler
Frith
Fry
Geers
Gerdes
Gerken
Goldinger
Goldinger
Grush
Grush
Hartsuiker
Hauser
Hawkins
Hawkins
Hawkins
Hawkins
Hermansky
Hintzman
Hoare
Holden
Holmes
Howell
Howell
Huang
Hunter
Hunter
Ikuta
Jarvis
Jelinek
Jelinek
John
Junqua
Junqua
Jusczyk
Keller
Kuhl
Kurzweil
Kurzweil
Lane
Lengagne
Levelt
Levelt
Levelt
Levelt
Levelt
Lewicki
Lewis
Liberman
Liberman
Lindblom
Lippmann
Lombard
Makhoul
Marr
Martin-Loeches
Maslow
Meguerditchiana
Meltzoff
Moore
Morgan
Mountcastle
Nicolelis
Norris
Pacherie
Paul
Perkell
Perlis
Philipson
Pinker
Pinker
Powers
Pulvermüller
Rabiner
Rakoczy
Rizzolatti
Rizzolatti
Rizzolatti
Roger K. Moore
Roy
Scharenborg
Scharenborg
Scherer
Schweizer
Searle
Shannon
Sinha
Slaney
Slevc
Sokhi
Sternberg
Stevens
Studdart-Kennedy
Taylor
Taylor
Taylor
Taylor
Taylor
Tirassa
Tirassa
Toates
Tremblay
Tulving
Tummolini
Walker
Wang
Wang
Warren
Wilson
Wilson
Wundt
Wörgötter
Yarbus
Yu
Zipf
Publication venue: 'Elsevier BV'
Publication date: 01/05/2007
Field of study

Attempting to understand the fundamental mechanisms underlying spoken language processing, whether it is viewed as behaviour exhibited by human beings or as a faculty simulated by machines, is one of the greatest scientific challenges of our age. Despite tremendous achievements over the past 50 or so years, there is still a long way to go before we reach a comprehensive explanation of human spoken language behaviour and can create a technology with performance approaching or exceeding that of a human being. It is argued that progress is hampered by the fragmentation of the field across many different disciplines, coupled with a failure to create an integrated view of the fundamental mechanisms that underpin one organism's ability to communicate with another. This paper weaves together accounts from a wide variety of different disciplines concerned with the behaviour of living systems - many of them outside the normal realms of spoken language - and compiles them into a new model: PRESENCE (PREdictive SENsorimotor Control and Emulation). It is hoped that the results of this research will provide a sufficient glimpse into the future to give breath to a new generation of research into spoken language processing by mind or machine. (c) 2007 Elsevier B.V. All rights reserved

Crossref

White Rose Research Online

Recommended from our members

Auditory-based processing of communication sounds

Author: Walters Thomas C.
Publication venue: University of Cambridge
Publication date: 07/06/2011
Field of study

This thesis examines the possible benefits of adapting a biologically-inspired model of human auditory processing as part of a machine-hearing system. Features were generated by an auditory model, and used as input to machine learning systems to determine the content of the sound. Features were generated using the auditory image model (AIM) and were used for speech recognition and audio search. AIM comprises processing to simulate the human cochlea, and a ‘strobed temporal integration’ process which generates a stabilised auditory image (SAI) from the input sound. The communication sounds which are produced by humans, other animals, and many musical instruments take the form of a pulse-resonance signal: pulses excite resonances in the body, and the resonance following each pulse contains information both about the type of object producing the sound and its size. In the case of humans, vocal tract length (VTL) determines the size properties of the resonance. In the speech recognition experiments, an auditory filterbank was combined with a Gaussian fitting procedure to produce features which are invariant to changes in speaker VTL. These features were compared against standard mel-frequency cepstral coefficients (MFCCs) in a size-invariant syllable recognition task. The VTL-invariant representation was found to produce better results than MFCCs when the system was trained on syllables from simulated talkers of one range of VTLs and tested on those from simulated talkers with a different range of VTLs. The image stabilisation process of strobed temporal integration was analysed. Based on the properties of the auditory filterbank being used, theoretical constraints were placed on the properties of the dynamic thresholding function used to perform strobe detection. These constraints were used to specify a simple, yet robust, strobe detection algorithm. The syllable recognition system described above was then extended to produce features from profiles of the SAI and tested with the same syllable database as before. For clean speech, performance of the features was comparable to that of those generated from the filterbank output. However when pink noise was added to the stimuli, performance dropped more slowly as a function of signal-to-noise ratio when using the SAI-based AIM features, than when using either the filterbank-based features or the MFCCs, demonstrating the noise-robustness properties of the SAI representation. The properties of the auditory filterbank in AIM were also analysed. Three models of the cochlea were considered: the static gammatone filterbank, dynamic compressive gammachirp (dcGC) and the pole-zero filter cascade (PZFC). The dcGC and gammatone are standard filterbank models, whereas the PZFC is a filter cascade, which more accurately models signal propagation in the cochlea. However, while the architecture of the filterbanks is different, they have all been successfully fitted to psychophysical masking data from humans. The abilities of the filterbanks to measure pitch strength were assessed, using stimuli which evoke a weak pitch percept in humans, in order to ascertain whether there is any benefit in the use of the more computationally efficient PZFC. Finally, a complete sound effects search system using auditory features was constructed in collaboration with Google research. Features were computed from the SAI by sampling the SAI space with boxes of different scales. Vector quantization (VQ) was used to convert this multi-scale representation to a sparse code. The ‘passive-aggressive model for image retrieval’ (PAMIR) was used to learn the relationships between dictionary words and these auditory codewords. These auditory sparse codes were compared against sparse codes generated from MFCCs, and the best performance was found when using the auditory features

Apollo (Cambridge)

Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

Author: Bear Helen L.
Harvey Richard
Publication venue: 'Elsevier BV'
Publication date: 08/05/2018
Field of study

Visual lip gestures observed whilst lipreading have a few working definitions, the most common two are: ‘the visual equivalent of a phoneme’ and ‘phonemes which are indistinguishable on the lips’. To date there is no formal definition, in part because to date we have not established a two-way relationship or mapping between visemes and phonemes. Some evidence suggests that visual speech is highly dependent upon the speaker. So here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We test these phoneme to viseme maps to examine how similarly speakers talk visually and we use signed rank tests to measure the distance between individuals. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures

arXiv.org e-Print Archive

University of East Anglia digital repository

A survey on perceived speaker traits: personality, likability, pathology, and the first challenge

Author: Batliner Anton
Bocklet Tobias
Burkhardt Felix
Eyben Florian
Mohammadi Gelareh
Noeth Elmar
Schuller Björn
Steidl Stefan
van Son Rob
Vinciarelli Alessandro
Weiss Benjamin
Weninger Felix
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits – the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fields of research and describe the three sub-challenges in terms of the challenge conditions, the baseline results provided by the organisers, and a new openSMILE feature set, which has been used for computing the baselines and which has been provided to the participants. Furthermore, we summarise the approaches and the results presented by the participants to show the various techniques that are currently applied to solve these classification tasks

Enlighten

International Migration, Integration and Social Cohesion online publications

Modelling the effects of speech rate variation for automatic speech recognition

Author: Wrede Britta
Publication venue: Bielefeld University
Publication date: 01/01/2002
Field of study

Wrede B. Modelling the effects of speech rate variation for automatic speech recognition. Bielefeld (Germany): Bielefeld University; 2002.In automatic speech recognition it is a widely observed phenomenon that variations in speech rate cause severe degradations of the speech recognition performance. This is due to the fact that standard stochastic based speech recognition systems specialise on average speech rate. Although many approaches to modelling speech rate variation have been made, an integrated approach in a substantial system still has be to developed. General approaches to rate modelling are based on rate dependent models which are trained with rate specific subsets of the training data. During decoding a signal based rate estimation is performed according to which the set of rate dependent models is selected. While such approaches are able to reduce the word error rate significantly, they suffer from shortcomings such as the reduction of training data and the expensive training and decoding procedure. However, phonetic investigations show that there is a systematic relationship between speech rate and the acoustic characteristics of speech. In fast speech a tendency of reduction can be observed which can be described in more detail as a centralisation effect and an increase in coarticulation. Centralisation means that the formant frequencies of vowels tend to shift towards the vowel space center while increased coarticulation denotes the tendency of the spectral features of a vowel to shift towards those of its phonemic neighbour. The goal of this work is to investigate the possibility to incorporate the knowledge of the systematic nature of the influence of speech rate variation on the acoustic features in speech rate modelling. In an acoustic-phonetic analysis of a large corpus of spontaneous speech it was shown that an increased degree of the two effects of centralisation and coarticulation can be found in fast speech. Several measures for these effects were developed and used in speech recognition experiments with rate dependent models. A thorough investigation of rate dependent models showed that with duration and coarticulation based measures significant increases of the performance could be achieved. It was shown that by the use of different measures the models were adapted either to centralisation or coarticulation. Further experiments showed that by a more detailed modelling with more rate classes a further improvement can be achieved. It was also observed that a general basis for the models is needed before rate adaptation can be performed. In a comparison to other sources of acoustic variation it was shown that the effects of speech rate are as severe as those of speaker variation and environmental noise. All these results show that for a more substantial system that models rate variations accurately it is necessary to focus on both, durational and spectral effects. The systematic nature of the effects indicates that a continuous modelling is possible

Publications at Bielefeld University