134 research outputs found

    ACCDIST: A Metric for comparing speakers' accents

    Get PDF
    This paper introduces a new metric for the quantitative assessment of the similarity of speakers' accents. The ACCDIST metric is based on the correlation of inter-segment distance tables across speakers or groups. Basing the metric on segment similarity within a speaker ensures that it is sensitive to the speaker's pronunciation system rather than to his or her voice characteristics. The metric is shown to have an error rate of only 11% on the accent classification of speakers into 14 English regional accents of the British Isles, half the error rate of a metric based on spectral information directly. The metric may also be useful for cluster analysis of accent groups

    The new accent technologies:recognition, measurement and manipulation of accented speech

    Get PDF

    Are super-face-recognisers also super-voice-recognisers? Evidence from cross-modal identification tasks

    Get PDF
    Individual differences in face identification ability range from prosopagnosia to super-recognition. The current study examined whether face identification ability predicts voice identification ability (participants: N = 529). Superior-face-identifiers (exceptional at face memory and matching), superior-face-recognisers (exceptional at face memory only), superior-face-matchers (exceptional face matchers only), and controls completed the Bangor Voice Matching Test, Glasgow Voice Memory Test, and a Famous Voice Recognition Test. Meeting predictions, those possessing exceptional face memory and matching skills outperformed typical-range face groups at voice memory and voice matching respectively. Proportionally more super-face-identifiers also achieved our super-voice-recogniser criteria on two or more tests. Underlying cross-modality (voices vs. faces) and cross-task (memory vs. perception) mechanisms may therefore drive superior performances. Dissociations between Glasgow Voice Memory Test voice and bell recognition also suggest voice-specific effects to match those found with faces. These findings have applied implications for policing, particularly in cases when only suspect voice clips are available

    The effects of child language development on the performance of automatic speech recognition

    Get PDF
    In comparison to adults’, children’s ASR appears to be more challenging and yields inferior results. It has been suggested that for this issue to be addressed, linguistic understanding of children’s speech development needs to be employed to either provide a solution or an explanation. The present work aims to explore the influence of phonological effects associated with language acquisition (PEALA) in children’s ASR and investigate whether they can be detected in systematic patterns of ASR phone confusion errors or they can be evidenced in systematic patterns of acoustic feature structure. Findings from speech development research are used as the framework upon which a set of predictable error patterns is defined and guides the analysis of the experimental results reported. Several ASR experiments are conducted involving both children’s and adults’ speech. ASR phone confusion matrices are extracted and analysed according to a statistical significance test, proposed for the purposes of this work. A mathematical model is introduced to interpret the emerging results. Additionally, bottleneck features and i-vectors representing the acoustic features in one of the systems developed, are extracted and visualised using linear discriminant analysis (LDA). A qualitative analysis is conducted with reference to patterns that can be predicted through PEALA

    Automated phoneme mapping for cross-language speech recognition

    Get PDF
    This dissertation explores a unique automated approach to map one phoneme set to another, based on the acoustic distances between the individual phonemes. Although the focus of this investigation is on cross-language applications, this automated approach can be extended to same-language but different-database applications as well. The main goal of this investigation is to be able to use the data of a source language, to train the initial acoustic models of a target language for which very little speech data may be available. To do this, an automatic technique for mapping the phonemes of the two data sets must be found. Using this technique, it would be possible to accelerate the development of a speech recognition system for a new language. The current research in the cross-language speech recognition field has focused on manual methods to map phonemes. This investigation has considered an English-to-Afrikaans phoneme mapping, as well as an Afrikaans-to-English phoneme mapping. This has been previously applied to these language instances, but utilising manual phoneme mapping methods. To determine the best phoneme mapping, different acoustic distance measures are compared. The distance measures that are considered are the Kullback-Leibler measure, the Bhattacharyya distance metric, the Mahalanobis measure, the Euclidean measure, the L2 metric and the Jeffreys-Matusita distance. The distance measures are tested by comparing the cross-database recognition results obtained on phoneme models created from the TIMIT speech corpus and a locally-compiled South African SUN Speech database. By selecting the most appropriate distance measure, an automated procedure to map phonemes from the source language to the target language can be done. The best distance measure for the mapping gives recognition rates comparable to a manual mapping process undertaken by a phonetic expert. This study also investigates the effect of the number of Gaussian mixture components on the mapping and on the speech recognition system’s performance. The results indicate that the recogniser’s performance increases up to a limit as the number of mixtures increase. In addition, this study has explored the effect of excluding the Mel Frequency delta and acceleration cepstral coefficients. It is found that the inclusion of these temporal features help improve the mapping and the recognition system’s phoneme recognition rate. Experiments are also carried out to determine the impact of the number of HMM recogniser states. It is found that single-state HMMs deliver the optimum cross-language phoneme recognition results. After having done the mapping, speaker adaptation strategies are applied on the recognisers to improve their target-language performance. The models of a fully trained speech recogniser in a source language are adapted to target-language models using Maximum Likelihood Linear Regression (MLLR) followed by Maximum A Posteriori (MAP) techniques. Embedded Baum-Welch re-estimation is used to further adapt the models to the target language. These techniques result in a considerable improvement in the phoneme recognition rate. Although a combination of MLLR and MAP techniques have been used previously in speech adaptation studies, the combination of MLLR, MAP and EBWR in cross-language speech recognition is a unique contribution of this study. Finally, a data pooling technique is applied to build a new recogniser using the automatically mapped phonemes from the target language as well as the source language phonemes. This new recogniser demonstrates moderate bilingual phoneme recognition capabilities. The bilingual recogniser is then further adapted to the target language using MAP and embedded Baum-Welch re-estimation techniques. This combination of adaptation techniques together with the data pooling strategy is uniquely applied in the field of cross-language recognition. The results obtained using this technique outperform all other techniques tested in terms of phoneme recognition rates, although it requires a considerably more time consuming training process. It displays only slightly poorer phoneme recognition than the recognisers trained and tested on the same language database.Dissertation (MEng (Computer Engineering))--University of Pretoria, 2006.Electrical, Electronic and Computer Engineeringunrestricte

    Investigating emerging self-awareness : its neural underpinnings, the significance of self-recognition, and the relationship with social interactions

    Get PDF
    Up until now, self-recognition in the mirror, achieved at around 18 months, has been used to assess self-awareness in infancy. Even though the significance of this test is not universally accepted, this field has progressed very little over the last decades, in contrast to a broad volume of literature on the self in adults. However, a relationship between self-other differentiation and social cognitive abilities has been recently hypothesized, renewing the interest in mechanisms underlying emerging self-awareness. Adult studies have highlighted that brain networks, instead of isolated brain areas, support self-processing. Therefore, the first two studies of this thesis validated the use of advanced connectivity analyses on infant fNIRS data. Making use of these methods, one study demonstrated that functional connectivity between regions belonging to a network that has been related to abstract self-processing in adults gradually increases over the first two years of life. The same network was found to characterise infants who recognise themselves in the mirror. In another study, crucial regions of this network were shown to be engaged during self-recognition in 18-month-olds. As social interactions have been suggested to be fundamental for the construction of the self, the last two studies of this thesis investigated the relationship between emerging self-awareness and social interactions. To test this, I focused on mimicry, known to play an important role in affiliation and in mediating relationships. One study demonstrated that emerging selfawareness may affect infants’ tendency to selectively mimic in-group members, which may indicate a possible role of self-comparison and identification processes. The last study did not find evidence for a relationship between mothers’ tendency to imitate their infants at 4 months and emerging selfawareness. Taken together, these studies enrich our understanding of the mechanisms underlying emerging self-awareness and they represent a pioneering starting point for further investigations into this topic

    The avian dawn chorus across Great Britain: using new technology to study breeding bird song

    Get PDF
    The avian dawn chorus is a period of high song output performed daily around sunrise during the breeding season. Singing at dawn is of such significance to birds that they remain motivated to do so amid the noise of numerous others. Yet, we still do not fully understand why the dawn chorus exists. Technological advances in recording equipment, data storage and sound analysis tools now enable collection and scrutiny of large acoustic datasets, encouraging research on sound-producing organisms and promoting ‘the soundscape’ as an indicator of ecosystem health. Using an unrivalled dataset of dawn chorus recordings collected during this thesis, I explore the chorus throughout Great Britain with the prospect of furthering our understanding and appreciation of this daily event. I first evaluate the performance of four automated signal recognition tools (‘recognisers’) when identifying the singing events of target species during the dawn chorus, and devise a new ensemble approach that improves detection of singing events significantly over each of the recognisers in isolation. I then examine daily variation in the timing and peak of the chorus across the country in response to minimum overnight temperature. I conclude that cooler temperatures result in later chorus onset and peak the following dawn, but that the magnitude of this effect is greater at higher latitude sites with cooler and less variable overnight temperature regimes. Next, I present evidence of competition for acoustic space during the dawn chorus between migratory and resident species possessing similar song traits, and infer that this may lead either to fine-scale temporal partitioning of song, such that each competitor maintains optimal output, or to one competitor yielding. Finally, I investigate day-to-day attenuation of song during the leaf-out period from budburst through to full-leaf in woodland trees, and establish the potential for climate-driven advances in leaf-out phenology to attenuate song if seasonal singing activity in birds has not advanced to the same degree. I find that gradual attenuation of sound through the leaf-out process is dependent on the height of the receiver, and surmise that current advances in leaf-out phenology are unlikely to have undue effect on song propagation. This project illustrates the advantage of applying new technology to ecological studies of complex acoustic environments, and highlights areas in need of improvement, which is essential if we are to comprehend and preserve our natural soundscapes

    La frecuencia fundamental de la voz y sus efectos en el reconocimiento de habla continua

    Get PDF
    El objetivo del estudio que aquí se presenta es el de analizar el efecto de la variación de la frecuencia fundamental en las características de la señal de voz, estudiar los efectos de este parámetro sobre el comportamiento de los sistemas de reconocimiento de habla continua y evaluar técnicas de compensación de los efectos observados. Para llevarlo a cabo se han grabado varias bases de datos específicas que también se describen. Las pruebas realizadas indican que la tasa de acierto del sistema se ve afectada por el valor medio de la frecuencia fundamental, llegando a experimentar una reducción relativa del 38.8% para algunos valores extremos de este parámetro. La aplicación de la técnica MLLR permite compensar en parte esta degradación, experimentándose reducciones de la tasa de error de hasta el 51.9% para los locutores más problemáticos con sólo 10 frases de adaptación

    Computer lipreading via hybrid deep neural network hidden Markov models

    Get PDF
    Constructing a viable lipreading system is a challenge because it is claimed that only 30% of information of speech production is visible on the lips. Nevertheless, in small vocabulary tasks, there have been several reports of high accuracies. However, investigation of larger vocabulary tasks is rare. This work examines constructing a large vocabulary lipreading system using an approach based-on Deep Neural Network Hidden Markov Models (DNN-HMMs). We present the historical development of computer lipreading technology and the state-ofthe-art results in small and large vocabulary tasks. In preliminary experiments, we evaluate the performance of lipreading and audiovisual speech recognition in small vocabulary data sets. We then concentrate on the improvement of lipreading systems in a more substantial vocabulary size with a multi-speaker data set. We tackle the problem of lipreading an unseen speaker. We investigate the effect of employing several stepstopre-processvisualfeatures. Moreover, weexaminethecontributionoflanguage modelling in a lipreading system where we use longer n-grams to recognise visual speech. Our lipreading system is constructed on the 6000-word vocabulary TCDTIMIT audiovisual speech corpus. The results show that visual-only speech recognition can definitely reach about 60% word accuracy on large vocabularies. We actually achieved a mean of 59.42% measured via three-fold cross-validation on the speaker independent setting of the TCD-TIMIT corpus using Deep autoencoder features and DNN-HMM models. This is the best word accuracy of a lipreading system in a large vocabulary task reported on the TCD-TIMIT corpus. In the final part of the thesis, we examine how the DNN-HMM model improves lipreading performance. We also give an insight into lipreading by providing a feature visualisation. Finally, we present an analysis of lipreading results and suggestions for future development
    corecore