255 research outputs found

    The listening talker: A review of human and algorithmic context-induced modifications of speech

    Get PDF
    International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output

    Caracterización del ritmo del habla usando la coherencia espectral entre el desplazamiento de la mandíbula y la envolvente temporal del habla

    Get PDF
    Lower modulation rates in the temporal envelope (ENV) of the acoustic signal are believed to be the rhythmic backbone in speech, facilitating speech comprehension in terms of neuronal entrainments at δ- and θ-rates (these rates are comparable to the foot- and syllable-rates phonetically). The jaw plays the role of a carrier articulator regulating mouth opening in a quasi-cyclical way, which correspond to the low-frequency modulations as a physical consequence. This paper describes a method to examine the joint roles of jaw oscillation and ENV in realizing speech rhythm using spectral coherence. Relative powers in the frequency bands corresponding to the δ-and θ-oscillations in the coherence (respectively notated as %δ and %θ) were quantified as one possible way of revealing the amount of concomitant foot- and syllable-level rhythmicities carried by both acoustic and articulatory domains. Two English corpora (mngu0 and MOCHA-TIMIT) were used for the proof of concept. %δ and %θ were regressed on utterance duration for an initial analysis. Results showed that the degrees of foot- and syllable-sized rhythmicities are different and are contingent upon the utterance length.Se piensa que las frecuencias de modulación más bajas en la envolvente temporal (ENV) de la señal acústica constituyen la columna vertebral rítmica del habla, facilitando su comprensión a nivel de enlaces neuronales en términos de los rangos δ y θ (estos rangos son comparables fonéticamente a los rangos de pie métrico y silábicos). La mandíbula funciona como un articulador que regula la abertura de la boca de una manera cuasi cíclica, lo que se corresponde, como una consecuencia física, con las modulaciones de baja frecuencia. Este artículo describe un método para examinar el papel conjunto de la oscilación de la mandíbula y de la envolvente ENV en la producción del ritmo del habla utilizando la coherencia espectral. Las potencias relativas en las bandas de frecuencia correspondientes a las oscilaciones δ y θ en la coherencia (indicadas respectivamente como %δ y %θ) se cuantificaron como un posible modo de revelar la cantidad de ritmicidad concomitante a nivel de pie métrico y de sílaba que los dominios acústicos y articulatorios comportan. Para someter a prueba esta idea, en este estudio se analizaron dos corpus en inglés (mngu0 y MOCHA-TIMIT). Para un primer análisis, se realizó una regresión de %δ y %θ en función de la duración del enunciado. Los resultados mostraron que los grados de ritmicidad del pie y de la sílaba son diferentes y dependen de la longitud del enunciado

    A description of the rhythm of Barunga Kriol using rhythm metrics and an analysis of vowel reduction

    Get PDF
    Kriol is an English-lexifier creole language spoken by over 20,000 children and adults in the Northern parts of Australia, yet much about the prosody of this language remains unknown. This thesis provides a preliminary description of the rhythm and patterns of vowel reduction of Barunga Kriol - a variety of Kriol local to Barunga Community, NT – and compares it to a relatively standard variety of Australian English. The thesis is divided into two studies. Study 1, the Rhythm Metric Study, describes the rhythm of Barunga Kriol and Australian English using rhythm metrics. Study 2, the Vowel Reduction Study, compared patterns of vowel reduction in Barunga Kriol and Australian English. This thesis contributes the first in depth studies of vowel reduction patterns and rhythm using rhythm metrics in any variety of Kriol or Australian English. The research also sets an adult baseline for metric results and patterns of vowel reduction for Barunga Kriol and Australian English, useful for future studies of child speech in these varieties. As rhythm is a major contributor to intelligibility, the findings of this thesis have the potential to inform teaching practice in English as a Second Language

    Neural encoding of the speech envelope by children with developmental dyslexia.

    Get PDF
    Developmental dyslexia is consistently associated with difficulties in processing phonology (linguistic sound structure) across languages. One view is that dyslexia is characterised by a cognitive impairment in the "phonological representation" of word forms, which arises long before the child presents with a reading problem. Here we investigate a possible neural basis for developmental phonological impairments. We assess the neural quality of speech encoding in children with dyslexia by measuring the accuracy of low-frequency speech envelope encoding using EEG. We tested children with dyslexia and chronological age-matched (CA) and reading-level matched (RL) younger children. Participants listened to semantically-unpredictable sentences in a word report task. The sentences were noise-vocoded to increase reliance on envelope cues. Envelope reconstruction for envelopes between 0 and 10Hz showed that the children with dyslexia had significantly poorer speech encoding in the 0-2Hz band compared to both CA and RL controls. These data suggest that impaired neural encoding of low frequency speech envelopes, related to speech prosody, may underpin the phonological deficit that causes dyslexia across languages.Medical Research Council (Grant ID: G0902375)This is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.bandl.2016.06.00

    A computational model of the relationship between speech intelligibility and speech acoustics

    Get PDF
    abstract: Speech intelligibility measures how much a speaker can be understood by a listener. Traditional measures of intelligibility, such as word accuracy, are not sufficient to reveal the reasons of intelligibility degradation. This dissertation investigates the underlying sources of intelligibility degradations from both perspectives of the speaker and the listener. Segmental phoneme errors and suprasegmental lexical boundary errors are developed to reveal the perceptual strategies of the listener. A comprehensive set of automated acoustic measures are developed to quantify variations in the acoustic signal from three perceptual aspects, including articulation, prosody, and vocal quality. The developed measures have been validated on a dysarthric speech dataset with various severity degrees. Multiple regression analysis is employed to show the developed measures could predict perceptual ratings reliably. The relationship between the acoustic measures and the listening errors is investigated to show the interaction between speech production and perception. The hypothesize is that the segmental phoneme errors are mainly caused by the imprecise articulation, while the sprasegmental lexical boundary errors are due to the unreliable phonemic information as well as the abnormal rhythm and prosody patterns. To test the hypothesis, within-speaker variations are simulated in different speaking modes. Significant changes have been detected in both the acoustic signals and the listening errors. Results of the regression analysis support the hypothesis by showing that changes in the articulation-related acoustic features are important in predicting changes in listening phoneme errors, while changes in both of the articulation- and prosody-related features are important in predicting changes in lexical boundary errors. Moreover, significant correlation has been achieved in the cross-validation experiment, which indicates that it is possible to predict intelligibility variations from acoustic signal.Dissertation/ThesisDoctoral Dissertation Speech and Hearing Science 201

    Exploring the influence of suprasegmental features of speech on rater judgements of intelligibility

    Get PDF
    A thesis submitted to the University of Bedfordshire in partial fulfilment of the requirements for the degree of Doctor of PhilosophyThe importance of suprasegmental features of speech to pronunciation proficiency is well known, yet limited research has been undertaken to identify how raters attend to suprasegmental features in the English-language speaking test encounter. Currently, such features appear to be underrepresented in language learning frameworks and are not always satisfactorily incorporated into the analytical rating scales that are used by major language testing organisations. This thesis explores the influence of lexical stress, rhythm and intonation on rater decision making in order to provide insight into their proper place in rating scales and frameworks. Data were collected from 30 raters, half of whom were experienced professional raters and half of whom lacked rater training and a background in language learning or teaching. The raters were initially asked to score 12 test taker performances using a 9-point intelligibility scale. The performances were taken from the long turn of Cambridge English Main Suite exams and were selected on the basis of the inclusion of a range of notable suprasegmental features. Following scoring, the raters took part in a stimulated recall procedure to report the features that influenced their decisions. The resulting scores were quantitatively analysed using many-facet Rasch measurement analysis. Transcriptions of the verbal reports were analysed using qualitative methods. Finally, an integrated analysis of the quantitative and qualitative data was undertaken to develop a series of suprasegmental rating scale descriptors. The results showed that experienced raters do appear to attend to specific suprasegmental features in a reliable way, and that their decisions have a great deal in common with the way non-experienced raters regard such features. This indicates that stress, rhythm, and intonation may be somewhat underrepresented on current speaking proficiency scales and frameworks. The study concludes with the presentation of a series of suprasegmental rating scale descriptors

    Perceptual Restoration of Temporally Distorted Speech in L1 vs. L2: Local Time Reversal and Modulation Filtering

    Get PDF
    Speech is intelligible even when the temporal envelope of speech is distorted. The current study investigates how native and non-native speakers perceptually restore temporally distorted speech. Participants were native English speakers (NS), and native Japanese speakers who spoke English as a second language (NNS). In Experiment 1, participants listened to “locally time-reversed speech” where every x-ms of speech signal was reversed on the temporal axis. Here, the local time reversal shifted the constituents of the speech signal forward or backward from the original position, and the amplitude envelope of speech was altered as a function of reversed segment length. In Experiment 2, participants listened to “modulation-filtered speech” where the modulation frequency components of speech were low-pass filtered at a particular cut-off frequency. Here, the temporal envelope of speech was altered as a function of cut-off frequency. The results suggest that speech becomes gradually unintelligible as the length of reversed segments increases (Experiment 1), and as a lower cut-off frequency is imposed (Experiment 2). Both experiments exhibit the equivalent level of speech intelligibility across six levels of degradation for native and non-native speakers respectively, which poses a question whether the regular occurrence of local time reversal can be discussed in the modulation frequency domain, by simply converting the length of reversed segments (ms) into frequency (Hz)

    An exploration of the rhythm of Malay

    Get PDF
    In recent years there has been a surge of interest in speech rhythm. However we still lack a clear understanding of the nature of rhythm and rhythmic differences across languages. Various metrics have been proposed as means for measuring rhythm on the phonetic level and making typological comparisons between languages (Ramus et al, 1999; Grabe & Low, 2002; Dellwo, 2006) but the debate is ongoing on the extent to which these metrics capture the rhythmic basis of speech (Arvaniti, 2009; Fletcher, in press). Furthermore, cross linguistic studies of rhythm have covered a relatively small number of languages and research on previously unclassified languages is necessary to fully develop the typology of rhythm. This study examines the rhythmic features of Malay, for which, to date, relatively little work has been carried out on aspects rhythm and timing. The material for the analysis comprised 10 sentences produced by 20 speakers of standard Malay (10 males and 10 females). The recordings were first analysed using rhythm metrics proposed by Ramus et. al (1999) and Grabe & Low (2002). These metrics (∆C, %V, rPVI, nPVI) are based on durational measurements of vocalic and consonantal intervals. The results indicated that Malay clustered with other so-called syllable-timed languages like French and Spanish on the basis of all metrics. However, underlying the overall findings for these metrics there was a large degree of variability in values across speakers and sentences, with some speakers having values in the range typical of stressed-timed languages like English. Further analysis has been carried out in light of Fletcher’s (in press) argument that measurements based on duration do not wholly reflect speech rhythm as there are many other factors that can influence values of consonantal and vocalic intervals, and Arvaniti’s (2009) suggestion that other features of speech should also be considered in description of rhythm to discover what contributes to listeners’ perception of regularity. Spectrographic analysis of the Malay recordings brought to light two parameters that displayed consistency and regularity for all speakers and sentences: the duration of individual vowels and the duration of intervals between intensity minima. This poster presents the results of these investigations and points to connections between the features which seem to be consistently regulated in the timing of Malay connected speech and aspects of Malay phonology. The results are discussed in light of current debate on the descriptions of rhythm

    Factors affecting the perception of noise-vocoded speech: stimulus properties and listener variability.

    Get PDF
    This thesis presents an investigation of two general factors affecting speech perception in normal-hearing adults. Two sets of experiments are described, in which speakers of English are presented with degraded (noise-vocoded) speech. The first set of studies investigates the importance of linguistic rhythm as a cue for perceptual adaptation to noise-vocoded sentences. Results indicate that the presence of native English rhythmic patterns benefits speech recognition and adaptation, but not when higher-level linguistic information is absent (i.e. when the sentences are in a foreign language). It is proposed that rhythm may help in the perceptual encoding of degraded speech in phonological working memory. Experiments in this strand also present evidence against a critical role for indexical characteristics of the speaker in the adaptation process. The second set of studies concerns the issue of individual differences in speech perception. A psychometric curve-fitting approach is selected as the preferred method of quantifying variability in noise-vocoded sentence recognition. Measures of working memory and verbal IQ are identified as candidate correlates of performance with noise-vocoded sentences. When the listener is exposed to noise-vocoded stimuli from different linguistic categories (consonants and vowels, isolated words, sentences), there is evidence for the interplay of two initial listening 'modes' in response to the degraded speech signal, representing 'top-down' cognitive-linguistic processing and 'bottom-up' acoustic-phonetic analysis. Detailed analysis of segment recognition presents a perceptual role for temporal information across all the linguistic categories, and suggests that performance could be improved through training regimes that direct attention to the most informative acoustic properties of the stimulus. Across several experiments, the results also demonstrate long-term aspects of perceptual learning. In sum, this thesis demonstrates that consideration of both stimulus-based and listener-based factors forms a promising approach to the characterization of speech perception processes in the healthy adult listener
    corecore