865 research outputs found

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Rhythmic unit extraction and modelling for automatic language identification

    Get PDF
    International audienceThis paper deals with an approach to Automatic Language Identification based on rhythmic modelling. Beside phonetics and phonotactics, rhythm is actually one of the most promising features to be considered for language identification, even if its extraction and modelling are not a straightforward issue. Actually, one of the main problems to address is what to model. In this paper, an algorithm of rhythm extraction is described: using a vowel detection algorithm, rhythmic units related to syllables are segmented. Several parameters are extracted (consonantal and vowel duration, cluster complexity) and modelled with a Gaussian Mixture. Experiments are performed on read speech for 7 languages (English, French, German, Italian, Japanese, Mandarin and Spanish) and results reach up to 86 ± 6% of correct discrimination between stress-timed mora-timed and syllable-timed classes of languages, and to 67 ± 8% percent of correct language identification on average for the 7 languages with utterances of 21 seconds. These results are commented and compared with those obtained with a standard acoustic Gaussian mixture modelling approach (88 ± 5% of correct identification for the 7-languages identification task)

    Effects of variance and input distribution on the training of L2 learners' tone categorization

    Get PDF
    Recent psycholinguistic findings showed that (a) a multi-modal phonetic training paradigm that encodes visual, interactive information is more effective in training L2 learners' perception of novel categories, (b) decreasing the acoustic variance of a phonetic dimension allows the learners to more effectively shift the perceptual weight towards this dimension, and (c) using an implicit word learning task in which the words are contrasted with different lexical tones improves naĂŻve listeners' categorization of Mandarin Chinese tones. This dissertation investigates the effectiveness of video game training, variance manipulation and high variability training in the context of implicit word learning, in which American English speakers without any tone language experience learn four Mandarin Chinese tones by playing a video game. A video game was created in which each of four different animals is associated with a Chinese tone. The task for the participants is to select each animal's favorite food to feed it. At the beginning of the game, each animal is clearly visible. As the game progresses, the images of the animals become more and more vague and eventually visually indistinguishable. However, the four Chinese tones associated with the animals are played all through the game. Thus, the participants need to depend on the auditory information in order to clear the difficult levels. In terms of the training stimuli, the tone tokens were manipulated to have a greater variance on the pitch height dimension, but a smaller variance on the pitch direction dimension, in order to shift the English listeners' perception to pitch direction, a dimension that native Chinese listeners crucially rely on. A variety of pretests and posttests were used to investigate both the English speakers' perception of the tones and their weighting of the acoustic dimensions. These training stimuli were compared to other types of training stimuli used in the literature, such as the high variability natural stimuli and tones embedded in non-minimal pairs. A group of native English speakers was used as the control group without any tone input. A native control group was also included. The video game training for each speaker consisted of four 30-minute sessions on four different days, and 60 participants (including both the non-native control and native control group) participated in the experiments. The crucial findings in the study include (1) all naĂŻve listeners in the training condition successfully associated lexical tones with different animals without any explicit feedback after only 2 hours of training; (2) both the resynthesized stimuli with smaller variance on pitch direction and the multi-talker stimuli allowed native English speakers to shift their cue-weighting toward pitch direction and the multi-talker stimuli were more robust in terms of shifting the cue-weighting despite their more heterogeneous distribution in the acoustic space; (3) the multi-talker training allowed for better generalization as the trainees in multi-talker training identified the tones produced by new talkers better than trainees in other conditions; (4) there was a main effect of tone on tone identification and the falling tone was the most challenging one; (5) there is a correlation between cue-weighting and the tone discrimination performance before and after the training; (6) due to individual variability, individuals differed in terms of the amount of tone input they received during the video game training and the number of tone tokens was a significant predictor for the sensitivity to tones calculated as d'. Overall, the study showed an effect of talker variability and variances of multidimensional acoustic space on English speakers' cue-weighting for tone perception and their tone categorization

    A syllable-based investigation of coarticulation

    Get PDF
    Coarticulation has been long investigated in Speech Sciences and Linguistics (KĂŒhnert & Nolan, 1999). This thesis explores coarticulation through a syllable based model (Y. Xu, 2020). First, it is hypothesised that consonant and vowel are synchronised at the syllable onset for the sake of reducing temporal degrees of freedom, and such synchronisation is the essence of coarticulation. Previous efforts in the examination of CV alignment mainly report onset asynchrony (Gao, 2009; Shaw & Chen, 2019). The first study of this thesis tested the synchrony hypothesis using articulatory and acoustic data in Mandarin. Departing from conventional approaches, a minimal triplet paradigm was applied, in which the CV onsets were determined through the consonant and vowel minimal pairs, respectively. Both articulatory and acoustical results showed that CV articulation started in close temporal proximity, supporting the synchrony hypothesis. The second study extended the research to English and syllables with cluster onsets. By using acoustic data in conjunction with Deep Learning, supporting evidence was found for co-onset, which is in contrast to the widely reported c-center effect (Byrd, 1995). Secondly, the thesis investigated the mechanism that can maximise synchrony – Dimension Specific Sequential Target Approximation (DSSTA), which is highly relevant to what is commonly known as coarticulation resistance (Recasens & Espinosa, 2009). Evidence from the first two studies show that, when conflicts arise due to articulation requirements between CV, the CV gestures can be fulfilled by the same articulator on separate dimensions simultaneously. Last but not least, the final study tested the hypothesis that resyllabification is the result of coarticulation asymmetry between onset and coda consonants. It was found that neural network based models could infer syllable affiliation of consonants, and those inferred resyllabified codas had similar coarticulatory structure with canonical onset consonants. In conclusion, this thesis found that many coarticulation related phenomena, including local vowel to vowel anticipatory coarticulation, coarticulation resistance, and resyllabification, stem from the articulatory mechanism of the syllable

    Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages

    Full text link
    This paper provides an overall introduction of our Automatic Speech Recognition (ASR) systems for Southeast Asian languages. As not much existing work has been carried out on such regional languages, a few difficulties should be addressed before building the systems: limitation on speech and text resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia and Thai as examples to illustrate the strategies of collecting various resources required for building ASR systems.Comment: Published by the 2017 IEEE International Conference on Orange Technologies (ICOT 2017
    • 

    corecore