Search CORE

434 research outputs found

Detecting autism, emotions and social signals using AdaBoost

Author: Busa-Fekete Róbert
Gosztolya Gábor
Tóth László
Publication venue: Interspeech
Publication date: 01/01/2013
Field of study

SZTE Publicatio Repozitórium - SZTE - Repository of Publications

Current trends in multilingual speech processing

Author: BOURLARD HERVÉ
DINES JOHN
GARNER PHILIP
IMSENG DAVID
LIANG HUI
MAGIMAI-DOSS MATHEW
MOTLICEK PETR
SAHEER LAKSHMI
VALENTE FABIO
Publication venue
Publication date: 18/06/2018
Field of study

In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin

RERO DOC Digital Library

TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

Author: Eshky Aciel
Renals Steve
Ribeiro Manuel Sam
Richmond Korin
Sanger Jennifer
Wrench Alan
Zhang Jing-Xuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/11/2020
Field of study

We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of audio, ultrasound tongue imaging, and lip videos. TaL consists of two parts: TaL1 is a set of six recording sessions of one professional voice talent, a male native speaker of English; TaL80 is a set of recording sessions of 81 native speakers of English without voice talent experience. Overall, the corpus contains 24 hours of parallel ultrasound, video, and audio data, of which approximately 13.5 hours are speech. This paper describes the corpus and presents benchmark results for the tasks of speech recognition, speech synthesis (articulatory-to-acoustic mapping), and automatic synchronisation of ultrasound to audio. The TaL corpus is publicly available under the CC BY-NC 4.0 license.Comment: 8 pages, 4 figures, Accepted to SLT2021, IEEE Spoken Language Technology Worksho

arXiv.org e-Print Archive

Edinburgh Research Explorer

Statistical Parametric Methods for Articulatory-Based Foreign Accent Conversion

Author: Aryal Sandesh
Publication venue
Publication date: 04/05/2016
Field of study

Foreign accent conversion seeks to transform utterances from a non-native speaker (L2) to appear as if they had been produced by the same speaker but with a native (L1) accent. Such accent-modified utterances have been suggested to be effective in pronunciation training for adult second language learners. Accent modification involves separating the linguistic gestures and voice-quality cues from the L1 and L2 utterances, then transposing them across the two speakers. However, because of the complex interaction between these two sources of information, their separation in the acoustic domain is not straightforward. As a result, vocoding approaches to accent conversion results in a voice that is different from both the L1 and L2 speakers. In contrast, separation in the articulatory domain is straightforward since linguistic gestures are readily available via articulatory data. However, because of the difficulty in collecting articulatory data, conventional synthesis techniques based on unit selection are ill-suited for accent conversion given the small size of articulatory corpora and the inability to interpolate missing native sounds in L2 corpus. To address these issues, this dissertation presents two statistical parametric methods to accent conversion that operate in the acoustic and articulatory domains, respectively. The acoustic method uses a cross-speaker statistical mapping to generate L2 acoustic features from the trajectories of L1 acoustic features in a reference utterance. Our results show significant reductions in the perceived non-native accents compared to the corresponding L2 utterance. The results also show a strong voice-similarity between accent conversions and the original L2 utterance. Our second (articulatory-based) approach consists of building a statistical parametric articulatory synthesizer for a non-native speaker, then driving the synthesizer with the articulators from the reference L1 speaker. This statistical approach not only has low data requirements but also has the flexibility to interpolate missing sounds in the L2 corpus. In a series of listening tests, articulatory accent conversions were rated more intelligible and less accented than their L2 counterparts. In the final study, we compare the two approaches: acoustic and articulatory. Our results show that the articulatory approach, despite the direct access to the native linguistic gestures, is less effective in reducing perceived non-native accents than the acoustic approach

Texas A&M Repository

Incremental Syllable-Context Phonetic Vocoding

Author: Cernak Milos
Garner Philip N.
Lazaridis Alexandros
Motlicek Petr
Na Xingyu
Publication venue: Idiap
Publication date: 19/03/2015
Field of study

Current very low bit rate speech coders are, due to complexity limitations, designed to work off-line. This paper investigates incremental speech coding that operates real-time and incrementally (i.e., encoded speech depends only on already-uttered speech without the need of future speech information). Since human speech communication is asynchronous (i.e., different information flows being simultaneously processed), we hypothesised that such an incremental speech coder should also operate asynchronously. To accomplish this task, we describe speech coding that reflects the human cortical temporal sampling that packages information into units of different temporal granularity, such as phonemes and syllables, in parallel. More specifically, a phonetic vocoder — cascaded speech recognition and synthesis systems — extended with syllable-based information transmission mechanisms is investigated. There are two main aspects evaluated in this work, the synchronous and asynchronous coding. Synchronous coding refers to the case when the phonetic vocoder and speech generation process depend on the syllable boundaries during encoding and decoding respectively. On the other hand, asynchronous coding refers to the case when the phonetic encoding and speech generation processes are done independently of the syllable boundaries. Our experiments confirmed that the asynchronous incremental speech coding performs better, in terms of intelligibility and overall speech quality, mainly due to better alignment of the segmental and prosodic information. The proposed vocoding operates at an uncompressed bit rate of 213 bits/sec and achieves an average communication delay of 243 ms

Infoscience - École polytechnique fédérale de Lausanne

An exploration of the rhythm of Malay

Author: Docherty G. J
Samoylova Ekaterina
Wan Ahmad Wan Aslynn Salwani
Publication venue
Publication date: 01/01/2010
Field of study

In recent years there has been a surge of interest in speech rhythm. However we still lack a clear understanding of the nature of rhythm and rhythmic differences across languages. Various metrics have been proposed as means for measuring rhythm on the phonetic level and making typological comparisons between languages (Ramus et al, 1999; Grabe & Low, 2002; Dellwo, 2006) but the debate is ongoing on the extent to which these metrics capture the rhythmic basis of speech (Arvaniti, 2009; Fletcher, in press). Furthermore, cross linguistic studies of rhythm have covered a relatively small number of languages and research on previously unclassified languages is necessary to fully develop the typology of rhythm. This study examines the rhythmic features of Malay, for which, to date, relatively little work has been carried out on aspects rhythm and timing. The material for the analysis comprised 10 sentences produced by 20 speakers of standard Malay (10 males and 10 females). The recordings were first analysed using rhythm metrics proposed by Ramus et. al (1999) and Grabe & Low (2002). These metrics (∆C, %V, rPVI, nPVI) are based on durational measurements of vocalic and consonantal intervals. The results indicated that Malay clustered with other so-called syllable-timed languages like French and Spanish on the basis of all metrics. However, underlying the overall findings for these metrics there was a large degree of variability in values across speakers and sentences, with some speakers having values in the range typical of stressed-timed languages like English. Further analysis has been carried out in light of Fletcher’s (in press) argument that measurements based on duration do not wholly reflect speech rhythm as there are many other factors that can influence values of consonantal and vocalic intervals, and Arvaniti’s (2009) suggestion that other features of speech should also be considered in description of rhythm to discover what contributes to listeners’ perception of regularity. Spectrographic analysis of the Malay recordings brought to light two parameters that displayed consistency and regularity for all speakers and sentences: the duration of individual vowels and the duration of intervals between intensity minima. This poster presents the results of these investigations and points to connections between the features which seem to be consistently regulated in the timing of Malay connected speech and aspects of Malay phonology. The results are discussed in light of current debate on the descriptions of rhythm

The International Islamic University Malaysia Repository

Automatic Pronunciation Assessment -- A Review

Author: Ali Ahmed
Chowdhury Shammur Absar
Kheir Yassine El
Publication venue
Publication date: 21/10/2023
Field of study

Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding

arXiv.org e-Print Archive

Recommended from our members

The production and perception of domain-initial strengthening in Seoul, Busan, and Ulsan Korean

Author: Yoo Kayeon
Publication venue: University of Cambridge
Publication date: 01/02/2020
Field of study

Korean exhibits one of the most consistent examples of the cross-linguistic phenomenon of domain-initial strengthening (hereafter DIS; T. Cho & Keating, 2001; Keating, Cho, Fougeron, & Hsu, 2004). DIS is defined as temporal and/or spatial enhancement of segmental articulation in the initial position of prosodic domains. Broadly, this dissertation serves as a detailed case study of the production patterns and the perceptual benefits of this phenomenon. The recent findings of denasalisation and devoicing of the initial nasals in Korean (Young Shin Kim, 2011; Yoo, 2015a) suggest that there is a striking parallelism between the lenis stops /p, t, k/ and the nasal consonants /m, n/ in their patterns of DIS. Nevertheless, we currently lack an account that captures this parallelism. In addition, there is disagreement over the categorical nature of lenis stop voicing (S.-A. Jun, 1993; Docherty, 1995) and denasalisation (Yoshida, 2008; Young Shin Kim, 2011). Despite the obvious similarities between the arguably discrete processes of lenis stop voicing and denasalisation, and the kind of gradient effects widely reported for DIS, there has been no explicit investigation of the links among them. Thus, I examined the hypothesis that DIS, operating in the phonetic component, has given rise to the categorical rules of lenis stop voicing and denasalisation in the phrase-level phonology through rule scattering, as predicted by the theory of the life cycle of phonological processes (Bermúdez-Otero & Trousdale, 2012; Turton, 2014). Recordings were collected in Seoul, Busan, and Ulsan, and various auditory and acoustic analyses were conducted to examine the phonetic variation of the relevant stops. The study adopted the three-city design as these varieties were expected to be at different stages in the life cycle, particularly with regard to the stabilisation of denasalisation. In the second part of this dissertation, I conducted a perception experiment to investigate if listeners are able to use DIS patterns as a cue to a prosodic boundary. According to the results, Seoul showed the most advanced patterns in the stabilisation of DIS. As predicted by rule scattering, speakers who showed evidence of categorical lenis stop voicing and/or denasalisation also showed an overlaid effect of a gradient phonetic process. The perception study strongly supported the hypothesis that listeners exploit DIS cues to detect the beginning of a prosodic domain. Based on these findings, this dissertation offers a unified account of lenis stop voicing, denasalisation, and DIS within a single framework, offering insights into the nature of DIS as well as its functional role in prosodic parsing.Cambridge Trust International Scholarshi

Apollo (Cambridge)

Methods in prosody

Author
Publication venue
Publication date
Field of study

This book presents a collection of pioneering papers reflecting current methods in prosody research with a focus on Romance languages. The rapid expansion of the field of prosody research in the last decades has given rise to a proliferation of methods that has left little room for the critical assessment of these methods. The aim of this volume is to bridge this gap by embracing original contributions, in which experts in the field assess, reflect, and discuss different methods of data gathering and analysis. The book might thus be of interest to scholars and established researchers as well as to students and young academics who wish to explore the topic of prosody, an expanding and promising area of study

OAPEN Library

Incremental Syllable-Context Phonetic Vocoding

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref