67 research outputs found

    On the distribution of clicks and inbreaths in class presentations and spontaneous conversations: blending vocal and kinetic activities

    Get PDF
    The present exploratory study compares the distribution of clicks and inbreaths in the productions of French students in two different communication settings (semiread oral class presentations vs spontaneous dyadic conversations). Grounded in a conversation analytic and discourse-pragmatic approach, mixing qualitative and quantitative methods, this study looks at the functions of clicks and inbreaths as well as accompanying kinetic behaviors (e.g swallowing, facial expressions, hand movement) in discourse. Preliminary results show a higher rate of pre-utterances clicks and inbreaths during oral presentations, which reflects the type of talk produced (structured and clear, which requires planning and preparation). And the qualitative analyses illustrate the ways speakers blend vocal and kinetic activities when producing clicks and inbreaths

    On the distribution of clicks and inbreaths in class presentations and spontaneous conversations: blending vocal and kinetic activities

    Get PDF
    The present exploratory study compares the distribution of clicks and inbreaths in the productions of French students in two different communication settings (semiread oral class presentations vs spontaneous dyadic conversations). Grounded in a conversation analytic and discourse-pragmatic approach, mixing qualitative and quantitative methods, this study looks at the functions of clicks and inbreaths as well as accompanying kinetic behaviors (e.g swallowing, facial expressions, hand movement) in discourse. Preliminary results show a higher rate of pre-utterances clicks and inbreaths during oral presentations, which reflects the type of talk produced (structured and clear, which requires planning and preparation). And the qualitative analyses illustrate the ways speakers blend vocal and kinetic activities when producing clicks and inbreaths

    Distributional and Acoustic Characteristics of Filler Particles in German with Consideration of Forensic-Phonetic Aspects

    Get PDF
    In this study, we investigate the use of the filler particles (FPs) uh, um, hm, as well as glottal FPs and tongue clicks of 100 male native German speakers in a corpus of spontaneous speech. For this purpose, the frequency distribution, FP duration, duration of pauses surrounding FPs, voice quality of FPs, and their vowel quality are investigated in two conditions, namely, normal speech and Lombard speech. Speaker-specific patterns are investigated on the basis of twelve sample speakers. Our results show that tongue clicks and glottal FPs are as common as typically described FPs, and should be a part of disfluency research. Moreover, the frequency of uh, um, and hm decreases in the Lombard condition while the opposite is found for tongue clicks. Furthermore, along with the usual F1 increase, a considerable reduction in vowel space is found in the Lombard condition for the vowels in uh and um. A high degree of within- and between-speaker variation is found on the individual speaker level

    Making Personal Digital Assistants Aware of What They Do Not Know

    Get PDF
    Abstract Personal digital assistants (PDAs) are spoken (and typed) dialog systems that are expected to assist users without being constrained to a particular domain. Typically, it is possible to construct deep multi-domain dialog systems focused on a narrow set of head domains. For the long tail (or when the speech recognition is not correct) the PDA does not know what to do. Two common fallback approaches are to either acknowledge its limitation or show web search results. Either approach can severely undermine the user's trust in the PDA's intelligence if invoked at the wrong time. In this paper, we propose features that are helpful in predicting the right fallback response. We then use these features to construct dialog policies such that the PDA is able to correctly decide between invoking web search or acknowledging its limitation. We evaluate these dialog policies on real user logs gathered from a PDA, deployed to millions of users, using both offline (judged) and online (user-click) metrics. We demonstrate that our hybrid dialog policy significantly increases the accuracy of choosing the correct response, measured by analyzing click-rate in logs, and also enhances user satisfaction, measured by human evaluations of the replayed experience

    The phonetics of speech breathing : pauses, physiology, acoustics, and perception

    Get PDF
    Speech is made up of a continuous stream of speech sounds that is interrupted by pauses and breathing. As phoneticians are primarily interested in describing the segments of the speech stream, pauses and breathing are often neglected in phonetic studies, even though they are vital for speech. The present work adds to a more detailed view of both pausing and speech breathing with a special focus on the latter and the resulting breath noises, investigating their acoustic, physiological, and perceptual aspects. We present an overview of how a selection of corpora annotate pauses and pause-internal particles, as well as a recording setup that can be used for further studies on speech breathing. For pauses, this work emphasized their optionality and variability under different tempos, as well as the temporal composition of silence and breath noise in breath pauses. For breath noises, we first focused on acoustic and physiological characteristics: We explored alignment between the onsets and offsets of audible breath noises with the start and end of expansion of both rib cage and abdomen. Further, we found similarities between speech breath noises and aspiration phases of /k/, as well as that breath noises may be produced with a more open and slightly more front place of articulation than realizations of schwa. We found positive correlations between acoustic and physiological parameters, suggesting that when speakers inhale faster, the resulting breath noises were more intense and produced more anterior in the mouth. Inspecting the entire spectrum of speech breath noises, we showed relatively flat spectra and several weak peaks. These peaks largely overlapped with resonances reported for inhalations produced with a central vocal tract configuration. We used 3D-printed vocal tract models representing four vowels and four fricatives to simulate in- and exhalations by reversing airflow direction. We found the direction to not have a general effect for all models, but only for those with high-tongue configurations, as opposed to those that were more open. Then, we compared inhalations produced with the schwa-model to human inhalations in an attempt to approach the vocal tract configuration in speech breathing. There were some similarities, however, several complexities of human speech breathing not captured in the models complicated comparisons. In two perception studies, we investigated how much information listeners could auditorily extract from breath noises. First, we tested categorizing different breath noises into six different types, based on airflow direction and airway usage, e.g. oral inhalation. Around two thirds of all answers were correct. Second, we investigated how well breath noises could be used to discriminate between speakers and to extract coarse information on speaker characteristics, such as age (old/young) and sex (female/male). We found that listeners were able to distinguish between two breath noises coming from the same or different speakers in around two thirds of all cases. Hearing one breath noise, classification of sex was successful in around 64%, while for age it was 50%, suggesting that sex was more perceivable than age in breath noises.Deutsche Forschungsgemeinschaft (DFG) – Projektnummer 418659027: "Pause-internal phonetic particles in speech communication

    DMRN+16: Digital Music Research Network One-day Workshop 2021

    Get PDF
    DMRN+16: Digital Music Research Network One-day Workshop 2021 Queen Mary University of London Tuesday 21st December 2021 Keynote speakers Keynote 1. Prof. Sophie Scott -Director, Institute of Cognitive Neuroscience, UCL. Title: "Sound on the brain - insights from functional neuroimaging and neuroanatomy" Abstract In this talk I will use functional imaging and models of primate neuroanatomy to explore how sound is processed in the human brain. I will demonstrate that sound is represented cortically in different parallel streams. I will expand this to show how this can impact on the concept of auditory perception, which arguably incorporates multiple kinds of distinct perceptual processes. I will address the roles that subcortical processes play in this, and also the contributions from hemispheric asymmetries. Keynote 2: Prof. Gus Xia - Assistant Professor at NYU Shanghai Title: "Learning interpretable music representations: from human stupidity to artificial intelligence" Abstract Gus has been leading the Music X Lab in developing intelligent systems that help people better compose and learn music. In this talk, he will show us the importance of music representation for both humans and machines, and how to learn better music representations via the design of inductive bias. Once we got interpretable music representations, the potential applications are limitless

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    The social stratification of clicks in English interaction

    Get PDF
    This thesis investigates how phonetic clicks work as a sociolinguistic variable embedded in interaction, adding to the growing research on the social stratification of sounds on the margins of language. While phonemic clicks occur rarely in some southern and western African languages, clicks are common as non-phonemic, interactional features in many languages, including English, and are anecdotally assumed to be used most to display a stance or attitude towards someone or something (e.g. Ogden 2013; Wright 2011; Gil 2013). There is also some evidence that clicks might vary in a similar way to more traditional linguistic variables (e.g. male and female speakers might perform clicks differently and at different rates—Ogden 2013; Pillion 2018). Previous work on clicks in English has shown that clicks can be produced with the full range of articulation (bilabial to alveolar-lateral), and occur alongside phonetic accompaniments, i.e. audible inbreath, creaky or nasal collocated speech, and/or particles, such as uh and um (e.g. Wright 2011; Ogden 2013). Previous studies using Conversation Analysis have demonstrated that English clicks have two main interactional functions: sequence-managing in talk, e.g. marking word search, marking the shift from one speaker to another, indexing the beginning of a new topic or interactional functions; and affect-laden functions, such as displaying disapproval, disagreement, sympathy (Ogden 2013). Click presence in these interactional functions seems to vary. Clicks are rarely studied or discussed in conjunction with social factors, though there is some indication that clicks may vary according to region, style and social factors (e.g. Ogden 2013; Moreno 2016). It remains unclear how click production or interactional function may vary according to speaker gender or age. This PhD thesis analyses clicks in a regional variety of Scottish English in Glasgow, by combining approaches from phonetics, variationist sociolinguistics, and Conversation Analysis. Specifically, it examines: (1) the phonetic form (i.e. auditorily identified place of articulation, acoustic characteristics such as spectral Centre of Gravity and duration, and phonetic accompaniments) and interactional function (sequence-managing or affect-laden) of Glasgow clicks; (2) how click phonetic form and interactional function vary according to linguistic and social factors in Glasgow; (3) how clicks in one particular interactional function, word search, are performed differently to clicks in other interactional functions and whether linguistic or social factors promote click presence within word search. These research questions were investigated in the speech of a stratified sample of 50 native Glaswegian men and women between the ages of 17 and 60, who were recorded and filmed in same-gender, self-selected pairs. Participants were told to complain and tell stories of frustrating situations, in order to elicit stance-displaying clicks, which might have been less common in a different context (Moreno 2016). Clicks were identified auditorily, and coded for place of articulation, phonetic accompaniments, position in the speaker’s turn (i.e. before the turn, the middle of the turn, after the turn, or in isolation), and interactional function. Word search sequences with and without a click were identified and transcribed using a strict set of criteria from Conversation Analysis (e.g. Goodwin and Goodwin 1986), and coded for phonetic accompaniments, in order to study the variation of clicks in their interactional context as well as how the interactional function itself varies with and without a click. Results revealed systematic patterning of clicks across phonetic features, interactional function, and age and gender. Glaswegian clicks were mostly produced with dental articulation and occurred with phonetic accompaniments. The presence of phonetic accompaniments was found to be subject to the interactional context, i.e. clicks with particles were more likely to be used in sequence-managing functions than affect-laden functions. Acoustic analysis of clicks showed that spectral frequency can be used as a measure of click place of articulation, much like for phonemic stops (e.g. Chodroff and Wilson 2014), with dental, dento-alveolar, and alveolar clicks having the highest mean frequency, and labial, palatal, and alveolar-lateral clicks showing lower mean frequencies. Clicks’ spectral frequency was constrained by speaker age, such that younger speakers produced clicks with higher spectral frequency, despite the lack of age-related physiological differences between younger and older speakers here. Click duration was also found to be indicative of the interactional function in which the click is embedded; clicks used in sequence-managing functions are shorter than affect-laden Clicks. Clicks could have both sequence-managing and affect-laden functions, though sequence managing functions were far more common. These interactional categories were dependent in complex ways on who performed the click and where; women were more likely than men to perform affect-laden clicks outside of their turn, i.e. as a listener Response. Comparisons of word search clicks to clicks in other interactional functions revealed that the variation in clicks’ phonetic patterning is very much due to the interactional function in which they occur; word search clicks were more likely to be produced with dental, dento-alveolar, or alveolar articulation, more likely to co-occur with a particle, and occurred more frequently mid-turn. When all word searches with and without clicks were examined, results showed that the presence of phonetic accompaniments did not vary according to whether or not a click is present for these Glaswegian speakers. However, age did play a role in who performs a click within an interactional function, such that younger speakers performed more word searches but fewer clicks than the older speakers. Together these results indicate that much of the phonetic form of clicks is due to the interactional function in which the click is embedded and socio-indexical information about who produces the click. These findings highlight how interactional function can constrain phonetic variation in conjunction with social factors, which demonstrates that examining both phonetic features and social factors, together with interactional context can contribute to crucial information about variation in future research for phonetics, sociolinguistics, and Conversation Analysis

    Speech verification for computer assisted pronunciation training

    Get PDF
    Computer assisted pronunciation training (CAPT) is an approach that uses computer technology and computer-based resources in teaching and learning pronunciation. It is also part of computer assisted language learning (CALL) technology that has been widely applied to online learning platforms in the past years. This thesis deals with one of the central tasks in CAPT, i.e. speech veri- fication. The goal is to provide a framework that identifies pronunciation errors in speech data of second language (L2) learners and generates feedback with information and instruction for error correction. Furthermore, the framework is supposed to support the adaptation to new L1-L2 language pairs with minimal adjustment and modification. The central result is a novel approach to L2 speech verification, which combines both modern language technologies and linguistic expertise. For pronunciation verification, we select a set of L2 speech data, create alias phonemes from the errors annotated by linguists, then train an acoustic model with mixed L2 and gold standard data and perform HTK phoneme recognition to identify the error phonemes. For prosody verification, FD-PSOLA and Dynamic time warping are both applied to verify the differences in duration, pitch and stress. Feedback is generated for both verifications. Our feedback is presented to learners not only visually as with other existing CAPT systems, but also perceptually by synthesizing the learner’s own audio, e.g. for prosody verification, the gold standard prosody is transplanted onto the learner’s own voice. The framework is self-adaptable under semi-supervision, and requires only a certain amount of mixed gold standard and annotated L2 speech data for boot- strapping. Verified speech data is validated by linguists, annotated in case of wrong verification, and used in the next iteration of training. Mary Annotation Tool (MAT) is developed as an open-source component of MARYTTS for both annotating and validating. To deal with uncertain pauses and interruptions in L2 speech, the silence model in HTK is also adapted, and used in all components of the framework where forced alignment is required. Various evaluations are conducted that help us obtain insights into the applicability and potential of our CAPT system. The pronunciation verification shows high accuracy in both precision and recall, and encourages us to acquire more error-annotated L2 speech data to enhance the trained acoustic model. To test the effect of feedback, a progressive evaluation is carried out and it shows that our perceptual feedback helps learners realize their errors, which they could not otherwise observe from visual feedback and textual instructions. In order to im- prove the user interface, a questionnaire is also designed to collect the learners’ experiences and suggestions.Computer Assisted Pronunciation Training (CAPT) ist ein Ansatz, der mittels Computer und computergestĂŒtzten Ressourcen das Erlernen der korrekten Aussprache im Fremdsprachenunterricht erleichtert. Dieser Ansatz ist ein Teil der Computer Assisted Language Learning (CALL) Technologie, die seit mehreren Jahren auf Online-Lernplattformen hĂ€ufig zum Einsatz kommt. Diese Arbeit ist der Sprachverifikation gewidmet, einer der zentralen Aufgaben innerhalb des CAPT. Das Ziel ist, ein Framework zur Identifikation von Aussprachefehlern zu entwickeln fĂŒrMenschen, die eine Fremdsprache (L2-Sprache) erlernen. Dabei soll Feedback mit fehlerspezifischen Informationen und Anweisungen fĂŒr eine richtige Aussprache erzeugt werden. DarĂŒber hinaus soll das Rahmenwerk die Anpassung an neue Sprachenpaare (L1-L2) mit minimalen Adaptationen und Modifikationen unterstĂŒtzen. Das zentrale Ergebnis ist ein neuartiger Ansatz fĂŒr die L2-SprachprĂŒfung, der sowohl auf modernen Sprachtechnologien als auch auf corpuslinguistischen AnsĂ€tzen beruht. FĂŒr die AusspracheĂŒberprĂŒfung erstellen wir Alias-Phoneme aus Fehlern, die von Linguisten annotiert wurden. Dann trainieren wir ein akustisches Modell mit gemischten L2- und Goldstandarddaten und fĂŒhren eine HTK-Phonemerkennung3 aus, um die Fehlerphoneme zu identifizieren. FĂŒr die ProsodieĂŒberprĂŒfung werden sowohl FD-PSOLA4 und Dynamic Time Warping angewendet, um die Unterschiede in der Dauer, Tonhöhe und Betonung zwischen dem Gesprochenen und dem Goldstandard zu verifizieren. Feedbacks werden fĂŒr beide ÜberprĂŒfungen generiert und den Lernenden nicht nur visuell prĂ€sentiert, so wie in anderen vorhandenen CAPT-Systemen, sondern auch perzeptuell vorgestellt. So wird unter anderem fĂŒr die Prosodieverifikation die Goldstandardprosodie auf die eigene Stimme des Lernenden ĂŒbergetragen. Zur Anpassung des Frameworks an weitere L1-L2 Sprachdaten muss das System ĂŒber Maschinelles Lernen trainiert werden. Da es sich um ein semi-ĂŒberwachtes Lernverfahren handelt, sind nur eine gewisseMenge an gemischten Goldstandardund annotierten L2-Sprachdaten fĂŒr das Bootstrapping erforderlich. Verifizierte Sprachdaten werden von Linguisten validiert, im Falle einer falschen Verifizierung nochmals annotiert, und bei der nĂ€chsten Iteration des Trainings verwendet. FĂŒr die Annotation und Validierung wurde das Mary Annotation Tool (MAT) als Open-Source-Komponente von MARYTTS entwickelt. Um mit unsicheren Pausen und Unterbrechungen in der L2-Sprache umzugehen, wurde auch das sogenannte Stillmodell in HTK angepasst und in allen Komponenten des Rahmenwerks verwendet, in denen Forced Alignment erforderlich ist. Unterschiedliche Evaluierungen wurden durchgefĂŒhrt, um Erkenntnisse ĂŒber die Anwendungspotenziale und die BeschrĂ€nkungen des Systems zu gewinnen. Die AusspracheĂŒberprĂŒfung zeigt eine hohe Genauigkeit sowohl bei der PrĂ€zision als auch beim Recall. Dadurch war es möglich weitere fehlerbehaftete L2-Sprachdaten zu verwenden, um somit das trainierte akustische Modell zu verbessern. Um die Wirkung des Feedbacks zu testen, wird eine progressive Auswertung durchgefĂŒhrt. Das Ergebnis zeigt, dass perzeptive Feedbacks dabei helfen, dass die Lernenden sogar Fehler erkennen, die sie nicht aus visuellen Feedbacks und Textanweisungen beobachten können. Zudem wurden mittels Fragebogen die Erfahrungen und Anregungen der BenutzeroberflĂ€che der Lernenden gesammelt, um das System kĂŒnftig zu verbessern. 3 Hidden Markov Toolkit 4 Pitch Synchronous Overlap and Ad
    • 

    corecore