1,504 research outputs found

    Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals

    Full text link
    Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer that is capable of effectively learning representations of speech-related features, paving the way for streaming ASR with a large vocabulary. To alleviate the deficiency of streaming networks unable to access entire future inputs, we propose the Guidance Initialization that facilitates the transfer of feature knowledge related to the global context from the non-streaming Transformer to the tailored streaming Transformer through weight inheritance. Further, we propose a cross-modal structure based on knowledge distillation (KD), named cross-modal KD, to mitigate the negative effect of low quality mmWave signals on recognition performance. In the cross-modal KD, the audio streaming Transformer provides feature and response guidance that inherit fruitful and accurate speech information to supervise the training of the tailored radio streaming Transformer. The experimental results show that our Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.Comment: Accepted by Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (ACM IMWUT/UbiComp 2023

    Speech-based recognition of self-reported and observed emotion in a dimensional space

    Get PDF
    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Coping with gradient forms of /t/-deletion and lexical ambiguity in spoken word recognition

    No full text
    This study investigates how listeners cope with gradient forms of deletion of word-final /t/ when recognising words in a phonological context that makes /t/-deletion viable. A corpus study confirmed a high incidence of /t/-deletion in an /st#b/ context in Dutch. A discrimination study showed that differences between released /t/, unreleased /t/ and fully deleted /t/ in this specific /st#b/ context were salient. Two on-line experiments were carried out to investigate whether lexical activation might be affected by this form variation. Even though unreleased and released variants were processed equally fast by listeners, a detailed analysis of the unreleased condition provided evidence for gradient activation. Activating a target ending in /t/ is slowest for the most reduced variant because phonological context has to be taken into account. Importantly, activation for a target with /t/ in the absence of cues for /t/ is reduced if there is a surface-matching lexical competitor

    Gesture and Speech in Interaction - 4th edition (GESPIN 4)

    Get PDF
    International audienceThe fourth edition of Gesture and Speech in Interaction (GESPIN) was held in Nantes, France. With more than 40 papers, these proceedings show just what a flourishing field of enquiry gesture studies continues to be. The keynote speeches of the conference addressed three different aspects of multimodal interaction:gesture and grammar, gesture acquisition, and gesture and social interaction. In a talk entitled Qualitiesof event construal in speech and gesture: Aspect and tense, Alan Cienki presented an ongoing researchproject on narratives in French, German and Russian, a project that focuses especially on the verbal andgestural expression of grammatical tense and aspect in narratives in the three languages. Jean-MarcColletta's talk, entitled Gesture and Language Development: towards a unified theoretical framework,described the joint acquisition and development of speech and early conventional and representationalgestures. In Grammar, deixis, and multimodality between code-manifestation and code-integration or whyKendon's Continuum should be transformed into a gestural circle, Ellen Fricke proposed a revisitedgrammar of noun phrases that integrates gestures as part of the semiotic and typological codes of individuallanguages. From a pragmatic and cognitive perspective, Judith Holler explored the use ofgaze and hand gestures as means of organizing turns at talk as well as establishing common ground in apresentation entitled On the pragmatics of multi-modal face-to-face communication: Gesture, speech andgaze in the coordination of mental states and social interaction.Among the talks and posters presented at the conference, the vast majority of topics related, quitenaturally, to gesture and speech in interaction - understood both in terms of mapping of units in differentsemiotic modes and of the use of gesture and speech in social interaction. Several presentations explored the effects of impairments(such as diseases or the natural ageing process) on gesture and speech. The communicative relevance ofgesture and speech and audience-design in natural interactions, as well as in more controlled settings liketelevision debates and reports, was another topic addressed during the conference. Some participantsalso presented research on first and second language learning, while others discussed the relationshipbetween gesture and intonation. While most participants presented research on gesture and speech froman observer's perspective, be it in semiotics or pragmatics, some nevertheless focused on another importantaspect: the cognitive processes involved in language production and perception. Last but not least,participants also presented talks and posters on the computational analysis of gestures, whether involvingexternal devices (e.g. mocap, kinect) or concerning the use of specially-designed computer software forthe post-treatment of gestural data. Importantly, new links were made between semiotics and mocap data

    Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

    Get PDF
    There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV). However, a moderate success has been achieved. A recent study [1] presented a time contrastive learning (TCL) concept to explore the non-stationarity of brain signals for classification of brain states. Speech signals have similar non-stationarity property, and TCL further has the advantage of having no need for labeled data. We therefore present a TCL based BN feature extraction method. The method uniformly partitions each speech utterance in a training dataset into a predefined number of multi-frame segments. Each segment in an utterance corresponds to one class, and class labels are shared across utterances. DNNs are then trained to discriminate all speech frames among the classes to exploit the temporal structure of speech. In addition, we propose a segment-based unsupervised clustering algorithm to re-assign class labels to the segments. TD-SV experiments were conducted on the RedDots challenge database. The TCL-DNNs were trained using speech data of fixed pass-phrases that were excluded from the TD-SV evaluation set, so the learned features can be considered phrase-independent. We compare the performance of the proposed TCL bottleneck (BN) feature with those of short-time cepstral features and BN features extracted from DNNs discriminating speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels and boundaries are generated by three different automatic speech recognition (ASR) systems. Experimental results show that the proposed TCL-BN outperforms cepstral features and speaker+pass-phrase discriminant BN features, and its performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other work

    Alzheimer’s Dementia Recognition Through Spontaneous Speech

    Get PDF
    corecore