24,752 research outputs found

    Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

    Full text link
    This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System

    A summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition

    Get PDF
    We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.5 page(s

    An exploration of sarcasm detection in children with Attention Deficit Hyperactivity Disorder

    Get PDF
    This document is the Accepted Manuscript version of the following article: Amanda K. Ludlow, Eleanor Chadwick, Alice Morey, Rebecca Edwards, and Roberto Gutierrez, ‘An exploration of sarcasm detection in children with Attention Deficit Hyperactivity Disorder’, Journal of Communication Disorders, Vol. 70: 25-34, November 2017. Under embargo. Embargo end date: 31 October 2019. The Version of Record is available at doi: https://doi.org/10.1016/j.jcomdis.2017.10.003.The present research explored the ability of children with ADHD to distinguish between sarcasm and sincerity. Twenty-two children with a clinical diagnosis of ADHD were compared with 22 age and verbal IQ matched typically developing children using the Social Inference–Minimal Test from The Awareness of Social Inference Test (TASIT, McDonald, Flanagan, & Rollins, 2002). This test assesses an individual’s ability to interpret naturalistic social interactions containing sincerity, simple sarcasm and paradoxical sarcasm. Children with ADHD demonstrated specific deficits in comprehending paradoxical sarcasm and they performed significantly less accurately than the typically developing children. While there were no significant differences between the children with ADHD and the typically developing children in their ability to comprehend sarcasm based on the speaker’s intentions and beliefs, the children with ADHD were found to be significantly less accurate when basing their decision on the feelings of the speaker, but also on what the speaker had said. Results are discussed in light of difficulties in their understanding of complex cues of social interactions, and non-literal language being symptomatic of children with a clinical diagnosis of ADHD. The importance of pragmatic language skills in their ability to detect social and emotional information is highlighted.Peer reviewe

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Effects of simultaneous speech and sign on infants’ attention to spoken language

    Get PDF
    Objectives: To examine the hypothesis that infants receiving a degraded auditory signal have more difficulty segmenting words from fluent speech if familiarized with the words presented in both speech and sign compared to familiarization with the words presented in speech only. Study Design: Experiment utilizing an infant-controlled visual preference procedure. Methods: Twenty 8.5-month-old normal-hearing infants completed testing. Infants were familiarized with repetitions of words in either the speech + sign (n = 10) or the speech only (n = 10) condition. Results: Infants were then presented with four six-sentence passages using an infant-controlled visual preference procedure. Every sentence in two of the passages contained the words presented in the familiarization phase, whereas none of the sentences in the other two passages contained familiar words.Infants exposed to the speech + sign condition looked at familiar word passages for 15.3 seconds and at nonfamiliar word passages for 15.6 seconds, t (9) = -0.130, p = .45. Infants exposed to the speech only condition looked at familiar word passages for 20.9 seconds and to nonfamiliar word passages for 15.9 seconds. This difference was statistically significant, t (9) = 2.076, p = .03. Conclusions: Infants\u27 ability to segment words from degraded speech is negatively affected when these words are initially presented in simultaneous speech and sign. The current study suggests that a decreased ability to segment words from fluent speech may contribute towards the poorer performance of pediatric cochlear implant recipients in total communication settings on a wide range of spoken language outcome measures

    A Formal Framework for Linguistic Annotation

    Get PDF
    `Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -- audio, video and/or physiological recordings -- or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, co-reference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focussed on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.Comment: 49 page
    corecore