3,814 research outputs found

    Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

    Full text link
    This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System

    Sound localization accuracy in the blind population

    Get PDF
    The ability to accurately locate a sound source is crucial in the blind population to orient and mobilize independently in the environment. Sound localization is accomplished by the detection of binaural differences in intensity and time of incoming sound waves along with phase differences and spectral cues. It is dependent on auditory sensitivity and processing. However, localization ability can not be predicted from the audiogram or an auditory processing evaluation. Auditory information is not received only from objects making sound, but also from objects reflecting sound. Auditory information used in this manner is called echolocation. Echolocation significantly enhances localization in the absence of vision. Research has shown that echolocation is an important form of localization used by the blind to facilitate independent mobility. However, the ability to localize sound is not evaluated in the blind population. Due to the importance of localization and echolocation for independent mobility in the blind, it would seem appropriate to evaluate the accuracy of this skill set. Echolocation is dependent upon the same auditory processes as localization. More specifically, localization is a precursor to echolocation. Therefore, localization ability will be evaluated in two normal hearing groups, a young normal vision population and young blind population. Both groups will have normal hearing and auditory processing verified by an audiological evaluation that includes a central auditory screening. The localization assessment will be performed using a 24-speaker array in a sound treated chamber with four different testing conditions (1) low-pass broadband stimuli in quiet, (2) low-pass broadband stimuli in noise, (3) high-pass broadband stimuli in quiet, and (4) high-pass broadband speech stimuli in noise. It is hypothesized that blind individuals may exhibit keener localization skills than their normal vision counterparts, particularly if they are experienced, independent travelers. Results of this study may lead to future research in localization assessment, and possibly localization training for blind individuals

    Perceptual strategies in active and passive hearing of neotropical bats

    Get PDF
    Basic spectral and temporal sound properties, such as frequency content and timing, are evaluated by the auditory system to build an internal representation of the external world and to generate auditory guided behaviour. Using echolocating bats as model system, I investigated aspects of spectral and temporal processing during echolocation and in relation to passive listening, and the echo-acoustic object recognition for navigation. In the first project (chapter 2), the spectral processing during passive and active hearing was compared in the echolocting bat Phyllostomus discolor. Sounds are ubiquitously used for many vital behaviours, such as communication, predator and prey detection, or echolocation. The frequency content of a sound is one major component for the correct perception of the transmitted information, but it is distorted while travelling from the sound source to the receiver. In order to correctly determine the frequency content of an acoustic signal, the receiver needs to compensate for these distortions. We first investigated whether P. discolor compensates for distortions of the spectral shape of transmitted sounds during passive listening. Bats were trained to discriminate lowpass filtered from highpass filtered acoustic impulses, while hearing a continuous white noise background with a flat spectral shape. We then assessed their spontaneous classification of acoustic impulses with varying spectral content depending on the background’s spectral shape (flat or lowpass filtered). Lowpass filtered noise background increased the proportion of highpass classifications of the same filtered impulses, compared to white noise background. Like humans, the bats thus compensated for the background’s spectral shape. In an active-acoustic version of the identical experiment, the bats had to classify filtered playbacks of their emitted echolocation calls instead of passively presented impulses. During echolocation, the classification of the filtered echoes was independent of the spectral shape of the passively presented background noise. Likewise, call structure did not change to compensate for the background’s spectral shape. Hence, auditory processing differs between passive and active hearing, with echolocation representing an independent mode with its own rules of auditory spectral analysis. The second project (chapter 3) was concerned with the accurate measurement of the time of occurrence of auditory signals, and as such also distance in echolocation. In addition, the importance of passive listening compared to echolocation turned out to be an unexpected factor in this study. To measure the distance to objects, called ranging, bats measure the time delay between an outgoing call and its returning echo. Ranging accuracy received considerable interest in echolocation research for several reasons: (i) behaviourally, it is of importance for the bat’s ability to locate objects and navigate its surrounding, (ii) physiologically, the neuronal implementation of precise measurements of very short time intervals is a challenge and (iii) the conjectured echo-acoustic receiver of bats is of interest for signal processing. Here, I trained the nectarivorous bat Glossophaga soricina to detect a jittering real target and found a biologically plausible distance accuracy of 4–7 mm, corresponding to a temporal accuracy of 20–40 μs. However, presumably all bats did not learn to use the jittering echo delay as the first and most prominent cue, but relied on passive acoustic listening first, which could only be prevented by the playback of masking noise. This shows that even a non-gleaning bat heavily relies on passive acoustic cues and that the measuring of short time intervals is difficult. This result questions other studies reporting a sub-microsecond time jitter threshold. The third project (chapter 4) linked the perception of echo-acoustic stimuli to the appropriate behavioural reactions, namely evasive flight manoeuvres around virtual objects presented in the flight paths of wild, untrained bats. Echolocating bats are able to orient in complete darkness only by analysing the echoes of their emitted calls. They detect, recognize and classify objects based on the spectro-temporal reflection pattern received at the two ears. Auditory object analysis, however, is inevitably more complicated than visual object analysis, because the one-dimensional acoustic time signal only transmits range information, i.e., the object’s distance and its longitudinal extent. All other object dimensions like width and height have to be inferred from comparative analysis of the signals at both ears and over time. The purpose of this study was to measure perceived object dimensions in wild, experimentally naïve bats by video-recording and analysing the bats’ evasive flight manoeuvres in response to the presentation of virtual echo-acoustic objects with independently manipulated acoustic parameters. Flight manoeuvres were analysed by extracting the flight paths of all passing bats. As a control to our method, we also recorded the flight paths of bats in response to a real object. Bats avoided the real object by flying around it. However, we did not find any flight path changes in response to the presentation of several virtual objects. We assume that the missing spatial extent of virtual echo-acoustic objects, due to playback from only one loudspeaker, was the main reason for the failure to evoke evasive flight manoeuvres. This study therefore emphasises for the first time the importance of the spatial dimension of virtual objects, which were up to now neglected in virtual object presentations

    Auditory Displays and Assistive Technologies: the use of head movements by visually impaired individuals and their implementation in binaural interfaces

    Get PDF
    Visually impaired people rely upon audition for a variety of purposes, among these are the use of sound to identify the position of objects in their surrounding environment. This is limited not just to localising sound emitting objects, but also obstacles and environmental boundaries, thanks to their ability to extract information from reverberation and sound reflections- all of which can contribute to effective and safe navigation, as well as serving a function in certain assistive technologies thanks to the advent of binaural auditory virtual reality. It is known that head movements in the presence of sound elicit changes in the acoustical signals which arrive at each ear, and these changes can improve common auditory localisation problems in headphone-based auditory virtual reality, such as front-to-back reversals. The goal of the work presented here is to investigate whether the visually impaired naturally engage head movement to facilitate auditory perception and to what extent it may be applicable to the design of virtual auditory assistive technology. Three novel experiments are presented; a field study of head movement behaviour during navigation, a questionnaire assessing the self-reported use of head movement in auditory perception by visually impaired individuals (each comparing visually impaired and sighted participants) and an acoustical analysis of inter-aural differences and cross- correlations as a function of head angle and sound source distance. It is found that visually impaired people self-report using head movement for auditory distance perception. This is supported by head movements observed during the field study, whilst the acoustical analysis showed that interaural correlations for sound sources within 5m of the listener were reduced as head angle or distance to sound source were increased, and that interaural differences and correlations in reflected sound were generally lower than that of direct sound. Subsequently, relevant guidelines for designers of assistive auditory virtual reality are proposed

    Time and information in perceptual adaptation to speech

    Get PDF
    Presubmission manuscript and supplementary files (stimuli, stimulus presentation code, data, data analysis code).Perceptual adaptation to a talker enables listeners to efficiently resolve the many-to-many mapping between variable speech acoustics and abstract linguistic representations. However, models of speech perception have not delved into the variety or the quantity of information necessary for successful adaptation, nor how adaptation unfolds over time. In three experiments using speeded classification of spoken words, we explored how the quantity (duration), quality (phonetic detail), and temporal continuity of talker-specific context contribute to facilitating perceptual adaptation to speech. In single- and mixed-talker conditions, listeners identified phonetically-confusable target words in isolation or preceded by carrier phrases of varying lengths and phonetic content, spoken by the same talker as the target word. Word identification was always slower in mixed-talker conditions than single-talker ones. However, interference from talker variability decreased as the duration of preceding speech increased but was not affected by the amount of preceding talker-specific phonetic information. Furthermore, efficiency gains from adaptation depended on temporal continuity between preceding speech and the target word. These results suggest that perceptual adaptation to speech may be understood via models of auditory streaming, where perceptual continuity of an auditory object (e.g., a talker) facilitates allocation of attentional resources, resulting in more efficient perceptual processing.NIH NIDCD (R03DC014045

    Perceptual compasses: spatial navigation in multisensory environments

    Get PDF
    Moving through space is a crucial activity in daily human life. The main objective of my Ph.D. project consisted of investigating how people exploit the multisensory sources of information available (vestibular, visual, auditory) to efficiently navigate. Specifically, my Ph.D. aimed at i) examining the multisensory integration mechanisms underlying spatial navigation; ii) establishing the crucial role of vestibular signals in spatial encoding and processing, and its interaction with environmental landmarks; iii) providing the neuroscientific basis to develop tailored assessment protocols and rehabilitation procedures to enhance orientation and mobility based on the integration of different sensory modalities, especially addressed to improve the compromised navigational performance of visually impaired (VI) people. To achieve these aims, we conducted behavioral experiments on adult participants, including psychophysics procedures, galvanic stimulation, and modeling. In particular, the experiments involved active spatial navigation tasks with audio-visual landmarks and selfmotion discrimination tasks with and without acoustic landmarks using a motion platform (Rotational-Translational Chair) and an acoustic virtual reality tool. Besides, we applied Galvanic Vestibular Stimulation to directly modulate signals coming from the vestibular system during behavioral tasks that involved interaction with audio-visual landmarks. In addition, when appropriate, we compared the obtained results with predictions coming from the Maximum Likelihood Estimation model, to verify the potential optimal integration between the available multisensory cues. i) Results on multisensory navigation showed a sub-group of integrators and another of non-integrators, revealing inter-individual differences in audio-visual processing while moving through the environment. Finding these idiosyncrasies in a homogeneous sample of adults emphasizes the role of individual perceptual characteristics in multisensory perception, highlighting how important it is to plan tailored rehabilitation protocols considering each individual’s perceptual preferences and experiences. ii) We also found a robust inherent overestimation bias when estimating passive self-motion stimuli. This finding shed new light on how our brain processes and elaborates the available cues building a more functional representation of the world. We also demonstrated a novel impact of the vestibular signals on the encoding of visual environmental cues without actual self-motion information. The role that vestibular inputs play in visual cues perception, and space encoding has multiple consequences on humans’ ability to functionally navigate in space and interact with environmental objects, especially when vestibular signals are impaired due to intrinsic (vestibular disorders) or environmental conditions (altered gravity, e.g. spaceflight missions). Finally, iii) the combination of the Rotational-Translational Chair and the acoustic virtual reality tool revealed a slight improvement in self-motion perception for VI people when exploiting acoustic cues. This approach shows to be a successful technique for evaluating audio-vestibular perception and improving spatial representation abilities of VI people, providing the basis to develop new rehabilitation procedures focused on multisensory perception. Overall, the findings resulting from my Ph.D. project broaden the scientific knowledge about spatial navigation in multisensory environments, yielding new insights into the exploration of the brain mechanisms associated with mobility, orientation, and locomotion abilities

    Visual and spatial audio mismatching in virtual environments

    Get PDF
    This paper explores how vision affects spatial audio perception in virtual reality. We created four virtual environments with different reverb and room sizes, and recorded binaural clicks in each one. We conducted two experiments: one where participants judged the audio-visual match, and another where they pointed to the click direction. We found that vision influences spatial audio perception and that congruent audio-visual cues improve accuracy. We suggest some implications for virtual reality design and evaluation
    corecore