6,895 research outputs found
Spherical microphone array acoustic rake receivers
Several signal independent acoustic rake receivers are proposed for speech dereverberation using spherical microphone arrays. The proposed rake designs take advantage of multipaths, by separately capturing and combining early reflections with the direct path. We investigate several approaches in combining reflections with the direct path source signal, including the development of beam patterns that point nulls at all preceding reflections. The proposed designs are tested in experimental simulations and their dereverberation performances evaluated using objective measures. For the tested configuration, the proposed designs achieve higher levels of dereverberation compared to conventional signal independent beamforming systems; achieving up to 3.6 dB improvement in the direct-to-reverberant ratio over the plane-wave decomposition beamformer
Technical aspects of a demonstration tape for three-dimensional sound displays
This document was developed to accompany an audio cassette that demonstrates work in three-dimensional auditory displays, developed at the Ames Research Center Aerospace Human Factors Division. It provides a text version of the audio material, and covers the theoretical and technical issues of spatial auditory displays in greater depth than on the cassette. The technical procedures used in the production of the audio demonstration are documented, including the methods for simulating rotorcraft radio communication, synthesizing auditory icons, and using the Convolvotron, a real-time spatialization device
Egocentric Auditory Attention Localization in Conversations
In a noisy conversation environment such as a dinner party, people often
exhibit selective auditory attention, or the ability to focus on a particular
speaker while tuning out others. Recognizing who somebody is listening to in a
conversation is essential for developing technologies that can understand
social behavior and devices that can augment human hearing by amplifying
particular sound sources. The computer vision and audio research communities
have made great strides towards recognizing sound sources and speakers in
scenes. In this work, we take a step further by focusing on the problem of
localizing auditory attention targets in egocentric video, or detecting who in
a camera wearer's field of view they are listening to. To tackle the new and
challenging Selective Auditory Attention Localization problem, we propose an
end-to-end deep learning approach that uses egocentric video and multichannel
audio to predict the heatmap of the camera wearer's auditory attention. Our
approach leverages spatiotemporal audiovisual features and holistic reasoning
about the scene to make predictions, and outperforms a set of baselines on a
challenging multi-speaker conversation dataset. Project page:
https://fkryan.github.io/saa
Recommended from our members
Cross-modal extinction in a boy with severely autistic behaviour and high verbal intelligence
Anecdotal reports from individuals with autism suggest a loss of awareness to stimuli from one modality in the presence of stimuli from another. Here we document such a case in a detailed study of T.M., a 13-year-old boy with autism in whom significant autistic behaviors are combined with an uneven IQ profile of superior verbal and low performance abilities. Although T.M.'s speech is often unintelligible and his behavior is dominated by motor stereotypies and impulsivity, he can communicate by typing or pointing independently within a letter board. A series of experiments using simple and highly salient visual, auditory, and tactile stimuli demonstrated a hierarchy of cross-modal extinction, in which auditory information extinguished other modalities at various levels of processing. T.M. also showed deficits in shifting and sustaining attention. These results provide evidence for mono-channel perception in autism and suggest a general pattern of winner-takes-all processing in which a stronger stimulus-d riven representation dominates behavior, extinguishing weaker representations
Real-time Microphone Array Processing for Sound-field Analysis and Perceptually Motivated Reproduction
This thesis details real-time implementations of sound-field analysis and perceptually motivated reproduction methods for visualisation and auralisation purposes. For the former, various methods for visualising the relative distribution of sound energy from one point in space are investigated and contrasted; including a novel reformulation of the cross-pattern coherence (CroPaC) algorithm, which integrates a new side-lobe suppression technique. Whereas for auralisation applications, listening tests were conducted to compare ambisonics reproduction with a novel headphone formulation of the directional audio coding (DirAC) method. The results indicate that the side-lobe suppressed CroPaC method offers greater spatial selectivity in reverberant conditions compared with other popular approaches, and that the new DirAC formulation yields higher perceived spatial accuracy when compared to the ambisonics method
Cognitive performance in open-plan office acoustic simulations: Effects of room acoustics and semantics but not spatial separation of sound sources
The irrelevant sound effect (ISE) characterizes short-term memory performance
impairment during irrelevant sounds relative to quiet. Irrelevant sound
presentation in most laboratory-based ISE studies has been rather limited to
represent complex scenarios including open-plan offices (OPOs) and not many
studies have considered serial recall of heard information. This paper
investigates ISE using an auditory-verbal serial recall task, wherein
performance was evaluated for relevant factors in simulating OPO acoustics: the
irrelevant sounds including the semanticity of speech, reproduction methods
over headphones, and room acoustics. Results (Experiments 1 and 2) show that
ISE was exhibited in most conditions with anechoic (irrelevant) nonspeech
sounds with/without speech, but the effect was substantially higher with
meaningful speech compared to foreign speech, suggesting a semantic effect.
Performance differences in conditions with diotic and binaural reproductions
were not statistically robust, suggesting limited role of spatial separation of
sources. In Experiment 3, statistically robust ISE were exhibited for binaural
room acoustic conditions with mid-frequency reverberation times, T30 (s) = 0.4,
0.8, 1.1, suggesting cognitive impairment regardless of sound absorption
representative of OPOs. Performance differences in T30 = 0.4 s relative to T30
= 0.8 and 1.1 s conditions were statistically robust. This emphasizes the
benefits for cognitive performance with increased sound absorption, reinforcing
extant room acoustic design recommendations. Performance differences in T30 =
0.8 s vs. 1.1 s were not statistically robust. Collectively, these results
suggest that certain findings from ISE studies with idiosyncratic acoustics may
not translate well to complex OPO acoustic environments
Meta-analyses support a taxonomic model for representations of different categories of audio-visual interaction events in the human brain
Our ability to perceive meaningful action events involving objects, people and other animate agents is characterized in part by an interplay of visual and auditory sensory processing and their cross-modal interactions. However, this multisensory ability can be altered or dysfunctional in some hearing and sighted individuals, and in some clinical populations. The present meta-analysis sought to test current hypotheses regarding neurobiological architectures that may mediate audio-visual multisensory processing. Reported coordinates from 82 neuroimaging studies (137 experiments) that revealed some form of audio-visual interaction in discrete brain regions were compiled, converted to a common coordinate space, and then organized along specific categorical dimensions to generate activation likelihood estimate (ALE) brain maps and various contrasts of those derived maps. The results revealed brain regions (cortical “hubs”) preferentially involved in multisensory processing along different stimulus category dimensions, including (1) living versus non-living audio-visual events, (2) audio-visual events involving vocalizations versus actions by living sources, (3) emotionally valent events, and (4) dynamic-visual versus static-visual audio-visual stimuli. These meta-analysis results are discussed in the context of neurocomputational theories of semantic knowledge representations and perception, and the brain volumes of interest are available for download to facilitate data interpretation for future neuroimaging studies
Multisensory Motion Perception in 3\u20134 Month-Old Infants
Human infants begin very early in life to take advantage of multisensory information by extracting the invariant amodal information that is conveyed redundantly by multiple senses. Here we addressed the question as to whether infants can bind multisensory moving stimuli, and whether this occurs even if the motion produced by the stimuli is only illusory. Three- to 4-month-old infants were presented with two bimodal pairings: visuo-tactile and audio-visual. Visuo-tactile pairings consisted of apparently vertically moving bars (the Barber Pole illusion) moving in either the same or opposite direction with a concurrent tactile stimulus consisting of strokes given on the infant\u2019s back. Audio-visual pairings consisted of the Barber Pole illusion in its visual and auditory version, the latter giving the impression of a continuous rising or ascending pitch. We found that infants were able to discriminate congruently (same direction) vs. incongruently moving (opposite direction) pairs irrespective of modality (Experiment 1). Importantly, we also found that congruently moving visuo-tactile and audio-visual stimuli were preferred over incongruently moving bimodal stimuli (Experiment 2). Our findings suggest that very young infants are able to extract motion as amodal component and use it to match stimuli that only apparently move in the same direction
Optimality and limitations of audio-visual integration for cognitive systems
Multimodal integration is an important process in perceptual decision-making. In humans, this process has often been shown to be statistically optimal, or near optimal: sensory information is combined in a fashion that minimizes the average error in perceptual representation of stimuli. However, sometimes there are costs that come with the optimization, manifesting as illusory percepts. We review audio-visual facilitations and illusions that are products of multisensory integration, and the computational models that account for these phenomena. In particular, the same optimal computational model can lead to illusory percepts, and we suggest that more studies should be needed to detect and mitigate these illusions, as artifacts in artificial cognitive systems. We provide cautionary considerations when designing artificial cognitive systems with the view of avoiding such artifacts. Finally, we suggest avenues of research toward solutions to potential pitfalls in system design. We conclude that detailed understanding of multisensory integration and the mechanisms behind audio-visual illusions can benefit the design of artificial cognitive systems.Human-Robot Interactio
- …