243 research outputs found
Learning Mid-Level Auditory Codes from Natural Sound Statistics
Interaction with the world requires an organism to transform sensory signals into representations in which behaviorally meaningful properties of the environment are made explicit. These representations are derived through cascades of neuronal processing stages in which neurons at each stage recode the output of preceding stages. Explanations of sensory coding may thus involve understanding how low-level patterns are combined into more complex structures. Although models exist in the visual domain to explain how mid-level features such as junctions and curves might be derived from oriented filters in early visual cortex, little is known about analogous grouping principles for mid-level auditory representations. We propose a hierarchical generative model of natural sounds that learns combina- tions of spectrotemporal features from natural stimulus statistics. In the first layer the model forms a sparse convolutional code of spectrograms using a dictionary of learned spectrotemporal kernels. To generalize from specific kernel activation patterns, the second layer encodes patterns of time-varying magnitude of multiple first layer coefficients. Because second-layer features are sensitive to combi- nations of spectrotemporal features, the representation they support encodes more complex acoustic patterns than the first layer. When trained on corpora of speech and environmental sounds, some second-layer units learned to group spectrotemporal features that occur together in natural sounds. Others instantiate opponency between dissimilar sets of spectrotemporal features. Such groupings might be instantiated by neurons in the auditory cortex, providing a hypothesis for mid-level neuronal computation.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216
Self-Supervised Audio-Visual Co-Segmentation
Segmenting objects in images and separating sound sources in audio are
challenging tasks, in part because traditional approaches require large amounts
of labeled data. In this paper we develop a neural network model for visual
object segmentation and sound source separation that learns from natural videos
through self-supervision. The model is an extension of recently proposed work
that maps image pixels to sounds. Here, we introduce a learning approach to
disentangle concepts in the neural networks, and assign semantic categories to
network feature channels to enable independent image segmentation and sound
source separation after audio-visual training on videos. Our evaluations show
that the disentangled model outperforms several baselines in semantic
segmentation and sound source separation.Comment: Accepted to ICASSP 201
Ambient Sound Provides Supervision for Visual Learning
The sound of crashing waves, the roar of fast-moving cars -- sound conveys
important information about the objects in our surroundings. In this work, we
show that ambient sounds can be used as a supervisory signal for learning
visual models. To demonstrate this, we train a convolutional neural network to
predict a statistical summary of the sound associated with a video frame. We
show that, through this process, the network learns a representation that
conveys information about objects and scenes. We evaluate this representation
on several recognition tasks, finding that its performance is comparable to
that of other state-of-the-art unsupervised learning methods. Finally, we show
through visualizations that the network learns units that are selective to
objects that are often associated with characteristic sounds.Comment: ECCV 201
Visually Indicated Sounds
Objects make distinctive sounds when they are hit or scratched. These sounds
reveal aspects of an object's material properties, as well as the actions that
produced them. In this paper, we propose the task of predicting what sound an
object makes when struck as a way of studying physical interactions within a
visual scene. We present an algorithm that synthesizes sound from silent videos
of people hitting and scratching objects with a drumstick. This algorithm uses
a recurrent neural network to predict sound features from videos and then
produces a waveform from these features with an example-based synthesis
procedure. We show that the sounds predicted by our model are realistic enough
to fool participants in a "real or fake" psychophysical experiment, and that
they convey significant information about material properties and physical
interactions
Recommended from our members
Inharmonic Speech: A Tool for the Study of Speech Perception and Separation
Sounds created by a periodic process have a Fourier representation with harmonic structure – i.e., components at multiples of a fundamental frequency. Harmonic frequency relations are a prominent feature of speech and many other natural sounds. Harmonicity is closely related to the perception of pitch and is believed to provide an important acoustic grouping cue underlying sound segregation. Here we introduce a method to manipulate the harmonicity of otherwise natural-sounding speech tokens, providing stimuli with which to study the role of harmonicity in speech perception. Our algorithm utilizes elements of the STRAIGHT framework for speech manipulation and synthesis, in which a recorded speech utterance is decomposed into voiced and unvoiced vocal
excitation and vocal tract filtering. Unlike the conventional STRAIGHT method, we model voiced excitation as a combination of time-varying sinusoids. By individually modifying the frequency of each sinusoid, we introduce inharmonic excitation without changing other aspects of the speech signal. The resulting signal remains highly intelligible, and can be used to assess the role of harmonicity in the perception of prosody or in the segregation of speech from mixtures of talkers
Spatial cues alone produce inaccurate sound segregation: The effect of interaural time differences
To clarify the role of spatial cues in sound segregation, this study explored whether interaural time differences (ITDs) are sufficient to allow listeners to identify a novel sound source from a mixture of sources. Listeners heard mixtures of two synthetic sounds, a target and distractor, each of which possessed naturalistic spectrotemporal correlations but otherwise lacked strong grouping cues, and which contained either the same or different ITDs. When the task was to judge whether a probe sound matched a source in the preceding mixture, performance improved greatly when the same target was presented repeatedly across distinct distractors, consistent with previous results. In contrast, performance improved only slightly with ITD separation of target and distractor, even when spectrotemporal overlap between target and distractor was reduced. However, when subjects localized, rather than identified, the sources in the mixture, sources with different ITDs were reported as two sources at distinct and accurately identified locations. ITDs alone thus enable listeners to perceptually segregate mixtures of sources, but the perceived content of these sources is inaccurate when other segregation cues, such as harmonicity and common onsets and offsets, do not also promote proper source separation
Development and use of prediction models for classification of cardiovascular risk of remote Indigenous Australians
Background: Cardiovascular disease (CVD) is the leading cause of death for Indigenous Australians. There is widespread belief that current tools have deficiencies for assessing CVD risk in this high-risk population. We sought to develop a 5-year CVD risk score using a wide range of known risk factors to further improve CVD risk prediction in this population.
Methods: We used clinical and demographic information on Indigenous people aged between 30 and 74 years without a history of CVD events who participated in the Well Person’s Health Check (WPHC), a community-based survey. Baseline assessments were conducted between 1998 and 2000, and data were linked to administrative hospitalisation and death records for identification of CVD events. We used Cox proportional hazard models to estimate the 5-year CVD risk, and the Harrell’s c-statistic and the modified Hosmer-Lemeshow (mH-L) χ2 statistic to assess the model discrimination and calibration, respectively.
Results: The study sample consisted of 1,583 individuals (48.1% male; mean age 45.0 year). The risk score consisted of sex, age, systolic blood pressure, diabetes mellitus, waist circumference, triglycerides, and albumin creatinine ratio. The bias-corrected c-statistic was 0.72 and the bias-corrected mH-L χ2 statistic was 12.01 (p-value, 0.212), indicating good discrimination and calibration, respectively. Using our risk score, the CVD risk of the Indigenous Australians could be stratified to a greater degree compared to a recalibrated Framingham risk score.
Conclusions: A seven-factor risk score could satisfactorily stratify 5-year risk of CVD in an Indigenous Australian cohort. These findings inform future research targeting CVD risk in Indigenous Australians
Summary statistics in auditory perception
Sensory signals are transduced at high resolution, but their structure must be stored in a more compact format. Here we provide evidence that the auditory system summarizes the temporal details of sounds using time-averaged statistics. We measured discrimination of 'sound textures' that were characterized by particular statistical properties, as normally result from the superposition of many acoustic features in auditory scenes. When listeners discriminated examples of different textures, performance improved with excerpt duration. In contrast, when listeners discriminated different examples of the same texture, performance declined with duration, a paradoxical result given that the information available for discrimination grows with duration. These results indicate that once these sounds are of moderate length, the brain's representation is limited to time-averaged statistics, which, for different examples of the same texture, converge to the same values with increasing duration. Such statistical representations produce good categorical discrimination, but limit the ability to discern temporal detail.Howard Hughes Medical Institut
- …