15 research outputs found
HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal Uncertainty
The way we perceive a sound depends on many aspects-- its ecological
frequency, acoustic features, typicality, and most notably, its identified
source. In this paper, we present the HCU400: a dataset of 402 sounds ranging
from easily identifiable everyday sounds to intentionally obscured artificial
ones. It aims to lower the barrier for the study of aural phenomenology as the
largest available audio dataset to include an analysis of causal attribution.
Each sample has been annotated with crowd-sourced descriptions, as well as
familiarity, imageability, arousal, and valence ratings. We extend existing
calculations of causal uncertainty, automating and generalizing them with word
embeddings. Upon analysis we find that individuals will provide less polarized
emotion ratings as a sound's source becomes increasingly ambiguous; individual
ratings of familiarity and imageability, on the other hand, diverge as
uncertainty increases despite a clear negative trend on average
On HRTF Notch Frequency Prediction Using Anthropometric Features and Neural Networks
High fidelity spatial audio often performs better when produced using a
personalized head-related transfer function (HRTF). However, the direct
acquisition of HRTFs is cumbersome and requires specialized equipment. Thus,
many personalization methods estimate HRTF features from easily obtained
anthropometric features of the pinna, head, and torso. The first HRTF notch
frequency (N1) is known to be a dominant feature in elevation localization, and
thus a useful feature for HRTF personalization. This paper describes the
prediction of N1 frequency from pinna anthropometry using a neural model.
Prediction is performed separately on three databases, both simulated and
measured, and then by domain mixing in-between the databases. The model
successfully predicts N1 frequency for individual databases and by domain
mixing between some databases. Prediction errors are better or comparable to
those previously reported, showing significant improvement when acquired over a
large database and with a larger output range
Towards Improved Room Impulse Response Estimation for Speech Recognition
We propose to characterize and improve the performance of blind room impulse
response (RIR) estimation systems in the context of a downstream application
scenario, far-field automatic speech recognition (ASR). We first draw the
connection between improved RIR estimation and improved ASR performance, as a
means of evaluating neural RIR estimators. We then propose a GAN-based
architecture that encodes RIR features from reverberant speech and constructs
an RIR from the encoded features, and uses a novel energy decay relief loss to
optimize for capturing energy-based properties of the input reverberant speech.
We show that our model outperforms the state-of-the-art baselines on acoustic
benchmarks (by 72% on the energy decay relief and 22% on an early-reflection
energy metric), as well as in an ASR evaluation task (by 6.9% in word error
rate)
Hearing Loss Detection from Facial Expressions in One-on-one Conversations
Individuals with impaired hearing experience difficulty in conversations,
especially in noisy environments. This difficulty often manifests as a change
in behavior and may be captured via facial expressions, such as the expression
of discomfort or fatigue. In this work, we build on this idea and introduce the
problem of detecting hearing loss from an individual's facial expressions
during a conversation. Building machine learning models that can represent
hearing-related facial expression changes is a challenge. In addition, models
need to disentangle spurious age-related correlations from hearing-driven
expressions. To this end, we propose a self-supervised pre-training strategy
tailored for the modeling of expression variations. We also use adversarial
representation learning to mitigate the age bias. We evaluate our approach on a
large-scale egocentric dataset with real-world conversational scenarios
involving subjects with hearing loss and show that our method for hearing loss
detection achieves superior performance over baselines.Comment: Accepted by ICASSP 202
The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective
In recent years, the thriving development of research related to egocentric
videos has provided a unique perspective for the study of conversational
interactions, where both visual and audio signals play a crucial role. While
most prior work focus on learning about behaviors that directly involve the
camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction
problem, marking the first attempt to infer exocentric conversational
interactions from egocentric videos. We propose a unified multi-modal framework
-- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of
conversation behaviors -- speaking and listening -- for both the camera wearer
as well as all other social partners present in the egocentric video.
Specifically, we adopt the self-attention mechanism to model the
representations across-time, across-subjects, and across-modalities. To
validate our method, we conduct experiments on a challenging egocentric video
dataset that includes multi-speaker and multi-conversation scenarios. Our
results demonstrate the superior performance of our method compared to a series
of baselines. We also present detailed ablation studies to assess the
contribution of each component in our model. Check our project page at
https://vjwq.github.io/AV-CONV/
System specific power reduction techniques for wearable navigation technology
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (page 73).As a result of advances in computer vision, mapping, and controls, wearable technology for visually-impaired individuals has become a growing space of research within Assistive Technology. A team at the MIT Energy Ecient Circuits Group has made an important stride forward by presenting a wearable navigation prototype in a fully integrated hardware form factor, but one of biggest barriers to usability of the device is its excessive power consumption. As such, the goal of this work is, broadly, to- (1) Understand the largest sources of power consumption in the initial navigation proto- type system, and expose relevant features for control; (2) Develop a set of algorithms that can capitalize on the motion of a user, the motion of the environment around a user, and the proximity of obstacles within the environment to the user, in order to dynamically tune the exposed parameters to scale power as necessary; and (3) Lay the foundation for the next generation wearable navigation prototype by translating critical software operations and the power scaling algorithms into a hardware architecture capable of working with a smaller and less power intensive depth camera. The first portion of this work focuses on the wearable navigation prototype built around Texas Instrument's OPT9220/9221 Time of Flight chipset. Illumination voltage, frame rate, and integration duty cycle are identied as key control features, and a step rate estimation algorithm, scene statistics algorithm, and frame skipping controller to tune these features are built and tested. The latter half the work focuses on the newer OPT8320 evaluation platform, for which a Bluespec System Verilog implementation of these power algorithms and the point cloud generation operation is presented and tested. Overall, the work demonstrates the critical concept that simple, system specific, fully integrated algorithms can effectively be used to reduce analog power system-wide.by Ishwarya Ananthabhotla.M. Eng
Cognitive Audio: Enabling Auditory Interfaces with an Understanding of How We Hear
Over the last several decades, neuroscientists, cognitive scientists, and psychologists have made strides in understanding the complex and mysterious processes that define the interaction between our minds and the sounds around us. Some of these processes, particularly at the lowest levels of abstraction relative to a sound wave, are well understood, and are easy to characterize across large sections of the human population; others, however, are the sum of both intuition and observations drawn from small-scale laboratory experiments, and remain as of yet poorly understood. In this thesis, I suggest that there is value in coupling insight into the workings of auditory processing, beginning with abstractions in pre-conscious processing, with new frontiers in interface design and state-of-the-art infrastructure for parsing and identifying sound objects, as a means of unlocking audio technologies that are much more immersive, naturalistic, and synergistic than those present in the existing landscape. From the vantage point of today's computational models and devices that largely represent audio at the level of the digital sample, I gesture towards a world of auditory interfaces that work deeply in concert with uniquely human tendencies, allowing us to altogether re-imagine how we capture, preserve, and experience bodies of sound -- towards, for example, augmented reality devices that manipulate sound objects to minimize distractions, lossy "codecs" that operate on semantic rather than time-frequency information, and soundscape design engines operating on large corpora of audio data that optimize for aesthetic or experiential outcomes instead of purely objective ones.
To do this, I aim to introduce and explore a new research direction focused on the marriage of principles governing pre-conscious auditory cognition with traditional HCI approaches to auditory interface design via explicit statistical modeling, termed "Cognitive Audio". Along the way, I consider the major roadblocks that present themselves in approaching this convergence: I ask how we might "probe" and measure a cognitive principle of interest robustly enough to inform system design, in the absence of immediately observable biophysical phenomena that may accompany, for example, visual cognition; I also ask how we might build reliable, meaningful statistical models from the resulting data that drive compelling experiences despite inherent noise, sparsity, and generalizations made at the level of the crowd.
I discuss early insights into these questions through the lens of a series of projects centered on auditory processing at different levels of abstraction. I begin with a discussion of early work focused on cognitive models of lower-level phenomena; these exercises then inform a comprehensive effort to construct general purpose estimators of gestalt concepts in sound understanding. I then demonstrate the affordances of these estimators in the context of application systems that I construct and characterize, incorporating additional explorations on methods for personalization that sit atop these estimators. Finally, I conclude with a dialogue on the intersection between the key contributions in this dissertation and a string of major themes relevant to the audio technology and computation world today.Ph.D