15 research outputs found

    HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal Uncertainty

    Full text link
    The way we perceive a sound depends on many aspects-- its ecological frequency, acoustic features, typicality, and most notably, its identified source. In this paper, we present the HCU400: a dataset of 402 sounds ranging from easily identifiable everyday sounds to intentionally obscured artificial ones. It aims to lower the barrier for the study of aural phenomenology as the largest available audio dataset to include an analysis of causal attribution. Each sample has been annotated with crowd-sourced descriptions, as well as familiarity, imageability, arousal, and valence ratings. We extend existing calculations of causal uncertainty, automating and generalizing them with word embeddings. Upon analysis we find that individuals will provide less polarized emotion ratings as a sound's source becomes increasingly ambiguous; individual ratings of familiarity and imageability, on the other hand, diverge as uncertainty increases despite a clear negative trend on average

    On HRTF Notch Frequency Prediction Using Anthropometric Features and Neural Networks

    Full text link
    High fidelity spatial audio often performs better when produced using a personalized head-related transfer function (HRTF). However, the direct acquisition of HRTFs is cumbersome and requires specialized equipment. Thus, many personalization methods estimate HRTF features from easily obtained anthropometric features of the pinna, head, and torso. The first HRTF notch frequency (N1) is known to be a dominant feature in elevation localization, and thus a useful feature for HRTF personalization. This paper describes the prediction of N1 frequency from pinna anthropometry using a neural model. Prediction is performed separately on three databases, both simulated and measured, and then by domain mixing in-between the databases. The model successfully predicts N1 frequency for individual databases and by domain mixing between some databases. Prediction errors are better or comparable to those previously reported, showing significant improvement when acquired over a large database and with a larger output range

    Towards Improved Room Impulse Response Estimation for Speech Recognition

    Full text link
    We propose to characterize and improve the performance of blind room impulse response (RIR) estimation systems in the context of a downstream application scenario, far-field automatic speech recognition (ASR). We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators. We then propose a GAN-based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features, and uses a novel energy decay relief loss to optimize for capturing energy-based properties of the input reverberant speech. We show that our model outperforms the state-of-the-art baselines on acoustic benchmarks (by 72% on the energy decay relief and 22% on an early-reflection energy metric), as well as in an ASR evaluation task (by 6.9% in word error rate)

    Hearing Loss Detection from Facial Expressions in One-on-one Conversations

    Full text link
    Individuals with impaired hearing experience difficulty in conversations, especially in noisy environments. This difficulty often manifests as a change in behavior and may be captured via facial expressions, such as the expression of discomfort or fatigue. In this work, we build on this idea and introduce the problem of detecting hearing loss from an individual's facial expressions during a conversation. Building machine learning models that can represent hearing-related facial expression changes is a challenge. In addition, models need to disentangle spurious age-related correlations from hearing-driven expressions. To this end, we propose a self-supervised pre-training strategy tailored for the modeling of expression variations. We also use adversarial representation learning to mitigate the age bias. We evaluate our approach on a large-scale egocentric dataset with real-world conversational scenarios involving subjects with hearing loss and show that our method for hearing loss detection achieves superior performance over baselines.Comment: Accepted by ICASSP 202

    The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

    Full text link
    In recent years, the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions, where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction problem, marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework -- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of conversation behaviors -- speaking and listening -- for both the camera wearer as well as all other social partners present in the egocentric video. Specifically, we adopt the self-attention mechanism to model the representations across-time, across-subjects, and across-modalities. To validate our method, we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our project page at https://vjwq.github.io/AV-CONV/

    System specific power reduction techniques for wearable navigation technology

    No full text
    Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (page 73).As a result of advances in computer vision, mapping, and controls, wearable technology for visually-impaired individuals has become a growing space of research within Assistive Technology. A team at the MIT Energy Ecient Circuits Group has made an important stride forward by presenting a wearable navigation prototype in a fully integrated hardware form factor, but one of biggest barriers to usability of the device is its excessive power consumption. As such, the goal of this work is, broadly, to- (1) Understand the largest sources of power consumption in the initial navigation proto- type system, and expose relevant features for control; (2) Develop a set of algorithms that can capitalize on the motion of a user, the motion of the environment around a user, and the proximity of obstacles within the environment to the user, in order to dynamically tune the exposed parameters to scale power as necessary; and (3) Lay the foundation for the next generation wearable navigation prototype by translating critical software operations and the power scaling algorithms into a hardware architecture capable of working with a smaller and less power intensive depth camera. The first portion of this work focuses on the wearable navigation prototype built around Texas Instrument's OPT9220/9221 Time of Flight chipset. Illumination voltage, frame rate, and integration duty cycle are identied as key control features, and a step rate estimation algorithm, scene statistics algorithm, and frame skipping controller to tune these features are built and tested. The latter half the work focuses on the newer OPT8320 evaluation platform, for which a Bluespec System Verilog implementation of these power algorithms and the point cloud generation operation is presented and tested. Overall, the work demonstrates the critical concept that simple, system specific, fully integrated algorithms can effectively be used to reduce analog power system-wide.by Ishwarya Ananthabhotla.M. Eng

    Cognitive Audio: Enabling Auditory Interfaces with an Understanding of How We Hear

    No full text
    Over the last several decades, neuroscientists, cognitive scientists, and psychologists have made strides in understanding the complex and mysterious processes that define the interaction between our minds and the sounds around us. Some of these processes, particularly at the lowest levels of abstraction relative to a sound wave, are well understood, and are easy to characterize across large sections of the human population; others, however, are the sum of both intuition and observations drawn from small-scale laboratory experiments, and remain as of yet poorly understood. In this thesis, I suggest that there is value in coupling insight into the workings of auditory processing, beginning with abstractions in pre-conscious processing, with new frontiers in interface design and state-of-the-art infrastructure for parsing and identifying sound objects, as a means of unlocking audio technologies that are much more immersive, naturalistic, and synergistic than those present in the existing landscape. From the vantage point of today's computational models and devices that largely represent audio at the level of the digital sample, I gesture towards a world of auditory interfaces that work deeply in concert with uniquely human tendencies, allowing us to altogether re-imagine how we capture, preserve, and experience bodies of sound -- towards, for example, augmented reality devices that manipulate sound objects to minimize distractions, lossy "codecs" that operate on semantic rather than time-frequency information, and soundscape design engines operating on large corpora of audio data that optimize for aesthetic or experiential outcomes instead of purely objective ones. To do this, I aim to introduce and explore a new research direction focused on the marriage of principles governing pre-conscious auditory cognition with traditional HCI approaches to auditory interface design via explicit statistical modeling, termed "Cognitive Audio". Along the way, I consider the major roadblocks that present themselves in approaching this convergence: I ask how we might "probe" and measure a cognitive principle of interest robustly enough to inform system design, in the absence of immediately observable biophysical phenomena that may accompany, for example, visual cognition; I also ask how we might build reliable, meaningful statistical models from the resulting data that drive compelling experiences despite inherent noise, sparsity, and generalizations made at the level of the crowd. I discuss early insights into these questions through the lens of a series of projects centered on auditory processing at different levels of abstraction. I begin with a discussion of early work focused on cognitive models of lower-level phenomena; these exercises then inform a comprehensive effort to construct general purpose estimators of gestalt concepts in sound understanding. I then demonstrate the affordances of these estimators in the context of application systems that I construct and characterize, incorporating additional explorations on methods for personalization that sit atop these estimators. Finally, I conclude with a dialogue on the intersection between the key contributions in this dissertation and a string of major themes relevant to the audio technology and computation world today.Ph.D

    Manipulating Causal Uncertainty in Sound Objects

    No full text
    corecore