5 research outputs found

    Complex Neural Networks for Audio

    Get PDF
    Audio is represented in two mathematically equivalent ways: the real-valued time domain (i.e., waveform) and the complex-valued frequency domain (i.e., spectrum). There are advantages to the frequency-domain representation, e.g., the human auditory system is known to process sound in the frequency-domain. Furthermore, linear time-invariant systems are convolved with sources in the time-domain, whereas they may be factorized in the frequency-domain. Neural networks have become rather useful when applied to audio tasks such as machine listening and audio synthesis, which are related by their dependencies on high quality acoustic models. They ideally encapsulate fine-scale temporal structure, such as that encoded in the phase of frequency-domain audio, yet there are no authoritative deep learning methods for complex audio. This manuscript is dedicated to addressing the shortcoming. Chapter 2 motivates complex networks by their affinity with complex-domain audio, while Chapter 3 contributes methods for building and optimizing complex networks. We show that the naive implementation of Adam optimization is incorrect for complex random variables and show that selection of input and output representation has a significant impact on the performance of a complex network. Experimental results with novel complex neural architectures are provided in the second half of this manuscript. Chapter 4 introduces a complex model for binaural audio source localization. We show that, like humans, the complex model can generalize to different anatomical filters, which is important in the context of machine listening. The complex model\u27s performance is better than that of the real-valued models, as well as real- and complex-valued baselines. Chapter 5 proposes a two-stage method for speech enhancement. In the first stage, a complex-valued stochastic autoencoder projects complex vectors to a discrete space. In the second stage, long-term temporal dependencies are modeled in the discrete space. The autoencoder raises the performance ceiling for state of the art speech enhancement, but the dynamic enhancement model does not outperform other baselines. We discuss areas for improvement and note that the complex Adam optimizer improves training convergence over the naive implementation

    GROOVE KERNELS AS RHYTHMIC-ACOUSTIC MOTIF DESCRIPTORS

    No full text
    ABSTRACT The "groove" of a song correlates with enjoyment and bodily movement. Recent work has shown that humans often agree whether a song does or does not have groove and how much groove a song has. It is therefore useful to develop algorithms that characterize the quality of groove across songs. We evaluate three unsupervised tempo-invariant models for measuring pairwise musical groove similarity: A temporal model, a timbre-temporal model, and a pitchtimbre-temporal model. The temporal model uses a rhythm similarity metric proposed by Holzapfel and Stylianou, while the timbre-inclusive models are built on shift invariant probabilistic latent component analysis. We evaluate the models using a dataset of over 8000 real-world musical recordings spanning approximately 10 genres, several decades, multiple meters, a large range of tempos, and Western and non-Western localities. A blind perceptual study is conducted: given a random music query, humans rate the groove similarity of the top three retrievals chosen by each of the models, as well as three random retrievals

    Modeling and Predicting Song Adjacencies in Commercial Albums

    No full text
    (Abstract to follow

    Tapping into the vibe of the city using vibn, a continuous sensing application for smartphones

    No full text
    We present VibN, a mobile sensing application deployed at large scale through the Apple App Store and the Android Market. VibN has been built to determine “what’s going on ” around the user in real-time by exploiting multiple sensor feeds. The application allows its users to explore live points of interest of the city by presenting real-time hotspots from sensor data. Each hotspot is characterized by a demographics breakdown of inhabitants and a list of short audio clips. The audio clips augment traditional microblogging methods by allowing users to automatically and manually provide rich audio data about their locations. VibN also allows one to browse historical points of interest and view how locations in a city evolve over time. Additionally, VibN automatically determines a user’s personal points of interest, which are a means for building a user’s breadcrumb diary of locations where they have spent significant amount of time. In this paper, we present the design, evaluation, and results from the large scale deployment of VibN through the popular Apple App Store and Android Market
    corecore