13 research outputs found

    Complex Neural Networks for Audio

    Get PDF
    Audio is represented in two mathematically equivalent ways: the real-valued time domain (i.e., waveform) and the complex-valued frequency domain (i.e., spectrum). There are advantages to the frequency-domain representation, e.g., the human auditory system is known to process sound in the frequency-domain. Furthermore, linear time-invariant systems are convolved with sources in the time-domain, whereas they may be factorized in the frequency-domain. Neural networks have become rather useful when applied to audio tasks such as machine listening and audio synthesis, which are related by their dependencies on high quality acoustic models. They ideally encapsulate fine-scale temporal structure, such as that encoded in the phase of frequency-domain audio, yet there are no authoritative deep learning methods for complex audio. This manuscript is dedicated to addressing the shortcoming. Chapter 2 motivates complex networks by their affinity with complex-domain audio, while Chapter 3 contributes methods for building and optimizing complex networks. We show that the naive implementation of Adam optimization is incorrect for complex random variables and show that selection of input and output representation has a significant impact on the performance of a complex network. Experimental results with novel complex neural architectures are provided in the second half of this manuscript. Chapter 4 introduces a complex model for binaural audio source localization. We show that, like humans, the complex model can generalize to different anatomical filters, which is important in the context of machine listening. The complex model\u27s performance is better than that of the real-valued models, as well as real- and complex-valued baselines. Chapter 5 proposes a two-stage method for speech enhancement. In the first stage, a complex-valued stochastic autoencoder projects complex vectors to a discrete space. In the second stage, long-term temporal dependencies are modeled in the discrete space. The autoencoder raises the performance ceiling for state of the art speech enhancement, but the dynamic enhancement model does not outperform other baselines. We discuss areas for improvement and note that the complex Adam optimizer improves training convergence over the naive implementation

    Musical Audio Synthesis Using Autoencoding Neural Nets

    Get PDF
    With an optimal network topology and tuning of hyperpa- rameters, artificial neural networks (ANNs) may be trained to learn a mapping from low level audio features to one or more higher-level representations. Such artificial neu- ral networks are commonly used in classification and re- gression settings to perform arbitrary tasks. In this work we suggest repurposing autoencoding neural networks as musical audio synthesizers. We offer an interactive musi- cal audio synthesis system that uses feedforward artificial neural networks for musical audio synthesis, rather than discriminative or regression tasks. In our system an ANN is trained on frames of low-level features. A high level representation of the musical audio is learned though an autoencoding neural net. Our real-time synthesis system allows one to interact directly with the parameters of the model and generate musical audio in real time. This work therefore proposes the exploitation of neural networks for creative musical applications

    A new method for ecoacoustics? Toward the extraction and evaluation of ecologically-meaningful soundscape components using sparse coding methods

    Get PDF
    Passive acoustic monitoring is emerging as a promising non-invasive proxy for ecological complexity with potential as a tool for remote assessment and monitoring (Sueur and Farina, 2015). Rather than attempting to recognise species-specific calls, either manually or automatically, there is a growing interest in evaluating the global acoustic environment. Positioned within the conceptual framework of ecoacoustics, a growing number of indices have been proposed which aim to capture community-level dynamics by (e.g. Pieretti et al., 2011; Farina, 2014; Sueur et al., 2008b) by providing statistical summaries of the frequency or time domain signal. Although promising, the ecological relevance and efficacy as a monitoring tool of these indices is still unclear. In this paper we suggest that by virtue of operating in the time or frequency domain, existing indices are limited in their ability to access key structural information in the spectro-temporal domain. Alternative methods in which time-frequency dynamics are preserved are considered. Sparse-coding and source separation algorithms (specifically, shift-invariant probabilistic latent component analysis in 2D) are proposed as a means to access and summarise time- frequency dynamics which may be more ecologically-meaningful

    Untitled  [Image no. 3177]

    No full text
    For more information about this item, browse to http://hdl.handle.net/102.100.100/329

    GROOVE KERNELS AS RHYTHMIC-ACOUSTIC MOTIF DESCRIPTORS

    No full text
    ABSTRACT The "groove" of a song correlates with enjoyment and bodily movement. Recent work has shown that humans often agree whether a song does or does not have groove and how much groove a song has. It is therefore useful to develop algorithms that characterize the quality of groove across songs. We evaluate three unsupervised tempo-invariant models for measuring pairwise musical groove similarity: A temporal model, a timbre-temporal model, and a pitchtimbre-temporal model. The temporal model uses a rhythm similarity metric proposed by Holzapfel and Stylianou, while the timbre-inclusive models are built on shift invariant probabilistic latent component analysis. We evaluate the models using a dataset of over 8000 real-world musical recordings spanning approximately 10 genres, several decades, multiple meters, a large range of tempos, and Western and non-Western localities. A blind perceptual study is conducted: given a random music query, humans rate the groove similarity of the top three retrievals chosen by each of the models, as well as three random retrievals

    Modeling and Predicting Song Adjacencies in Commercial Albums

    No full text
    (Abstract to follow

    Large-scale music tag recommendation with explicit multiple attributes

    No full text
    10.1145/1873951.1874006MM'10 - Proceedings of the ACM Multimedia 2010 International Conference401-41
    corecore