thesis

Characterizing Audio Events for Video Soundtrack Analysis

Abstract

There is an entire emerging ecosystem of amateur video recordings on the internet today, in addition to the abundance of more professionally produced content. The ability to automatically scan and evaluate the content of these recordings would be very useful for search and indexing, especially as amateur content tends to be more poorly labeled and tagged than professional content. Although the visual content is often considered to be of primary importance, the audio modality contains rich information which may be very helpful in the context of video search and understanding. Any technology that could help to interpret video soundtrack data would also be applicable in a number of other scenarios, such as mobile device audio awareness, surveillance, and robotics. In this thesis we approach the problem of extracting information from these kinds of unconstrained audio recordings. Specifically we focus on techniques for characterizing discrete audio events within the soundtrack (e.g. a dog bark or door slam), since we expect events to be particularly informative about content. Our task is made more complicated by the extremely variable recording quality and noise present in this type of audio. Initially we explore the idea of using the matching pursuit algorithm to decompose and isolate components of audio events. Using these components we develop an approach for non-exact (approximate) fingerprinting as a way to search audio data for similar recurring events. We demonstrate a proof of concept for this idea. Subsequently we extend the use of matching pursuit to build an actual audio fingerprinting system, with the goal of identifying simultaneously recorded amateur videos (i.e. videos taken in the same place at the same time by different people, which contain overlapping audio). Automatic discovery of these simultaneous recordings is one particularly interesting facet of general video indexing. We evaluate this fingerprinting system on a database of 733 internet videos. Next we return to searching for features to directly characterize soundtrack events. We develop a system to detect transient sounds and represent audio clips as a histogram of the transients it contains. We use this representation for video classification over a database of 1873 internet videos. When we combine these features with a spectral feature baseline system we achieve a relative improvement of 7.5% in mean average precision over the baseline. In another attempt to devise features to better describe and compare events, we investigate decomposing audio using a convolutional form of non-negative matrix factorization, resulting in event-like spectro-temporal patches. We use the resulting representation to build an event detection system that is more robust to additive noise than a comparative baseline system. Lastly we investigate a promising feature representation that has been used by others previously to describe event-like sound effect clips. These features derive from an auditory model and are meant to capture fine time structure in sound events. We compare these features and a related but simpler feature set on the task of video classification over 9317 internet videos. We find that combinations of these features with baseline spectral features produce a significant improvement in mean average precision over the baseline

    Similar works