2,032 research outputs found
Recommended from our members
Using the Soundtrack to Classify Videos
Describes classifying environmental sounds, for a panel session discussing the development of "multimedia analytics" -- the science of how people can effectively and efficiently extract information from multimedia content
Recommended from our members
Environmental Sound Recognition and Classification
Describes getting information out of soundtracks and environmental recordings
Recommended from our members
Characterizing Audio Events for Video Soundtrack Analysis
There is an entire emerging ecosystem of amateur video recordings on the internet today, in addition to the abundance of more professionally produced content. The ability to automatically scan and evaluate the content of these recordings would be very useful for search and indexing, especially as amateur content tends to be more poorly labeled and tagged than professional content. Although the visual content is often considered to be of primary importance, the audio modality contains rich information which may be very helpful in the context of video search and understanding. Any technology that could help to interpret video soundtrack data would also be applicable in a number of other scenarios, such as mobile device audio awareness, surveillance, and robotics. In this thesis we approach the problem of extracting information from these kinds of unconstrained audio recordings. Specifically we focus on techniques for characterizing discrete audio events within the soundtrack (e.g. a dog bark or door slam), since we expect events to be particularly informative about content. Our task is made more complicated by the extremely variable recording quality and noise present in this type of audio. Initially we explore the idea of using the matching pursuit algorithm to decompose and isolate components of audio events. Using these components we develop an approach for non-exact (approximate) fingerprinting as a way to search audio data for similar recurring events. We demonstrate a proof of concept for this idea. Subsequently we extend the use of matching pursuit to build an actual audio fingerprinting system, with the goal of identifying simultaneously recorded amateur videos (i.e. videos taken in the same place at the same time by different people, which contain overlapping audio). Automatic discovery of these simultaneous recordings is one particularly interesting facet of general video indexing. We evaluate this fingerprinting system on a database of 733 internet videos. Next we return to searching for features to directly characterize soundtrack events. We develop a system to detect transient sounds and represent audio clips as a histogram of the transients it contains. We use this representation for video classification over a database of 1873 internet videos. When we combine these features with a spectral feature baseline system we achieve a relative improvement of 7.5% in mean average precision over the baseline. In another attempt to devise features to better describe and compare events, we investigate decomposing audio using a convolutional form of non-negative matrix factorization, resulting in event-like spectro-temporal patches. We use the resulting representation to build an event detection system that is more robust to additive noise than a comparative baseline system. Lastly we investigate a promising feature representation that has been used by others previously to describe event-like sound effect clips. These features derive from an auditory model and are meant to capture fine time structure in sound events. We compare these features and a related but simpler feature set on the task of video classification over 9317 internet videos. We find that combinations of these features with baseline spectral features produce a significant improvement in mean average precision over the baseline
A Review of Verbal and Non-Verbal Human-Robot Interactive Communication
In this paper, an overview of human-robot interactive communication is
presented, covering verbal as well as non-verbal aspects of human-robot
interaction. Following a historical introduction, and motivation towards fluid
human-robot communication, ten desiderata are proposed, which provide an
organizational axis both of recent as well as of future research on human-robot
communication. Then, the ten desiderata are examined in detail, culminating to
a unifying discussion, and a forward-looking conclusion
Recommended from our members
Using Cloudworks to Support OER Activities
This report forms the third and final output of the Pearls in the Clouds project, funded by the Higher Education Academy. It focuses on evaluation of the use of a social networking site, Cloudworks, to support evidence-based practice.
The aim of this project (Pearls in the Clouds) has been to evaluate the ways in which web 2.0 tools like Cloudworks can support evidence-informed practices in relation to learning and teaching. We have reviewed evidence from empirically grounded studies surrounding the uses of web2.0 in higher education and highlighted the gap between using web2.0 to support learning and teaching, and using it to support learning about learning and teaching (in an evidence-informed way) (Conole and Alevizou, 2010). We have reported on findings from a case study focusing on the use of Cloudworks by a community of practice - educational technologists - reflecting upon, and, negotiating their role in enhancing teaching and learning in higher education (Galley et al., 2010). The object of this study is to explore and evaluate the use of the site by individuals and communities involved in the production of, and research on, the development, delivery and use of Open Educational Resources (OER)
Recommended from our members
Spectral vs. spectro-temporal features for acoustic event detection
Automatic detection of different types of acoustic events is an interesting problem in soundtrack processing. Typical approaches to the problem use short-term spectral features to describe the audio signal, with additional modeling on top to take temporal context into account. We propose an approach to detecting and modeling acoustic events that directly describes temporal context, using convolutive non-negative matrix factorization (NMF). NMF is useful for finding parts-based decompositions of data; here it is used to discover a set of spectro-temporal patch bases that best describe the data, with the patches corresponding to event-like structures. We derive features from the activations of these patch bases, and perform event detection on a database consisting of 16 classes of meeting-room acoustic events. We compare our approach with a baseline using standard short-term mel frequency cepstal coefficient (MFCC) features. We demonstrate that the event-based system is more robust in the presence of added noise than the MFCC-based system, and that a combination of the two systems performs even better than either individually
Sonic Elongation: Creative Audition in Documentary Film
This paper investigates documentary films in which real-world sound captured from the location shoot has been treated more creatively than the captured image; in particular, instances when real-world noises pass freely between sound and musical composition. I call this process the sonic elongation from sound to music; a blurring that allows the soundtrack to keep one foot in the image, thus allowing the film to retain a loose grip on the traditional nonfiction aesthetic. With reference to several recent documentary feature films, I argue that such moments rely on a confusion between hearing and listening
Recommended from our members
Recognizing and Classifying Environmental Sounds
Prof. Ellis presents a summary of LabROSA's new work, with a focus on recognizing environmental sounds, particularly for video classification by soundtrack
- …