401 research outputs found
Automatic Audio Content Analysis
This paper describes the theoretic framework and applications of automatic audio content analysis. Research in multimedia content analysis has so far concentrated on the video domain. We demonstrate the strength of automatic audio content analysis. We explain the algorithms we use, including analysis of amplitude, frequency and pitch, and simulations of human audio perception. These algorithms serve us as tools for further audio content analysis. We use these tools in applications like the segmentation of audio data streams into logical units for further processing, the analysis of music, as well as the recognition of sounds indicative of violence like shots, explosions and cries
MPEG-1 bitstreams processing for audio content analysis
In this paper, we present the MPEG-1 Audio bitstreams processing work which our research group is involved in. This work is primarily based on the processing of the encoded bitstream, and the extraction of useful audio features for the purposes of analysis and browsing. In order to prepare for the discussion of these features, the MPEG-1 audio bitstream format is first described. The Application Interface Protocol (API) which we have been developing in C++ is then introduced, before completing the paper with a discussion on audio feature extraction
AudioPairBank: Towards A Large-Scale Tag-Pair-Based Audio Content Analysis
Recently, sound recognition has been used to identify sounds, such as car and
river. However, sounds have nuances that may be better described by
adjective-noun pairs such as slow car, and verb-noun pairs such as flying
insects, which are under explored. Therefore, in this work we investigate the
relation between audio content and both adjective-noun pairs and verb-noun
pairs. Due to the lack of datasets with these kinds of annotations, we
collected and processed the AudioPairBank corpus consisting of a combined total
of 1,123 pairs and over 33,000 audio files. One contribution is the previously
unavailable documentation of the challenges and implications of collecting
audio recordings with these type of labels. A second contribution is to show
the degree of correlation between the audio content and the labels through
sound recognition experiments, which yielded results of 70% accuracy, hence
also providing a performance benchmark. The results and study in this paper
encourage further exploration of the nuances in audio and are meant to
complement similar research performed on images and text in multimedia
analysis.Comment: This paper is a revised version of "AudioSentibank: Large-scale
Semantic Ontology of Acoustic Concepts for Audio Content Analysis
Multi-channel approaches for musical audio content analysis
The goal of this research project is to undertake a critical evaluation of signal representations for musical audio content analysis. In particular it will contrast three different means for undertaking the analysis of micro-rhythmic content in Afro-Latin American music, namely through the use of: i) stereo or mono mixed recordings; ii) separated sources obtained via state of the art musical audio source separation techniques; and iii) the use of perfectly separated multi-track stems.
In total the project comprises the following four objectives: i) To compile a dataset of mixed and multi-channel recordings of the Brazilian Maracatu musicians; ii) To conceive methods for rhythmical micro-variations analysis and pattern recognition; iii) To explore diverse music source separation approaches that preserve micro-rhythmic content; iv) To evaluate the performance of several automatic onset estimation approaches; and v) To compare the rhythmic analysis obtained from the original multi-channel sources versus the separated ones to evaluate separation quality regarding microtiming identification
Iracema: a Python library for audio content analysis
Iracema is a Python library that aims to provide models for the extraction of meaningful informationfrom recordings of monophonic pieces of music, for purposes of research in music performance. With this objective in mind, we propose an architecture that will provide to users an abstraction level that simplifies the manipulation of different kinds of time series, as well as the extraction of segments from them. In this paper we: (1) introduce some key concepts at the core of the proposed architecture; (2) describe the current functionalities of the package; (3) give some examples of the application programming interface; and (4) give some brief examples of audio analysis using the system
Audio Content Analysis for Unobtrusive Event Detection in Smart Homes
Institute of Engineering Sciences
The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.Environmental sound signals are multi-source, heterogeneous, and varying in time. Many systems have been proposed to process such signals for event detection in ambient assisted living applications. Typically, these systems use feature extraction, selection, and classification. However, despite major advances, several important questions remain unanswered, especially in real-world settings. This paper contributes to the body of knowledge in the field by addressing the following problems for ambient sounds recorded in various real-world kitchen environments: 1) which features and which classifiers are most suitable in the
presence of background noise? 2) what is the effect of signal duration on recognition accuracy? 3) how do the signal-to-noise-ratio and the distance between the microphone and the audio source affect the recognition accuracy in an environment in which the system was not trained? We show that for systems that use traditional classifiers, it is beneficial to combine gammatone frequency cepstral coefficients and discrete wavelet transform coefficients and to use a gradient boosting classifier. For systems based on deep learning, we consider 1D and 2D Convolutional Neural Networks (CNN) using mel-spectrogram energies
and mel-spectrograms images, as inputs, respectively and show that the 2D CNN outperforms the 1D CNN. We obtained competitive classification results for two such systems. The first one, which uses a gradient boosting classifier,
achieved an F1-Score of 90.2% and a recognition accuracy of 91.7%. The second
one, which uses a 2D CNN with mel-spectrogram images, achieved an F1-Score
of 92.7% and a recognition accuracy of 96%
Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data
The development of audio event recognition models requires labeled training
data, which are generally hard to obtain. One promising source of recordings of
audio events is the large amount of multimedia data on the web. In particular,
if the audio content analysis must itself be performed on web audio, it is
important to train the recognizers themselves from such data. Training from
these web data, however, poses several challenges, the most important being the
availability of labels : labels, if any, that may be obtained for the data are
generally {\em weak}, and not of the kind conventionally required for training
detectors or classifiers. We propose that learning algorithms that can exploit
weak labels offer an effective method to learn from web data. We then propose a
robust and efficient deep convolutional neural network (CNN) based framework to
learn audio event recognizers from weakly labeled data. The proposed method can
train from and analyze recordings of variable length in an efficient manner and
outperforms a network trained with {\em strongly labeled} web data by a
considerable margin
- …