23 research outputs found
Anti-social behavior detection in audio-visual surveillance systems
In this paper we propose a general purpose framework for
detection of unusual events. The proposed system is based on the unsupervised method for unusual scene detection in web{cam images that was introduced in [1]. We extend their algorithm to accommodate data from different modalities and introduce the concept of time-space blocks. In addition, we evaluate early and late fusion techniques for our audio-visual data features. The experimental results on 192 hours of data show that data fusion of audio and video outperforms using a single modality
Using Sensor Metadata Streams to Identify Topics of Local Events in the City
In this paper, we study the emerging Information Retrieval (IR) task of local event retrieval using sensor metadata streams. Sensor metadata streams include information such as the crowd density from video processing, audio classifications, and social media activity. We propose to use these metadata streams to identify the topics of local events within a city, where each event topic corresponds to a set of terms representing a type of events such as a concert or a protest. We develop a supervised approach that is capable of mapping sensor metadata observations to an event topic. In addition to using a variety of sensor metadata observations about the current status of the environment as learning features, our approach incorporates additional background features to model cyclic event patterns. Through experimentation with data collected from two locations in a major Spanish city, we show that our approach markedly outperforms an alternative baseline. We also show that modelling background information improves event topic identification
Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data
The development of audio event recognition models requires labeled training
data, which are generally hard to obtain. One promising source of recordings of
audio events is the large amount of multimedia data on the web. In particular,
if the audio content analysis must itself be performed on web audio, it is
important to train the recognizers themselves from such data. Training from
these web data, however, poses several challenges, the most important being the
availability of labels : labels, if any, that may be obtained for the data are
generally {\em weak}, and not of the kind conventionally required for training
detectors or classifiers. We propose that learning algorithms that can exploit
weak labels offer an effective method to learn from web data. We then propose a
robust and efficient deep convolutional neural network (CNN) based framework to
learn audio event recognizers from weakly labeled data. The proposed method can
train from and analyze recordings of variable length in an efficient manner and
outperforms a network trained with {\em strongly labeled} web data by a
considerable margin
Albayzin-2010 audio segmentation evaluation: evaluation setup and results
In this paper, we present the audio segmentation task from the Albayzín-2010 evaluation, and the results obtained by the
eight participants from Spanish and Portuguese universities.
The evaluation task consisted of the segmentation of audio files from the Catalan 3/24 TV channel into 5 acoustic classes:
music, speech, speech over music, speech over noise and other. The final results from all participants show that the problem of segmenting broadcast news is still challenging. We also present an analysis of the segmentation errors of the submitted systems. Additionally, the evaluation setup,
including the database and the segmentation metric, is also described.Peer ReviewedPostprint (published version
Automatic Environmental Sound Recognition: Performance versus Computational Cost
In the context of the Internet of Things (IoT), sound sensing applications
are required to run on embedded platforms where notions of product pricing and
form factor impose hard constraints on the available computing power. Whereas
Automatic Environmental Sound Recognition (AESR) algorithms are most often
developed with limited consideration for computational cost, this article seeks
which AESR algorithm can make the most of a limited amount of computing power
by comparing the sound classification performance em as a function of its
computational cost. Results suggest that Deep Neural Networks yield the best
ratio of sound classification accuracy across a range of computational costs,
while Gaussian Mixture Models offer a reasonable accuracy at a consistently
small cost, and Support Vector Machines stand between both in terms of
compromise between accuracy and computational cost
Audio Bank: A High-Level Acoustic Signal Representation for Audio Event Recognition
Automatic audio event recognition plays a pivotal role in making human robot
interaction more closer and has a wide applicability in industrial automation,
control and surveillance systems. Audio event is composed of intricate phonic
patterns which are harmonically entangled. Audio recognition is dominated by
low and mid-level features, which have demonstrated their recognition
capability but they have high computational cost and low semantic meaning. In
this paper, we propose a new computationally efficient framework for audio
recognition. Audio Bank, a new high-level representation of audio, is comprised
of distinctive audio detectors representing each audio class in
frequency-temporal space. Dimensionality of the resulting feature vector is
reduced using non-negative matrix factorization preserving its discriminability
and rich semantic information. The high audio recognition performance using
several classifiers (SVM, neural network, Gaussian process classification and
k-nearest neighbors) shows the effectiveness of the proposed method.Comment: 6 pages, 9 figures, published in IEEE International Conf ICCAS 2014
(Best paper award
Learning sound representations using trainable COPE feature extractors
Sound analysis research has mainly been focused on speech and music
processing. The deployed methodologies are not suitable for analysis of sounds
with varying background noise, in many cases with very low signal-to-noise
ratio (SNR). In this paper, we present a method for the detection of patterns
of interest in audio signals. We propose novel trainable feature extractors,
which we call COPE (Combination of Peaks of Energy). The structure of a COPE
feature extractor is determined using a single prototype sound pattern in an
automatic configuration process, which is a type of representation learning. We
construct a set of COPE feature extractors, configured on a number of training
patterns. Then we take their responses to build feature vectors that we use in
combination with a classifier to detect and classify patterns of interest in
audio signals. We carried out experiments on four public data sets: MIVIA audio
events, MIVIA road events, ESC-10 and TU Dortmund data sets. The results that
we achieved (recognition rate equal to 91.71% on the MIVIA audio events, 94% on
the MIVIA road events, 81.25% on the ESC-10 and 94.27% on the TU Dortmund)
demonstrate the effectiveness of the proposed method and are higher than the
ones obtained by other existing approaches. The COPE feature extractors have
high robustness to variations of SNR. Real-time performance is achieved even
when the value of a large number of features is computed.Comment: Accepted for publication in Pattern Recognitio