10,123 research outputs found
Time-Frequency Feature Fusion for Noise-Robust Audio Event Classification
This paper explores the use of three different two-dimensional time-frequency features for audio event classification with deep neural network back-end classifiers. The evaluations use spectrogram, cochleogram and constant-Q transform based images for classification of 50 classes of audio events in varying levels of acoustic background noise, revealing interesting performance patterns with respect to noise level, feature image type and classifier. Evidence is obtained that two well-performing features, the spectrogram and cochleogram, make use of information that is potentially complementary in the input features. Feature fusion is thus explored for each pair of features, as well as for all tested features. Results indicate that a fusion of spectrogram and cochleogram information is particularly beneficial, yielding an impressive 50-class accuracy of over 96% in 0dB SNR, and exceeding 99% accuracy in 10dB SNR and above. Meanwhile the cochleogram image feature is found to perform well in extreme noise cases of -5dB and -10dB SNR
Musical notes classification with Neuromorphic Auditory System using FPGA and a Convolutional Spiking Network
In this paper, we explore the capabilities of a sound
classification system that combines both a novel FPGA cochlear
model implementation and a bio-inspired technique based on a
trained convolutional spiking network. The neuromorphic
auditory system that is used in this work produces a form of
representation that is analogous to the spike outputs of the
biological cochlea. The auditory system has been developed using
a set of spike-based processing building blocks in the frequency
domain. They form a set of band pass filters in the spike-domain
that splits the audio information in 128 frequency channels, 64
for each of two audio sources. Address Event Representation
(AER) is used to communicate the auditory system with the
convolutional spiking network. A layer of convolutional spiking
network is developed and trained on a computer with the ability
to detect two kinds of sound: artificial pure tones in the presence
of white noise and electronic musical notes. After the training
process, the presented system is able to distinguish the different
sounds in real-time, even in the presence of white noise.Ministerio de Economía y Competitividad TEC2012-37868-C04-0
General highlight detection in sport videos
Attention is a psychological measurement of human reflection against stimulus. We propose a general framework of highlight detection by comparing attention intensity during the watching of sports videos. Three steps are involved: adaptive selection on salient features, unified attention estimation and highlight identification. Adaptive selection computes feature correlation to decide an optimal set of salient features. Unified estimation combines these features by the technique of multi-resolution autoregressive (MAR) and thus creates a temporal curve of attention intensity. We rank the intensity of attention to discriminate boundaries of highlights. Such a framework alleviates semantic uncertainty around sport highlights and leads to an efficient and effective highlight detection. The advantages are as follows: (1) the capability of using data at coarse temporal resolutions; (2) the robustness against noise caused by modality asynchronism, perception uncertainty and feature mismatch; (3) the employment of Markovian constrains on content presentation, and (4) multi-resolution estimation on attention intensity, which enables the precise allocation of event boundaries
A target guided subband filter for acoustic event detection in noisy environments using wavelet packets
This paper deals with acoustic event detection (AED), such as screams, gunshots, and explosions, in noisy environments. The main aim is to improve the detection performance under adverse conditions with a very low signal-to-noise ratio (SNR). A novel filtering method combined with an energy detector is presented. The wavelet packet transform (WPT) is first used for time-frequency representation of the acoustic signals. The proposed filter in the wavelet packet domain then uses a priori knowledge of the target event and an estimate of noise features to selectively suppress the background noise. It is in fact a content-aware band-pass filter which can automatically pass the frequency bands that are more significant in the target than in the noise. Theoretical analysis shows that the proposed filtering method is capable of enhancing the target content while suppressing the background noise for signals with a low SNR. A condition to increase the probability of correct detection is also obtained. Experiments have been carried out on a large dataset of acoustic events that are contaminated by different types of environmental noise and white noise with varying SNRs. Results show that the proposed method is more robust and better adapted to noise than ordinary energy detectors, and it can work even with an SNR as low as -15 dB. A practical system for real time processing and multi-target detection is also proposed in this work
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
- …