2,626 research outputs found
Listening for Sirens: Locating and Classifying Acoustic Alarms in City Scenes
This paper is about alerting acoustic event detection and sound source
localisation in an urban scenario. Specifically, we are interested in spotting
the presence of horns, and sirens of emergency vehicles. In order to obtain a
reliable system able to operate robustly despite the presence of traffic noise,
which can be copious, unstructured and unpredictable, we propose to treat the
spectrograms of incoming stereo signals as images, and apply semantic
segmentation, based on a Unet architecture, to extract the target sound from
the background noise. In a multi-task learning scheme, together with signal
denoising, we perform acoustic event classification to identify the nature of
the alerting sound. Lastly, we use the denoised signals to localise the
acoustic source on the horizon plane, by regressing the direction of arrival of
the sound through a CNN architecture. Our experimental evaluation shows an
average classification rate of 94%, and a median absolute error on the
localisation of 7.5{\deg} when operating on audio frames of 0.5s, and of
2.5{\deg} when operating on frames of 2.5s. The system offers excellent
performance in particularly challenging scenarios, where the noise level is
remarkably high.Comment: 6 pages, 9 figure
Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition
Despite the recent advancement in speech emotion recognition (SER) within a
single corpus setting, the performance of these SER systems degrades
significantly for cross-corpus and cross-language scenarios. The key reason is
the lack of generalisation in SER systems towards unseen conditions, which
causes them to perform poorly in cross-corpus and cross-language settings.
Recent studies focus on utilising adversarial methods to learn domain
generalised representation for improving cross-corpus and cross-language SER to
address this issue. However, many of these methods only focus on cross-corpus
SER without addressing the cross-language SER performance degradation due to a
larger domain gap between source and target language data. This contribution
proposes an adversarial dual discriminator (ADDi) network that uses the
three-players adversarial game to learn generalised representations without
requiring any target data labels. We also introduce a self-supervised ADDi
(sADDi) network that utilises self-supervised pre-training with unlabelled
data. We propose synthetic data generation as a pretext task in sADDi, enabling
the network to produce emotionally discriminative and domain invariant
representations and providing complementary synthetic data to augment the
system. The proposed model is rigorously evaluated using five publicly
available datasets in three languages and compared with multiple studies on
cross-corpus and cross-language SER. Experimental results demonstrate that the
proposed model achieves improved performance compared to the state-of-the-art
methods.Comment: Accepted in IEEE Transactions on Affective Computin
Audio Event Detection using Weakly Labeled Data
Acoustic event detection is essential for content analysis and description of
multimedia recordings. The majority of current literature on the topic learns
the detectors through fully-supervised techniques employing strongly labeled
data. However, the labels available for majority of multimedia data are
generally weak and do not provide sufficient detail for such methods to be
employed. In this paper we propose a framework for learning acoustic event
detectors using only weakly labeled data. We first show that audio event
detection using weak labels can be formulated as an Multiple Instance Learning
problem. We then suggest two frameworks for solving multiple-instance learning,
one based on support vector machines, and the other on neural networks. The
proposed methods can help in removing the time consuming and expensive process
of manually annotating data to facilitate fully supervised learning. Moreover,
it can not only detect events in a recording but can also provide temporal
locations of events in the recording. This helps in obtaining a complete
description of the recording and is notable since temporal information was
never known in the first place in weakly labeled data.Comment: ACM Multimedia 201
Recent advances in LVCSR : A benchmark comparison of performances
Large Vocabulary Continuous Speech Recognition (LVCSR), which is characterized by a high variability of the speech, is the most challenging task in automatic speech recognition (ASR). Believing that the evaluation of ASR systems on relevant and common speech corpora is one of the key factors that help accelerating research, we present, in this paper, a benchmark comparison of the performances of the current state-of-the-art LVCSR systems over different speech recognition tasks. Furthermore, we put objectively into evidence the best performing technologies and the best accuracy achieved so far in each task. The benchmarks have shown that the Deep Neural Networks and Convolutional Neural Networks have proven their efficiency on several LVCSR tasks by outperforming the traditional Hidden Markov Models and Guaussian Mixture Models. They have also shown that despite the satisfying performances in some LVCSR tasks, the problem of large-vocabulary speech recognition is far from being solved in some others, where more research efforts are still needed
Deep Spoken Keyword Spotting:An Overview
Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS
Learning from Very Few Samples: A Survey
Few sample learning (FSL) is significant and challenging in the field of
machine learning. The capability of learning and generalizing from very few
samples successfully is a noticeable demarcation separating artificial
intelligence and human intelligence since humans can readily establish their
cognition to novelty from just a single or a handful of examples whereas
machine learning algorithms typically entail hundreds or thousands of
supervised samples to guarantee generalization ability. Despite the long
history dated back to the early 2000s and the widespread attention in recent
years with booming deep learning technologies, little surveys or reviews for
FSL are available until now. In this context, we extensively review 300+ papers
of FSL spanning from the 2000s to 2019 and provide a timely and comprehensive
survey for FSL. In this survey, we review the evolution history as well as the
current progress on FSL, categorize FSL approaches into the generative model
based and discriminative model based kinds in principle, and emphasize
particularly on the meta learning based FSL approaches. We also summarize
several recently emerging extensional topics of FSL and review the latest
advances on these topics. Furthermore, we highlight the important FSL
applications covering many research hotspots in computer vision, natural
language processing, audio and speech, reinforcement learning and robotic, data
analysis, etc. Finally, we conclude the survey with a discussion on promising
trends in the hope of providing guidance and insights to follow-up researches.Comment: 30 page
Cross-Domain Car Detection Model with Integrated Convolutional Block Attention Mechanism
Car detection, particularly through camera vision, has become a major focus
in the field of computer vision and has gained widespread adoption. While
current car detection systems are capable of good detection, reliable detection
can still be challenging due to factors such as proximity between the car,
light intensity, and environmental visibility. To address these issues, we
propose a cross-domain car detection model that we apply to car recognition for
autonomous driving and other areas. Our model includes several novelties:
1)Building a complete cross-domain target detection framework. 2)Developing an
unpaired target domain picture generation module with an integrated
convolutional attention mechanism. 3)Adopting Generalized Intersection over
Union (GIOU) as the loss function of the target detection framework.
4)Designing an object detection model integrated with two-headed Convolutional
Block Attention Module(CBAM). 5)Utilizing an effective data enhancement method.
To evaluate the model's effectiveness, we performed a reduced will resolution
process on the data in the SSLAD dataset and used it as the benchmark dataset
for our task. Experimental results show that the performance of the
cross-domain car target detection model improves by 40% over the model without
our framework, and our improvements have a significant impact on cross-domain
car recognition
- …