601 research outputs found
Listening for Sirens: Locating and Classifying Acoustic Alarms in City Scenes
This paper is about alerting acoustic event detection and sound source
localisation in an urban scenario. Specifically, we are interested in spotting
the presence of horns, and sirens of emergency vehicles. In order to obtain a
reliable system able to operate robustly despite the presence of traffic noise,
which can be copious, unstructured and unpredictable, we propose to treat the
spectrograms of incoming stereo signals as images, and apply semantic
segmentation, based on a Unet architecture, to extract the target sound from
the background noise. In a multi-task learning scheme, together with signal
denoising, we perform acoustic event classification to identify the nature of
the alerting sound. Lastly, we use the denoised signals to localise the
acoustic source on the horizon plane, by regressing the direction of arrival of
the sound through a CNN architecture. Our experimental evaluation shows an
average classification rate of 94%, and a median absolute error on the
localisation of 7.5{\deg} when operating on audio frames of 0.5s, and of
2.5{\deg} when operating on frames of 2.5s. The system offers excellent
performance in particularly challenging scenarios, where the noise level is
remarkably high.Comment: 6 pages, 9 figure
Listen, Think, and Understand
The ability of artificial intelligence (AI) systems to perceive and
comprehend audio signals is crucial for many applications. Although significant
progress has been made in this area since the development of AudioSet, most
existing models are designed to map audio inputs to pre-defined, discrete sound
label sets. In contrast, humans possess the ability to not only classify sounds
into general categories, but also to listen to the finer details of the sounds,
explain the reason for the predictions, think about what the sound infers, and
understand the scene and what action needs to be taken, if any. Such
capabilities beyond perception are not yet present in existing audio models. On
the other hand, modern large language models (LLMs) exhibit emerging reasoning
ability but they lack audio perception capabilities. Therefore, we ask the
question: can we build a model that has both audio perception and a reasoning
ability?
In this paper, we propose a new audio foundation model, called LTU (Listen,
Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset
consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse
(audio, question, answer) tuples, and have used an autoregressive training
framework with a perception-to-understanding curriculum. LTU demonstrates
strong performance and generalization ability on conventional audio tasks such
as classification and captioning. More importantly, it exhibits emerging audio
reasoning and comprehension abilities that are absent in existing audio models.
To the best of our knowledge, LTU is one of the first multimodal large language
models that focus on general audio (rather than just speech) understanding.Comment: Accepted at ICLR 2024. Code, dataset, and models are available at
https://github.com/YuanGongND/ltu. The interactive demo is at
https://huggingface.co/spaces/yuangongfdu/lt
Automatic detection of alarm sounds in a noisy hospital environment using model and non-model based approaches
Article publicat sense revisió per parells a ArxivIn the noisy acoustic environment of a Neonatal Intensive Care Unit (NICU) there is a variety of alarms, which are frequently triggered by the biomedical equipment. In this paper different approaches for automatic detection of those sound alarms are presented and compared: 1) a non-model-based approach that employs signal processing techniques; 2) a model-based approach based on neural networks; and 3) an approach that combines both non-model and model-based approaches. The performance of the developed detection systems that follow each of those approaches is assessed, analysed and compared both at the frame level and at the event level by using an audio database recorded in a real-world hospital environment.Preprin
AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyancea)
Soundscape studies typically attempt to capture the perception and understanding of sonic environments by surveying users. However, for long-term monitoring or assessing interventions, sound-signal-based approaches are required. To this end, most previous research focused on psycho-acoustic quantities or automatic sound recognition. Few attempts were made to include appraisal (e.g., in circumplex frameworks). This paper proposes an artificial intelligence (AI)-based dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF) to analyze automatic soundscape characterization, including sound recognition and appraisal. Using the DeLTA dataset containing human-annotated sound source labels and perceived annoyance, the DCNN-CaF is proposed to perform sound source classification (SSC) and human-perceived annoyance rating prediction (ARP). Experimental findings indicate that (1) the proposed DCNN-CaF using loudness and Mel features outperforms the DCNN-CaF using only one of them. (2) The proposed DCNN-CaF with cross-attention fusion outperforms other typical AI-based models and soundscape-related traditional machine learning methods on the SSC and ARP tasks. (3) Correlation analysis reveals that the relationship between sound sources and annoyance is similar for humans and the proposed AI-based DCNN-CaF model. (4) Generalization tests show that the proposed model's ARP in the presence of model-unknown sound sources is consistent with expert expectations and can explain previous findings from the literature on soundscape augmentation
AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyance
Soundscape studies typically attempt to capture the perception and
understanding of sonic environments by surveying users. However, for long-term
monitoring or assessing interventions, sound-signal-based approaches are
required. To this end, most previous research focused on psycho-acoustic
quantities or automatic sound recognition. Few attempts were made to include
appraisal (e.g., in circumplex frameworks). This paper proposes an artificial
intelligence (AI)-based dual-branch convolutional neural network with
cross-attention-based fusion (DCNN-CaF) to analyze automatic soundscape
characterization, including sound recognition and appraisal. Using the DeLTA
dataset containing human-annotated sound source labels and perceived annoyance,
the DCNN-CaF is proposed to perform sound source classification (SSC) and
human-perceived annoyance rating prediction (ARP). Experimental findings
indicate that (1) the proposed DCNN-CaF using loudness and Mel features
outperforms the DCNN-CaF using only one of them. (2) The proposed DCNN-CaF with
cross-attention fusion outperforms other typical AI-based models and
soundscape-related traditional machine learning methods on the SSC and ARP
tasks. (3) Correlation analysis reveals that the relationship between sound
sources and annoyance is similar for humans and the proposed AI-based DCNN-CaF
model. (4) Generalization tests show that the proposed model's ARP in the
presence of model-unknown sound sources is consistent with expert expectations
and can explain previous findings from the literature on sound-scape
augmentation.Comment: The Journal of the Acoustical Society of America, 154 (5), 314
- …