18 research outputs found
Graph Attention for Automated Audio Captioning
State-of-the-art audio captioning methods typically use the encoder-decoder
structure with pretrained audio neural networks (PANNs) as encoders for feature
extraction. However, the convolution operation used in PANNs is limited in
capturing the long-time dependencies within an audio signal, thereby leading to
potential performance degradation in audio captioning. This letter presents a
novel method using graph attention (GraphAC) for encoder-decoder based audio
captioning. In the encoder, a graph attention module is introduced after the
PANNs to learn contextual association (i.e. the dependency among the audio
features over different time frames) through an adjacency graph, and a top-k
mask is used to mitigate the interference from noisy nodes. The learnt
contextual association leads to a more effective feature representation with
feature node aggregation. As a result, the decoder can predict important
semantic information about the acoustic scene and events based on the
contextual associations learned from the audio signal. Experimental results
show that GraphAC outperforms the state-of-the-art methods with PANNs as the
encoders, thanks to the incorporation of the graph attention module into the
encoder for capturing the long-time dependencies within the audio signal. The
source code is available at https://github.com/LittleFlyingSheep/GraphAC.Comment: Accepted by IEEE Signal Processing Letter
Anomalous Sound Detection Using Self-Attention-Based Frequency Pattern Analysis of Machine Sounds
Different machines can exhibit diverse frequency patterns in their emitted
sound. This feature has been recently explored in anomaly sound detection and
reached state-of-the-art performance. However, existing methods rely on the
manual or empirical determination of the frequency filter by observing the
effective frequency range in the training data, which may be impractical for
general application. This paper proposes an anomalous sound detection method
using self-attention-based frequency pattern analysis and spectral-temporal
information fusion. Our experiments demonstrate that the self-attention module
automatically and adaptively analyses the effective frequencies of a machine
sound and enhances that information in the spectral feature representation.
With spectral-temporal information fusion, the obtained audio feature
eventually improves the anomaly detection performance on the DCASE 2020
Challenge Task 2 dataset.Comment: Published in INTERSPEECH 202
Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining
Existing contrastive learning methods for anomalous sound detection refine
the audio representation of each audio sample by using the contrast between the
samples' augmentations (e.g., with time or frequency masking). However, they
might be biased by the augmented data, due to the lack of physical properties
of machine sound, thereby limiting the detection performance. This paper uses
contrastive learning to refine audio representations for each machine ID,
rather than for each audio sample. The proposed two-stage method uses
contrastive learning to pretrain the audio representation model by
incorporating machine ID and a self-supervised ID classifier to fine-tune the
learnt model, while enhancing the relation between audio features from the same
ID. Experiments show that our method outperforms the state-of-the-art methods
using contrastive learning or self-supervised classification in overall anomaly
detection performance and stability on DCASE 2020 Challenge Task2 dataset.Comment: To appear in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP 2023
Hierarchical Metadata Information Constrained Self-Supervised Learning for Anomalous Sound Detection Under Domain Shift
Self-supervised learning methods have achieved promising performance for
anomalous sound detection (ASD) under domain shift, where the type of domain
shift is considered in feature learning by incorporating section IDs. However,
the attributes accompanying audio files under each section, such as machine
operating conditions and noise types, have not been considered, although they
are also crucial for characterizing domain shifts. In this paper, we present a
hierarchical metadata information constrained self-supervised (HMIC) ASD
method, where the hierarchical relation between section IDs and attributes is
constructed, and used as constraints to obtain finer feature representation. In
addition, we propose an attribute-group-center (AGC)-based method for
calculating the anomaly score under the domain shift condition. Experiments are
performed to demonstrate its improved performance over the state-of-the-art
self-supervised methods in DCASE 2022 challenge Task 2
Transformer-based Autoencoder with ID Constraint for Unsupervised Anomalous Sound Detection
Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous
sounds of devices when only normal sound data is available. The autoencoder
(AE) and self-supervised learning based methods are two mainstream methods.
However, the AE-based methods could be limited as the feature learned from
normal sounds can also fit with anomalous sounds, reducing the ability of the
model in detecting anomalies from sound. The self-supervised methods are not
always stable and perform differently, even for machines of the same type. In
addition, the anomalous sound may be short-lived, making it even harder to
distinguish from normal sound. This paper proposes an ID constrained
Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly
score computation for unsupervised ASD. Machine ID is employed to constrain the
latent space of the Transformer-based autoencoder (TransAE) by introducing a
simple ID classifier to learn the difference in the distribution for the same
machine type and enhance the ability of the model in distinguishing anomalous
sound. Moreover, weighted anomaly score computation is introduced to highlight
the anomaly scores of anomalous events that only appear for a short time.
Experiments performed on DCASE 2020 Challenge Task2 development dataset
demonstrate the effectiveness and superiority of our proposed method.Comment: Accepted by EURASIP Journal on Audio, Speech, and Music Processin
First-Shot Unsupervised Anomalous Sound Detection With Unknown Anomalies Estimated by Metadata-Assisted Audio Generation
First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new
task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for
the target machine types are unseen in training. Existing methods often rely on
the availability of normal and abnormal sound data from the target machines.
However, due to the lack of anomalous sound data for the target machine types,
it becomes challenging when adapting the existing ASD methods to the first-shot
task. In this paper, we propose a new framework for the first-shot unsupervised
ASD, where metadata-assisted audio generation is used to estimate unknown
anomalies, by utilising the available machine information (i.e., metadata and
sound data) to fine-tune a text-to-audio generation model for generating the
anomalous sounds that contain unique acoustic characteristics accounting for
each different machine types. We then use the method of Time-Weighted Frequency
domain audio Representation with Gaussian Mixture Model (TWFR-GMM) as the
backbone to achieve the first-shot unsupervised ASD. Our proposed FS-TWFR-GMM
method achieves competitive performance amongst top systems in DCASE 2023
Challenge Task 2, while requiring only 1% model parameters for detection, as
validated in our experiments.Comment: Submitted to ICASSP 202
Robust reproduction of sound zones with local sound orientation
Pressure matching (PM) and planarity control (PC) methods can be used to re-
produce local sound with a certain orientation at the listening zone, while suppressing
the sound energy at the quiet zone. In this letter, regularized PM and PC, incorporating coarse error estimation, are introduced to increase the robustness in non-ideal reproduction scenarios. Facilitated by this, the interaction between regularization, robustness, (tuned) personal audio optimization and local directional performance is explored. Simulations show that under certain conditions, PC and weighted PM achieve
comparable performance, while PC is more robust to a poorly selected regularization parameter
Anomalous Sound Detection using Spectral-Temporal Information Fusion
Unsupervised anomalous sound detection aims to detect unknown abnormal sounds
of machines from normal sounds. However, the state-of-the-art approaches are
not always stable and perform dramatically differently even for machines of the
same type, making it impractical for general applications. This paper proposes
a spectral-temporal fusion based self-supervised method to model the feature of
the normal sound, which improves the stability and performance consistency in
detection of anomalous sounds from individual machines, even of the same type.
Experiments on the DCASE 2020 Challenge Task 2 dataset show that the proposed
method achieved 81.39\%, 83.48\%, 98.22\% and 98.83\% in terms of the minimum
AUC (worst-case detection performance amongst individuals) in four types of
real machines (fan, pump, slider and valve), respectively, giving 31.79\%,
17.78\%, 10.42\% and 21.13\% improvement compared to the state-of-the-art
method, i.e., Glow\_Aff. Moreover, the proposed method has improved AUC
(average performance of individuals) for all the types of machines in the
dataset. The source codes are available at
https://github.com/liuyoude/STgram_MFNComment: To appear at ICASSP 202
An experimental study on transfer function estimation using acoustic modelling and singular value decomposition
Transfer functions relating sound source strengths and the sound pressure at field points areimportant for sound field control. Recently, two modal domain methods for transfer functionestimation have been compared using numerical simulations. One is the spatial harmonicdecomposition (SHD) method, which models a sound field with a series of cylindrical waves;while the other is the singular value decomposition (SVD) method, which uses prior sound sourcelocation information to build an acoustic model and obtain basis functions for sound fieldmodelling. In this paper, the feasibility of the SVD method using limited measurements to estimatetransfer functions over densely-spaced field samples within a target region is demonstratedexperimentally. Experimental results with various microphone placements and systemconfigurations are reported to demonstrate the geometric flexibility of the SVD method comparedto the SHD method. It is shown that the SVD method can estimate broadband transfer functionsup to 3099 Hz for a target region with a radius of 0.083 m using three microphones, and allowflexibility in system geometry. Furthermore, an application example of acoustic contrast control ispresented, showing that the proposed method is a promising approach to facilitating broadbandsound zone control with limited microphones