18 research outputs found

    Graph Attention for Automated Audio Captioning

    Full text link
    State-of-the-art audio captioning methods typically use the encoder-decoder structure with pretrained audio neural networks (PANNs) as encoders for feature extraction. However, the convolution operation used in PANNs is limited in capturing the long-time dependencies within an audio signal, thereby leading to potential performance degradation in audio captioning. This letter presents a novel method using graph attention (GraphAC) for encoder-decoder based audio captioning. In the encoder, a graph attention module is introduced after the PANNs to learn contextual association (i.e. the dependency among the audio features over different time frames) through an adjacency graph, and a top-k mask is used to mitigate the interference from noisy nodes. The learnt contextual association leads to a more effective feature representation with feature node aggregation. As a result, the decoder can predict important semantic information about the acoustic scene and events based on the contextual associations learned from the audio signal. Experimental results show that GraphAC outperforms the state-of-the-art methods with PANNs as the encoders, thanks to the incorporation of the graph attention module into the encoder for capturing the long-time dependencies within the audio signal. The source code is available at https://github.com/LittleFlyingSheep/GraphAC.Comment: Accepted by IEEE Signal Processing Letter

    Anomalous Sound Detection Using Self-Attention-Based Frequency Pattern Analysis of Machine Sounds

    Full text link
    Different machines can exhibit diverse frequency patterns in their emitted sound. This feature has been recently explored in anomaly sound detection and reached state-of-the-art performance. However, existing methods rely on the manual or empirical determination of the frequency filter by observing the effective frequency range in the training data, which may be impractical for general application. This paper proposes an anomalous sound detection method using self-attention-based frequency pattern analysis and spectral-temporal information fusion. Our experiments demonstrate that the self-attention module automatically and adaptively analyses the effective frequencies of a machine sound and enhances that information in the spectral feature representation. With spectral-temporal information fusion, the obtained audio feature eventually improves the anomaly detection performance on the DCASE 2020 Challenge Task 2 dataset.Comment: Published in INTERSPEECH 202

    Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining

    Full text link
    Existing contrastive learning methods for anomalous sound detection refine the audio representation of each audio sample by using the contrast between the samples' augmentations (e.g., with time or frequency masking). However, they might be biased by the augmented data, due to the lack of physical properties of machine sound, thereby limiting the detection performance. This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample. The proposed two-stage method uses contrastive learning to pretrain the audio representation model by incorporating machine ID and a self-supervised ID classifier to fine-tune the learnt model, while enhancing the relation between audio features from the same ID. Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification in overall anomaly detection performance and stability on DCASE 2020 Challenge Task2 dataset.Comment: To appear in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023

    Hierarchical Metadata Information Constrained Self-Supervised Learning for Anomalous Sound Detection Under Domain Shift

    Full text link
    Self-supervised learning methods have achieved promising performance for anomalous sound detection (ASD) under domain shift, where the type of domain shift is considered in feature learning by incorporating section IDs. However, the attributes accompanying audio files under each section, such as machine operating conditions and noise types, have not been considered, although they are also crucial for characterizing domain shifts. In this paper, we present a hierarchical metadata information constrained self-supervised (HMIC) ASD method, where the hierarchical relation between section IDs and attributes is constructed, and used as constraints to obtain finer feature representation. In addition, we propose an attribute-group-center (AGC)-based method for calculating the anomaly score under the domain shift condition. Experiments are performed to demonstrate its improved performance over the state-of-the-art self-supervised methods in DCASE 2022 challenge Task 2

    Transformer-based Autoencoder with ID Constraint for Unsupervised Anomalous Sound Detection

    Full text link
    Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detecting anomalies from sound. The self-supervised methods are not always stable and perform differently, even for machines of the same type. In addition, the anomalous sound may be short-lived, making it even harder to distinguish from normal sound. This paper proposes an ID constrained Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly score computation for unsupervised ASD. Machine ID is employed to constrain the latent space of the Transformer-based autoencoder (TransAE) by introducing a simple ID classifier to learn the difference in the distribution for the same machine type and enhance the ability of the model in distinguishing anomalous sound. Moreover, weighted anomaly score computation is introduced to highlight the anomaly scores of anomalous events that only appear for a short time. Experiments performed on DCASE 2020 Challenge Task2 development dataset demonstrate the effectiveness and superiority of our proposed method.Comment: Accepted by EURASIP Journal on Audio, Speech, and Music Processin

    First-Shot Unsupervised Anomalous Sound Detection With Unknown Anomalies Estimated by Metadata-Assisted Audio Generation

    Full text link
    First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it becomes challenging when adapting the existing ASD methods to the first-shot task. In this paper, we propose a new framework for the first-shot unsupervised ASD, where metadata-assisted audio generation is used to estimate unknown anomalies, by utilising the available machine information (i.e., metadata and sound data) to fine-tune a text-to-audio generation model for generating the anomalous sounds that contain unique acoustic characteristics accounting for each different machine types. We then use the method of Time-Weighted Frequency domain audio Representation with Gaussian Mixture Model (TWFR-GMM) as the backbone to achieve the first-shot unsupervised ASD. Our proposed FS-TWFR-GMM method achieves competitive performance amongst top systems in DCASE 2023 Challenge Task 2, while requiring only 1% model parameters for detection, as validated in our experiments.Comment: Submitted to ICASSP 202

    Robust reproduction of sound zones with local sound orientation

    No full text
    Pressure matching (PM) and planarity control (PC) methods can be used to re- produce local sound with a certain orientation at the listening zone, while suppressing the sound energy at the quiet zone. In this letter, regularized PM and PC, incorporating coarse error estimation, are introduced to increase the robustness in non-ideal reproduction scenarios. Facilitated by this, the interaction between regularization, robustness, (tuned) personal audio optimization and local directional performance is explored. Simulations show that under certain conditions, PC and weighted PM achieve comparable performance, while PC is more robust to a poorly selected regularization parameter

    Anomalous Sound Detection using Spectral-Temporal Information Fusion

    Full text link
    Unsupervised anomalous sound detection aims to detect unknown abnormal sounds of machines from normal sounds. However, the state-of-the-art approaches are not always stable and perform dramatically differently even for machines of the same type, making it impractical for general applications. This paper proposes a spectral-temporal fusion based self-supervised method to model the feature of the normal sound, which improves the stability and performance consistency in detection of anomalous sounds from individual machines, even of the same type. Experiments on the DCASE 2020 Challenge Task 2 dataset show that the proposed method achieved 81.39\%, 83.48\%, 98.22\% and 98.83\% in terms of the minimum AUC (worst-case detection performance amongst individuals) in four types of real machines (fan, pump, slider and valve), respectively, giving 31.79\%, 17.78\%, 10.42\% and 21.13\% improvement compared to the state-of-the-art method, i.e., Glow\_Aff. Moreover, the proposed method has improved AUC (average performance of individuals) for all the types of machines in the dataset. The source codes are available at https://github.com/liuyoude/STgram_MFNComment: To appear at ICASSP 202

    An experimental study on transfer function estimation using acoustic modelling and singular value decomposition

    No full text
    Transfer functions relating sound source strengths and the sound pressure at field points areimportant for sound field control. Recently, two modal domain methods for transfer functionestimation have been compared using numerical simulations. One is the spatial harmonicdecomposition (SHD) method, which models a sound field with a series of cylindrical waves;while the other is the singular value decomposition (SVD) method, which uses prior sound sourcelocation information to build an acoustic model and obtain basis functions for sound fieldmodelling. In this paper, the feasibility of the SVD method using limited measurements to estimatetransfer functions over densely-spaced field samples within a target region is demonstratedexperimentally. Experimental results with various microphone placements and systemconfigurations are reported to demonstrate the geometric flexibility of the SVD method comparedto the SHD method. It is shown that the SVD method can estimate broadband transfer functionsup to 3099 Hz for a target region with a radius of 0.083 m using three microphones, and allowflexibility in system geometry. Furthermore, an application example of acoustic contrast control ispresented, showing that the proposed method is a promising approach to facilitating broadbandsound zone control with limited microphones
    corecore