14 research outputs found

    Use What You Have: Video Retrieval Using Representations From Collaborative Experts

    Full text link
    The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human-generated queries for video datasets `in the wild' vary a lot in terms of degree of specificity, with some queries describing specific details such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pre-trained semantic embeddings which include 'general' features such as motion, appearance, and scene features from visual content. We also explore the use of more 'specific' cues from ASR and OCR which are intermittently available for videos and find that these signals remain challenging to use effectively for retrieval. We propose a collaborative experts model to aggregate information from these different pre-trained experts and assess our approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet. Code and data can be found at www.robots.ox.ac.uk/~vgg/research/collaborative-experts/. This paper contains a correction to results reported in the previous version.Comment: This update contains a correction to previously reported result

    Diversity-Robust Acoustic Feature Signatures Based on Multiscale Fractal Dimension for Similarity Search of Environmental Sounds

    Full text link
    This paper proposes new acoustic feature signatures based on the multiscale fractal dimension (MFD), which are robust against the diversity of environmental sounds, for the content-based similarity search. The diversity of sound sources and acoustic compositions is a typical feature of environmental sounds. Several acoustic features have been proposed for environmental sounds. Among them is the widely-used Mel-Frequency Cepstral Coefficients (MFCCs), which describes frequency-domain features. However, in addition to these features in the frequency domain, environmental sounds have other important features in the time domain with various time scales. In our previous paper, we proposed enhanced multiscale fractal dimension signature (EMFD) for environmental sounds. This paper extends EMFD by using the kernel density estimation method, which results in better performance of the similarity search tasks. Furthermore, it newly proposes another acoustic feature signature based on MFD, namely very-long-range multiscale fractal dimension signature (MFD-VL). The MFD-VL signature describes several features of the time-varying envelope for long periods of time. The MFD-VL signature has stability and robustness against background noise and small fluctuations in the parameters of sound sources, which are produced in field recordings. We discuss the effectiveness of these signatures in the similarity sound search by comparing with acoustic features proposed in the DCASE 2018 challenges. Due to the unique descriptiveness of our proposed signatures, we confirmed the signatures are effective when they are used with other acoustic features.Comment: 15 pages, 14 figure

    Semantic Sound Similarity with Deep Embeddings for Freesound

    Get PDF
    Freesound is an online platform where people using sounds for various purposes can share or download audio clips. In such platforms, it is crucial that the users are provided with accurate sound recommendations, which becomes challenging due to the large size of the audio collection, complexity of the sound properties, and the human aspect of the recommendations. To provide sound recommendations, Freesound features a "similar sounds" function. However, this function primarily relies on creating a digital representation of audio clips that assesses the acoustic characteristics of sounds, which proves to be insufficient for accurately capturing their semantic properties. This limitation reduces the content-based retrieval capa-bilities of Freesound users. Moreover, the audio representation is created by hand-picking features that were engineered using domain knowledge. Today, in various fields related to audio, this approach has been replaced by using neural networks as feature extractors. In this work, we search for pretrained general-purpose neural net-works that can be used to represent the semantic content of audio clips. We choose 8 such models and compare their semantic sound similarity performances both ob-jectively and subjectively. During the integration of deep embeddings in the sound similarity system, we explore numerous design choices and share valuable insights. We use the FSD50K evaluation set for all experiments and report various objective metrics using the sound class hierarchy to perform multi-level analysis, including class- and family-level. We find out that most of the neural networks outperform the hand-made representation subjectively and objectively. Specifically, the multi-modal representation learning model CLAP that uses natural language and audio as modalities outperforms other models by a significant margin, while the models that attempt to leverage the CLIP model for creating tri-modal representations fail

    A new variance-based approach for discriminative feature extraction in machine hearing classification using spectrogram features

    Get PDF
    Machine hearing is an emerging research field that is analogous to machine vision in that it aims to equip computers with the ability to hear and recognise a variety of sounds. It is a key enabler of natural human–computer speech interfacing, as well as in areas such as automated security surveillance, environmental monitoring, smart homes/buildings/cities. Recent advances in machine learning allow current systems to accurately recognise a diverse range of sounds under controlled conditions. However doing so in real-world noisy conditions remains a challenging task. Several front–end feature extraction methods have been used for machine hearing, employing speech recognition features like MFCC and PLP, as well as image-like features such as AIM and SIF. The best choice of feature is found to be dependent upon the noise environment and machine learning techniques used. Machine learning methods such as deep neural networks have been shown capable of inferring discriminative classification rules from less structured front–end features in related domains. In the machine hearing field, spectrogram image features have recently shown good performance for noise-corrupted classification using deep neural networks. However there are many methods of extracting features from spectrograms. This paper explores a novel data-driven feature extraction method that uses variance-based criteria to define spectral pooling of features from spectrograms. The proposed method, based on maximising the pooled spectral variance of foreground and background sound models, is shown to achieve very good performance for robust classification

    Sistema de clasificación de sonidos de la vida diaria

    Get PDF
    Este proyecto tiene como objetivo principal el reconocimiento y clasificación de diferentes sonidos asociados a la actividad de la vida cotidiana. Este campo de estudio tiene aplicaciones potenciales en el ámbito del reconocimiento de contextos y muy especialmente en las tecnologías de apoyo a personas con algún tipo de dependencia. Los sonidos de la vida diaria pertenecen a un tipo de sonidos denominado audio no estructurado, por lo que se han investigado las características que mejor definen estos sonidos de cara a su posterior clasificación. Además ha sido necesario crear una base de datos con las 11 clases de sonido empleadas en el proyecto, a saber “pasos”, “ducha”, “retrete”, “grifo”, “calle”, “coche”, “calle con tráfico”, “lavar vajilla”, “cocinar”, “microondas” y “estornudar”. De todo el conjunto posible de características que se pueden extraer de un sonido se ha decidido obtener 25: las 14 primeras componentes MFCC, 6 parámetros correspondientes a una descomposición en funciones de Gabor con matching pursuit y diversas características temporales y frecuenciales. Una vez completada la base de datos se ha realizado una clasificación basada en redes neuronales utilizando mapas auto-organizados, desarrollándose diversas mejoras basadas en la evaluación del coeficiente Kappa, el coste computacional y la matriz de confusión, que han dado lugar a una disminución del número total de características a 17 y un clasificador final basado en 4 mapas. A partir de estos 4 mapas se han creado una serie de tablas para hacer corresponder cada neurona con una clase y una serie de sistemas de optimización y evaluación del clasificador basadas en enventanado temporal e histogramas. Por último se han implementado dos simulaciones del funcionamiento del sistema en tiempo real obteniéndose buenos resultados

    Large-Scale Content-Based Audio Retrieval from Text Queries

    No full text
    In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We address the problem of handling generic sounds including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach base
    corecore