8 research outputs found

    Shift-Invariant Kernel Additive Modelling for Audio Source Separation

    Full text link
    A major goal in blind source separation to identify and separate sources is to model their inherent characteristics. While most state-of-the-art approaches are supervised methods trained on large datasets, interest in non-data-driven approaches such as Kernel Additive Modelling (KAM) remains high due to their interpretability and adaptability. KAM performs the separation of a given source applying robust statistics on the time-frequency bins selected by a source-specific kernel function, commonly the K-NN function. This choice assumes that the source of interest repeats in both time and frequency. In practice, this assumption does not always hold. Therefore, we introduce a shift-invariant kernel function capable of identifying similar spectral content even under frequency shifts. This way, we can considerably increase the amount of suitable sound material available to the robust statistics. While this leads to an increase in separation performance, a basic formulation, however, is computationally expensive. Therefore, we additionally present acceleration techniques that lower the overall computational complexity.Comment: Feedback is welcom

    REPET for Background/Foreground Separation in Audio

    Get PDF
    International audienceRepetition is a fundamental element in generating and perceiving structure. In audio, mixtures are often composed of structures where a repeating background signal is superimposed with a varying foreground signal (e.g., a singer overlaying varying vocals on a repeating accompaniment or a varying speech signal mixed up with a repeating background noise). On this basis, we present the REpeating Pattern Extraction Technique (REPET), a simple approach for separating the repeating background from the non-repeating foreground in an audio mixture. The basic idea is to find the repeating elements in the mixture, derive the underlying repeating models, and extract the repeating background by comparing the models to the mixture. Unlike other separation approaches, REPET does not depend on special parametrizations, does not rely on complex frameworks, and does not require external information. Because it is only based on repetition, it has the advantage of being simple, fast, blind, and therefore completely and easily automatable

    Improving independent vector analysis in speech and noise separation tasks

    Get PDF
    Independent vector analysis (IVA) is an efficient multichannel blind source separation method. However, source models conventionally assumed in IVA present some limitations in case of speech and noise separation tasks. Consequently, it is expected that using better source models that overcome these limitations will improve the source separation performance of IVA. In this work, an extension of IVA is proposed, with a new source model more suitable for speech and noise separation tasks. The proposed extended IVA was evaluated in a speech and noise separation task, where it was proven to improve separation performance over baseline IVA. Furthermore, extended IVA was evaluated with several post-filters, aiming to realize an analogous setup to a multichannel Wiener filter (MWF) system. This kind of setup proved to further increase the separation performance of IVA

    Single channel overlapped-speech detection and separation of spontaneous conversations

    Get PDF
    PhD ThesisIn the thesis, spontaneous conversation containing both speech mixture and speech dialogue is considered. The speech mixture refers to speakers speaking simultaneously (i.e. the overlapped-speech). The speech dialogue refers to only one speaker is actively speaking and the other is silent. That Input conversation is firstly processed by the overlapped-speech detection. Two output signals are then segregated into dialogue and mixture formats. The dialogue is processed by speaker diarization. Its outputs are the individual speech of each speaker. The mixture is processed by speech separation. Its outputs are independent separated speech signals of the speaker. When the separation input contains only the mixture, blind speech separation approach is used. When the separation is assisted by the outputs of the speaker diarization, it is informed speech separation. The research presents novel: overlapped-speech detection algorithm, and two speech separation algorithms. The proposed overlapped-speech detection is an algorithm to estimate the switching instants of the input. Optimization loop is adapted to adopt the best capsulated audio features and to avoid the worst. The optimization depends on principles of the pattern recognition, and k-means clustering. For of 300 simulated conversations, averages of: False-Alarm Error is 1.9%, Missed-Speech Error is 0.4%, and Overlap-Speaker Error is 1%. Approximately, these errors equal the errors of best recent reliable speaker diarization corpuses. The proposed blind speech separation algorithm consists of four sequential techniques: filter-bank analysis, Non-negative Matrix Factorization (NMF), speaker clustering and filter-bank synthesis. Instead of the required speaker segmentation, effective standard framing is contributed. Average obtained objective tests (SAR, SDR and SIR) of 51 simulated conversations are: 5.06dB, 4.87dB and 12.47dB respectively. For the proposed informed speech separation algorithm, outputs of the speaker diarization are a generated-database. The database associated the speech separation by creating virtual targeted-speech and mixture. The contributed virtual signals are trained to facilitate the separation by homogenising them with the NMF-matrix elements of the real mixture. Contributed masking optimized the resulting speech. Average obtained SAR, SDR and SIR of 341 simulated conversations are 9.55dB, 1.12dB, and 2.97dB respectively. Per the objective tests of the two speech separation algorithms, they are in the mid-range of the well-known NMF-based audio and speech separation methods

    Caractérisation et reconnaissance de sons d'eau pour le suivi des activités de la vie quotidienne. Une approche fondée sur le signal, l'acoustique et la perception

    Get PDF
    Avec le vieillissement de la population, le diagnostic et le traitement des démences telle que la maladie d'Alzheimer constituent des enjeux sociaux de grande importance. Le suivi des activités de la vie quotidienne du patient représente un point clé dans le diagnostic des démences. Dans ce contexte, le projet IMMED propose une utilisation innovante de la caméra portée pour le suivi à distance des activités effectuées. Nous avons ainsi travaillé sur la reconnaissance de sons produits par l'eau, qui permet d'inférer sur un certain nombre d'activités d'intérêt pour les médecins, dont les activités liées à l'alimentation, à l'entretien, ou à l'hygiène. Si divers travaux ont déjà été effectués sur la reconnaissance des sons d'eau, ils sont difficilement adaptables aux enregistrements de la vie quotidienne, caractérisés par un recouvrement important de différentes sources sonores. Nous plaçons donc ce travail dans le cadre de l'analyse computationnelle de scènes sonores, qui pose depuis plusieurs années les bases théoriques de la reconnaissance de sources dans un mélange sonore. Nous présentons dans cette thèse un système basé sur un nouveau descripteur audio, appelé couverture spectrale, qui permet de reconnaître les flux d'eau dans des signaux sonores issus d'environnements bruités. Des expériences effectuées sur plus de 7 heures de vidéo valident notre approche et permettent d'intégrer ce système au sein du projet IMMED. Une étape complémentaire de classification permet d'améliorer notablement les résultats. Néanmoins, nos systèmes sont limités par une certaine difficulté à caractériser, et donc à reconnaître, les sons d'eau. Nous avons élargi notre analyse aux études acoustiques qui décrivent l'origine des sons d'eau. Selon ces analyses, les sons d'eau proviennent principalement de la vibration de bulles d'air dans l'eau. Les études théoriques et l'analyse de signaux réels ont permis de mettre au point une nouvelle approche de reconnaissance, fondée sur la détection fréquentielle de bulles d'air en vibration. Ce système permet de détecter des sons de liquide variés, mais se trouve limité par des flux d'eau trop complexes et bruités. Au final, ce nouveau système, basé sur la vibration de bulles d'air, est complémentaire avec le système de reconnaissance de flux d'eau, mais ne peux s'y substituer. Pour comparer ce résultat avec le fonctionnement de l'écoute humaine, nous avons effectué une étude perceptive. Dans une expérience de catégorisation libre, effectuée sur un ensemble important de sons de liquide du quotidien, les participants sont amenés à effectuer des groupes de sons en fonction de leur similarité causale. Les analyses des résultats nous permettent d'identifier des catégories de sons produits par les liquides, qui mettent en évidence l'utilisation de différentes stratégies cognitives dans l'identification les sons d'eau et de liquide. Une expérience finale effectuée sur les catégories obtenues souligne l'aspect nécessaire et suffisant de nos systèmes sur un corpus varié de sons d'eau du quotidien. Nos deux approches semblent donc pertinentes pour caractériser et reconnaître un ensemble important de sons produits par l'eau.The analysis of instrumental activities of daily life is an important tool in the early diagnosis of dementia such as Alzheimer. The IMMED project investigates tele-monitoring technologies to support doctors in the diagnostic and follow-up of the illnesses. The project aims to automatically produce indexes to facilitate the doctor’s navigation throughout the individual video recordings. Water sound recognition is very useful to identify everyday activities (e.g. hygiene, household, cooking, etc.). Classical methods of sound recognition, based on learning techniques, are ineffective in the context of the IMMED corpus, where data are very heterogeneous. Computational auditory scene analysis provides a theoretical framework for audio event detection in everyday life recordings. We review applications of single or multiple audio event detection in real life. We propose a new system of water flow recognition, based on a new feature called spectral cover. Our system obtains good results on more than seven hours of videos, and thus is integrated to the IMMED framework. A second stage improves the system precision using Gammatone Cepstral Coefficients and Support Vector Machines. However, a perceptive study shows the difficulty to characterize water sounds by a unique definition. To detect other water sounds than water flow, we used material provide by acoustics studies. A liquid sound comes mainly from harmonic vibrations resulting from the entrainment of air bubbles. We depicted an original system to recognize water sounds as group of air bubble sounds. This new system is able to detect a wide variety of water sounds, but cannot replace our water flow detection system. Our two systems seem complementary to provide a robust recognition of different water sounds of daily living. A perceptive study aims to compare our two approaches with human perception. A free categorization task has been set up on various excerpts of liquid sounds. The framework of this experiment encourages causal similarity. Results show several classes of liquids sounds, which may reflect the cognitive categories. In a final experiment performed on these categories, most of the sounds are detected by one of our two systems. This result emphasizes the necessary and sufficient aspect of our two approaches, which seem relevant to characterize and identify a large set of sounds produced by the water
    corecore