66 research outputs found

    Contribution of Statistical Tests to Sparseness-Based Blind Source Separation

    Get PDF
    International audienceWe address the problem of blind source separation in the underdetermined mixture case. Two statistical tests are proposed to reduce the number of empirical parameters involved in standard sparseness-based underdetermined blind source separation (UBSS) methods. The first test performs multisource selection of the suitable time-frequency points for source recovery and is full automatic. The second one is dedicated to autosource selection for mixing matrix estimation and requires fixing two parameters only, regardless of the instrumented SNRs. We experimentally show that the use of these tests incurs no performance loss and even improves the performance of standard weak-sparseness UBSS approaches

    Trennung und SchĂ€tzung der Anzahl von Audiosignalquellen mit Zeit- und FrequenzĂŒberlappung

    Get PDF
    Everyday audio recordings involve mixture signals: music contains a mixture of instruments; in a meeting or conference, there is a mixture of human voices. For these mixtures, automatically separating or estimating the number of sources is a challenging task. A common assumption when processing mixtures in the time-frequency domain is that sources are not fully overlapped. However, in this work we consider some cases where the overlap is severe — for instance, when instruments play the same note (unison) or when many people speak concurrently ("cocktail party") — highlighting the need for new representations and more powerful models. To address the problems of source separation and count estimation, we use conventional signal processing techniques as well as deep neural networks (DNN). We ïŹrst address the source separation problem for unison instrument mixtures, studying the distinct spectro-temporal modulations caused by vibrato. To exploit these modulations, we developed a method based on time warping, informed by an estimate of the fundamental frequency. For cases where such estimates are not available, we present an unsupervised model, inspired by the way humans group time-varying sources (common fate). This contribution comes with a novel representation that improves separation for overlapped and modulated sources on unison mixtures but also improves vocal and accompaniment separation when used as an input for a DNN model. Then, we focus on estimating the number of sources in a mixture, which is important for real-world scenarios. Our work on count estimation was motivated by a study on how humans can address this task, which lead us to conduct listening experiments, conïŹrming that humans are only able to estimate the number of up to four sources correctly. To answer the question of whether machines can perform similarly, we present a DNN architecture, trained to estimate the number of concurrent speakers. Our results show improvements compared to other methods, and the model even outperformed humans on the same task. In both the source separation and source count estimation tasks, the key contribution of this thesis is the concept of “modulation”, which is important to computationally mimic human performance. Our proposed Common Fate Transform is an adequate representation to disentangle overlapping signals for separation, and an inspection of our DNN count estimation model revealed that it proceeds to ïŹnd modulation-like intermediate features.Im Alltag sind wir von gemischten Signalen umgeben: Musik besteht aus einer Mischung von Instrumenten; in einem Meeting oder auf einer Konferenz sind wir einer Mischung menschlicher Stimmen ausgesetzt. FĂŒr diese Mischungen ist die automatische Quellentrennung oder die Bestimmung der Anzahl an Quellen eine anspruchsvolle Aufgabe. Eine hĂ€uïŹge Annahme bei der Verarbeitung von gemischten Signalen im Zeit-Frequenzbereich ist, dass die Quellen sich nicht vollstĂ€ndig ĂŒberlappen. In dieser Arbeit betrachten wir jedoch einige FĂ€lle, in denen die Überlappung immens ist zum Beispiel, wenn Instrumente den gleichen Ton spielen (unisono) oder wenn viele Menschen gleichzeitig sprechen (Cocktailparty) —, so dass neue Signal-ReprĂ€sentationen und leistungsfĂ€higere Modelle notwendig sind. Um die zwei genannten Probleme zu bewĂ€ltigen, verwenden wir sowohl konventionelle Signalverbeitungsmethoden als auch tiefgehende neuronale Netze (DNN). Wir gehen zunĂ€chst auf das Problem der Quellentrennung fĂŒr Unisono-Instrumentenmischungen ein und untersuchen die speziellen, durch Vibrato ausgelösten, zeitlich-spektralen Modulationen. Um diese Modulationen auszunutzen entwickelten wir eine Methode, die auf Zeitverzerrung basiert und eine SchĂ€tzung der Grundfrequenz als zusĂ€tzliche Information nutzt. FĂŒr FĂ€lle, in denen diese SchĂ€tzungen nicht verfĂŒgbar sind, stellen wir ein unĂŒberwachtes Modell vor, das inspiriert ist von der Art und Weise, wie Menschen zeitverĂ€nderliche Quellen gruppieren (Common Fate). Dieser Beitrag enthĂ€lt eine neuartige ReprĂ€sentation, die die Separierbarkeit fĂŒr ĂŒberlappte und modulierte Quellen in Unisono-Mischungen erhöht, aber auch die Trennung in Gesang und Begleitung verbessert, wenn sie in einem DNN-Modell verwendet wird. Im Weiteren beschĂ€ftigen wir uns mit der SchĂ€tzung der Anzahl von Quellen in einer Mischung, was fĂŒr reale Szenarien wichtig ist. Unsere Arbeit an der SchĂ€tzung der Anzahl war motiviert durch eine Studie, die zeigt, wie wir Menschen diese Aufgabe angehen. Dies hat uns dazu veranlasst, eigene Hörexperimente durchzufĂŒhren, die bestĂ€tigten, dass Menschen nur in der Lage sind, die Anzahl von bis zu vier Quellen korrekt abzuschĂ€tzen. Um nun die Frage zu beantworten, ob Maschinen dies Ă€hnlich gut können, stellen wir eine DNN-Architektur vor, die erlernt hat, die Anzahl der gleichzeitig sprechenden Sprecher zu ermitteln. Die Ergebnisse zeigen Verbesserungen im Vergleich zu anderen Methoden, aber vor allem auch im Vergleich zu menschlichen Hörern. Sowohl bei der Quellentrennung als auch bei der SchĂ€tzung der Anzahl an Quellen ist ein Kernbeitrag dieser Arbeit das Konzept der “Modulation”, welches wichtig ist, um die Strategien von Menschen mittels Computern nachzuahmen. Unsere vorgeschlagene Common Fate Transformation ist eine adĂ€quate Darstellung, um die Überlappung von Signalen fĂŒr die Trennung zugĂ€nglich zu machen und eine Inspektion unseres DNN-ZĂ€hlmodells ergab schließlich, dass sich auch hier modulationsĂ€hnliche Merkmale ïŹnden lassen

    Automated interpretation of benthic stereo imagery

    Get PDF
    Autonomous benthic imaging, reduces human risk and increases the amount of collected data. However, manually interpreting these high volumes of data is onerous, time consuming and in many cases, infeasible. The objective of this thesis is to improve the scientific utility of the large image datasets. Fine-scale terrain complexity is typically quantified by rugosity and measured by divers using chains and tape measures. This thesis proposes a new technique for measuring terrain complexity from 3D stereo image reconstructions, which is non-contact and can be calculated at multiple scales over large spatial extents. Using robots, terrain complexity can be measured without endangering humans, beyond scuba depths. Results show that this approach is more robust, flexible and easily repeatable than traditional methods. These proposed terrain complexity features are combined with visual colour and texture descriptors and applied to classifying imagery. New multi-dataset feature selection methods are proposed for performing feature selection across multiple datasets, and are shown to improve the overall classification performance. The results show that the most informative predictors of benthic habitat types are the new terrain complexity measurements. This thesis presents a method that aims to reduce human labelling effort, while maximising classification performance by combining pre-clustering with active learning. The results support that utilising the structure of the unlabelled data in conjunction with uncertainty sampling can significantly reduce the number of labels required for a given level of accuracy. Typically 0.00001–0.00007% of image data is annotated and processed for science purposes (20–50 points in 1–2% of the images). This thesis proposes a framework that uses existing human-annotated point labels to train a superpixel-based automated classification system, which can extrapolate the classified results to every pixel across all the images of an entire survey

    Automated interpretation of benthic stereo imagery

    Get PDF
    Autonomous benthic imaging, reduces human risk and increases the amount of collected data. However, manually interpreting these high volumes of data is onerous, time consuming and in many cases, infeasible. The objective of this thesis is to improve the scientific utility of the large image datasets. Fine-scale terrain complexity is typically quantified by rugosity and measured by divers using chains and tape measures. This thesis proposes a new technique for measuring terrain complexity from 3D stereo image reconstructions, which is non-contact and can be calculated at multiple scales over large spatial extents. Using robots, terrain complexity can be measured without endangering humans, beyond scuba depths. Results show that this approach is more robust, flexible and easily repeatable than traditional methods. These proposed terrain complexity features are combined with visual colour and texture descriptors and applied to classifying imagery. New multi-dataset feature selection methods are proposed for performing feature selection across multiple datasets, and are shown to improve the overall classification performance. The results show that the most informative predictors of benthic habitat types are the new terrain complexity measurements. This thesis presents a method that aims to reduce human labelling effort, while maximising classification performance by combining pre-clustering with active learning. The results support that utilising the structure of the unlabelled data in conjunction with uncertainty sampling can significantly reduce the number of labels required for a given level of accuracy. Typically 0.00001–0.00007% of image data is annotated and processed for science purposes (20–50 points in 1–2% of the images). This thesis proposes a framework that uses existing human-annotated point labels to train a superpixel-based automated classification system, which can extrapolate the classified results to every pixel across all the images of an entire survey

    ăƒ™ă‚€ă‚șæł•ă«ă‚ˆă‚‹ăƒžă‚€ă‚Żăƒ­ăƒ•ă‚©ăƒłă‚ąăƒŹă‚€ć‡Šç†

    Get PDF
    äșŹéƒœć€§ć­Š0048æ–°ćˆ¶ăƒ»èȘČçš‹ćšćŁ«ćšćŁ«(æƒ…ć ±ć­Š)ç”Č珏18412ć·æƒ…ćšçŹŹ527ć·æ–°ćˆ¶||情||93(é™„ć±žć›łæ›žé€š)31270äșŹéƒœć€§ć­Šć€§ć­Šé™ąæƒ…ć ±ć­Šç ”ç©¶ç§‘çŸ„èƒœæƒ…ć ±ć­Šć°‚æ”»(䞻査)教授 ć„„äčƒ ćš, 教授 æȳ掟 達äčŸ, ć‡†æ•™æŽˆ CUTURI CAMETO Marco, èŹ›ćž« 搉äș• ć’Œäœłć­ŠäœèŠć‰‡çŹŹ4æĄçŹŹ1項è©Čćœ“Doctor of InformaticsKyoto UniversityDFA

    Speech Enhancement using Fiber Acoustic Sensor

    Get PDF
    With the development of IoT (Internet of Things) services and devices, the voice command becomes a more and more important tool for human computer interaction. However, the audio signal recorded by the conventional omni-directional microphone is easy to be corrupted by the environmental noise like interference speech. Although the conventional beamforming techniques are able to point the main lobe of beam pattern at the desired speaker, it requires several omni microphones to form a microphone array, which will occupy large space on an IoT device. Many researchers are devoting their efforts to inventing a microphone of small size that can create directional beam pattern. Recently, researchers get inspirations from the spider’s way to sense the acoustic wave. They invented a new small-size acoustic sensor made of spider silks. This acoustic sensor has a frequency-independent dipole beam pattern for wideband audio signal. Utilizing this fiber acoustic sensor, two compact microphone arrays and corresponding speech enhancement systems can be constructed. The first microphone array consists of one omni-microphone collocated with one fiber acoustic sensor. And the second one consists of two collocated fiber acoustic sensors with orthogonal dipole beam patterns. By using the first microphone array, a first-order adaptive beamformer is designed in this thesis to reduce speech interference effects and separate speeches. In this design, an adaptive first-order beam pattern is formed by means of normalized least mean square method. Considering a scenario where the desired speech and interference speech are present at the same time, this adaptive beamformer is able to point the null angle of beam pattern at the undesired speaker to achieve speech interference reduction. In order to verify this idea, numerical simulations are conducted in an ideal condition (clean speech without reverberation) and real scenario (clean speech corrupted by white noise and reverberation). The results show that this design is able to improve speech quality significantly in ideal case. Under the condition suffering from white noise and reverberation, the improvement is achieved as well but at a much smaller scale. By using the second collocated microphone array, a speech enhancement system is proposed to make the collocated fiber acoustic sensors be able to capture speech from any directions. This system includes three main parts. The first part conducts DOA (direction of arrival) estimation empowered by a machine learning method. Here the inter-channel acoustic intensity difference is employed to compute raw DOA estimates with the presence of white noise and reverberation. After obtaining the raw DOA estimates, the machine learning method (wrapped Gaussian mixture model) is used to give a more accurate DOA estimation. This proposed method is robust to both white noise and reverberation with a low computational complexity and solves the phase ambiguity problem (0 and π are identical). In the second part, by using the orthogonality of the dipoles of the two collocated fiber acoustic sensors (one is sin⁥Ξ and the other is cos⁥Ξ), along with the DOA (Ξ) estimated by the wrapped Gaussian mixture model, a steerable dipole beam pattern is generated to point the main lobe at the speaker. In the third part, a noise reduction procedure is applied to the output signal of the steerable beamformer. The proposed method is based on a time-frequency mask, which is used to filter out time-frequency bins of white noise and keep those of speech signal. In order to verify the effectiveness of the designed system, numerical simulations are conducted in the existence of both white noise and reverberation. The result shows that the proposed DOA estimation method is robust to both white noise and reverberation. It implies that this type of microphone array is able to obtain precise speaker spatial information. Meanwhile, the audio quality of the output signal of this system is improved by at least 50%

    Signal Processing in Space and Time:A Multidimensional Fourier Approach

    Get PDF
    Sound waves propagate through space and time by transference of energy between the particles in the medium, which vibrate according to the oscillation patterns of the waves. These vibrations can be captured by a microphone and translated into a digital signal, representing the amplitude of the sound pressure as a function of time. The signal obtained by the microphone characterizes the time-domain behavior of the acoustic wave field, but has no information related to the spatial domain. The spatial information can be obtained by measuring the vibrations with an array of microphones distributed at multiple locations in space. This allows the amplitude of the sound pressure to be represented not only as a function of time but also as a function of space. The use of microphone arrays creates a new class of signals that is somewhat unfamiliar to Fourier analysis. Current paradigms try to circumvent the problem by treating the microphone signals as multiple "cooperating" signals, and applying the Fourier analysis to each signal individually. Conceptually, however, this is not faithful to the mathematics of the wave equation, which expresses the acoustic wave field as a single function of space and time, and not as multiple functions of time. The goal of this thesis is to provide a formulation of Fourier theory that treats the wave field as a single function of space and time, and allows it to be processed as a multidimensional signal using the theory of digital signal processing (DSP). We base this on a physical principle known as the Huygens principle, which essentially says that the wave field can be sampled at the surface of a given region in space and subsequently reconstructed in the same region, using only the samples obtained at the surface. To translate this into DSP language, we show that the Huygens principle can be expressed as a linear system that is both space- and time-invariant, and can be formulated as a convolution operation. If the input signal is transformed into the spatio-temporal Fourier domain, the system can also be analyzed according to its frequency response. In the first half of the thesis, we derive theoretical results that express the 4-D Fourier transform of the wave field as a function of the parameters of the scene, such as the number of sources and their locations, the source signals, and the geometry of the microphone array. We also show that the wave field can be effectively analyzed on a small scale using what we call the space/time-frequency representation space, consisting of a Gabor representation across the spatio-temporal manifold defined by the microphone array. These results are obtained by treating the signals as continuous functions of space and time. The second half of the thesis is dedicated to processing the wave field in discrete space and time, using Nyquist sampling theory and multidimensional filter banks theory. In particular, we show examples of orthogonal filter banks that effectively represent the wave field in terms of its elementary components while satisfying the requirements of critical sampling and perfect reconstruction of the input. We discuss the architecture of such filter banks, and demonstrate their applicability in the context of real applications, such as spatial filtering and wave field coding

    The assessment and development of methods in (spatial) sound ecology

    Get PDF
    As vital ecosystems across the globe enter unchartered pressure from climate change industrial land use, understanding the processes driving ecosystem viability has never been more critical. Nuanced ecosystem understanding comes from well-collected field data and a wealth of associated interpretations. In recent years the most popular methods of ecosystem monitoring have revolutionised from often damaging and labour-intensive manual data collection to automated methods of data collection and analysis. Sound ecology describes the school of research that uses information transmitted through sound to infer properties about an area's species, biodiversity, and health. In this thesis, we explore and develop state-of-the-art automated monitoring with sound, specifically relating to data storage practice and spatial acoustic recording and data analysis. In the first chapter, we explore the necessity and methods of ecosystem monitoring, focusing on acoustic monitoring, later exploring how and why sound is recorded and the current state-of-the-art in acoustic monitoring. Chapter one concludes with us setting out the aims and overall content of the following chapters. We begin the second chapter by exploring methods used to mitigate data storage expense, a widespread issue as automated methods quickly amass vast amounts of data which can be expensive and impractical to manage. Importantly I explain how these data management practices are often used without known consequence, something I then address. Specifically, I present evidence that the most used data reduction methods (namely compression and temporal subsetting) have a surprisingly small impact on the information content of recorded sound compared to the method of analysis. This work also adds to the increasing evidence that deep learning-based methods of environmental sound quantification are more powerful and robust to experimental variation than more traditional acoustic indices. In the latter chapters, I focus on using multichannel acoustic recording for sound-source localisation. Knowing where a sound originated has a range of ecological uses, including counting individuals, locating threats, and monitoring habitat use. While an exciting application of acoustic technology, spatial acoustics has had minimal uptake owing to the expense, impracticality and inaccessibility of equipment. In my third chapter, I introduce MAARU (Multichannel Acoustic Autonomous Recording Unit), a low-cost, easy-to-use and accessible solution to this problem. I explain the software and hardware necessary for spatial recording and show how MAARU can be used to localise the direction of a sound to within ±10˚ accurately. In the fourth chapter, I explore how MAARU devices deployed in the field can be used for enhanced ecosystem monitoring by spatially clustering individuals by calling directions for more accurate abundance approximations and crude species-specific habitat usage monitoring. Most literature on spatial acoustics cites the need for many accurately synced recording devices over an area. This chapter provides the first evidence of advances made with just one recorder. Finally, I conclude this thesis by restating my aims and discussing my success in achieving them. Specifically, in the thesis’ conclusion, I reiterate the contributions made to the field as a direct result of this work and outline some possible development avenues.Open Acces
    • 

    corecore