1,420 research outputs found

    Online Audio-Visual Multi-Source Tracking and Separation: A Labeled Random Finite Set Approach

    Get PDF
    The dissertation proposes an online solution for separating an unknown and time-varying number of moving sources using audio and visual data. The random finite set framework is used for the modeling and fusion of audio and visual data. This enables an online tracking algorithm to estimate the source positions and identities for each time point. With this information, a set of beamformers can be designed to separate each desired source and suppress the interfering sources

    Suivi Multi-Locuteurs avec des Informations Audio-Visuelles pour la Perception des Robots

    Get PDF
    Robot perception plays a crucial role in human-robot interaction (HRI). Perception system provides the robot information of the surroundings and enables the robot to give feedbacks. In a conversational scenario, a group of people may chat in front of the robot and move freely. In such situations, robots are expected to understand where are the people, who are speaking, or what are they talking about. This thesis concentrates on answering the first two questions, namely speaker tracking and diarization. We use different modalities of the robot’s perception system to achieve the goal. Like seeing and hearing for a human-being, audio and visual information are the critical cues for a robot in a conversational scenario. The advancement of computer vision and audio processing of the last decade has revolutionized the robot perception abilities. In this thesis, we have the following contributions: we first develop a variational Bayesian framework for tracking multiple objects. The variational Bayesian framework gives closed-form tractable problem solutions, which makes the tracking process efficient. The framework is first applied to visual multiple-person tracking. Birth and death process are built jointly with the framework to deal with the varying number of the people in the scene. Furthermore, we exploit the complementarity of vision and robot motorinformation. On the one hand, the robot’s active motion can be integrated into the visual tracking system to stabilize the tracking. On the other hand, visual information can be used to perform motor servoing. Moreover, audio and visual information are then combined in the variational framework, to estimate the smooth trajectories of speaking people, and to infer the acoustic status of a person- speaking or silent. In addition, we employ the model to acoustic-only speaker localization and tracking. Online dereverberation techniques are first applied then followed by the tracking system. Finally, a variant of the acoustic speaker tracking model based on von-Mises distribution is proposed, which is specifically adapted to directional data. All the proposed methods are validated on datasets according to applications.La perception des robots joue un rĂŽle crucial dans l’interaction homme-robot (HRI). Le systĂšme de perception fournit les informations au robot sur l’environnement, ce qui permet au robot de rĂ©agir en consequence. Dans un scĂ©nario de conversation, un groupe de personnes peut discuter devant le robot et se dĂ©placer librement. Dans de telles situations, les robots sont censĂ©s comprendre oĂč sont les gens, ceux qui parlent et de quoi ils parlent. Cette thĂšse se concentre sur les deux premiĂšres questions, Ă  savoir le suivi et la diarisation des locuteurs. Nous utilisons diffĂ©rentes modalitĂ©s du systĂšme de perception du robot pour remplir cet objectif. Comme pour l’humain, l’ouie et la vue sont essentielles pour un robot dans un scĂ©nario de conversation. Les progrĂšs de la vision par ordinateur et du traitement audio de la derniĂšre dĂ©cennie ont rĂ©volutionnĂ© les capacitĂ©s de perception des robots. Dans cette thĂšse, nous dĂ©veloppons les contributions suivantes : nous dĂ©veloppons d’abord un cadre variationnel bayĂ©sien pour suivre plusieurs objets. Le cadre bayĂ©sien variationnel fournit des solutions explicites, rendant le processus de suivi trĂšs efficace. Cette approche est d’abord appliquĂ© au suivi visuel de plusieurs personnes. Les processus de crĂ©ations et de destructions sont en adĂ©quation avecle modĂšle probabiliste proposĂ© pour traiter un nombre variable de personnes. De plus, nous exploitons la complĂ©mentaritĂ© de la vision et des informations du moteur du robot : d’une part, le mouvement actif du robot peut ĂȘtre intĂ©grĂ© au systĂšme de suivi visuel pour le stabiliser ; d’autre part, les informations visuelles peuvent ĂȘtre utilisĂ©es pour effectuer l’asservissement du moteur. Par la suite, les informations audio et visuelles sont combinĂ©es dans le modĂšle variationnel, pour lisser les trajectoires et dĂ©duire le statut acoustique d’une personne : parlant ou silencieux. Pour experimenter un scenario oĂč l’informationvisuelle est absente, nous essayons le modĂšle pour la localisation et le suivi des locuteurs basĂ© sur l’information acoustique uniquement. Les techniques de dĂ©rĂ©verbĂ©ration sont d’abord appliquĂ©es, dont le rĂ©sultat est fourni au systĂšme de suivi. Enfin, une variante du modĂšle de suivi des locuteurs basĂ©e sur la distribution de von-Mises est proposĂ©e, celle-ci Ă©tant plus adaptĂ©e aux donnĂ©es directionnelles. Toutes les mĂ©thodes proposĂ©es sont validĂ©es sur des bases de donnĂ©es specifiques Ă  chaque application

    Recurrent Neural Networks and Matrix Methods for Cognitive Radio Spectrum Prediction and Security

    Get PDF
    In this work, machine learning tools, including recurrent neural networks (RNNs), matrix completion, and non-negative matrix factorization (NMF), are used for cognitive radio problems. Specifically addressed are a missing data problem and a blind signal separation problem. A specialized RNN called Cellular Simultaneous Recurrent Network (CSRN), typically used in image processing applications, has been modified. The CRSN performs well for spatial spectrum prediction of radio signals with missing data. An algorithm called soft-impute for matrix completion used together with an RNN performs well for missing data problems in the radio spectrum time-frequency domain. Estimating missing spectrum data can improve cognitive radio efficiency. An NMF method called tuning pruning is used for blind source separation of radio signals in simulation. An NMF optimization technique using a geometric constraint is proposed to limit the solution space of blind signal separation. Both NMF methods are promising in addressing a security problem known as spectrum sensing data falsification attack

    Algorithmic Analysis of Complex Audio Scenes

    Get PDF
    In this thesis, we examine the problem of algorithmic analysis of complex audio scenes with a special emphasis on natural audio scenes. One of the driving goals behind this work is to develop tools for monitoring the presence of animals in areas of interest based on their vocalisations. This task, which often occurs in the evaluation of nature conservation measures, leads to a number of subproblems in audio scene analysis. In order to develop and evaluate pattern recognition algorithms for animal sounds, a representative collection of such sounds is necessary. Building such a collection is beyond the scope of a single researcher and we therefore use data from the Animal Sound Archive of the Humboldt University of Berlin. Although a large portion of well annotated recordings from this archive has been available in digital form, little infrastructure for searching and sharing this data has been available. We describe a distributed infrastructure for searching, sharing and annotating animal sound collections collaboratively, which we have developed in this context. Although searching animal sound databases by metadata gives good results for many applications, annotating all occurences of a specific sound is beyond the scope of human annotators. Moreover, finding similar vocalisations to that of an example is not feasible by using only metadata. We therefore propose an algorithm for content-based similarity search in animal sound databases. Based on principles of image processing, we develop suitable features for the description of animal sounds. We enhance a concept for content-based multimedia retrieval by a ranking scheme which makes it an efficient tool for similarity search. One of the main sources of complexity in natural audio scenes, and the most difficult problem for pattern recognition, is the large number of sound sources which are active at the same time. We therefore examine methods for source separation based on microphone arrays. In particular, we propose an algorithm for the extraction of simpler components from complex audio scenes based on a sound complexity measure. Finally, we introduce pattern recognition algorithms for the vocalisations of a number of bird species. Some of these species are interesting for reasons of nature conservation, while one of the species serves as a prototype for song birds with strongly structured songs.Algorithmische Analyse Komplexer Audioszenen In dieser Arbeit untersuchen wir das Problem der Analyse komplexer Audioszenen mit besonderem Augenmerk auf natĂŒrliche Audioszenen. Eine der treibenden Zielsetzungen hinter dieser Arbeit ist es Werkzeuge zu entwickeln, die es erlauben ein auf LautĂ€ußerungen basierendes Monitoring von Tierarten in Zielregionen durchzufĂŒhren. Diese Aufgabenstellung, die hĂ€ufig in der Evaluation von Naturschutzmaßnahmen auftritt, fĂŒhrt zu einer Anzahl von Unterproblemen innerhalb der Audioszenen-Analyse. Eine wichtige Voraussetzung um Mustererkennungs-Algorithmen fĂŒr Tierstimmen entwickeln zu können, ist die VerfĂŒgbarkeit großer Sammlungen von Aufnahmen von Tierstimmen. Eine solche Sammlung aufzubauen liegt jenseits der Möglichkeiten eines einzelnen Forschers und wir verwenden daher Daten des Tierstimmenarchivs der Humboldt UniversitĂ€t Berlin. Obwohl eine große Anzahl gut annotierter Aufnahmen in diesem Archiv in digitaler Form vorlagen, gab es nur wenig unterstĂŒtzende Infrastruktur um diese Daten durchsuchen und verteilen zu können. Wir beschreiben eine verteilte Infrastruktur, mit deren Hilfe es möglich ist Tierstimmen-Sammlungen zu durchsuchen, sowie gemeinsam zu verwenden und zu annotieren, die wir in diesem Kontext entwickelt haben. Obwohl das Durchsuchen von Tierstimmen-Datenbank anhand von Metadaten fĂŒr viele Anwendungen gute Ergebnisse liefert, liegt es jenseits der Möglichkeiten menschlicher Annotatoren alle Vorkommen eines bestimmten GerĂ€uschs zu annotieren. DarĂŒber hinaus ist es nicht möglich einem Beispiel Ă€hnlich klingende GerĂ€usche nur anhand von Metadaten zu finden. Deshalb schlagen wir einen Algorithmus zur inhaltsbasierten Ähnlichkeitssuche in Tierstimmen-Datenbanken vor. Ausgehend von Methoden der Bildverarbeitung entwickeln wir geeignete Merkmale fĂŒr die Beschreibung von Tierstimmen. Wir erweitern ein Konzept zur inhaltsbasierten Multimedia-Suche um ein Ranking-Schema, dass dieses zu einem effizienten Werkzeug fĂŒr die Ähnlichkeitssuche macht. Eine der grundlegenden Quellen von KomplexitĂ€t in natĂŒrlichen Audioszenen, und das schwierigste Problem fĂŒr die Mustererkennung, stellt die hohe Anzahl gleichzeitig aktiver GerĂ€uschquellen dar. Deshalb untersuchen wir Methoden zur Quellentrennung, die auf Mikrofon-Arrays basieren. Insbesondere schlagen wir einen Algorithmus zur Extraktion einfacherer Komponenten aus komplexen Audioszenen vor, der auf einem Maß fĂŒr die KomplexitĂ€t von Audioaufnahmen beruht. Schließlich fĂŒhren wir Mustererkennungs-Algorithmen fĂŒr die LautĂ€ußerungen einer Reihe von Vogelarten ein. Einige dieser Arten sind aus GrĂŒnden des Naturschutzes interessant, wĂ€hrend eine Art als Prototyp fĂŒr Singvögel mit stark strukturierten GesĂ€ngen dient

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    Distributed adaptive signal processing for frequency estimation

    Get PDF
    It is widely recognised that future smart grids will heavily rely upon intelligent communication and signal processing as enabling technologies for their operation. Traditional tools for power system analysis, which have been built from a circuit theory perspective, are a good match for balanced system conditions. However, the unprecedented changes that are imposed by smart grid requirements, are pushing the limits of these old paradigms. To this end, we provide new signal processing perspectives to address some fundamental operations in power systems such as frequency estimation, regulation and fault detection. Firstly, motivated by our finding that any excursion from nominal power system conditions results in a degree of non-circularity in the measured variables, we cast the frequency estimation problem into a distributed estimation framework for noncircular complex random variables. Next, we derive the required next generation widely linear, frequency estimators which incorporate the so-called augmented data statistics and cater for the noncircularity and a widely linear nature of system functions. Uniquely, we also show that by virtue of augmented complex statistics, it is possible to treat frequency tracking and fault detection in a unified way. To address the ever shortening time-scales in future frequency regulation tasks, the developed distributed widely linear frequency estimators are equipped with the ability to compensate for the fewer available temporal voltage data by exploiting spatial diversity in wide area measurements. This contribution is further supported by new physically meaningful theoretical results on the statistical behavior of distributed adaptive filters. Our approach avoids the current restrictive assumptions routinely employed to simplify the analysis by making use of the collaborative learning strategies of distributed agents. The efficacy of the proposed distributed frequency estimators over standard strictly linear and stand-alone algorithms is illustrated in case studies over synthetic and real-world three-phase measurements. An overarching theme in this thesis is the elucidation of underlying commonalities between different methodologies employed in classical power engineering and signal processing. By revisiting fundamental power system ideas within the framework of augmented complex statistics, we provide a physically meaningful signal processing perspective of three-phase transforms and reveal their intimate connections with spatial discrete Fourier transform (DFT), optimal dimensionality reduction and frequency demodulation techniques. Moreover, under the widely linear framework, we also show that the two most widely used frequency estimators in the power grid are in fact special cases of frequency demodulation techniques. Finally, revisiting classic estimation problems in power engineering through the lens of non-circular complex estimation has made it possible to develop a new self-stabilising adaptive three-phase transformation which enables algorithms designed for balanced operating conditions to be straightforwardly implemented in a variety of real-world unbalanced operating conditions. This thesis therefore aims to help bridge the gap between signal processing and power communities by providing power system designers with advanced estimation algorithms and modern physically meaningful interpretations of key power engineering paradigms in order to match the dynamic and decentralised nature of the smart grid.Open Acces

    Proceedings of the EAA Spatial Audio Signal Processing symposium: SASP 2019

    Get PDF
    International audienc
    • 

    corecore