7 research outputs found

    Bayesian framework for multiple acoustic source tracking

    Get PDF
    Acoustic source (speaker) tracking in the room environment plays an important role in many speech and audio applications such as multimedia, hearing aids and hands-free speech communication and teleconferencing systems; the position information can be fed into a higher processing stage for high-quality speech acquisition, enhancement of a specific speech signal in the presence of other competing talkers, or keeping a camera focused on the speaker in a video-conferencing scenario. Most of existing systems focus on the single source tracking problem, which assumes one and only one source is active all the time, and the state to be estimated is simply the source position. However, in practical scenarios, multiple speakers may be simultaneously active, and the tracking algorithm should be able to localise each individual source and estimate the number of sources. This thesis contains three contributions towards solutions to multiple acoustic source tracking in a moderate noisy and reverberant environment. The first contribution of this thesis is proposing a time-delay of arrival (TDOA) estimation approach for multiple sources. Although the phase transform (PHAT) weighted generalised cross-correlation (GCC) method has been employed to extract the TDOAs of multiple sources, it is primarily used for a single source scenario and its performance for multiple TDOA estimation has not been comprehensively studied. The proposed approach combines the degenerate unmixing estimation technique (DUET) and GCC method. Since the speech mixtures are assumed window-disjoint orthogonal (WDO) in the time-frequency domain, the spectrograms can be separated by employing DUET, and the GCC method can then be applied to the spectrogram of each individual source. The probabilities of detection and false alarm are also proposed to evaluate the TDOA estimation performance under a series of experimental parameters. Next, considering multiple acoustic sources may appear nonconcurrently, an extended Kalman particle filtering (EKPF) is developed for a special multiple acoustic source tracking problem, namely “nonconcurrent multiple acoustic tracking (NMAT)”. The extended Kalman filter (EKF) is used to approximate the optimum weights, and the subsequent particle filtering (PF) naturally takes the previous position estimates as well as the current TDOA measurements into account. The proposed approach is thus able to lock on the sharp change of the source position quickly, and avoid the tracking-lag in the general sequential importance resampling (SIR) PF. Finally, these investigations are extended into an approach to track the multiple unknown and time-varying number of acoustic sources. The DUET-GCC method is used to obtain the TDOA measurements for multiple sources and a random finite set (RFS) based Rao-blackwellised PF is employed and modified to track the sources. Each particle has a RFS form encapsulating the states of all sources and is capable of addressing source dynamics: source survival, new source appearance and source deactivation. A data association variable is defined to depict the source dynamic and its relation to the measurements. The Rao-blackwellisation step is used to decompose the state: the source positions are marginalised by using an EKF, and only the data association variable needs to be handled by a PF. The performances of all the proposed approaches are extensively studied under different noisy and reverberant environments, and are favorably comparable with the existing tracking techniques

    Multichannel source separation and tracking with phase differences by random sample consensus

    Get PDF
    Blind audio source separation (BASS) is a fascinating problem that has been tackled from many different angles. The use case of interest in this thesis is that of multiple moving and simultaneously-active speakers in a reverberant room. This is a common situation, for example, in social gatherings. We human beings have the remarkable ability to focus attention on a particular speaker while effectively ignoring the rest. This is referred to as the ``cocktail party effect'' and has been the holy grail of source separation for many decades. Replicating this feat in real-time with a machine is the goal of BASS. Single-channel methods attempt to identify the individual speakers from a single recording. However, with the advent of hand-held consumer electronics, techniques based on microphone array processing are becoming increasingly popular. Multichannel methods record a sound field from various locations to incorporate spatial information. If the speakers move over time, we need an algorithm capable of tracking their positions in the room. For compact arrays with 1-10 cm of separation between the microphones, this can be accomplished by applying a temporal filter on estimates of the directions-of-arrival (DOA) of the speakers. In this thesis, we review recent work on BSS with inter-channel phase difference (IPD) features and provide extensions to the case of moving speakers. It is shown that IPD features compose a noisy circular-linear dataset. This data is clustered with the RANdom SAmple Consensus (RANSAC) algorithm in the presence of strong reverberation to simultaneously localize and separate speakers. The remarkable performance of RANSAC is due to its natural tendency to reject outliers. To handle the case of non-stationary speakers, a factorial wrapped Kalman filter (FWKF) and a factorial von Mises-Fisher particle filter (FvMFPF) are proposed that track source DOAs directly on the unit circle and unit sphere, respectively. These algorithms combine directional statistics, Bayesian filtering theory, and probabilistic data association techniques to track the speakers with mixtures of directional distributions

    Suivi Multi-Locuteurs avec des Informations Audio-Visuelles pour la Perception des Robots

    Get PDF
    Robot perception plays a crucial role in human-robot interaction (HRI). Perception system provides the robot information of the surroundings and enables the robot to give feedbacks. In a conversational scenario, a group of people may chat in front of the robot and move freely. In such situations, robots are expected to understand where are the people, who are speaking, or what are they talking about. This thesis concentrates on answering the first two questions, namely speaker tracking and diarization. We use different modalities of the robot’s perception system to achieve the goal. Like seeing and hearing for a human-being, audio and visual information are the critical cues for a robot in a conversational scenario. The advancement of computer vision and audio processing of the last decade has revolutionized the robot perception abilities. In this thesis, we have the following contributions: we first develop a variational Bayesian framework for tracking multiple objects. The variational Bayesian framework gives closed-form tractable problem solutions, which makes the tracking process efficient. The framework is first applied to visual multiple-person tracking. Birth and death process are built jointly with the framework to deal with the varying number of the people in the scene. Furthermore, we exploit the complementarity of vision and robot motorinformation. On the one hand, the robot’s active motion can be integrated into the visual tracking system to stabilize the tracking. On the other hand, visual information can be used to perform motor servoing. Moreover, audio and visual information are then combined in the variational framework, to estimate the smooth trajectories of speaking people, and to infer the acoustic status of a person- speaking or silent. In addition, we employ the model to acoustic-only speaker localization and tracking. Online dereverberation techniques are first applied then followed by the tracking system. Finally, a variant of the acoustic speaker tracking model based on von-Mises distribution is proposed, which is specifically adapted to directional data. All the proposed methods are validated on datasets according to applications.La perception des robots joue un rĂŽle crucial dans l’interaction homme-robot (HRI). Le systĂšme de perception fournit les informations au robot sur l’environnement, ce qui permet au robot de rĂ©agir en consequence. Dans un scĂ©nario de conversation, un groupe de personnes peut discuter devant le robot et se dĂ©placer librement. Dans de telles situations, les robots sont censĂ©s comprendre oĂč sont les gens, ceux qui parlent et de quoi ils parlent. Cette thĂšse se concentre sur les deux premiĂšres questions, Ă  savoir le suivi et la diarisation des locuteurs. Nous utilisons diffĂ©rentes modalitĂ©s du systĂšme de perception du robot pour remplir cet objectif. Comme pour l’humain, l’ouie et la vue sont essentielles pour un robot dans un scĂ©nario de conversation. Les progrĂšs de la vision par ordinateur et du traitement audio de la derniĂšre dĂ©cennie ont rĂ©volutionnĂ© les capacitĂ©s de perception des robots. Dans cette thĂšse, nous dĂ©veloppons les contributions suivantes : nous dĂ©veloppons d’abord un cadre variationnel bayĂ©sien pour suivre plusieurs objets. Le cadre bayĂ©sien variationnel fournit des solutions explicites, rendant le processus de suivi trĂšs efficace. Cette approche est d’abord appliquĂ© au suivi visuel de plusieurs personnes. Les processus de crĂ©ations et de destructions sont en adĂ©quation avecle modĂšle probabiliste proposĂ© pour traiter un nombre variable de personnes. De plus, nous exploitons la complĂ©mentaritĂ© de la vision et des informations du moteur du robot : d’une part, le mouvement actif du robot peut ĂȘtre intĂ©grĂ© au systĂšme de suivi visuel pour le stabiliser ; d’autre part, les informations visuelles peuvent ĂȘtre utilisĂ©es pour effectuer l’asservissement du moteur. Par la suite, les informations audio et visuelles sont combinĂ©es dans le modĂšle variationnel, pour lisser les trajectoires et dĂ©duire le statut acoustique d’une personne : parlant ou silencieux. Pour experimenter un scenario oĂč l’informationvisuelle est absente, nous essayons le modĂšle pour la localisation et le suivi des locuteurs basĂ© sur l’information acoustique uniquement. Les techniques de dĂ©rĂ©verbĂ©ration sont d’abord appliquĂ©es, dont le rĂ©sultat est fourni au systĂšme de suivi. Enfin, une variante du modĂšle de suivi des locuteurs basĂ©e sur la distribution de von-Mises est proposĂ©e, celle-ci Ă©tant plus adaptĂ©e aux donnĂ©es directionnelles. Toutes les mĂ©thodes proposĂ©es sont validĂ©es sur des bases de donnĂ©es specifiques Ă  chaque application

    Online Audio-Visual Multi-Source Tracking and Separation: A Labeled Random Finite Set Approach

    Get PDF
    The dissertation proposes an online solution for separating an unknown and time-varying number of moving sources using audio and visual data. The random finite set framework is used for the modeling and fusion of audio and visual data. This enables an online tracking algorithm to estimate the source positions and identities for each time point. With this information, a set of beamformers can be designed to separate each desired source and suppress the interfering sources

    Sequential estimation techniques and application to multiple speaker tracking and language modeling

    Get PDF
    For many real-word applications, the considered data is given as a time sequence that becomes available in an orderly fashion, where the order incorporates important information about the entities of interest. The work presented in this thesis deals with two such cases by introducing new sequential estimation solutions. More precisely, we introduce a: I. Sequential Bayesian estimation framework to solve the multiple speaker localization, detection and tracking problem. This framework is a complete pipeline that includes 1) new observation estimators, which extract a fixed number of potential locations per time frame; 2) new unsupervised Bayesian detectors, which classify these estimates into noise/speaker classes and 3) new Bayesian filters, which use the speaker class estimates to track multiple speakers. This framework was developed to tackle the low overlap detection rate of multiple speakers and to reduce the number of constraints generally imposed in standard solutions. II. Sequential neural estimation framework for language modeling, which overcomes some of the shortcomings of standard approaches through merging of different models in a hybrid architecture. That is, we introduce two solutions that tightly merge particular models and then show how a generalization can be achieved through a new mixture model. In order to speed-up the training of large vocabulary language models, we introduce a new extension of the noise contrastive estimation approach to batch training.Bei vielen Anwendungen kommen Daten als zeitliche Sequenz vor, deren Reihenfolge wichtige Informationen ĂŒber die betrachteten EntitĂ€ten enthĂ€lt. In der vorliegenden Arbeit werden zwei derartige FĂ€lle bearbeitet, indem neue sequenzielle SchĂ€tzverfahren eingefĂŒhrt werden: I. Ein Framework fĂŒr ein sequenzielles bayessches SchĂ€tzverfahren zur Lokalisation, Erkennung und Verfolgung mehrerer Sprecher. Es besteht aus 1) neuen BeobachtungsschĂ€tzern, welche pro Zeitfenster eine bestimmte Anzahl möglicher Aufenthaltsorte bestimmen; 2) neuen, unĂŒberwachten bayesschen Erkennern, die diese AbschĂ€tzungen nach Sprechern/Rauschen klassifizieren und 3) neuen bayesschen Filtern, die SchĂ€tzungen aus der Sprecher-Klasse zur Verfolgung mehrerer Sprecher verwenden. Dieses Framework wurde speziell zur Verbesserung der i.A. niedrigen Erkennungsrate bei gleichzeitig Sprechenden entwickelt und benötigt weniger Randbedingungen als Standardlösungen. II. Ein sequenzielles neuronales Vorhersageframework fĂŒr Sprachmodelle, das einige Nachteile von StandardansĂ€tzen durch das ZusammenfĂŒhren verschiedener Modelle in einer Hybridarchitektur beseitigt. Konkret stellen wir zwei Lösungen vor, die bestimmte Modelle integrieren, und leiten dann eine Verallgemeinerung durch die Verwendung eines neuen Mischmodells her. Um das Trainieren von Sprachmodellen mit sehr großem Vokabular zu beschleunigen, wird eine Erweiterung des rauschkontrastiven SchĂ€tzverfahrens fĂŒr Batch-Training vorgestellt
    corecore