96 research outputs found
3D AUDIO-VISUAL SPEAKER TRACKING WITH AN ADAPTIVE PARTICLE FILTER
reserved4siWe propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of a camera and a small microphone array. After extracting audio-visual cues from individual modalities we fuse them adaptively using their reliability in a particle filter framework. The reliability of the audio signal is measured based on the maximum Global Coherence Field (GCF) peak value at each frame. The visual reliability is based on colour-histogram matching with detection results compared with a reference image in the RGB space. Experiments on the AV16.3 dataset show that the proposed adaptive audio-visual tracker outperforms both the individual modalities and a classical approach with fixed parameters in terms of tracking accuracy.Qian, Xinyuan; Brutti, Alessio; Omologo, Maurizio; Cavallaro, AndreaQian, Xinyuan; Brutti, Alessio; Omologo, Maurizio; Cavallaro, Andre
Audio-visual tracking of concurrent speakers
Audio-visual tracking of an unknown number of concurrent speakers in 3D is a challenging task, especially when sound and video are collected with a compact sensing platform. In this paper, we propose a tracker that builds on generative and discriminative audio-visual likelihood models formulated in a particle filtering framework. We localize multiple concurrent speakers with a de-emphasized acoustic map assisted by the image detection-derived 3D video observations. The 3D multimodal observations are either assigned to existing tracks for discriminative likelihood computation or used to initialize new tracks. The generative likelihoods rely on color distribution of the target and the de-emphasized acoustic map value. Experiments on AV16.3 and CAV3D datasets show that the proposed tracker outperforms the uni-modal trackers and the state-of-the-art approaches both in 3D and on the image plane
Multi-speaker tracking from an audio-visual sensing device
Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult. To address this challenge, we propose a novel 3D audio-visual people tracker that exploits visual observations (object detections) to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker. This solution allows the tracker to estimate, with a small microphone array, the distance of a sound. Moreover, we apply a color-based visual likelihood on the image plane to compensate for misdetections. Finally, we use a 3D particle filter and greedy data association to combine visual observations, color-based and acoustic likelihoods to track the position of multiple simultaneous speakers. We compare the proposed multimodal 3D tracker against two state-of-the-art methods on the AV16.3 dataset and on a newly collected dataset with co-located sensors, which we make available to the research community. Experimental results show that our multimodal approach outperforms the other methods both in 3D and on the image plane
Robust F0 estimation based on a multi-microphone periodicity function for distant-talking speech
This work addresses the problem of deriving F0 from distanttalking speech signals acquired by a microphone network. The method here proposed exploits the redundancy across the channels by jointly processing the different signals. To this purpose, a multi-microphone periodicity function is derived from the magnitude spectrum of all the channels. This function allows to estimate F0 reliably, even under reverberant conditions, without the need of any post-processing or smoothing technique. Experiments, conducted on real data, showed that the proposed frequency-domain algorithm is more suitable than other time-domain based ones
Time-frequency reassigned features for automatic chord recognition
This paper addresses feature extraction for automatic chord recognition systems. Most chord recognition systems use chroma features as a front-end and some kind of classifier (HMM, SVM or template matching).
The vast majority of feature extraction approaches are based on mapping frequency bins from spectrum or constant-Q spectrum to chroma bins.
In this work a set of new chroma features that are based on the
time-frequency reassignment (TFR) technique is investigated.
The proposed feature set was evaluated on the commonly used Beatles dataset and proved to be efficient for the chord recognition task, outperforming standard chroma
Exploiting inter-microphone agreement for hypothesis combination in distant speech recognition
A multi-microphone hypothesis combination approach, suitable for the distant-talking scenario, is presented in this paper. The method is based on the inter-microphone agreement of information, extracted at speech recognition level. Particularly, temporal information is exploited to organize the clusters that shape the resulting confusion network, and to reduce the global hypothesis search space. As a result, a single combined confusion network is generated from multiple
lattices. The approach offers a novel perspective to solutions
based on confusion network combination. The method was
evaluated in a simulated domestic environment equipped with
largely spaced microphones. The experimental evidence sug-
gests that results, comparable or, in some cases, better than
the state of the art, can be achieved under optimal configura-
tions with the proposed method
Talker Localization And Speech Enhancement In A Noisy Environment Using A Microphone Array Based Acquisition System
This paper deals with the use of linear microphone arrays for detection, localization and enhancement of a generic acoustic message produced in a noisy environment. A CrosspowerSpectrum Phase based analysis and a Coherence Measure representation are presented, that allow an accurate time delay estimation employed for the acoustic source position hypothesis. Preliminary results in terms of source localization accuracy are given. Once source position is estimated, an enhanced version of the original acoustic message is derived, that can represent the input for a speech recognition system. Keywords: Microphone Arrays, Talker Localization, Speech Enhancement. 1. INTRODUCTION Automatic speech recognizer performance often degrades drastically when employed in conditions that are different with respect to those for which they were designed. One reason for this is environmental noise. Another reason can be inconsistent use of the acoustic transducers during message acquisition. For example,..
Large-scale cover song identification using chord profiles
This paper focuses on cover song identification among datasets potentially containing millions of songs. A compact representation of music contents plays an important role in large-scale analysis and retrieval. The proposed approach is based on high-level summarization of musical songs using chord profiles. Search is performed in two steps. In the first step, the Locality Sensitive Hashing (LHS) method is used to retrieve songs with similar
chord profiles. On the resulting list of songs a second processing step is applied to progressively refine the ranking.
Experiments conducted on both the Million Song Dataset (MSD) and a subset of the Second Hand Songs (SHS) dataset showed the effectiveness of the proposed solution, which provides state-of-the-art results
- …