Search CORE

279 research outputs found

Audio‐Visual Speaker Tracking

Author: Kılıç Volkan
Wang Wenwu
Publication venue: 'IntechOpen'
Publication date: 12/07/2017
Field of study

Target motion tracking found its application in interdisciplinary fields, including but not limited to surveillance and security, forensic science, intelligent transportation system, driving assistance, monitoring prohibited area, medical science, robotics, action and expression recognition, individual speaker discrimination in multi‐speaker environments and video conferencing in the fields of computer vision and signal processing. Among these applications, speaker tracking in enclosed spaces has been gaining relevance due to the widespread advances of devices and technologies and the necessity for seamless solutions in real‐time tracking and localization of speakers. However, speaker tracking is a challenging task in real‐life scenarios as several distinctive issues influence the tracking process, such as occlusions and an unknown number of speakers. One approach to overcome these issues is to use multi‐modal information, as it conveys complementary information about the state of the speakers compared to single‐modal tracking. To use multi‐modal information, several approaches have been proposed which can be classified into two categories, namely deterministic and stochastic. This chapter aims at providing multimedia researchers with a state‐of‐the‐art overview of tracking methods, which are used for combining multiple modalities to accomplish various multimedia analysis tasks, classifying them into different categories and listing new and future trends in this field

IntechOpen

Crossref

Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates

Author: Macias-Guarasa Javier
Pizarro Daniel
Vera-Diaz Juan Manuel
Publication venue: 'MDPI AG'
Publication date: 29/07/2018
Field of study

This paper presents a novel approach for indoor acoustic source localization using microphone arrays and based on a Convolutional Neural Network (CNN). The proposed solution is, to the best of our knowledge, the first published work in which the CNN is designed to directly estimate the three dimensional position of an acoustic source, using the raw audio signal as the input information avoiding the use of hand crafted audio features. Given the limited amount of available localization data, we propose in this paper a training strategy based on two steps. We first train our network using semi-synthetic data, generated from close talk speech recordings, and where we simulate the time delays and distortion suffered in the signal that propagates from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results show that this strategy is able to produce networks that significantly improve existing localization methods based on \textit{SRP-PHAT} strategies. In addition, our experiments show that our CNN method exhibits better resistance against varying gender of the speaker and different window sizes compared with the other methods.Comment: 18 pages, 3 figures, 8 table

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

To separate speech! a system for recognizing simultaneous speech

Author: Dietrich Klakow
Emilian Stoimenov
John Mcdonough
Kenichi Kumatani
Matthias Wölfel
Stefan Schacht
Tobias Gehrig
Uwe Mayer
Publication venue
Publication date: 01/01/2007
Field of study

Abstract. The PASCAL Speech Separation Challenge (SSC) is based on a corpus of sentences from the Wall Street Journal task read by two speakers simultaneously and captured with two circular eight-channel microphone arrays. This work describes our system for the recognition of such simultaneous speech. Our system has four principal components: A person tracker returns the locations of both active speakers, as well as segmentation information for each utterance, which are often of unequal length; two beamformers in generalized sidelobe canceller (GSC) configuration separate the simultaneous speech by setting their active weight vectors according to a minimum mutual information (MMI) criterion; a postfilter and binary mask operating on the outputs of the beamformers further enhance the separated speech; and finally an automatic speech recognition (ASR) engine based on a weighted finite-state transducer (WFST) returns the most likely word hypotheses for the separated streams. In addition to optimizing each of these components, we investigated the effect of the filter bank design used to perform subband analysis and synthesis during beamforming. On the SSC development data, our system achieved a word error rate of 39.6%

CiteSeerX

Audio source separation into the wild

Author: Aichner
Anguera Miro
Araki
Araki
Arberet
Arberet
Arberet
Attias
Avargel
Avargel
Badeau
Benaroya
Benesty
Bertrand
Bertrand
Bishop
Bustamante
Cardoso
Cemgil
Chazan
Chazan
Cherkassky
Cook
Cox
Crochiere
Dempster
DiBiase
Dillon
Doclo
Doclo
Drude
Duong
Duong
Dvorkind
Evers
Evers
Fallon
Feng
Févotte
Févotte
Gannot
Gannot
Gannot
Gilloire
Girgis
Girin
Habets
Hadad
Hershey
Higuchi
Higuchi
Higuchi
Hild
Hori
Ikram
Kamkar-Parsi
Kleijn
Kounades-Bastian
Kounades-Bastian
Kounades-Bastian
Kounades-Bastian
Kounades-Bastian
Koutras
Kowalski
Kuttruff
Laufer
Lee
Leglaive
Leglaive
Leglaive
Li
Li
Li
Li
Liutkus
Loesch
Loizou
Luo
Lyon
Löllmann
Ma
Malik
Mandel
Markovich
Markovich-Golan
Markovich-Golan
Markovich-Golan
Markovich-Golan
Marquardt
Mitianoudis
Mukai
Nakadai
Nakadai
Narayanan
Nesta
Nugraha
O'Connor
O'Grady
Ozerov
Ozerov
Ozerov
Parra
Parra
Parsons
Pedersen
Pertilä
Plumbley
Prieto
Roman
Roman
Sawada
Sawada
Schmid
Schmidt
Schwartz
Schwartz
Schwartz
Simon
Smaragdis
Sturmel
Talmon
Talmon
Thiergart
Thiergart
Valin
Van Trees
Vijayasenan
Vincent
Vincent
Wang
Wang
Wang
Wang
Warsitz
Wehr
Weinstein
Widrow
Winter
Yilmaz
Yoshioka
Zeng
Zhang
Publication venue: 'Elsevier BV'
Publication date: 16/11/2018
Field of study

International audienceThis review chapter is dedicated to multichannel audio source separation in real-life environment. We explore some of the major achievements in the field and discuss some of the remaining challenges. We will explore several important practical scenarios, e.g. moving sources and/or microphones, varying number of sources and sensors, high reverberation levels, spatially diffuse sources, and synchronization problems. Several applications such as smart assistants, cellular phones, hearing aids and robots, will be discussed. Our perspectives on the future of the field will be given as concluding remarks of this chapter

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Hal-Diderot

Multimodal methods for blind source separation of audio sources

Author: Syed M.R. Naqvi (7200659)
Publication venue
Publication date: 01/01/2009
Field of study

The enhancement of the performance of frequency domain convolutive blind source separation (FDCBSS) techniques when applied to the problem of separating audio sources recorded in a room environment is the focus of this thesis. This challenging application is termed the cocktail party problem and the ultimate aim would be to build a machine which matches the ability of a human being to solve this task. Human beings exploit both their eyes and their ears in solving this task and hence they adopt a multimodal approach, i.e. they exploit both audio and video modalities. New multimodal methods for blind source separation of audio sources are therefore proposed in this work as a step towards realizing such a machine. The geometry of the room environment is initially exploited to improve the separation performance of a FDCBSS algorithm. The positions of the human speakers are monitored by video cameras and this information is incorporated within the FDCBSS algorithm in the form of constraints added to the underlying cross-power spectral density matrix-based cost function which measures separation performance. [Continues.

Loughborough University Institutional Repository