6 research outputs found
Real-Time Sound Source Localization in Videoconferencing Environments
[ES] Los mecanismos de Localización de Fuentes de Sonido (SSL) han sido ampliamente estudiados. Muchas
aplicaciones como sistemas de teleconferencia o realzado de voz necesitan la localización de una o más
fuentes acústicas. Además es esencial localizar las fuentes incluso en ambientes ruidosos y con reverberación. Se ha demostrado que el Steered Response Power (SRP) es un método más robusto que los
métodos de dos pasos basados en la diferencia de tiempo de llegada. El problema en el cálculo del SRP
es que es necesario el uso de un mallado fino lo que implica un coste computacional muy alto para
ser utilizado en sistemas de tiempo real. Con este propósito, se ha introducido una nueva estrategia
(función modificada SRP-PHAT) que puede ser usada en un sistema de tiempo real con un coste computacional
bajo. Además se ha demostrado que la distribución estadística de las posiciones estimadas
cuando el hablante está activo puede ser utilizado satisfactoriamente para distinguir fragmentos de
habla y no habla. El principal objetivo de este trabajo es describir nuestra nueva propuesta e integrarla
en un sistema de localización y detección de hablantes en tiempo real. Se mostrara la aplicabilidad del
método en un entorno real de videoconferencia usando una cámara acústicamente dirigida.[EN] Sound Source Localization (SSL) mechanisms have been extensively studied. Many applications like
teleconferencing or speech enhancement systems require the localization of one or more acoustic
sources. Moreover, it is essential to localize sources also in noisy and reverberant environments. It
has been shown that computing the Steered Response Power (SRP) is more robust approach than twostage,
direct time-difference of arrival methods. The problem with computing the SRP is that a fine
grid search procedure is needed, which is too expensive for a real-time system. To this end, it has been
introduced a new strategy (modified SRP-PHAT functional) which can be used for a real-time system
with a low computational cost. Moreover, it has been demonstrated that the statistical distribution of
location estimates when a speaker is active can be successfully used to discriminate between speech and
non-speech frames. The main objective of this work is to describe our new localization approach and
integrate it into a real-time speaker localization and detection system. The applicability of the method
will be shown for a real videoconferencing environment using an acoustically-driven steering cameraMartí Guerola, A. (2010). Real-Time Sound Source Localization in Videoconferencing Environments. http://hdl.handle.net/10251/27143.Archivo delegad
Speaker Localization and Detection in Videoconferencing Environments Using a Modified SRP-PHAT Algorithm
[EN] The Steered Response Power - Phase Transform (SRP-PHAT) algorithm has been shown to be one of the most robust sound source localization approaches operating in noisy and reverberant environments. However, its practical implementation is usually based on a costly fine grid-search procedure, making the computational cost of the method a real issue. In this paper, we introduce an effective strategy which performs a full exploration of the sampled
space rather than computing the SRP at discrete spatial positions, increasing its robustness and allowing for a coarser spatial grid that reduces the computational cost required in a practical implementation. The modified SRP-PHAT functional has been successfully implemented in a real time speaker localization system for multiparticipant videoconferencing environments. Moreover, a localization-based speech-non speech frame discriminator is presented.This work was supported by the Ministry of Education and Science under the project TEC2009-14414-C03-01.Martí Guerola, A.; Cobos Serrano, M.; Aguilera Martí, E.; López Monfort, JJ. (2011). Speaker Localization and Detection in Videoconferencing Environments Using a Modified SRP-PHAT Algorithm. Waves. 3:40-47. http://hdl.handle.net/10251/57648S4047
An Immersive Multi-Party Conferencing System for Mobile Devices Using 3D Binaural Audio
[EN] The use of mobile telephony, along with the widespread
of smartphones in the consumer market, is gradually displacing
traditional telephony. Fixed-line telephone conference
calls have been widely employed for carrying out
distributed meetings around the world in the last decades.
However, the powerful characteristics brought by
modern mobile devices and data networks allow for new
conferencing schemes based on immersive communication,
one the fields having major commercial and technical
interest within the telecommunications industry today.
In this context, adding spatial audio features into conventional
conferencing systems is a natural way of creating
a realistic communication environment. In fact, the
human auditory system takes advantage of spatial audio
cues to locate, separate and understand multiple speakers
when they talk simultaneously. As a result, speech
intelligibility is significantly improved if the speakers are
simulated to be spatially distributed. This paper describes
the development of a new immersive multi-party conference
call service for mobile devices (smartphones and
tablets) that substantially improves the identification and
intelligibility of the participants. Headphone-based audio
reproduction and binaural sound processing algorithms
allow the user to locate the different speakers within a
virtual meeting room. Moreover, the use of a large touch
screen helps the user to identify and remember the participants
taking part in the conference, with the possibility
of changing their spatial location in an interactive
way.This work has been partially supported by the government of Spain grant TEC-2009-14414-C03-01 and by the new technologies department of TelefónicaAguilera Martí, E.; López Monfort, JJ.; Cobos Serrano, M.; Macià Pina, L.; Martí Guerola, A. (2012). An Immersive Multi-Party Conferencing System for Mobile Devices Using 3D Binaural Audio. Waves. 4:5-14. http://hdl.handle.net/10251/57918S514
A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling
The Steered Response Power – Phase Transform
(SRP-PHAT) algorithm has been shown to be one of the most robust
sound source localization approaches operating in noisy and
reverberant environments. However, its practical implementation
is usually based on a costly fine grid-search procedure, making
the computational cost of the method a real issue. In this letter,
we introduce an effective strategy that extends the conventional
SRP-PHAT functional with the aim of considering the volume
surrounding the discrete locations of the spatial grid. As a result,
the modified functional performs a full exploration of the sampled
space rather than computing the SRP at discrete spatial positions,
increasing its robustness and allowing for a coarser spatial grid.
To this end, the Generalized Cross-Correlation (GCC) function
corresponding to each microphone pair must be properly accumulated
according to the defined microphone setup. Experiments
carried out under different acoustic conditions confirm the validity
of the proposed approach.Manuscript received September 06, 2010; revised October 22, 2010; accepted October 27, 2010. Date of publication November 11, 2010; date of current version December 16, 2010. This work was suported by the The Spanish Ministry of Science and Innovation supported this work under the project TEC2009-14414-C03-01. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Constantine L. Kotropoulos.Cobos Serrano, M.; Martí Guerola, A.; López Monfort, JJ. (2011). A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling. IEEE Signal Processing Letters. 18:71-74. doi:10.1109/LSP.2010.2091502S71741
Multichannel audio processing for speaker localization, separation and enhancement
This thesis is related to the field of acoustic signal processing and its applications to emerging
communication environments. Acoustic signal processing is a very wide research area covering
the design of signal processing algorithms involving one or several acoustic signals to perform
a given task, such as locating the sound source that originated the acquired signals, improving
their signal to noise ratio, separating signals of interest from a set of interfering sources or recognizing
the type of source and the content of the message. Among the above tasks, Sound Source
localization (SSL) and Automatic Speech Recognition (ASR) have been specially addressed in
this thesis. In fact, the localization of sound sources in a room has received a lot of attention in
the last decades. Most real-word microphone array applications require the localization of one
or more active sound sources in adverse environments (low signal-to-noise ratio and high reverberation).
Some of these applications are teleconferencing systems, video-gaming, autonomous
robots, remote surveillance, hands-free speech acquisition, etc. Indeed, performing robust sound
source localization under high noise and reverberation is a very challenging task. One of the
most well-known algorithms for source localization in noisy and reverberant environments is
the Steered Response Power - Phase Transform (SRP-PHAT) algorithm, which constitutes the
baseline framework for the contributions proposed in this thesis. Another challenge in the design
of SSL algorithms is to achieve real-time performance and high localization accuracy with a reasonable
number of microphones and limited computational resources. Although the SRP-PHAT
algorithm has been shown to be an effective localization algorithm for real-world environments,
its practical implementation is usually based on a costly fine grid-search procedure, making the
computational cost of the method a real issue. In this context, several modifications and optimizations
have been proposed to improve its performance and applicability. An effective strategy
that extends the conventional SRP-PHAT functional is presented in this thesis. This approach
performs a full exploration of the sampled space rather than computing the SRP at discrete spatial
positions, increasing its robustness and allowing for a coarser spatial grid that reduces the
computational cost required in a practical implementation with a small hardware cost (reduced
number of microphones). This strategy allows to implement real-time applications based on
location information, such as automatic camera steering or the detection of speech/non-speech
fragments in advanced videoconferencing systems.
As stated before, besides the contributions related to SSL, this thesis is also related to the
field of ASR. This technology allows a computer or electronic device to identify the words spoken
by a person so that the message can be stored or processed in a useful way. ASR is used on
a day-to-day basis in a number of applications and services such as natural human-machine
interfaces, dictation systems, electronic translators and automatic information desks. However,
there are still some challenges to be solved. A major problem in ASR is to recognize people
speaking in a room by using distant microphones. In distant-speech recognition, the microphone
does not only receive the direct path signal, but also delayed replicas as a result of multi-path
propagation. Moreover, there are multiple situations in teleconferencing meetings when multiple
speakers talk simultaneously. In this context, when multiple speaker signals are present, Sound
Source Separation (SSS) methods can be successfully employed to improve ASR performance
in multi-source scenarios. This is the motivation behind the training method for multiple talk
situations proposed in this thesis. This training, which is based on a robust transformed model
constructed from separated speech in diverse acoustic environments, makes use of a SSS method
as a speech enhancement stage that suppresses the unwanted interferences. The combination
of source separation and this specific training has been explored and evaluated under different
acoustical conditions, leading to improvements of up to a 35% in ASR performance.Martí Guerola, A. (2013). Multichannel audio processing for speaker localization, separation and enhancement [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/33101TESI
A steered response power iterative method for high-accuracy acoustic source localization
Source localization using the steered response power (SRP) usually requires a costly grid-search procedure. To address this issue, a modified SRP algorithm was recently introduced, providing improved robustness when using coarser spatial grids. In this letter, an iterative method based on the modified SRP is presented. A coarse spatial grid is initially evaluated with the modified SRP, selecting the point with the highest accumulated value. Then, its corresponding volume is iteratively decomposed by using a finer spatial grid. Experiments have shown that this method provides almost the same accuracy as the fine-grid search with a substantial reduction of functional evaluations. (C) 2013 Acoustical Society of America.The Spanish Ministry of Economy and Competitiveness and FEDER supported this work under the projects TEC2012-37945-C02-01/02.Martí Guerola, A.; Cobos Serrano, M.; López Monfort, JJ.; Escolano Carrasco, J. (2013). A steered response power iterative method for high-accuracy acoustic source localization. Journal of the Acoustical Society of America. 134(4):2627-2630. https://doi.org/10.1121/1.4820885S26272630134