6 research outputs found

    Real-Time Sound Source Localization in Videoconferencing Environments

    Full text link
    [ES] Los mecanismos de Localización de Fuentes de Sonido (SSL) han sido ampliamente estudiados. Muchas aplicaciones como sistemas de teleconferencia o realzado de voz necesitan la localización de una o más fuentes acústicas. Además es esencial localizar las fuentes incluso en ambientes ruidosos y con reverberación. Se ha demostrado que el Steered Response Power (SRP) es un método más robusto que los métodos de dos pasos basados en la diferencia de tiempo de llegada. El problema en el cálculo del SRP es que es necesario el uso de un mallado fino lo que implica un coste computacional muy alto para ser utilizado en sistemas de tiempo real. Con este propósito, se ha introducido una nueva estrategia (función modificada SRP-PHAT) que puede ser usada en un sistema de tiempo real con un coste computacional bajo. Además se ha demostrado que la distribución estadística de las posiciones estimadas cuando el hablante está activo puede ser utilizado satisfactoriamente para distinguir fragmentos de habla y no habla. El principal objetivo de este trabajo es describir nuestra nueva propuesta e integrarla en un sistema de localización y detección de hablantes en tiempo real. Se mostrara la aplicabilidad del método en un entorno real de videoconferencia usando una cámara acústicamente dirigida.[EN] Sound Source Localization (SSL) mechanisms have been extensively studied. Many applications like teleconferencing or speech enhancement systems require the localization of one or more acoustic sources. Moreover, it is essential to localize sources also in noisy and reverberant environments. It has been shown that computing the Steered Response Power (SRP) is more robust approach than twostage, direct time-difference of arrival methods. The problem with computing the SRP is that a fine grid search procedure is needed, which is too expensive for a real-time system. To this end, it has been introduced a new strategy (modified SRP-PHAT functional) which can be used for a real-time system with a low computational cost. Moreover, it has been demonstrated that the statistical distribution of location estimates when a speaker is active can be successfully used to discriminate between speech and non-speech frames. The main objective of this work is to describe our new localization approach and integrate it into a real-time speaker localization and detection system. The applicability of the method will be shown for a real videoconferencing environment using an acoustically-driven steering cameraMartí Guerola, A. (2010). Real-Time Sound Source Localization in Videoconferencing Environments. http://hdl.handle.net/10251/27143.Archivo delegad

    Speaker Localization and Detection in Videoconferencing Environments Using a Modified SRP-PHAT Algorithm

    Full text link
    [EN] The Steered Response Power - Phase Transform (SRP-PHAT) algorithm has been shown to be one of the most robust sound source localization approaches operating in noisy and reverberant environments. However, its practical implementation is usually based on a costly fine grid-search procedure, making the computational cost of the method a real issue. In this paper, we introduce an effective strategy which performs a full exploration of the sampled space rather than computing the SRP at discrete spatial positions, increasing its robustness and allowing for a coarser spatial grid that reduces the computational cost required in a practical implementation. The modified SRP-PHAT functional has been successfully implemented in a real time speaker localization system for multiparticipant videoconferencing environments. Moreover, a localization-based speech-non speech frame discriminator is presented.This work was supported by the Ministry of Education and Science under the project TEC2009-14414-C03-01.Martí Guerola, A.; Cobos Serrano, M.; Aguilera Martí, E.; López Monfort, JJ. (2011). Speaker Localization and Detection in Videoconferencing Environments Using a Modified SRP-PHAT Algorithm. Waves. 3:40-47. http://hdl.handle.net/10251/57648S4047

    An Immersive Multi-Party Conferencing System for Mobile Devices Using 3D Binaural Audio

    Full text link
    [EN] The use of mobile telephony, along with the widespread of smartphones in the consumer market, is gradually displacing traditional telephony. Fixed-line telephone conference calls have been widely employed for carrying out distributed meetings around the world in the last decades. However, the powerful characteristics brought by modern mobile devices and data networks allow for new conferencing schemes based on immersive communication, one the fields having major commercial and technical interest within the telecommunications industry today. In this context, adding spatial audio features into conventional conferencing systems is a natural way of creating a realistic communication environment. In fact, the human auditory system takes advantage of spatial audio cues to locate, separate and understand multiple speakers when they talk simultaneously. As a result, speech intelligibility is significantly improved if the speakers are simulated to be spatially distributed. This paper describes the development of a new immersive multi-party conference call service for mobile devices (smartphones and tablets) that substantially improves the identification and intelligibility of the participants. Headphone-based audio reproduction and binaural sound processing algorithms allow the user to locate the different speakers within a virtual meeting room. Moreover, the use of a large touch screen helps the user to identify and remember the participants taking part in the conference, with the possibility of changing their spatial location in an interactive way.This work has been partially supported by the government of Spain grant TEC-2009-14414-C03-01 and by the new technologies department of TelefónicaAguilera Martí, E.; López Monfort, JJ.; Cobos Serrano, M.; Macià Pina, L.; Martí Guerola, A. (2012). An Immersive Multi-Party Conferencing System for Mobile Devices Using 3D Binaural Audio. Waves. 4:5-14. http://hdl.handle.net/10251/57918S514

    A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling

    Full text link
    The Steered Response Power – Phase Transform (SRP-PHAT) algorithm has been shown to be one of the most robust sound source localization approaches operating in noisy and reverberant environments. However, its practical implementation is usually based on a costly fine grid-search procedure, making the computational cost of the method a real issue. In this letter, we introduce an effective strategy that extends the conventional SRP-PHAT functional with the aim of considering the volume surrounding the discrete locations of the spatial grid. As a result, the modified functional performs a full exploration of the sampled space rather than computing the SRP at discrete spatial positions, increasing its robustness and allowing for a coarser spatial grid. To this end, the Generalized Cross-Correlation (GCC) function corresponding to each microphone pair must be properly accumulated according to the defined microphone setup. Experiments carried out under different acoustic conditions confirm the validity of the proposed approach.Manuscript received September 06, 2010; revised October 22, 2010; accepted October 27, 2010. Date of publication November 11, 2010; date of current version December 16, 2010. This work was suported by the The Spanish Ministry of Science and Innovation supported this work under the project TEC2009-14414-C03-01. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Constantine L. Kotropoulos.Cobos Serrano, M.; Martí Guerola, A.; López Monfort, JJ. (2011). A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling. IEEE Signal Processing Letters. 18:71-74. doi:10.1109/LSP.2010.2091502S71741

    Multichannel audio processing for speaker localization, separation and enhancement

    Full text link
    This thesis is related to the field of acoustic signal processing and its applications to emerging communication environments. Acoustic signal processing is a very wide research area covering the design of signal processing algorithms involving one or several acoustic signals to perform a given task, such as locating the sound source that originated the acquired signals, improving their signal to noise ratio, separating signals of interest from a set of interfering sources or recognizing the type of source and the content of the message. Among the above tasks, Sound Source localization (SSL) and Automatic Speech Recognition (ASR) have been specially addressed in this thesis. In fact, the localization of sound sources in a room has received a lot of attention in the last decades. Most real-word microphone array applications require the localization of one or more active sound sources in adverse environments (low signal-to-noise ratio and high reverberation). Some of these applications are teleconferencing systems, video-gaming, autonomous robots, remote surveillance, hands-free speech acquisition, etc. Indeed, performing robust sound source localization under high noise and reverberation is a very challenging task. One of the most well-known algorithms for source localization in noisy and reverberant environments is the Steered Response Power - Phase Transform (SRP-PHAT) algorithm, which constitutes the baseline framework for the contributions proposed in this thesis. Another challenge in the design of SSL algorithms is to achieve real-time performance and high localization accuracy with a reasonable number of microphones and limited computational resources. Although the SRP-PHAT algorithm has been shown to be an effective localization algorithm for real-world environments, its practical implementation is usually based on a costly fine grid-search procedure, making the computational cost of the method a real issue. In this context, several modifications and optimizations have been proposed to improve its performance and applicability. An effective strategy that extends the conventional SRP-PHAT functional is presented in this thesis. This approach performs a full exploration of the sampled space rather than computing the SRP at discrete spatial positions, increasing its robustness and allowing for a coarser spatial grid that reduces the computational cost required in a practical implementation with a small hardware cost (reduced number of microphones). This strategy allows to implement real-time applications based on location information, such as automatic camera steering or the detection of speech/non-speech fragments in advanced videoconferencing systems. As stated before, besides the contributions related to SSL, this thesis is also related to the field of ASR. This technology allows a computer or electronic device to identify the words spoken by a person so that the message can be stored or processed in a useful way. ASR is used on a day-to-day basis in a number of applications and services such as natural human-machine interfaces, dictation systems, electronic translators and automatic information desks. However, there are still some challenges to be solved. A major problem in ASR is to recognize people speaking in a room by using distant microphones. In distant-speech recognition, the microphone does not only receive the direct path signal, but also delayed replicas as a result of multi-path propagation. Moreover, there are multiple situations in teleconferencing meetings when multiple speakers talk simultaneously. In this context, when multiple speaker signals are present, Sound Source Separation (SSS) methods can be successfully employed to improve ASR performance in multi-source scenarios. This is the motivation behind the training method for multiple talk situations proposed in this thesis. This training, which is based on a robust transformed model constructed from separated speech in diverse acoustic environments, makes use of a SSS method as a speech enhancement stage that suppresses the unwanted interferences. The combination of source separation and this specific training has been explored and evaluated under different acoustical conditions, leading to improvements of up to a 35% in ASR performance.Martí Guerola, A. (2013). Multichannel audio processing for speaker localization, separation and enhancement [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/33101TESI

    A steered response power iterative method for high-accuracy acoustic source localization

    Full text link
    Source localization using the steered response power (SRP) usually requires a costly grid-search procedure. To address this issue, a modified SRP algorithm was recently introduced, providing improved robustness when using coarser spatial grids. In this letter, an iterative method based on the modified SRP is presented. A coarse spatial grid is initially evaluated with the modified SRP, selecting the point with the highest accumulated value. Then, its corresponding volume is iteratively decomposed by using a finer spatial grid. Experiments have shown that this method provides almost the same accuracy as the fine-grid search with a substantial reduction of functional evaluations. (C) 2013 Acoustical Society of America.The Spanish Ministry of Economy and Competitiveness and FEDER supported this work under the projects TEC2012-37945-C02-01/02.Martí Guerola, A.; Cobos Serrano, M.; López Monfort, JJ.; Escolano Carrasco, J. (2013). A steered response power iterative method for high-accuracy acoustic source localization. Journal of the Acoustical Society of America. 134(4):2627-2630. https://doi.org/10.1121/1.4820885S26272630134
    corecore