Search CORE

741 research outputs found

Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates

Author: Macias-Guarasa Javier
Pizarro Daniel
Vera-Diaz Juan Manuel
Publication venue: 'MDPI AG'
Publication date: 29/07/2018
Field of study

This paper presents a novel approach for indoor acoustic source localization using microphone arrays and based on a Convolutional Neural Network (CNN). The proposed solution is, to the best of our knowledge, the first published work in which the CNN is designed to directly estimate the three dimensional position of an acoustic source, using the raw audio signal as the input information avoiding the use of hand crafted audio features. Given the limited amount of available localization data, we propose in this paper a training strategy based on two steps. We first train our network using semi-synthetic data, generated from close talk speech recordings, and where we simulate the time delays and distortion suffered in the signal that propagates from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results show that this strategy is able to produce networks that significantly improve existing localization methods based on \textit{SRP-PHAT} strategies. In addition, our experiments show that our CNN method exhibits better resistance against varying gender of the speaker and different window sizes compared with the other methods.Comment: 18 pages, 3 figures, 8 table

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

FinBTech: Blockchain-Based Video and Voice Authentication System for Enhanced Security in Financial Transactions Utilizing FaceNet512 and Gaussian Mixture Models

Author: Laila Prof N. Jeenath
Tamilpavai Dr G.
Publication venue
Publication date: 28/10/2023
Field of study

In the digital age, it is crucial to make sure that financial transactions are as secure and reliable as possible. This abstract offers a ground-breaking method that combines smart contracts, blockchain technology, FaceNet512 for improved face recognition, and Gaussian Mixture Models (GMM) for speech authentication to create a system for video and audio verification that is unmatched. Smart contracts and the immutable ledger of the blockchain are combined to offer a safe and open environment for financial transactions. FaceNet512 and GMM offer multi-factor biometric authentication simultaneously, enhancing security to new heights. By combining cutting-edge technology, this system offers a strong defense against identity theft and illegal access, establishing a new benchmark for safe financial transactions

arXiv.org e-Print Archive

Acoustic localization of people in reverberant environments using deep learning techniques

Author: Vera Díaz Juan Manuel
Publication venue
Publication date: 01/01/2023
Field of study

La localización de las personas a partir de información acústica es cada vez más importante en aplicaciones del mundo real como la seguridad, la vigilancia y la interacción entre personas y robots. En muchos casos, es necesario localizar con precisión personas u objetos en función del sonido que generan, especialmente en entornos ruidosos y reverberantes en los que los métodos de localización tradicionales pueden fallar, o en escenarios en los que los métodos basados en análisis de vídeo no son factibles por no disponer de ese tipo de sensores o por la existencia de oclusiones relevantes. Por ejemplo, en seguridad y vigilancia, la capacidad de localizar con precisión una fuente de sonido puede ayudar a identificar posibles amenazas o intrusos. En entornos sanitarios, la localización acústica puede utilizarse para controlar los movimientos y actividades de los pacientes, especialmente los que tienen problemas de movilidad. En la interacción entre personas y robots, los robots equipados con capacidades de localización acústica pueden percibir y responder mejor a su entorno, lo que permite interacciones más naturales e intuitivas con los humanos. Por lo tanto, el desarrollo de sistemas de localización acústica precisos y robustos utilizando técnicas avanzadas como el aprendizaje profundo es de gran importancia práctica. Es por esto que en esta tesis doctoral se aborda dicho problema en tres líneas de investigación fundamentales: (i) El diseño de un sistema extremo a extremo (end-to-end) basado en redes neuronales capaz de mejorar las tasas de localización de sistemas ya existentes en el estado del arte. (ii) El diseño de un sistema capaz de localizar a uno o varios hablantes simultáneos en entornos con características y con geometrías de arrays de sensores diferentes sin necesidad de re-entrenar. (iii) El diseño de sistemas capaces de refinar los mapas de potencia acústica necesarios para localizar a las fuentes acústicas para conseguir una mejor localización posterior. A la hora de evaluar la consecución de dichos objetivos se han utilizado diversas bases de datos realistas con características diferentes, donde las personas involucradas en las escenas pueden actuar sin ningún tipo de restricción. Todos los sistemas propuestos han sido evaluados bajo las mismas condiciones consiguiendo superar en términos de error de localización a los sistemas actuales del estado del arte

e_Buah - Biblioteca Digital de la Universidad de Alcalá

MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation

Author: Chen Dong
Chen Qifeng
Qi Chenyang
Wang Yong
Wen Fang
Wu HsiangTao
Zhang Bo
Zhang Bowen
Zhang Pan
Publication venue
Publication date: 26/03/2023
Field of study

In this work, we propose an ID-preserving talking head generation framework, which advances previous methods in two aspects. First, as opposed to interpolating from sparse flow, we claim that dense landmarks are crucial to achieving accurate geometry-aware flow fields. Second, inspired by face-swapping methods, we adaptively fuse the source identity during synthesis, so that the network better preserves the key characteristics of the image portrait. Although the proposed model surpasses prior generation fidelity on established benchmarks, to further make the talking head generation qualified for real usage, personalized fine-tuning is usually needed. However, this process is rather computationally demanding that is unaffordable to standard users. To solve this, we propose a fast adaptation model using a meta-learning approach. The learned model can be adapted to a high-quality personalized model as fast as 30 seconds. Last but not the least, a spatial-temporal enhancement module is proposed to improve the fine details while ensuring temporal coherency. Extensive experiments prove the significant superiority of our approach over the state of the arts in both one-shot and personalized settings.Comment: CVPR 2023, project page: https://meta-portrait.github.i

arXiv.org e-Print Archive