67,730 research outputs found
UniX-Encoder: A Universal -Channel Speech Encoder for Ad-Hoc Microphone Array Speech Processing
The speech field is evolving to solve more challenging scenarios, such as
multi-channel recordings with multiple simultaneous talkers. Given the many
types of microphone setups out there, we present the UniX-Encoder. It's a
universal encoder designed for multiple tasks, and worked with any microphone
array, in both solo and multi-talker environments. Our research enhances
previous multi-channel speech processing efforts in four key areas: 1)
Adaptability: Contrasting traditional models constrained to certain microphone
array configurations, our encoder is universally compatible. 2) Multi-Task
Capability: Beyond the single-task focus of previous systems, UniX-Encoder acts
as a robust upstream model, adeptly extracting features for diverse tasks
including ASR and speaker recognition. 3) Self-Supervised Training: The encoder
is trained without requiring labeled multi-channel data. 4) End-to-End
Integration: In contrast to models that first beamform then process
single-channels, our encoder offers an end-to-end solution, bypassing explicit
beamforming or separation. To validate its effectiveness, we tested the
UniX-Encoder on a synthetic multi-channel dataset from the LibriSpeech corpus.
Across tasks like speech recognition and speaker diarization, our encoder
consistently outperformed combinations like the WavLM model with the BeamformIt
frontend.Comment: Submitted to ICASSP 202
Speaker characterization by means of attention pooling
State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.Peer ReviewedPostprint (published version
- …