Search CORE

665 research outputs found

Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

Author: Kinnunen Tomi
Liu Xuechen
Sahidullah Md
Publication venue
Publication date: 19/12/2023
Field of study

Even though deep speaker models have demonstrated impressive accuracy in speaker verification tasks, this often comes at the expense of increased model size and computation time, presenting challenges for deployment in resource-constrained environments. Our research focuses on addressing this limitation through the development of small footprint deep speaker embedding extraction using knowledge distillation. While previous work in this domain has concentrated on speaker embedding extraction at the utterance level, our approach involves amalgamating embeddings from different levels of the x-vector model (teacher network) to train a compact student network. The results highlight the significance of frame-level information, with the student models exhibiting a remarkable size reduction of 85%-91% compared to their teacher counterparts, depending on the size of the teacher embeddings. Notably, by concatenating teacher embeddings, we achieve student networks that maintain comparable performance to the teacher while enjoying a substantial 75% reduction in model size. These findings and insights extend to other x-vector variants, underscoring the broad applicability of our approach.Comment: Submitted to Data & Knowledge Engineering at Dec. 2023. Copyright may be transferred without notic

arXiv.org e-Print Archive

FRILL: A Non-Semantic Speech Embedding for Mobile Devices

Author: Garrison Jake
Joglekar Sachin
Patel Shwetak
Peplinski Jacob
Shor Joel
Publication venue
Publication date: 10/06/2021
Field of study

Learned speech representations can drastically improve performance on tasks with limited labeled data. However, due to their size and complexity, learned representations have limited utility in mobile settings where run-time performance can be a significant bottleneck. In this work, we propose a class of lightweight non-semantic speech embedding models that run efficiently on mobile devices based on the recently proposed TRILL speech embedding. We combine novel architectural modifications with existing speed-up techniques to create embedding models that are fast enough to run in real-time on a mobile device and exhibit minimal performance degradation on a benchmark of non-semantic speech tasks. One such model (FRILL) is 32x faster on a Pixel 1 smartphone and 40% the size of TRILL, with an average decrease in accuracy of only 2%. To our knowledge, FRILL is the highest-quality non-semantic embedding designed for use on mobile devices. Furthermore, we demonstrate that these representations are useful for mobile health tasks such as non-speech human sounds detection and face-masked speech detection. Our models and code are publicly available.Comment: Accepted to Interspeech 202

arXiv.org e-Print Archive

Leveraging Speaker Embeddings with Adversarial Multi-task Learning for Age Group Classification

Author: Baeg Kwangje
Han Young-Sub
Jeon Byoung-Ki
Kim Yeong-Gwan
Publication venue
Publication date: 22/01/2023
Field of study

Recently, researchers have utilized neural network-based speaker embedding techniques in speaker-recognition tasks to identify speakers accurately. However, speaker-discriminative embeddings do not always represent speech features such as age group well. In an embedding model that has been highly trained to capture speaker traits, the task of age group classification is closer to speech information leakage. Hence, to improve age group classification performance, we consider the use of speaker-discriminative embeddings derived from adversarial multi-task learning to align features and reduce the domain discrepancy in age subgroups. In addition, we investigated different types of speaker embeddings to learn and generalize the domain-invariant representations for age groups. Experimental results on the VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive adversarial network in multi-objective scenarios and leveraging speaker embeddings for the domain adaptation task

arXiv.org e-Print Archive

Compact recurrent neural networks for acoustic event detection on low-energy low-complexity platforms

Author: Brutti Alessio
Cerutti Gianmarco
Farella Elisabetta
Prasad Rahul
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Outdoor acoustic events detection is an exciting research field but challenged by the need for complex algorithms and deep learning techniques, typically requiring many computational, memory, and energy resources. This challenge discourages IoT implementation, where an efficient use of resources is required. However, current embedded technologies and microcontrollers have increased their capabilities without penalizing energy efficiency. This paper addresses the application of sound event detection at the edge, by optimizing deep learning techniques on resource-constrained embedded platforms for the IoT. The contribution is two-fold: firstly, a two-stage student-teacher approach is presented to make state-of-the-art neural networks for sound event detection fit on current microcontrollers; secondly, we test our approach on an ARM Cortex M4, particularly focusing on issues related to 8-bits quantization. Our embedded implementation can achieve 68% accuracy in recognition on Urbansound8k, not far from state-of-the-art performance, with an inference time of 125 ms for each second of the audio stream, and power consumption of 5.5 mW in just 34.3 kB of RAM

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

Baselines and Protocols for Household Speaker Recognition

Author: Kinnunen Tomi
Liu Xuechen
Sahidullah Md
Sholokhov Alexey
Publication venue: 'International Speech Communication Association'
Publication date: 01/06/2022
Field of study

International audienceSpeaker recognition on household devices, such as smart speakers, features several challenges: (i) robustness across a vast number of heterogeneous domains (households), (ii) short utterances, (iii) possibly absent speaker labels of the enrollment data (passive enrollment), and (iv) presence of unknown persons (guests). While many commercial products exist, there is less published research and no publicly-available evaluation protocols or open-source baselines. Our work serves to bridge this gap by providing an accessible evaluation benchmark derived from public resources (VoxCeleb and ASVspoof 2019 data) along with a preliminary pool of open-source baselines. This includes four algorithms for active enrollment (speaker labels available) and one algorithm for passive enrollment

INRIA a CCSD electronic archive server

Deep Spoken Keyword Spotting:An Overview

Author: Espejo Ivan Lopez
Hansen John
Jensen Jesper
Tan Zheng-Hua
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/11/2021
Field of study

Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS

arXiv.org e-Print Archive

VBN

Recommended from our members

Workshop Report: Developing a Research Agenda for the Energy Water Nexus

Author: Danny Reible
Hightower Mike
Webber Michael E.
Publication venue: Center for Research in Water Resources, University of Texas at Austin
Publication date: 31/12/2013
Field of study

The energy water nexus has attracted public scrutiny because of the concerns about their interdependence and the possibility for cascading vulnerabilities from one system to the other. There are trends toward more water-‐intensive energy (such as biofuels , unconventional oil and gas production, and regulations driving more water consumption for thermoelectric power production ) and more energy-‐intensive water (such as desalination, or deeper ground water pumping and production). In addition demographic trends of population and economic growth will likely drive up total and per capita water and energy demand, and due to climate change related distortions of the hydrologic cycle, it is expected that the existing interdependencies will be come even more of a concern. Therefore, developing a research agenda and strategy to mitigate potential vulnerabilities and to meet economic and environmental targets for efficiently using energy and water would be very worthwhile. To address these concerns, the National Science Foundation (NSF) sponsored a workshop on June 10-‐11, 2013 in Arlington, VA (at NSF headquarters) to bring together technical, academic, and industry experts from across the country to help develop such a research agenda. The workshop was sponsored by NSF Grant Number CBET 1341032 from the Division of Chemical, Bioengineering, Environmental and Transport Systems. Supporting programs were: Thermal Transport Processes, Environmental Sustainability, and Environmental Engineering.Center for Research in Water Resource

Texas ScholarWorks

Deep representation learning for speech recognition

Author: Równicka Joanna Małgorzata
Publication venue: The University of Edinburgh
Publication date: 31/07/2021
Field of study

Representation learning is a fundamental ingredient of deep learning. However, learning a good representation is a challenging task. For speech recognition, such a representation should contain the information needed to perform well in this task. A robust representation should also be reusable, hence it should capture the structure of the data. Interpretability is another desired characteristic. In this thesis we strive to learn an optimal deep representation for speech recognition using feed-forward Neural Networks (NNs) with different connectivity patterns. First and foremost, we aim to improve the robustness of the acoustic models. We use attribute-aware and adaptive training strategies to model the underlying factors of variation related to the speakers and the acoustic conditions. We focus on low-latency and real-time decoding scenarios. We explore different utterance summaries (referred to as utterance embeddings), capturing various sources of speech variability, and we seek to optimise speaker adaptive training (SAT) with control networks acting on the embeddings. We also propose a multi-scale CNN layer, to learn factorised representations. The proposed multi-scale approach also tackles the computational and memory efficiency. We also present a number of different approaches as an attempt to better understand learned representations. First, with a controlled design, we aim to assess the role of individual components of deep CNN acoustic models. Next, with saliency maps, we evaluate the importance of each input feature with respect to the classification criterion. Then, we propose to evaluate layer-wise and model-wise learned representations in different diagnostic verification tasks (speaker and acoustic condition verification). We propose a deep CNN model as the embedding extractor, merging the information learned at different layers in the network. Similarly, we perform the analyses for the embeddings used in SAT-DNNs to gain more insight. For the multi-scale models, we also show how to compare learned representations (and assess their robustness) with a metric invariant to affine transformations

Edinburgh Research Archive

Proceedings of the 8th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2023)

Author
Publication venue: Tampere University Press
Publication date: 01/11/2023
Field of study

This volume gathers the papers presented at the Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), Tampere, Finland, during 21–22 September 2023

Trepo - Institutional Repository of Tampere University