665 research outputs found
Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification
Even though deep speaker models have demonstrated impressive accuracy in
speaker verification tasks, this often comes at the expense of increased model
size and computation time, presenting challenges for deployment in
resource-constrained environments. Our research focuses on addressing this
limitation through the development of small footprint deep speaker embedding
extraction using knowledge distillation. While previous work in this domain has
concentrated on speaker embedding extraction at the utterance level, our
approach involves amalgamating embeddings from different levels of the x-vector
model (teacher network) to train a compact student network. The results
highlight the significance of frame-level information, with the student models
exhibiting a remarkable size reduction of 85%-91% compared to their teacher
counterparts, depending on the size of the teacher embeddings. Notably, by
concatenating teacher embeddings, we achieve student networks that maintain
comparable performance to the teacher while enjoying a substantial 75%
reduction in model size. These findings and insights extend to other x-vector
variants, underscoring the broad applicability of our approach.Comment: Submitted to Data & Knowledge Engineering at Dec. 2023. Copyright may
be transferred without notic
FRILL: A Non-Semantic Speech Embedding for Mobile Devices
Learned speech representations can drastically improve performance on tasks
with limited labeled data. However, due to their size and complexity, learned
representations have limited utility in mobile settings where run-time
performance can be a significant bottleneck. In this work, we propose a class
of lightweight non-semantic speech embedding models that run efficiently on
mobile devices based on the recently proposed TRILL speech embedding. We
combine novel architectural modifications with existing speed-up techniques to
create embedding models that are fast enough to run in real-time on a mobile
device and exhibit minimal performance degradation on a benchmark of
non-semantic speech tasks. One such model (FRILL) is 32x faster on a Pixel 1
smartphone and 40% the size of TRILL, with an average decrease in accuracy of
only 2%. To our knowledge, FRILL is the highest-quality non-semantic embedding
designed for use on mobile devices. Furthermore, we demonstrate that these
representations are useful for mobile health tasks such as non-speech human
sounds detection and face-masked speech detection. Our models and code are
publicly available.Comment: Accepted to Interspeech 202
Leveraging Speaker Embeddings with Adversarial Multi-task Learning for Age Group Classification
Recently, researchers have utilized neural network-based speaker embedding
techniques in speaker-recognition tasks to identify speakers accurately.
However, speaker-discriminative embeddings do not always represent speech
features such as age group well. In an embedding model that has been highly
trained to capture speaker traits, the task of age group classification is
closer to speech information leakage. Hence, to improve age group
classification performance, we consider the use of speaker-discriminative
embeddings derived from adversarial multi-task learning to align features and
reduce the domain discrepancy in age subgroups. In addition, we investigated
different types of speaker embeddings to learn and generalize the
domain-invariant representations for age groups. Experimental results on the
VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive
adversarial network in multi-objective scenarios and leveraging speaker
embeddings for the domain adaptation task
Compact recurrent neural networks for acoustic event detection on low-energy low-complexity platforms
Outdoor acoustic events detection is an exciting research field but
challenged by the need for complex algorithms and deep learning techniques,
typically requiring many computational, memory, and energy resources. This
challenge discourages IoT implementation, where an efficient use of resources
is required. However, current embedded technologies and microcontrollers have
increased their capabilities without penalizing energy efficiency. This paper
addresses the application of sound event detection at the edge, by optimizing
deep learning techniques on resource-constrained embedded platforms for the
IoT. The contribution is two-fold: firstly, a two-stage student-teacher
approach is presented to make state-of-the-art neural networks for sound event
detection fit on current microcontrollers; secondly, we test our approach on an
ARM Cortex M4, particularly focusing on issues related to 8-bits quantization.
Our embedded implementation can achieve 68% accuracy in recognition on
Urbansound8k, not far from state-of-the-art performance, with an inference time
of 125 ms for each second of the audio stream, and power consumption of 5.5 mW
in just 34.3 kB of RAM
Baselines and Protocols for Household Speaker Recognition
International audienceSpeaker recognition on household devices, such as smart speakers, features several challenges: (i) robustness across a vast number of heterogeneous domains (households), (ii) short utterances, (iii) possibly absent speaker labels of the enrollment data (passive enrollment), and (iv) presence of unknown persons (guests). While many commercial products exist, there is less published research and no publicly-available evaluation protocols or open-source baselines. Our work serves to bridge this gap by providing an accessible evaluation benchmark derived from public resources (VoxCeleb and ASVspoof 2019 data) along with a preliminary pool of open-source baselines. This includes four algorithms for active enrollment (speaker labels available) and one algorithm for passive enrollment
Deep Spoken Keyword Spotting:An Overview
Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS
Recommended from our members
Workshop Report: Developing a Research Agenda for the Energy Water Nexus
The
energy
water
nexus
has
attracted
public
scrutiny
because
of
the
concerns
about
their
interdependence
and
the
possibility
for
cascading
vulnerabilities
from
one
system
to
the
other.
There
are
trends
toward
more
water-Ââintensive
energy
(such
as
biofuels
,
unconventional
oil
and
gas
production,
and
regulations
driving
more
water
consumption
for
thermoelectric
power
production
)
and
more
energy-Ââintensive
water
(such
as
desalination,
or
deeper
ground
water
pumping
and
production).
In
addition
demographic
trends
of
population
and
economic
growth
will
likely
drive
up
total
and
per
capita
water
and
energy
demand,
and
due
to
climate
change
related
distortions
of
the
hydrologic
cycle,
it
is
expected
that
the
existing
interdependencies
will
be
come
even
more
of
a
concern.
Therefore,
developing
a
research
agenda
and
strategy
to
mitigate
potential
vulnerabilities
and
to
meet
economic
and
environmental
targets
for
efficiently
using
energy
and
water
would
be
very
worthwhile.
To
address
these
concerns,
the
National
Science
Foundation
(NSF)
sponsored
a
workshop
on
June
10-Ââ11,
2013
in
Arlington,
VA
(at
NSF
headquarters)
to
bring
together
technical,
academic,
and
industry
experts
from
across
the
country
to
help
develop
such
a
research
agenda.
The
workshop
was
sponsored
by
NSF
Grant
Number
CBET
1341032
from
the
Division
of
Chemical,
Bioengineering,
Environmental
and
Transport
Systems.
Supporting
programs
were:
Thermal
Transport
Processes,
Environmental
Sustainability,
and
Environmental
Engineering.Center for Research in Water Resource
Deep representation learning for speech recognition
Representation learning is a fundamental ingredient of deep learning. However, learning a good representation is a challenging task. For speech recognition, such a representation should contain the information needed to perform well in this task. A robust representation should also be reusable, hence it should capture the structure of the data. Interpretability is another desired characteristic. In this thesis we strive to learn an optimal deep representation for speech recognition using feed-forward Neural Networks (NNs) with different connectivity patterns.
First and foremost, we aim to improve the robustness of the acoustic models. We use attribute-aware and adaptive training strategies to model the underlying factors of variation related to the speakers and the acoustic conditions. We focus on low-latency and real-time decoding scenarios. We explore different utterance summaries (referred to as utterance embeddings), capturing various sources of speech variability, and we seek to optimise speaker adaptive training (SAT) with control networks acting on the embeddings. We also propose a multi-scale CNN layer, to learn factorised representations. The proposed multi-scale approach also tackles the computational and memory efficiency.
We also present a number of different approaches as an attempt to better understand learned representations. First, with a controlled design, we aim to assess the role of individual components of deep CNN acoustic models. Next, with saliency maps, we evaluate the importance of each input feature with respect to the classification criterion. Then, we propose to evaluate layer-wise and model-wise learned representations in different diagnostic verification tasks (speaker and acoustic condition verification). We propose a deep CNN model as the embedding extractor, merging the information learned at different layers in the network. Similarly, we perform the analyses for the embeddings used in SAT-DNNs to gain more insight. For the multi-scale models, we also show how to compare learned representations (and assess their robustness) with a metric invariant to affine transformations
Proceedings of the 8th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2023)
This volume gathers the papers presented at the Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), Tampere, Finland, during 21â22 September 2023
- âŠ