642 research outputs found
Fast and Accurate OOV Decoder on High-Level Features
This work proposes a novel approach to out-of-vocabulary (OOV) keyword search
(KWS) task. The proposed approach is based on using high-level features from an
automatic speech recognition (ASR) system, so called phoneme posterior based
(PPB) features, for decoding. These features are obtained by calculating
time-dependent phoneme posterior probabilities from word lattices, followed by
their smoothing. For the PPB features we developed a special novel very fast,
simple and efficient OOV decoder. Experimental results are presented on the
Georgian language from the IARPA Babel Program, which was the test language in
the OpenKWS 2016 evaluation campaign. The results show that in terms of maximum
term weighted value (MTWV) metric and computational speed, for single ASR
systems, the proposed approach significantly outperforms the state-of-the-art
approach based on using in-vocabulary proxies for OOV keywords in the indexed
database. The comparison of the two OOV KWS approaches on the fusion results of
the nine different ASR systems demonstrates that the proposed OOV decoder
outperforms the proxy-based approach in terms of MTWV metric given the
comparable processing speed. Other important advantages of the OOV decoder
include extremely low memory consumption and simplicity of its implementation
and parameter optimization.Comment: Interspeech 2017, August 2017, Stockholm, Sweden. 201
Dialogue history integration into end-to-end signal-to-concept spoken language understanding systems
This work investigates the embeddings for representing dialog history in
spoken language understanding (SLU) systems. We focus on the scenario when the
semantic information is extracted directly from the speech signal by means of a
single end-to-end neural network model. We proposed to integrate dialogue
history into an end-to-end signal-to-concept SLU system. The dialog history is
represented in the form of dialog history embedding vectors (so-called
h-vectors) and is provided as an additional information to end-to-end SLU
models in order to improve the system performance. Three following types of
h-vectors are proposed and experimentally evaluated in this paper: (1)
supervised-all embeddings predicting bag-of-concepts expected in the answer of
the user from the last dialog system response; (2) supervised-freq embeddings
focusing on predicting only a selected set of semantic concept (corresponding
to the most frequent errors in our experiments); and (3) unsupervised
embeddings. Experiments on the MEDIA corpus for the semantic slot filling task
demonstrate that the proposed h-vectors improve the model performance.Comment: Accepted for ICASSP 2020 (Submitted: October 21, 2019
Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech
International audienceThis work investigates speaker adaptation and transfer learning for spoken language understanding (SLU). We focus on the direct extraction of semantic tags from the audio signal using an end-to-end neural network approach. We demonstrate that the learning performance of the target predictive function for the semantic slot filling task can be substantially improved by speaker adaptation and by various knowledge transfer approaches. First, we explore speaker adaptive training (SAT) for end-to-end SLU models and propose to use zero pseudo i-vectors for more efficient model initialization and pretraining in SAT. Second, in order to improve the learning convergence for the target semantic slot filling (SF) task, models trained for different tasks, such as automatic speech recognition and named entity extraction are used to initialize neural end-to-end models trained for the target task. In addition, we explore the impact of the knowledge transfer for SLU from a speech recognition task trained in a different language. These approaches allow to develop end-to-end SLU systems in low-resource data scenarios when there is no enough in-domain semantically labeled data, but other resources, such as word transcriptions for the same or another language or named entity annotation, are available
A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems
Self-supervised models for speech processing emerged recently as popular
foundation blocks in speech processing pipelines. These models are pre-trained
on unlabeled audio data and then used in speech processing downstream tasks
such as automatic speech recognition (ASR) or speech translation (ST). Since
these models are now used in research and industrial systems alike, it becomes
necessary to understand the impact caused by some features such as gender
distribution within pre-training data. Using French as our investigation
language, we train and compare gender-specific wav2vec 2.0 models against
models containing different degrees of gender balance in their pre-training
data. The comparison is performed by applying these models to two
speech-to-text downstream tasks: ASR and ST. Our results show that the type of
downstream integration matters. We observe lower overall performance using
gender-specific pre-training before fine-tuning an end-to-end ASR system.
However, when self-supervised models are used as feature extractors, the
overall ASR and ST results follow more complex patterns, in which the balanced
pre-trained model is not necessarily the best option. Lastly, our crude
'fairness' metric, the relative performance difference measured between female
and male test sets, does not display a strong variation from balanced to
gender-specific pre-trained wav2vec 2.0 models.Comment: submitted to INTERSPEECH 202
Language-independent speaker anonymization using orthogonal Householder neural network
Speaker anonymization aims to conceal a speaker's identity while preserving
content information in speech. Current mainstream neural-network speaker
anonymization systems disentangle speech into prosody-related, content, and
speaker representations. The speaker representation is then anonymized by a
selection-based speaker anonymizer that uses a mean vector over a set of
randomly selected speaker vectors from an external pool of English speakers.
However, the resulting anonymized vectors are subject to severe privacy leakage
against powerful attackers, reduction in speaker diversity, and language
mismatch problems for unseen language speaker anonymization. To generate
diverse, language-neutral speaker vectors, this paper proposes an anonymizer
based on an orthogonal Householder neural network (OHNN). Specifically, the
OHNN acts like a rotation to transform the original speaker vectors into
anonymized speaker vectors, which are constrained to follow the distribution
over the original speaker vector space. A basic classification loss is
introduced to ensure that anonymized speaker vectors from different speakers
have unique speaker identities. To further protect speaker identities, an
improved classification loss and similarity loss are used to push
original-anonymized sample pairs away from each other. Experiments on
VoicePrivacy Challenge datasets in English and the AISHELL-3 dataset in
Mandarin demonstrate the proposed anonymizer's effectiveness
Privacy attacks for automatic speech recognition acoustic models in a federated learning framework
This paper investigates methods to effectively retrieve speaker information
from the personalized speaker adapted neural network acoustic models (AMs) in
automatic speech recognition (ASR). This problem is especially important in the
context of federated learning of ASR acoustic models where a global model is
learnt on the server based on the updates received from multiple clients. We
propose an approach to analyze information in neural network AMs based on a
neural network footprint on the so-called Indicator dataset. Using this method,
we develop two attack models that aim to infer speaker identity from the
updated personalized models without access to the actual users' speech data.
Experiments on the TED-LIUM 3 corpus demonstrate that the proposed approaches
are very effective and can provide equal error rate (EER) of 1-2%.Comment: Submitted to ICASSP 202
- …