790 research outputs found
Very Fast Keyword Spotting System with Real Time Factor below 0.01
In the paper we present an architecture of a keyword spotting (KWS) system
that is based on modern neural networks, yields good performance on various
types of speech data and can run very fast. We focus mainly on the last aspect
and propose optimizations for all the steps required in a KWS design: signal
processing and likelihood computation, Viterbi decoding, spot candidate
detection and confidence calculation. We present time and memory efficient
modelling by bidirectional feedforward sequential memory networks (an
alternative to recurrent nets) either by standard triphones or so called
quasi-monophones, and an entirely forward decoding of speech frames (with
minimal need for look back). Several variants of the proposed scheme are
evaluated on 3 large Czech datasets (broadcast, internet and telephone, 17
hours in total) and their performance is compared by Detection Error Tradeoff
(DET) diagrams and real-time (RT) factors. We demonstrate that the complete
system can run in a single pass with a RT factor close to 0.001 if all
optimizations (including a GPU for likelihood computation) are applied.Comment: 11 pages, 3 figure
Latent Phrase Matching for Dysarthric Speech
Many consumer speech recognition systems are not tuned for people with speech
disabilities, resulting in poor recognition and user experience, especially for
severe speech differences. Recent studies have emphasized interest in
personalized speech models from people with atypical speech patterns. We
propose a query-by-example-based personalized phrase recognition system that is
trained using small amounts of speech, is language agnostic, does not assume a
traditional pronunciation lexicon, and generalizes well across speech
difference severities. On an internal dataset collected from 32 people with
dysarthria, this approach works regardless of severity and shows a 60%
improvement in recall relative to a commercial speech recognition system. On
the public EasyCall dataset of dysarthric speech, our approach improves
accuracy by 30.5%. Performance degrades as the number of phrases increases, but
consistently outperforms ASR systems when trained with 50 unique phrases
Multimedia information technology and the annotation of video
The state of the art in multimedia information technology has not progressed to the point where a single solution is available to meet all reasonable needs of documentalists and users of video archives. In general, we do not have an optimistic view of the usability of new technology in this domain, but digitization and digital power can be expected to cause a small revolution in the area of video archiving. The volume of data leads to two views of the future: on the pessimistic side, overload of data will cause lack of annotation capacity, and on the optimistic side, there will be enough data from which to learn selected concepts that can be deployed to support automatic annotation. At the threshold of this interesting era, we make an attempt to describe the state of the art in technology. We sample the progress in text, sound, and image processing, as well as in machine learning
Open-vocabulary keyword spotting in any language through multilingual contrastive speech-phoneme pretraining
In this paper, we introduce a massively multilingual speech corpora with
fine-grained phonemic transcriptions, encompassing more than 115 languages from
diverse language families. Based on this multilingual dataset, we propose
CLAP-IPA, a multilingual phoneme-speech contrastive embedding model capable of
open-vocabulary matching between speech signals and phonemically transcribed
keywords or arbitrary phrases. The proposed model has been tested on two
fieldwork speech corpora in 97 unseen languages, exhibiting strong
generalizability across languages. Comparison with a text-based model shows
that using phonemes as modeling units enables much better crosslinguistic
generalization than orthographic texts.Comment: Preprint; Work in Progres
Cross-Lingual Topic Prediction for Speech Using Translations
Given a large amount of unannotated speech in a low-resource language, can we
classify the speech utterances by topic? We consider this question in the
setting where a small amount of speech in the low-resource language is paired
with text translations in a high-resource language. We develop an effective
cross-lingual topic classifier by training on just 20 hours of translated
speech, using a recent model for direct speech-to-text translation. While the
translations are poor, they are still good enough to correctly classify the
topic of 1-minute speech segments over 70% of the time - a 20% improvement over
a majority-class baseline. Such a system could be useful for humanitarian
applications like crisis response, where incoming speech in a foreign
low-resource language must be quickly assessed for further action.Comment: Accepted to ICASSP 202
- …