2,340 research outputs found
Version Control of Speaker Recognition Systems
This paper discusses one of the most challenging practical engineering
problems in speaker recognition systems - the version control of models and
user profiles. A typical speaker recognition system consists of two stages: the
enrollment stage, where a profile is generated from user-provided enrollment
audio; and the runtime stage, where the voice identity of the runtime audio is
compared against the stored profiles. As technology advances, the speaker
recognition system needs to be updated for better performance. However, if the
stored user profiles are not updated accordingly, version mismatch will result
in meaningless recognition results. In this paper, we describe different
version control strategies for different types of speaker recognition systems,
according to how they are deployed in the production environment
Health Diagnostics Using User Utterances
Respiratory illnesses can be hard to track and diagnose. Obtaining useful clinical data on these illnesses is difficult because it requires physical interaction, e.g., via nasal or sinus swab. It is known that respiratory illness can impact speech pathways. To this end, this disclosure describes techniques to use readily accessible software to obtain and classify potentially useful data. With user permission, utterances of the user, e.g., activation of a speech-activated device via a hotword, are analyzed to form speaker-ID models. These models are evaluated against additional utterances of the user in a sequential manner. The evaluation scores, along with the timestamps and details of the models, are aggregated to determine if the user has an interval of time where their speaker-ID models are unstable, inconsistent, or lacking self-similarity. This signal can be used as a proxy for detection or as a motivating factor for clinical investigation
Attention-Based Models for Text-Dependent Speaker Verification
Attention-based models have recently shown great performance on a range of
tasks, such as speech recognition, machine translation, and image captioning
due to their ability to summarize relevant information that expands through the
entire length of an input sequence. In this paper, we analyze the usage of
attention mechanisms to the problem of sequence summarization in our end-to-end
text-dependent speaker recognition system. We explore different topologies and
their variants of the attention layer, and compare different pooling methods on
the attention weights. Ultimately, we show that attention-based models can
improves the Equal Error Rate (EER) of our speaker verification system by
relatively 14% compared to our non-attention LSTM baseline model.Comment: Submitted to ICASSP 201
Secure audio processing
Automatic speech recognizers (ASR) are now nearly ubiquitous, finding application in smart assistants, smartphones, smart speakers, and other devices. An attack on an ASR that triggers such a device into carrying out false instructions can lead to severe consequences. Typically, speech recognition is performed using machine learning models, e.g., neural networks, whose intermediate outputs are not always fully concealed. Exposing such intermediate outputs makes the crafting of malicious input audio easier. This disclosure describes techniques that thwart attacks on speech recognition systems by moving model inference processing to a secure computing enclave. The memory of the secure enclave and signals are inaccessible to the user and untrusted processes, and therefore, resistant to attacks
- …