2,295 research outputs found
Security and privacy problems in voice assistant applications: A survey
Voice assistant applications have become omniscient nowadays. Two models that provide the two most important functions for real-life applications (i.e., Google Home, Amazon Alexa, Siri, etc.) are Automatic Speech Recognition (ASR) models and Speaker Identification (SI) models. According to recent studies, security and privacy threats have also emerged with the rapid development of the Internet of Things (IoT). The security issues researched include attack techniques toward machine learning models and other hardware components widely used in voice assistant applications. The privacy issues include technical-wise information stealing and policy-wise privacy breaches. The voice assistant application takes a steadily growing market share every year, but their privacy and security issues never stopped causing huge economic losses and endangering users' personal sensitive information. Thus, it is important to have a comprehensive survey to outline the categorization of the current research regarding the security and privacy problems of voice assistant applications. This paper concludes and assesses five kinds of security attacks and three types of privacy threats in the papers published in the top-tier conferences of cyber security and voice domain
ALISA: An automatic lightly supervised speech segmentation and alignment tool
This paper describes the ALISA tool, which implements a lightly supervised method for sentence-level alignment of speech with imperfect transcripts. Its intended use is to enable the creation of new speech corpora from a multitude of resources in a language-independent fashion, thus avoiding the need to record or transcribe speech data. The method is designed so that it requires minimum user intervention and expert knowledge, and it is able to align data in languages which employ alphabetic scripts. It comprises a GMM-based voice activity de-tector and a highly constrained grapheme-based speech aligner. The method is evaluated objectively against a gold standard segmentation and transcription, as well as subjectively through building and testing speech synthesis systems from the retrieved data. Results show that on average, 70 % of the original data is correctly aligned, with a word error rate of less than 0.5%. In one case, sub-jective listening tests show a statistically significant preference for voices built on the gold transcript, but this is small and in other tests, no statistically sig-nificant di↵erences between the systems built from the fully supervised training data and the one which uses the proposed method are found
CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings
Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we
organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6).
The new challenge revisits the previous CHiME-5 challenge and further considers
the problem of distant multi-microphone conversational speech diarization and
recognition in everyday home environments. Speech material is the same as the
previous CHiME-5 recordings except for accurate array synchronization. The
material was elicited using a dinner party scenario with efforts taken to
capture data that is representative of natural conversational speech. This
paper provides a baseline description of the CHiME-6 challenge for both
segmented multispeaker speech recognition (Track 1) and unsegmented
multispeaker speech recognition (Track 2). Of note, Track 2 is the first
challenge activity in the community to tackle an unsegmented multispeaker
speech recognition scenario with a complete set of reproducible open source
baselines providing speech enhancement, speaker diarization, and speech
recognition modules
Current trends in multilingual speech processing
In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin
Security and Privacy Problems in Voice Assistant Applications: A Survey
Voice assistant applications have become omniscient nowadays. Two models that
provide the two most important functions for real-life applications (i.e.,
Google Home, Amazon Alexa, Siri, etc.) are Automatic Speech Recognition (ASR)
models and Speaker Identification (SI) models. According to recent studies,
security and privacy threats have also emerged with the rapid development of
the Internet of Things (IoT). The security issues researched include attack
techniques toward machine learning models and other hardware components widely
used in voice assistant applications. The privacy issues include technical-wise
information stealing and policy-wise privacy breaches. The voice assistant
application takes a steadily growing market share every year, but their privacy
and security issues never stopped causing huge economic losses and endangering
users' personal sensitive information. Thus, it is important to have a
comprehensive survey to outline the categorization of the current research
regarding the security and privacy problems of voice assistant applications.
This paper concludes and assesses five kinds of security attacks and three
types of privacy threats in the papers published in the top-tier conferences of
cyber security and voice domain.Comment: 5 figure
Decoding visemes: improving machine lip-reading
Abstract
This thesis is about improving machine lip-reading, that is, the classi�cation of
speech from only visual cues of a speaker. Machine lip-reading is a niche research
problem in both areas of speech processing and computer vision.
Current challenges for machine lip-reading fall into two groups: the content of the
video, such as the rate at which a person is speaking or; the parameters of the video
recording for example, the video resolution. We begin our work with a literature
review to understand the restrictions current technology limits machine lip-reading
recognition and conduct an experiment into resolution a�ects. We show that high
de�nition video is not needed to successfully lip-read with a computer.
The term \viseme" is used in machine lip-reading to represent a visual cue or
gesture which corresponds to a subgroup of phonemes where the phonemes are
indistinguishable in the visual speech signal. Whilst a viseme is yet to be formally
de�ned, we use the common working de�nition `a viseme is a group of phonemes
with identical appearance on the lips'. A phoneme is the smallest acoustic unit a
human can utter. Because there are more phonemes per viseme, mapping between
the units creates a many-to-one relationship. Many mappings have been presented,
and we conduct an experiment to determine which mapping produces the most
accurate classi�cation. Our results show Lee's [82] is best. Lee's classi�cation also
outperforms machine lip-reading systems which use the popular Fisher [48] phonemeto-
viseme map.
Further to this, we propose three methods of deriving speaker-dependent phonemeto-
viseme maps and compare our new approaches to Lee's. Our results show the
ii
iii
sensitivity of phoneme clustering and we use our new knowledge for our �rst suggested
augmentation to the conventional lip-reading system.
Speaker independence in machine lip-reading classi�cation is another unsolved
obstacle. It has been observed, in the visual domain, that classi�ers need training
on the test subject to achieve the best classi�cation. Thus machine lip-reading is
highly dependent upon the speaker. Speaker independence is the opposite of this,
or in other words, is the classi�cation of a speaker not present in the classi�er's
training data. We investigate the dependence of phoneme-to-viseme maps between
speakers. Our results show there is not a high variability of visual cues, but there is
high variability in trajectory between visual cues of an individual speaker with the
same ground truth. This implies a dependency upon the number of visemes within
each set for each individual.
Finally, we investigate how many visemes is the optimum number within a set.
We show the phoneme-to-viseme maps in literature rarely have enough visemes
and the optimal number, which varies by speaker, ranges from 11 to 35. The last
di�culty we address is decoding from visemes back to phonemes and into words.
Traditionally this is completed using a language model. The language model unit is
either: the same as the classi�er, e.g. visemes or phonemes; or the language model
unit is words. In a novel approach we use these optimum range viseme sets within
hierarchical training of phoneme labelled classi�ers. This new method of classi�er
training demonstrates signi�cant increase in classi�cation with a word language
network
Phoneme-based Video Indexing Using Phonetic Disparity Search
This dissertation presents and evaluates a method to the video indexing problem by investigating a categorization method that transcribes audio content through Automatic Speech Recognition (ASR) combined with Dynamic Contextualization (DC), Phonetic Disparity Search (PDS) and Metaphone indexation. The suggested approach applies genome pattern matching algorithms with computational summarization to build a database infrastructure that provides an indexed summary of the original audio content. PDS complements the contextual phoneme indexing approach by optimizing topic seek performance and accuracy in large video content structures. A prototype was established to translate news broadcast video into text and phonemes automatically by using ASR utterance conversions. Each phonetic utterance extraction was then categorized, converted to Metaphones, and stored in a repository with contextual topical information attached and indexed for posterior search analysis. Following the original design strategy, a custom parallel interface was built to measure the capabilities of dissimilar phonetic queries and provide an interface for result analysis. The postulated solution provides evidence of a superior topic matching when compared to traditional word and phoneme search methods. Experimental results demonstrate that PDS can be 3.7% better than the same phoneme query, Metaphone search proved to be 154.6% better than the same phoneme seek and 68.1 % better than the equivalent word search
Automatic Speech Recognition for ageing voices
With ageing, human voices undergo several changes which are typically characterised
by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking
rate. The focus of this thesis is to understand the impact of ageing on Automatic
Speech Recognition (ASR) performance and improve the ASR accuracies for older
voices.
Baseline results on three corpora indicate that the word error rates (WER) for older
adults are significantly higher than those of younger adults and the decrease in accuracies
is higher for males speakers as compared to females.
Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies
were found to be significantly higher for older adults. However, the hypothesis
that these changes explain the differences in WER for the two age groups is proven incorrect.
Experiments with artificial introduction of glottal source disfluencies in speech
from younger adults do not display a significant impact on WERs. Changes in fundamental
frequency observed quite often in older voices has a marginal impact on ASR
accuracies.
Analysis of phoneme errors between younger and older speakers shows a pattern
of certain phonemes especially lower vowels getting more affected with ageing. These
changes however are seen to vary across speakers. Another factor that is strongly associated
with ageing voices is a decrease in the rate of speech. Experiments to analyse
the impact of slower speaking rate on ASR accuracies indicate that the insertion errors
increase while decoding slower speech with models trained on relatively faster speech.
We then propose a way to characterise speakers in acoustic space based on speaker
adaptation transforms and observe that speakers (especially males) can be segregated
with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical
acoustic models based on gender and age. Significant improvements in word
accuracies are achieved over the baseline results with such models. The idea is then extended
to construct unsupervised hierarchical models which also outperform the baseline
models by a good margin.
Finally, we hypothesize that the ASR accuracies can be improved by augmenting
the adaptation data with speech from acoustically closest speakers. A strategy to select
the augmentation speakers is proposed. Experimental results on two corpora indicate
that the hypothesis holds true only when the amount of available adaptation is limited
to a few seconds. The efficacy of such a speaker selection strategy is analysed for both
younger and older adults
- …