8,774 research outputs found
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Data augmentation is one of the most effective ways to make end-to-end
automatic speech recognition (ASR) perform close to the conventional hybrid
approach, especially when dealing with low-resource tasks. Using recent
advances in speech synthesis (text-to-speech, or TTS), we build our TTS system
on an ASR training database and then extend the data with synthesized speech to
train a recognition model. We argue that, when the training data amount is
relatively low, this approach can allow an end-to-end model to reach hybrid
systems' quality. For an artificial low-to-medium-resource setup, we compare
the proposed augmentation with the semi-supervised learning technique. We also
investigate the influence of vocoder usage on final ASR performance by
comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an
external language model, our approach outperforms a semi-supervised setup for
LibriSpeech test-clean and only 33% worse than a comparable supervised setup.
Our system establishes a competitive result for end-to-end ASR trained on
LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for
test-other
Synesthesia: Detecting Screen Content via Remote Acoustic Side Channels
We show that subtle acoustic noises emanating from within computer screens
can be used to detect the content displayed on the screens. This sound can be
picked up by ordinary microphones built into webcams or screens, and is
inadvertently transmitted to other parties, e.g., during a videoconference call
or archived recordings. It can also be recorded by a smartphone or "smart
speaker" placed on a desk next to the screen, or from as far as 10 meters away
using a parabolic microphone.
Empirically demonstrating various attack scenarios, we show how this channel
can be used for real-time detection of on-screen text, or users' input into
on-screen virtual keyboards. We also demonstrate how an attacker can analyze
the audio received during video call (e.g., on Google Hangout) to infer whether
the other side is browsing the web in lieu of watching the video call, and
which web site is displayed on their screen
- …