5,569 research outputs found
Lip2AudSpec: Speech reconstruction from silent lip movements video
In this study, we propose a deep neural network for reconstructing
intelligible speech from silent lip movement videos. We use auditory
spectrogram as spectral representation of speech and its corresponding sound
generation method resulting in a more natural sounding reconstructed speech.
Our proposed network consists of an autoencoder to extract bottleneck features
from the auditory spectrogram which is then used as target to our main lip
reading network comprising of CNN, LSTM and fully connected layers. Our
experiments show that the autoencoder is able to reconstruct the original
auditory spectrogram with a 98% correlation and also improves the quality of
reconstructed speech from the main lip reading network. Our model, trained
jointly on different speakers is able to extract individual speaker
characteristics and gives promising results of reconstructing intelligible
speech with superior word recognition accuracy
Visual Speech Enhancement
When video is shot in noisy environment, the voice of a speaker seen in the
video can be enhanced using the visible mouth movements, reducing background
noise. While most existing methods use audio-only inputs, improved performance
is obtained with our visual speech enhancement, based on an audio-visual neural
network. We include in the training data videos to which we added the voice of
the target speaker as background noise. Since the audio input is not sufficient
to separate the voice of a speaker from his own voice, the trained model better
exploits the visual input and generalizes well to different noise types. The
proposed model outperforms prior audio visual methods on two public lipreading
datasets. It is also the first to be demonstrated on a dataset not designed for
lipreading, such as the weekly addresses of Barack Obama.Comment: Accepted to Interspeech 2018. Supplementary video:
https://www.youtube.com/watch?v=nyYarDGpcY
Wavelet-based birdsong recognition for conservation : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Palmerston North, New Zealand
Listed in 2017 Dean's List of Exceptional ThesesAccording to the International Union for the Conservation of Nature Red Data List
nearly a quarter of the world's bird species are either threatened or at risk of extinction.
To be able to protect endangered species, we need accurate survey methods that reliably
estimate numbers and hence population trends. Acoustic monitoring is the most
commonly-used method to survey birds, particularly cryptic and nocturnal species,
not least because it is non-invasive, unbiased, and relatively time-effective. Unfortunately,
the resulting data still have to be analysed manually. The current practice,
manual spectrogram reading, is tedious, prone to bias due to observer variations, and
not reproducible.
While there is a large literature on automatic recognition of targeted recordings of
small numbers of species, automatic analysis of long field recordings has not been well
studied to date. This thesis considers this problem in detail, presenting experiments
demonstrating the true efficacy of recorders in natural environments under different
conditions, and then working to reduce the noise present in the recording, as well as to
segment and recognise a range of New Zealand native bird species.
The primary issues with field recordings are that the birds are at variable distances
from the recorder, that the recordings are corrupted by many different forms of noise,
that the environment affects the quality of the recorded sound, and that birdsong is
often relatively rare within a recording. Thus, methods of dealing with faint calls,
denoising, and effective segmentation are all needed before individual species can be
recognised reliably. Experiments presented in this thesis demonstrate clearly the effects
of distance and environment on recorded calls. Some of these results are unsurprising,
for example an inverse square relationship with distance is largely true. Perhaps more
surprising is that the height from which a call is transmitted has a signifcant effect on
the recorded sound. Statistical analyses of the experiments, which demonstrate many
significant environmental and sound factors, are presented.
Regardless of these factors, the recordings have noise present, and removing this
noise is helpful for reliable recognition. A method for denoising based on the wavelet
packet decomposition is presented and demonstrated to significantly improve the quality
of recordings. Following this, wavelets were also used to implement a call detection
algorithm that identifies regions of the recording with calls from a target bird species.
This algorithm is validated using four New Zealand native species namely Australasian
bittern (Botaurus poiciloptilus), brown kiwi (Apteryx mantelli ), morepork (Ninox novaeseelandiae),
and kakapo (Strigops habroptilus), but could be used for any species.
The results demonstrate high recall rates and tolerate false positives when compared
to human experts
Seeing Through Noise: Visually Driven Speaker Separation and Enhancement
Isolating the voice of a specific person while filtering out other voices or
background noises is challenging when video is shot in noisy environments. We
propose audio-visual methods to isolate the voice of a single speaker and
eliminate unrelated sounds. First, face motions captured in the video are used
to estimate the speaker's voice, by passing the silent video frames through a
video-to-speech neural network-based model. Then the speech predictions are
applied as a filter on the noisy input audio. This approach avoids using
mixtures of sounds in the learning process, as the number of such possible
mixtures is huge, and would inevitably bias the trained model. We evaluate our
method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our
method attains significant SDR and PESQ improvements over the raw
video-to-speech predictions, and a well-known audio-only method.Comment: Supplementary video: https://www.youtube.com/watch?v=qmsyj7vAzo
A Home Security System Based on Smartphone Sensors
Several new smartphones are released every year. Many people upgrade to new phones, and their old phones are not put to any further use. In this paper, we explore the feasibility of using such retired smartphones and their on-board sensors to build a home security system. We observe that door-related events such as opening and closing have unique vibration signatures when compared to many types of environmental vibrational noise. These events can be captured by the accelerometer of a smartphone when the phone is mounted on a wall near a door. The rotation of a door can also be captured by the magnetometer of a smartphone when the phone is mounted on a door. We design machine learning and threshold-based methods to detect door opening events based on accelerometer and magnetometer data and build a prototype home security system that can detect door openings and notify the homeowner via email, SMS and phone calls upon break-in detection. To further augment our security system, we explore using the smartphone’s built-in microphone to detect door and window openings across multiple doors and windows simultaneously. Experiments in a residential home show that the accelerometer- based detection can detect door open events with an accuracy higher than 98%, and magnetometer-based detection has 100% accuracy. By using the magnetometer method to automate the training phase of a neural network, we find that sound-based detection of door openings has an accuracy of 90% across multiple doors
The Conversation: Deep Audio-Visual Speech Enhancement
Our goal is to isolate individual speakers from multi-talker simultaneous
speech in videos. Existing works in this area have focussed on trying to
separate utterances from known speakers in controlled environments. In this
paper, we propose a deep audio-visual speech enhancement network that is able
to separate a speaker's voice given lip regions in the corresponding video, by
predicting both the magnitude and the phase of the target signal. The method is
applicable to speakers unheard and unseen during training, and for
unconstrained environments. We demonstrate strong quantitative and qualitative
results, isolating extremely challenging real-world examples.Comment: To appear in Interspeech 2018. We provide supplementary material with
interactive demonstrations on
http://www.robots.ox.ac.uk/~vgg/demo/theconversatio
Expediting TTS Synthesis with Adversarial Vocoding
Recent approaches in text-to-speech (TTS) synthesis employ neural network
strategies to vocode perceptually-informed spectrogram representations directly
into listenable waveforms. Such vocoding procedures create a computational
bottleneck in modern TTS pipelines. We propose an alternative approach which
utilizes generative adversarial networks (GANs) to learn mappings from
perceptually-informed spectrograms to simple magnitude spectrograms which can
be heuristically vocoded. Through a user study, we show that our approach
significantly outperforms na\"ive vocoding strategies while being hundreds of
times faster than neural network vocoders used in state-of-the-art TTS systems.
We also show that our method can be used to achieve state-of-the-art results in
unsupervised synthesis of individual words of speech.Comment: Published as a conference paper at INTERSPEECH 201
- …