3,750 research outputs found
Analysing the importance of different visual feature coefficients
A study is presented to determine the relative importance of different visual features for speech recognition which includes pixel-based, model-based, contour-based and physical features. Analysis to determine the discriminability of features is per- formed through F-ratio and J-measures for both static and tem- poral derivatives, the results of which were found to correlate highly with speech recognition accuracy (r = 0.97). Princi- pal component analysis is then used to combine all visual fea- tures into a single feature vector, of which further analysis is performed on the resulting basis functions. An optimal feature vector is obtained which outperforms the best individual feature (AAM) with 93.5 % word accuracy
Objective measures for predicting the intelligibility of spectrally smoothed speech with artificial excitation
A study is presented on how well objective measures of speech quality and intelligibility can predict the subjective in- telligibility of speech that has undergone spectral envelope smoothing and simplification of its excitation. Speech modi- fications are made by resynthesising speech that has been spec- trally smoothed. Objective measures are applied to the mod- ified speech and include measures of speech quality, signal- to-noise ratio and intelligibility, as well as proposing the nor- malised frequency-weighted spectral distortion (NFD) measure. The measures are compared to subjective intelligibility scores where it is found that several have high correlation (|r| ≥ 0.7), with NFD achieving the highest correlation (r = −0.81
Report on Domestic Violence and Law Enforcement in a Rural Setting: The Case of Eastern Kentucky
A report submitted by Neil Websdale to the Research and Creative Productions Committee in 1992 on domestic violence and the criminal justice response to that violence in Eastern Kentucky
A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation
This work proposes and compares perceptually motivated loss functions for deep learning based binary mask estimation for speech separation. Previous loss functions have focused on maximising classification accuracy of mask estimation but we now propose loss functions that aim to maximise the hit mi- nus false-alarm (HIT-FA) rate which is known to correlate more closely to speech intelligibility. The baseline loss function is bi- nary cross-entropy (CE), a standard loss function used in binary mask estimation, which maximises classification accuracy. We propose first a loss function that maximises the HIT-FA rate in- stead of classification accuracy. We then propose a second loss function that is a hybrid between CE and HIT-FA, providing a balance between classification accuracy and HIT-FA rate. Eval- uations of the perceptually motivated loss functions with the GRID database show improvements to HIT-FA rate and ESTOI across babble and factory noises. Further tests then explore ap- plication of the perceptually motivated loss functions to a larger vocabulary dataset
The Effect of Real-Time Constraints on Automatic Speech Animation
Machine learning has previously been applied successfully to speech-driven facial animation. To account for carry-over and anticipatory coarticulation a common approach is to predict the facial pose using a symmetric window of acoustic speech that includes both past and future context. Using future context limits this approach for animating the faces of characters in real-time and networked applications, such as online gaming. An acceptable latency for conversational speech is 200ms and typically network transmission times will consume a significant part of this. Consequently, we consider asymmetric windows by investigating the extent to which decreasing the future context effects the quality of predicted animation using both deep neural networks (DNNs) and bi-directional LSTM recurrent neural networks (BiLSTMs). Specifically we investigate future contexts from 170ms (fully-symmetric) to 0ms (fullyasymmetric
Audio speech enhancement using masks derived from visual speech
The aim of the work in this thesis is to explore how visual speech can be used within monaural masking based speech enhancement to remove interfering noise, with a focus on improving intelligibility. Visual speech has the advantage of not being corrupted by interfering noise and can therefore provide additional information within a speech enhancement framework. More specifically, this work considers
audio-only, visual-only and audio-visual methods of mask estimation within deep learning architectures with application to both seen and unseen noise types.
To estimate masks from audio and visual speech information, models are developed using deep neural networks, specifically feed-forward (DNN) and recurrent (RNN) neural networks for temporal modelling and convolutional neural networks (CNN) for visual feature extraction. It was found that the proposed layer normalised bi-directional feed-forward hybrid network using gated recurrent units (LNBiGRUDNN) provided best performance across all objective measures for temporal modelling. Also, extracting visual features using both pre-trained and end-to-end trained CNNs outperform traditional active appearance model (AAM) feature extraction across all noise types and SNRs tested. End-to-end CNNs trained on images focused on mouth-only regions-of-interest provided best performance for both audio-visual and visual-only models.
The best performing audio-visual masking method outperformed both audio-only and visual-only masking methods in both matched and unseen noise type and SNR dependent conditions. For example, in unseen cafeteria babble noise at -10 dB, audio-visual masking had an ESTOI of 46.8, while audio-only and visual-only masking scored 15.0 and 42.4, and the unprocessed audio scored 9.3. Formal tests show that visual information is critical for improving intelligibility at low SNRs and for generalisation to unseen noise conditions. Experiments in large unconstrained vocabulary speech confirm that the model architectures and approaches developed can generalise to unconstrained speech across noise independent conditions and can be considered for monaural speaker dependent real-world applications
The Magnetic Distortion Calibration System of the LHCb RICH1 Detector
The LHCb RICH1 detector uses hybrid photon detectors (HPDs) as its optical
sensors. A calibration system has been constructed to provide corrections for
distortions that are primarily due to external magnetic fields. We describe
here the system design, construction, operation and performance.Comment: 9 pages, 14 figure
I spy with my little eye: a history of the policing of class and gender relations in Eugene, Oregon (USA)
My thesis is that local police in Eugene and Lane County, Oregon,\ud
have been integral parts of a process of governmentality which was\ud
directed at the constitution and reconstitution of various forms of\ud
social order. In terms of class relations we find police mediating and\ud
managing a number of antagonisms. This management role took both\ud
coercive and consensual forms and was largely concerned with the historical\ud
regulation of the proletariat. We witness a more passive role\ud
for police in the field of patriarchy. Here law enforcement strategies\ud
were non-interventionist vis a vis domestic violence, rape and prostitution.\ud
This passivity tended to reproduce the sovereign powers of men\ud
over women. In order to grasp the historical function of policing I\ud
argue that we must consider its utility in terms of both class and\ud
gender relations. While selective policing served to ensure the ongoing\ud
governability of the increasing numbers of male wage workers, it also\ud
allowed men in general to remain as sovereigns within families.\ud
In Section I I draw upon Marxism, Feminism, Poststructuralism and\ud
Phenomenology to make explicit my theoretical and methodological\ud
approach. My recognition of the importance of human agency is reflected\ud
in my use of qualitative sources such as oral histories, government\ud
documents, newspapers and court archival material. These sources are\ud
augmented by a guarded quantitative analysis of census data, crime\ud
statistics and police annual reports. Sections II and III provide\ud
historical outlines of national, state and local levels of class (II)\ud
and gender (III) relations respectively. In Section IV I discuss the\ud
rise of local policing and its relationship to other forms of\ud
governmentality. This leads me into a detailed appreciation of the\ud
policing of class (V) and gender conflict (VI).\u
Speaker-independent speech animation using perceptual loss functions and synthetic data
We propose a real-time speaker-independent speech- to-facial animation system that predicts lip and jaw movements on a reference face for audio speech taken from any speaker. Our approach is motivated by two key observations; 1) Speaker- independent facial animation can be generated from phoneme labels, but to perform this automatically a speech recogniser is needed which, due to contextual look-ahead, introduces too much time lag. 2) Audio-driven speech animation can be performed in real-time but requires large, multi-speaker audio-visual speech datasets of which there are few. We adopt a novel three- stage training procedure that leverages the advantages of each approach. First we train a phoneme-to-visual speech model from a large single-speaker audio-visual dataset. Next, we use this model to generate the synthetic visual component of a large multi-speaker audio dataset of which the video is not available. Finally, we learn an audio-to-visual speech mapping using the synthetic visual features as the target. Furthermore, we increase the realism of the predicted facial animation by introducing two perceptually-based loss functions that aim to improve mouth closures and openings. The proposed method and loss functions are evaluated objectively using mean square error, global variance and a new metric that measures the extent of mouth opening. Subjective tests show that our approach produces facial animation comparable to those produced from phoneme sequences and that improved mouth closures, particularly for bilabial closures, are achieved
- …