8,133 research outputs found
Audio-Visual Speech Recognition using Red Exclusion an Neural Networks
PO BOX Q534,QVB POST OFFICE, SYDNEY,
AUSTRALIA, 123
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Transfer learning aims to reduce the amount of data required to excel at a
new task by re-using the knowledge acquired from learning other related tasks.
This paper proposes a novel transfer learning scenario, which distills robust
phonetic features from grounding models that are trained to tell whether a pair
of image and speech are semantically correlated, without using any textual
transcripts. As semantics of speech are largely determined by its lexical
content, grounding models learn to preserve phonetic information while
disregarding uncorrelated factors, such as speaker and channel. To study the
properties of features distilled from different layers, we use them as input
separately to train multiple speech recognition models. Empirical results
demonstrate that layers closer to input retain more phonetic information, while
following layers exhibit greater invariance to domain shift. Moreover, while
most previous studies include training data for speech recognition for feature
extractor training, our grounding models are not trained on any of those data,
indicating more universal applicability to new domains.Comment: Accepted to Interspeech 2019. 4 pages, 2 figure
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Adaptive threshold optimisation for colour-based lip segmentation in automatic lip-reading systems
A thesis submitted to the Faculty of Engineering and the Built Environment,
University of the Witwatersrand, Johannesburg, in ful lment of the requirements for
the degree of Doctor of Philosophy.
Johannesburg, September 2016Having survived the ordeal of a laryngectomy, the patient must come to terms with
the resulting loss of speech. With recent advances in portable computing power,
automatic lip-reading (ALR) may become a viable approach to voice restoration. This
thesis addresses the image processing aspect of ALR, and focuses three contributions
to colour-based lip segmentation.
The rst contribution concerns the colour transform to enhance the contrast
between the lips and skin. This thesis presents the most comprehensive study to
date by measuring the overlap between lip and skin histograms for 33 di erent
colour transforms. The hue component of HSV obtains the lowest overlap of 6:15%,
and results show that selecting the correct transform can increase the segmentation
accuracy by up to three times.
The second contribution is the development of a new lip segmentation algorithm
that utilises the best colour transforms from the comparative study. The algorithm
is tested on 895 images and achieves percentage overlap (OL) of 92:23% and segmentation
error (SE) of 7:39 %.
The third contribution focuses on the impact of the histogram threshold on the
segmentation accuracy, and introduces a novel technique called Adaptive Threshold
Optimisation (ATO) to select a better threshold value. The rst stage of ATO
incorporates -SVR to train the lip shape model. ATO then uses feedback of shape
information to validate and optimise the threshold. After applying ATO, the SE
decreases from 7:65% to 6:50%, corresponding to an absolute improvement of 1:15 pp
or relative improvement of 15:1%. While this thesis concerns lip segmentation in
particular, ATO is a threshold selection technique that can be used in various
segmentation applications.MT201
- …