Search CORE

9 research outputs found

Zero-shot keyword spotting for visual speech recognition in-the-wild

Author: Fei Tao
JS Chung
K Audhkhasi
K He
M Cooke
S Fernández
S Hochreiter
S Watanabe
Z Akata
Publication venue
Publication date: 25/07/2018
Field of study

Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.Comment: Accepted at ECCV-201

arXiv.org e-Print Archive

Crossref

Speeching: Mobile Crowdsourced Speech Assessment to Support Self-Monitoring and Management for People with Parkinson's

Author: Audhkhasi Kartik
Bigham Jeffrey P.
Cavender Anna C.
Côté Nicolas
Evanini Keelan
Goto Masataka
McGraw Ian
Miller Nick
Parent Gabriel
Putnam R
Swan M
Wicks Paul
Wolters Maria K.
Ziegler Wolfram
Publication venue
Publication date: 01/01/2016
Field of study

We present Speeching, a mobile application that uses crowdsourcing to support the self-monitoring and management of speech and voice issues for people with Parkinson's (PwP). The application allows participants to audio record short voice tasks, which are then rated and assessed by crowd workers. Speeching then feeds these results back to provide users with examples of how they were perceived by listeners unconnected to them (thus not used to their speech patterns). We conducted our study in two phases. First we assessed the feasibility of utilising the crowd to provide ratings of speech and voice that are comparable to those of experts. We then conducted a trial to evaluate how the provision of feedback, using Speeching, was valued by PwP. Our study highlights how applications like Speeching open up new opportunities for self-monitoring in digital health and wellbeing, and provide a means for those without regular access to clinical assessment services to practice-and get meaningful feedback on-their speech

Northumbria Research Link

Crossref

Edinburgh Research Explorer

Lancaster E-Prints

Explore Bristol Research

Multilingual representations for low resource speech recognition and keyword search

Author: Audhkhasi K
Cui J
Cui X
Gales MJF
Golik P
Kingsbury B
Kislal E
Knill KM
Mangu L
Ney H
Nussbaum-Thom M
Picheny M
Ragni A
Ramabhadran B
Schluter R
Sethy A
Tüske Z
Wang H
Woodland P
Publication venue: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings
Publication date: 01/01/2015
Field of study

© 2015 IEEE. This paper examines the impact of multilingual (ML) acoustic representations on Automatic Speech Recognition (ASR) and keyword search (KWS) for low resource languages in the context of the OpenKWS15 evaluation of the IARPA Babel program. The task is to develop Swahili ASR and KWS systems within two weeks using as little as 3 hours of transcribed data. Multilingual acoustic representations proved to be crucial for building these systems under strict time constraints. The paper discusses several key insights on how these representations are derived and used. First, we present a data sampling strategy that can speed up the training of multilingual representations without appreciable loss in ASR performance. Second, we show that fusion of diverse multilingual representations developed at different LORELEI sites yields substantial ASR and KWS gains. Speaker adaptation and data augmentation of these representations improves both ASR and KWS performance (up to 8.7% relative). Third, incorporating un-transcribed data through semi-supervised learning, improves WER and KWS performance. Finally, we show that these multilingual representations significantly improve ASR and KWS performance (relative 9% for WER and 5% for MTWV) even when forty hours of transcribed audio in the target language is available. Multilingual representations significantly contributed to the LORELEI KWS systems winning the OpenKWS15 evaluation

Publikationsserver der RWTH Aachen University

Apollo (Cambridge)

White Rose Research Online

CUED - Cambridge University Engineering Department