9 research outputs found

    Zero-shot keyword spotting for visual speech recognition in-the-wild

    Full text link
    Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.Comment: Accepted at ECCV-201

    Speeching: Mobile Crowdsourced Speech Assessment to Support Self-Monitoring and Management for People with Parkinson's

    Get PDF
    We present Speeching, a mobile application that uses crowdsourcing to support the self-monitoring and management of speech and voice issues for people with Parkinson's (PwP). The application allows participants to audio record short voice tasks, which are then rated and assessed by crowd workers. Speeching then feeds these results back to provide users with examples of how they were perceived by listeners unconnected to them (thus not used to their speech patterns). We conducted our study in two phases. First we assessed the feasibility of utilising the crowd to provide ratings of speech and voice that are comparable to those of experts. We then conducted a trial to evaluate how the provision of feedback, using Speeching, was valued by PwP. Our study highlights how applications like Speeching open up new opportunities for self-monitoring in digital health and wellbeing, and provide a means for those without regular access to clinical assessment services to practice-and get meaningful feedback on-their speech

    Multilingual representations for low resource speech recognition and keyword search

    Get PDF
    © 2015 IEEE. This paper examines the impact of multilingual (ML) acoustic representations on Automatic Speech Recognition (ASR) and keyword search (KWS) for low resource languages in the context of the OpenKWS15 evaluation of the IARPA Babel program. The task is to develop Swahili ASR and KWS systems within two weeks using as little as 3 hours of transcribed data. Multilingual acoustic representations proved to be crucial for building these systems under strict time constraints. The paper discusses several key insights on how these representations are derived and used. First, we present a data sampling strategy that can speed up the training of multilingual representations without appreciable loss in ASR performance. Second, we show that fusion of diverse multilingual representations developed at different LORELEI sites yields substantial ASR and KWS gains. Speaker adaptation and data augmentation of these representations improves both ASR and KWS performance (up to 8.7% relative). Third, incorporating un-transcribed data through semi-supervised learning, improves WER and KWS performance. Finally, we show that these multilingual representations significantly improve ASR and KWS performance (relative 9% for WER and 5% for MTWV) even when forty hours of transcribed audio in the target language is available. Multilingual representations significantly contributed to the LORELEI KWS systems winning the OpenKWS15 evaluation

    ALBAYZIN 2018 spoken term detection evaluation: a multi-domain international evaluation in Spanish

    No full text
    corecore