35,625 research outputs found

    A Wavelet Transform Module for a Speech Recognition Virtual Machine

    Get PDF
    This work explores the trade-offs between time and frequency information during the feature extraction process of an automatic speech recognition (ASR) system using wavelet transform (WT) features instead of Mel-frequency cepstral coefficients (MFCCs) and the benefits of combining the WTs and the MFCCs as inputs to an ASR system. A virtual machine from the Speech Recognition Virtual Kitchen resource (www.speechkitchen.org) is used as the context for implementing a wavelet signal processing module in a speech recognition system. Contributions include a comparison of MFCCs and WT features on small and large vocabulary tasks, application of combined MFCC and WT features on a noisy environment task, and the implementation of an expanded signal processing module in an existing recognition system. The updated virtual machine, which allows straightforward comparisons of signal processing approaches, is available for research and education purposes

    Improving Speech Recognition for Interviews with both Clean and Telephone Speech

    Get PDF
    High quality automatic speech recognition (ASR) depends on the context of the speech. Cleanly recorded speech has better results than speech recorded over telephone lines. In telephone speech, the signal is band-pass filtered which limits frequencies available for computation. Consequently, the transmitted speech signal may be distorted by noise, causing higher word error rates (WER). The main goal of this research project is to examine approaches to improve recognition of telephone speech while maintaining or improving results for clean speech in mixed telephone-clean speech recordings, by reducing mismatches between the test data and the available models. The test data includes recorded interviews where the interviewer was near the hand-held, single-channel recorder and the interviewee was on a speaker phone with the speaker near the recorder. Available resources include the Eesen offline transcriber and two acoustic models based on clean training data or telephone training data (Switchboard). The Eesen offline transcriber is on a virtual machine available through the Speech Recognition Virtual Kitchen and uses an approach based on a deep recurrent neural network acoustic model and a weighted finite state transducer decoder to transcribe audio into text. This project addresses the problem of high WER that comes when telephone speech is tested on cleanly-trained models by 1) replacing the clean model with a telephone model and 2) analyzing and addressing errors through data cleaning, correcting audio segmentation, and adding words to the dictionary. These approaches reduced the overall WER. This paper includes an overview of the transcriber, acoustic models, and the methods used to improve speech recognition, as well as results of transcription performance. We expect these approaches to reduce the WER on the telephone speech. Future work includes applying a variety of filters to the speech signal could reduce both additive and convolutional noise resulting from the telephone channel

    Structured evaluation of virtual environments for special-needs education

    Get PDF
    This paper describes the development of a structured approach to evaluate experiential and communication virtual learning environments (VLEs) designed specifically for use in the education of children with severe learning difficulties at the Shepherd special needs school in Nottingham, UK. Constructivist learning theory was used as a basis for the production of an evaluation framework, used to evaluate the design of three VLEs and how they were used by students with respect to this learning theory. From an observational field study of student-teacher pairs using the VLEs, 18 behaviour categories were identified as relevant to five of the seven constructivist principles defined by Jonassen (1994). Analysis of student-teacher behaviour was used to provide support for, or against, the constructivist principles. The results show that the three VLEs meet the constructivist principles in very different ways and recommendations for design modifications are put forward

    Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

    Get PDF
    First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. Dataset and Project page: http://epic-kitchens.github.ioComment: European Conference on Computer Vision (ECCV) 2018 Dataset and Project page: http://epic-kitchens.github.i
    • …
    corecore