4,902 research outputs found

    End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

    Full text link
    Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio

    Cost-Driven Hardware-Software Co-Optimization of Machine Learning Pipelines

    Full text link
    Researchers have long touted a vision of the future enabled by a proliferation of internet-of-things devices, including smart sensors, homes, and cities. Increasingly, embedding intelligence in such devices involves the use of deep neural networks. However, their storage and processing requirements make them prohibitive for cheap, off-the-shelf platforms. Overcoming those requirements is necessary for enabling widely-applicable smart devices. While many ways of making models smaller and more efficient have been developed, there is a lack of understanding of which ones are best suited for particular scenarios. More importantly for edge platforms, those choices cannot be analyzed in isolation from cost and user experience. In this work, we holistically explore how quantization, model scaling, and multi-modality interact with system components such as memory, sensors, and processors. We perform this hardware/software co-design from the cost, latency, and user-experience perspective, and develop a set of guidelines for optimal system design and model deployment for the most cost-constrained platforms. We demonstrate our approach using an end-to-end, on-device, biometric user authentication system using a $20 ESP-EYE board

    Anticipatory Mobile Computing: A Survey of the State of the Art and Research Challenges

    Get PDF
    Today's mobile phones are far from mere communication devices they were ten years ago. Equipped with sophisticated sensors and advanced computing hardware, phones can be used to infer users' location, activity, social setting and more. As devices become increasingly intelligent, their capabilities evolve beyond inferring context to predicting it, and then reasoning and acting upon the predicted context. This article provides an overview of the current state of the art in mobile sensing and context prediction paving the way for full-fledged anticipatory mobile computing. We present a survey of phenomena that mobile phones can infer and predict, and offer a description of machine learning techniques used for such predictions. We then discuss proactive decision making and decision delivery via the user-device feedback loop. Finally, we discuss the challenges and opportunities of anticipatory mobile computing.Comment: 29 pages, 5 figure

    Robust Dual-Modal Speech Keyword Spotting for XR Headsets

    Full text link
    While speech interaction finds widespread utility within the Extended Reality (XR) domain, conventional vocal speech keyword spotting systems continue to grapple with formidable challenges, including suboptimal performance in noisy environments, impracticality in situations requiring silence, and susceptibility to inadvertent activations when others speak nearby. These challenges, however, can potentially be surmounted through the cost-effective fusion of voice and lip movement information. Consequently, we propose a novel vocal-echoic dual-modal keyword spotting system designed for XR headsets. We devise two different modal fusion approches and conduct experiments to test the system's performance across diverse scenarios. The results show that our dual-modal system not only consistently outperforms its single-modal counterparts, demonstrating higher precision in both typical and noisy environments, but also excels in accurately identifying silent utterances. Furthermore, we have successfully applied the system in real-time demonstrations, achieving promising results. The code is available at https://github.com/caizhuojiang/VE-KWS.Comment: Accepted to IEEE VR 202

    Using a common accessibility profile to improve accessibility

    Get PDF
    People have difficulties using computers. Some have more difficulties than others. There is a need for guidance in how to evaluate and improve the accessibility of systems for users. Since different users have considerably different accessibility needs, accessibility is a very complex issue.ISO 9241-171 defines accessibility as the "usability of a product, service, environment or facility by people with the widest range of capabilities." While this definition can help manufacturers make their products more accessible to more people, it does not ensure that a given product is accessible to a particular individual.A reference model is presented to act as a theoretical foundation. This Universal Access Reference Model (UARM) focuses on the accessibility of the interaction between users and systems, and provides a mechanism to share knowledge and abilities between users and systems. The UARM also suggests the role assistive technologies (ATs) can play in this interaction. The Common Accessibility Profile (CAP), which is based on the UARM, can be used to describe accessibility.The CAP is a framework for identifying the accessibility issues of individual users with particular systems configurations. It profiles the capabilities of systems and users to communicate. The CAP can also profile environmental interference to this communication and the use of ATs to transform communication abilities. The CAP model can be extended as further general or domain specific requirements are standardized.The CAP provides a model that can be used to structure various specifications in a manner that, in the future, will allow computational combination and comparison of profiles.Recognizing its potential impact, the CAP is now being standardized by the User Interface subcommittee the International Organization for Standardization and the International Electrotechnical Commission
    • …
    corecore