4,902 research outputs found
End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models
Speech activity detection (SAD) plays an important role in current speech
processing systems, including automatic speech recognition (ASR). SAD is
particularly difficult in environments with acoustic noise. A practical
solution is to incorporate visual information, increasing the robustness of the
SAD approach. An audiovisual system has the advantage of being robust to
different speech modes (e.g., whisper speech) or background noise. Recent
advances in audiovisual speech processing using deep learning have opened
opportunities to capture in a principled way the temporal relationships between
acoustic and visual features. This study explores this idea proposing a
\emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach
models the temporal dynamic of the sequential audiovisual data, improving the
accuracy and robustness of the proposed SAD system. Instead of estimating
hand-crafted features, the study investigates an end-to-end training approach,
where acoustic and visual features are directly learned from the raw data
during training. The experimental evaluation considers a large audiovisual
corpus with over 60.8 hours of recordings, collected from 105 speakers. The
results demonstrate that the proposed framework leads to absolute improvements
up to 1.2% under practical scenarios over a VAD baseline using only audio
implemented with deep neural network (DNN). The proposed approach achieves
92.7% F1-score when it is evaluated using the sensors from a portable tablet
under noisy acoustic environment, which is only 1.0% lower than the performance
obtained under ideal conditions (e.g., clean speech obtained with a high
definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio
Cost-Driven Hardware-Software Co-Optimization of Machine Learning Pipelines
Researchers have long touted a vision of the future enabled by a
proliferation of internet-of-things devices, including smart sensors, homes,
and cities. Increasingly, embedding intelligence in such devices involves the
use of deep neural networks. However, their storage and processing requirements
make them prohibitive for cheap, off-the-shelf platforms. Overcoming those
requirements is necessary for enabling widely-applicable smart devices. While
many ways of making models smaller and more efficient have been developed,
there is a lack of understanding of which ones are best suited for particular
scenarios. More importantly for edge platforms, those choices cannot be
analyzed in isolation from cost and user experience. In this work, we
holistically explore how quantization, model scaling, and multi-modality
interact with system components such as memory, sensors, and processors. We
perform this hardware/software co-design from the cost, latency, and
user-experience perspective, and develop a set of guidelines for optimal system
design and model deployment for the most cost-constrained platforms. We
demonstrate our approach using an end-to-end, on-device, biometric user
authentication system using a $20 ESP-EYE board
Anticipatory Mobile Computing: A Survey of the State of the Art and Research Challenges
Today's mobile phones are far from mere communication devices they were ten
years ago. Equipped with sophisticated sensors and advanced computing hardware,
phones can be used to infer users' location, activity, social setting and more.
As devices become increasingly intelligent, their capabilities evolve beyond
inferring context to predicting it, and then reasoning and acting upon the
predicted context. This article provides an overview of the current state of
the art in mobile sensing and context prediction paving the way for
full-fledged anticipatory mobile computing. We present a survey of phenomena
that mobile phones can infer and predict, and offer a description of machine
learning techniques used for such predictions. We then discuss proactive
decision making and decision delivery via the user-device feedback loop.
Finally, we discuss the challenges and opportunities of anticipatory mobile
computing.Comment: 29 pages, 5 figure
Robust Dual-Modal Speech Keyword Spotting for XR Headsets
While speech interaction finds widespread utility within the Extended Reality
(XR) domain, conventional vocal speech keyword spotting systems continue to
grapple with formidable challenges, including suboptimal performance in noisy
environments, impracticality in situations requiring silence, and
susceptibility to inadvertent activations when others speak nearby. These
challenges, however, can potentially be surmounted through the cost-effective
fusion of voice and lip movement information. Consequently, we propose a novel
vocal-echoic dual-modal keyword spotting system designed for XR headsets. We
devise two different modal fusion approches and conduct experiments to test the
system's performance across diverse scenarios. The results show that our
dual-modal system not only consistently outperforms its single-modal
counterparts, demonstrating higher precision in both typical and noisy
environments, but also excels in accurately identifying silent utterances.
Furthermore, we have successfully applied the system in real-time
demonstrations, achieving promising results. The code is available at
https://github.com/caizhuojiang/VE-KWS.Comment: Accepted to IEEE VR 202
Using a common accessibility profile to improve accessibility
People have difficulties using computers. Some have more difficulties than others. There is a need for guidance in how to evaluate and improve the accessibility of systems for users. Since different users have considerably different accessibility needs, accessibility is a very complex issue.ISO 9241-171 defines accessibility as the "usability of a product, service, environment or facility by people with the widest range of capabilities." While this definition can help manufacturers make their products more accessible to more people, it does not ensure that a given product is accessible to a particular individual.A reference model is presented to act as a theoretical foundation. This Universal Access Reference Model (UARM) focuses on the accessibility of the interaction between users and systems, and provides a mechanism to share knowledge and abilities between users and systems. The UARM also suggests the role assistive technologies (ATs) can play in this interaction. The Common Accessibility Profile (CAP), which is based on the UARM, can be used to describe accessibility.The CAP is a framework for identifying the accessibility issues of individual users with particular systems configurations. It profiles the capabilities of systems and users to communicate. The CAP can also profile environmental interference to this communication and the use of ATs to transform communication abilities. The CAP model can be extended as further general or domain specific requirements are standardized.The CAP provides a model that can be used to structure various specifications in a manner that, in the future, will allow computational combination and comparison of profiles.Recognizing its potential impact, the CAP is now being standardized by the User Interface subcommittee the International Organization for Standardization and the International Electrotechnical Commission
- …