5,141 research outputs found
A Modular Approach for Synchronized Wireless Multimodal Multisensor Data Acquisition in Highly Dynamic Social Settings
Existing data acquisition literature for human behavior research provides
wired solutions, mainly for controlled laboratory setups. In uncontrolled
free-standing conversation settings, where participants are free to walk
around, these solutions are unsuitable. While wireless solutions are employed
in the broadcasting industry, they can be prohibitively expensive. In this
work, we propose a modular and cost-effective wireless approach for
synchronized multisensor data acquisition of social human behavior. Our core
idea involves a cost-accuracy trade-off by using Network Time Protocol (NTP) as
a source reference for all sensors. While commonly used as a reference in
ubiquitous computing, NTP is widely considered to be insufficiently accurate as
a reference for video applications, where Precision Time Protocol (PTP) or
Global Positioning System (GPS) based references are preferred. We argue and
show, however, that the latency introduced by using NTP as a source reference
is adequate for human behavior research, and the subsequent cost and modularity
benefits are a desirable trade-off for applications in this domain. We also
describe one instantiation of the approach deployed in a real-world experiment
to demonstrate the practicality of our setup in-the-wild.Comment: 9 pages, 8 figures, Proceedings of the 28th ACM International
Conference on Multimedia (MM '20), October 12--16, 2020, Seattle, WA, USA.
First two authors contributed equall
Multimodal Polynomial Fusion for Detecting Driver Distraction
Distracted driving is deadly, claiming 3,477 lives in the U.S. in 2015 alone.
Although there has been a considerable amount of research on modeling the
distracted behavior of drivers under various conditions, accurate automatic
detection using multiple modalities and especially the contribution of using
the speech modality to improve accuracy has received little attention. This
paper introduces a new multimodal dataset for distracted driving behavior and
discusses automatic distraction detection using features from three modalities:
facial expression, speech and car signals. Detailed multimodal feature analysis
shows that adding more modalities monotonically increases the predictive
accuracy of the model. Finally, a simple and effective multimodal fusion
technique using a polynomial fusion layer shows superior distraction detection
results compared to the baseline SVM and neural network models.Comment: INTERSPEECH 201
An Immersive Telepresence System using RGB-D Sensors and Head Mounted Display
We present a tele-immersive system that enables people to interact with each
other in a virtual world using body gestures in addition to verbal
communication. Beyond the obvious applications, including general online
conversations and gaming, we hypothesize that our proposed system would be
particularly beneficial to education by offering rich visual contents and
interactivity. One distinct feature is the integration of egocentric pose
recognition that allows participants to use their gestures to demonstrate and
manipulate virtual objects simultaneously. This functionality enables the
instructor to ef- fectively and efficiently explain and illustrate complex
concepts or sophisticated problems in an intuitive manner. The highly
interactive and flexible environment can capture and sustain more student
attention than the traditional classroom setting and, thus, delivers a
compelling experience to the students. Our main focus here is to investigate
possible solutions for the system design and implementation and devise
strategies for fast, efficient computation suitable for visual data processing
and network transmission. We describe the technique and experiments in details
and provide quantitative performance results, demonstrating our system can be
run comfortably and reliably for different application scenarios. Our
preliminary results are promising and demonstrate the potential for more
compelling directions in cyberlearning.Comment: IEEE International Symposium on Multimedia 201
Real-time data acquisition, transmission and archival framework
Most human actions are a direct response to stimuli from their five senses. In the past few decades there has been a growing interest in capturing and storing the information that is obtained from the senses using analog and digital sensors. By storing this data it is possible to further analyze and better understand human perception. While many devices have been created for capturing and storing data, existing software and hardware architectures are aimed towards specialized devices and require expensive high-performance systems. This thesis aims to create a framework that supports capture and monitoring of a variety of sensors and can be scaled to run on low and high-performance systems such as netbooks, laptops and desktop systems. The proposed architecture was tested using aural and visual sensors due to their availability and higher bandwidth requirements compared to other sensors. Four different portable computing devices were used for testing with a varied set of hardware capabilities. On each of the systems the same suite of tests were run to benchmark and analyze CPU, memory, network, and storage usage statistics. From the results it was shown that on all of these platforms capturing data from multiple video, audio and other sensor sources was possible in real-time. Performance was shown to scale based on several factors, but the most important were CPU architecture, network topology and data interfaces used
Self-Localization of Ad-Hoc Arrays Using Time Difference of Arrivals
This work was supported by the U.K. Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/K007491/1
An Immersive Telepresence System using RGB-D Sensors and Head-mounted Display
We present a tele-immersive system that enables people to interact with each other in a virtual world using body gestures in addition to verbal communication. Beyond the obvious applications, including general online conversations and gaming, we hypothesize that our proposed system would be particularly beneficial to education by offering rich visual contents and interactivity. One distinct feature is the integration of egocentric pose recognition that allows participants to use their gestures to demonstrate and manipulate virtual objects simultaneously. This functionality enables the instructor to effectively and efficiently explain and illustrate complex concepts or sophisticated problems in an intuitive manner. The highly interactive and flexible environment can capture and sustain more student attention than the traditional classroom setting and, thus, delivers a compelling experience to the students. Our main focus here is to investigate possible solutions for the system design and implementation and devise strategies for fast, efficient computation suitable for visual data processing and network transmission. We describe the technique and experiments in details and provide quantitative performance results, demonstrating our system can be run comfortably and reliably for different application scenarios. Our preliminary results are promising and demonstrate the potential for more compelling directions in cyberlearning
Computational Multimedia for Video Self Modeling
Video self modeling (VSM) is a behavioral intervention technique in which a learner models a target behavior by watching a video of oneself. This is the idea behind the psychological theory of self-efficacy - you can learn or model to perform certain tasks because you see yourself doing it, which provides the most ideal form of behavior modeling. The effectiveness of VSM has been demonstrated for many different types of disabilities and behavioral problems ranging from stuttering, inappropriate social behaviors, autism, selective mutism to sports training. However, there is an inherent difficulty associated with the production of VSM material. Prolonged and persistent video recording is required to capture the rare, if not existed at all, snippets that can be used to string together in forming novel video sequences of the target skill. To solve this problem, in this dissertation, we use computational multimedia techniques to facilitate the creation of synthetic visual content for self-modeling that can be used by a learner and his/her therapist with a minimum amount of training data. There are three major technical contributions in my research. First, I developed an Adaptive Video Re-sampling algorithm to synthesize realistic lip-synchronized video with minimal motion jitter. Second, to denoise and complete the depth map captured by structure-light sensing systems, I introduced a layer based probabilistic model to account for various types of uncertainties in the depth measurement. Third, I developed a simple and robust bundle-adjustment based framework for calibrating a network of multiple wide baseline RGB and depth cameras
- …