730 research outputs found
RGBD Datasets: Past, Present and Future
Since the launch of the Microsoft Kinect, scores of RGBD datasets have been
released. These have propelled advances in areas from reconstruction to gesture
recognition. In this paper we explore the field, reviewing datasets across
eight categories: semantics, object pose estimation, camera tracking, scene
reconstruction, object tracking, human actions, faces and identification. By
extracting relevant information in each category we help researchers to find
appropriate data for their needs, and we consider which datasets have succeeded
in driving computer vision forward and why.
Finally, we examine the future of RGBD datasets. We identify key areas which
are currently underexplored, and suggest that future directions may include
synthetic data and dense reconstructions of static and dynamic scenes.Comment: 8 pages excluding references (CVPR style
Recommended from our members
British Sign Language Recognition via Late Fusion of Computer Vision and Leap Motion with Transfer Learning to American Sign Language
In this work, we show that a late fusion approach to multimodality in sign language recognition improves the overall ability of the model in comparison to the singular approaches of image classification (88.14%) and Leap Motion data classification (72.73%). With a large synchronous dataset of 18 BSL gestures collected from multiple subjects, two deep neural networks are benchmarked and compared to derive a best topology for each. The Vision model is implemented by a Convolutional Neural Network and optimised Artificial Neural Network, and the Leap Motion model is implemented by an evolutionary search of Artificial Neural Network topology. Next, the two best networks are fused for synchronised processing, which results in a better overall result (94.44%) as complementary features are learnt in addition to the original task. The hypothesis is further supported by application of the three models to a set of completely unseen data where a multimodality approach achieves the best results relative to the single sensor method. When transfer learning with the weights trained via British Sign Language, all three models outperform standard random weight distribution when classifying American Sign Language (ASL), and the best model overall for ASL classification was the transfer learning multimodality approach, which scored 82.55% accuracy
Fusion of pose and head tracking data for immersive mixed-reality application development
This work addresses the creation of a development framework where application developers can create, in a natural way, immersive physical activities where users experience a 3D first-person perception of full body control. The proposed frame-work is based on commercial motion sensors and a Head-Mounted Display (HMD), and a uses Unity 3D as a unifying environment where user pose, virtual scene and immersive visualization functions are coordinated. Our proposal is exemplified by the development of a toy application showing its practical us
An advanced virtual dance performance evaluator
The ever increasing availability of high speed Internet access has led to a leap in technologies that support real-time realistic interaction between humans in online virtual environments. In the context of this work, we wish to realise the vision of an online dance studio where a dance class is to be provided by an expert dance teacher and to be delivered to online students via the web. In this paper we study some of the technical issues that need to be addressed in this challenging scenario. In particular, we describe an automatic dance analysis tool that would be used to evaluate a student's performance and provide him/her with meaningful feedback to aid improvement
Jester: A Device Abstraction and Data Fusion API for Skeletal Tracking
Humans naturally interact with the world in three dimensions. Traditionally, personal computers have relied on 2D mice for input because 3D user tracking systems were cumbersome and expensive. Recently, 3D input hardware has become accurate and affordable enough to be marketed to average consumers and integrated into niche applications. Presently, 3D application developers must learn a different API for each device their software will support, and there is no simple way to integrate sensor data if the system has multiple 3D input devices. This thesis presents Jester, a library designed to simplify the development and improve the accuracy of 3D input-supported applications by providing an easily-extensible set of sensor wrappers that abstract the hardware specific details of capturing skeletal data and fusing sensor data in multiple 3D input device systems. Jester\u27s capabilities are demonstrated by creating a toy application that uses a PrimeSense Carmine and Leap Motion Controller to provide full body and finger skeletal tracking. Jester was able to fuse the data in real time while using the Carmine\u27s data to compensate for ambiguity in the Leap\u27s tracking
Towards the Design of a Natural User Interface for Performing and Learning Musical Gestures
AbstractA large variety of musical instruments, either acoustical or digital, are based on a keyboard scheme. Keyboard instruments can produce sounds through acoustic means but they are increasingly used to control digital sound synthesis processes with nowadays music. Interestingly, with all the different possibilities of sonic outcomes, the input remains a musical gesture. In this paper we present the conceptualization of a Natural User Interface (NUI), named the Intangible Musical Instrument (IMI), aiming to support both learning of expert musical gestures and performing music as a unified user experience. The IMI is designed to recognize metaphors of pianistic gestures, focusing on subtle uses of fingers and upper-body. Based on a typology of musical gestures, a gesture vocabulary has been created, hierarchized from basic to complex. These piano-like gestures are finally recognized and transformed into sounds
MIFTel: a multimodal interactive framework based on temporal logic rules
Human-computer and multimodal interaction are increasingly used in everyday life. Machines are able to get more from the surrounding world, assisting humans in different application areas. In this context, the correct processing and management of signals provided by the environments is determinant for structuring the data. Different sources and acquisition times can be exploited for improving recognition results. On the basis of these assumptions, we are proposing a multimodal system that exploits Allen’s temporal logic combined with a prevision method. The main object is to correlate user’s events with system’s reactions. After post-elaborating coming data from different signal sources (RGB images, depth maps, sounds, proximity sensors, etc.), the system is managing the correlations between recognition/detection results and events in real-time to create an interactive environment for the user. For increasing the recognition reliability, a predictive model is also associated with the proposed method. The modularity of the system grants a full dynamic development and upgrade with custom modules. Finally, a comparison with other similar systems is shown, underlining the high flexibility and robustness of the proposed event management method
- …