13 research outputs found

    An Efficient Image-Based Telepresence System for Videoconferencing

    Full text link

    3D Hand reconstruction from monocular camera with model-based priors

    Get PDF
    As virtual and augmented reality (VR/AR) technology gains popularity, facilitating intuitive digital interactions in 3D is of crucial importance. Tools such as VR controllers exist, but such devices support only a limited range of interactions, mapped onto complex sequences of button presses that can be intimidating to learn. In contrast, users already have an instinctive understanding of manual interactions in the real world, which is readily transferable to the virtual world. This makes hands the ideal mode of interaction for down-stream applications such as robotic teleoperation, sign-language translation, and computer-aided design. Existing hand-tracking systems come with several inconvenient limitations. Wearable solutions such as gloves and markers unnaturally limit the range of articulation. Multi-camera systems are not trivial to calibrate and have specialized hardware requirements which make them cumbersome to use. Given these drawbacks, recent research tends to focus on monocular inputs, as these do not constrain articulation and suitable devices are pervasive in everyday life. 3D reconstruction in this setting is severely under-constrained, however, due to occlusions and depth ambiguities. The majority of state-of-the-art works rely on a learning framework to resolve these ambiguities statistically; as a result they have several limitations in common. For example, they require a vast amount of annotated 3D data that is labor intensive to obtain and prone to systematic error. Additionally, traits that are hard to quantify with annotations - the details of individual hand appearance - are difficult to reconstruct in such a framework. Existing methods also make the simplifying assumption that only a single hand is present in the scene. Two-hand interactions introduce additional challenges, however, in the form of inter-hand occlusion, left-right confusion, and collision constraints, that single hand methods cannot address. To tackle the aforementioned shortcomings of previous methods, this thesis advances the state-of-the-art through the novel use of model-based priors to incorporate hand-specific knowledge. In particular, this thesis presents a training method that reduces the amount of annotations required and is robust to systemic biases; it presents the first tracking method that addresses the challenging two-hand-interaction scenario using monocular RGB video, and also the first probabilistic method to model image ambiguity for two-hand interactions. Additionally, this thesis also contributes the first parametric hand texture model with example applications in hand personalization.Virtual- und Augmented-Reality-Technologien (VR/AR) gewinnen rapide an Beliebtheit und Einfluss, und so ist die Erleichterung intuitiver digitaler Interaktionen in 3D von wachsender Bedeutung. Zwar gibt es Tools wie VR-Controller, doch solche Geräte unterstützen nur ein begrenztes Spektrum an Interaktionen, oftmals abgebildet auf komplexe Sequenzen von Tastendrücken, deren Erlernen einschüchternd sein kann. Im Gegensatz dazu haben Nutzer bereits ein instinktives Verständnis für manuelle Interaktionen in der realen Welt, das sich leicht auf die virtuelle Welt übertragen lässt. Dies macht Hände zum idealen Werkzeug der Interaktion für nachgelagerte Anwendungen wie robotergestützte Teleoperation, Übersetzung von Gebärdensprache und computergestütztes Design. Existierende Hand-Tracking Systeme leiden unter mehreren unbequemen Einschränkungen. Tragbare Lösungen wie Handschuhe und aufgesetzte Marker schränken den Bewegungsspielraum auf unnatürliche Weise ein. Systeme mit mehreren Kameras erfordern genaue Kalibrierung und haben spezielle Hardwareanforderungen, die ihre Anwendung umständlich gestalten. Angesichts dieser Nachteile konzentriert sich die neuere Forschung tendenziell auf monokularen Input, da so Bewegungsabläufe nicht gestört werden und geeignete Geräte im Alltag allgegenwärtig sind. Die 3D-Rekonstruktion in diesem Kontext stößt jedoch aufgrund von Okklusionen und Tiefenmehrdeutigkeiten schnell an ihre Grenzen. Die Mehrheit der Arbeiten auf dem neuesten Stand der Technik setzt hierbei auf ein ML-Framework, um diese Mehrdeutigkeiten statistisch aufzulösen; infolgedessen haben all diese mehrere Einschränkungen gemein. Beispielsweise benötigen sie eine große Menge annotierter 3D-Daten, deren Beschaffung arbeitsintensiv und anfällig für systematische Fehler ist. Darüber hinaus sind Merkmale, die mit Anmerkungen nur schwer zu quantifizieren sind – die Details des individuellen Erscheinungsbildes – in einem solchen Rahmen schwer zu rekonstruieren. Bestehende Verfahren gehen auch vereinfachend davon aus, dass nur eine einzige Hand in der Szene vorhanden ist. Zweihand-Interaktionen bringen jedoch zusätzliche Herausforderungen in Form von Okklusion der Hände untereinander, Links-Rechts-Verwirrung und Kollisionsbeschränkungen mit sich, die Einhand-Methoden nicht bewältigen können. Um die oben genannten Mängel früherer Methoden anzugehen, bringt diese Arbeit den Stand der Technik durch die neuartige Verwendung modellbasierter Priors voran, um Hand-spezifisches Wissen zu integrieren. Insbesondere stellt diese Arbeit eine Trainingsmethode vor, die die Menge der erforderlichen Annotationen reduziert und robust gegenüber systemischen Verzerrungen ist; es wird die erste Tracking-Methode vorgestellt, die das herausfordernde Zweihand-Interaktionsszenario mit monokularem RGB-Video angeht, und auch die erste probabilistische Methode zur Modellierung der Bildmehrdeutigkeit für Zweihand-Interaktionen. Darüber hinaus trägt diese Arbeit auch das erste parametrische Handtexturmodell mit Beispielanwendungen in der Hand-Personalisierung bei

    Online Audio-Visual Multi-Source Tracking and Separation: A Labeled Random Finite Set Approach

    Get PDF
    The dissertation proposes an online solution for separating an unknown and time-varying number of moving sources using audio and visual data. The random finite set framework is used for the modeling and fusion of audio and visual data. This enables an online tracking algorithm to estimate the source positions and identities for each time point. With this information, a set of beamformers can be designed to separate each desired source and suppress the interfering sources

    Segmentation based coding of depth Information for 3D video

    Get PDF
    Increased interest in 3D artifact and the need of transmitting, broadcasting and saving the whole information that represents the 3D view, has been a hot topic in recent years. Knowing that adding the depth information to the views will increase the encoding bitrate considerably, we decided to find a new approach to encode/decode the depth information for 3D video. In this project, different approaches to encode/decode the depth information are experienced and a new method is implemented which its result is compared to the best previously developed method considering both bitrate and quality (PSNR)

    Vision-Based 2D and 3D Human Activity Recognition

    Get PDF

    Interactive Remote Collaboration Using Augmented Reality

    Get PDF
    With the widespread deployment of fast data connections and availability of a variety of sensors for different modalities, the potential of remote collaboration has greatly increased. While the now ubiquitous video conferencing applications take advantage of some of these capabilities, the use of video between remote users is limited to passively watching disjoint video feeds and provides no means for interaction with the remote environment. However, collaboration often involves sharing, exploring, referencing, or even manipulating the physical world, and thus tools should provide support for these interactions.We suggest that augmented reality is an intuitive and user-friendly paradigm to communicate information about the physical environment, and that integration of computer vision and augmented reality facilitates more immersive and more direct interaction with the remote environment than what is possible with today's tools.In this dissertation, we present contributions to realizing this vision on several levels. First, we describe a conceptual framework for unobtrusive mobile video-mediated communication in which the remote user can explore the live scene independent of the local user's current camera movement, and can communicate information by creating spatial annotations that are immediately visible to the local user in augmented reality. Second, we describe the design and implementation of several, increasingly more flexible and immersive user interfaces and system prototypes that implement this concept. Our systems do not require any preparation or instrumentation of the environment; instead, the physical scene is tracked and modeled incrementally using monocular computer vision. The emerging model then supports anchoring of annotations, virtual navigation, and synthesis of novel views of the scene. Third, we describe the design, execution and analysis of three user studies comparing our prototype implementations with more conventional interfaces and/or evaluating specific design elements. Study participants overwhelmingly preferred our technology, and their task performance was significantly better compared with a video-only interface, though no task performance difference was observed compared with a ``static marker'' interface. Last, we address a particular technical limitation of current monocular tracking and mapping systems which was found to be impeding and present a conceptual solution; namely, we describe a concept and proof-of-concept implementation for automatic model selection which allows tracking and modeling to cope with both parallax-inducing and rotation-only camera movements.We suggest that our results demonstrate the maturity and usability of our systems, and, more importantly, the potential of our approach to improve video-mediated communication and broaden its applicability

    MediaSync: Handbook on Multimedia Synchronization

    Get PDF
    This book provides an approachable overview of the most recent advances in the fascinating field of media synchronization (mediasync), gathering contributions from the most representative and influential experts. Understanding the challenges of this field in the current multi-sensory, multi-device, and multi-protocol world is not an easy task. The book revisits the foundations of mediasync, including theoretical frameworks and models, highlights ongoing research efforts, like hybrid broadband broadcast (HBB) delivery and users' perception modeling (i.e., Quality of Experience or QoE), and paves the way for the future (e.g., towards the deployment of multi-sensory and ultra-realistic experiences). Although many advances around mediasync have been devised and deployed, this area of research is getting renewed attention to overcome remaining challenges in the next-generation (heterogeneous and ubiquitous) media ecosystem. Given the significant advances in this research area, its current relevance and the multiple disciplines it involves, the availability of a reference book on mediasync becomes necessary. This book fills the gap in this context. In particular, it addresses key aspects and reviews the most relevant contributions within the mediasync research space, from different perspectives. Mediasync: Handbook on Multimedia Synchronization is the perfect companion for scholars and practitioners that want to acquire strong knowledge about this research area, and also approach the challenges behind ensuring the best mediated experiences, by providing the adequate synchronization between the media elements that constitute these experiences

    Virtual Reality Applications and Development

    Get PDF
    Virtual Reality (VR) has existed for many years; however, it has only recently gained wide spread popularity and commercial use. This change comes from the innovations in head mounted displays (HMDs) and from the work of many software engineers making quality user experiences (UX). In this thesis, four areas are explored inside of VR. One area of research is within the use of VR for virtual environments and fire simulations. The second area of research is within the use of VR for eye tracking and medical simulations. The third area of research is within multiplayer development for more immersive collaborative simulations. Finally, the fourth area of research is within the development of typing in 3D for virtual reality. Extending from this final area of research, this thesis details an application that details more practical and granular details about developing for VR and using the real-time development platform, Unity

    Social navigation of autonomous robots in populated environments

    Get PDF
    Programa de Doctorado en Biotecnología, Ingeniería y Tecnología QuímicaLínea de Investigación: Ingeniería InformáticaClave Programa: DBICódigo Línea: 19Today, more and more mobile robots are coexisting with us in our daily lives. As a result, the behavior of robots that share space with humans in dynamic environments is a subject of intense investigation in robotics. Robots must re- spect human social conventions, guarantee the comfort of surrounding people, and maintain the legibility so that humans can understand the robot¿s intentions. Robots that move in humans¿ vicinity should navigate in a socially compliant way; this is called human-aware navigation. These social behaviors are not easy to frame in mathematical expressions. Consequently, motion planners with pre- programmed constraints and hard-coded functions can fail in acquiring proper behaviors related to human-awareness. All in all, it is easier to demonstrate socially acceptable behaviors than mathematically defining them. Therefore, learning these social behaviors from data seems a more principled approach. This thesis aims at endowing mobile robots with new social skills for au- tonomous navigation in spaces populated with humans. This work makes use of learning from demonstration (LfD) approaches to solve the problem of human- aware navigation. Different techniques and algorithms are explored and devel- oped in order to transfer social navigation behaviors to a robot by using demon- strations of human experts performing the proposed tasks. The contributions of this thesis are in the field of Learning from Demonstra- tion applied to human-aware navigation tasks. First, a LfD technique based on Inverse Reinforcement Learning (IRL) is employed to learn a policy for ¿social¿ local motion planning. Then, a novel learning algorithm combining LfD concepts and sampling-based path planners is presented. Finally, other novel approaches combining different LfD techniques, like deep learning among others, and path planners are investigated. The methods proposed are compared against state- of-the-art approaches and tested in different experiments with the real robots employed in the European projects FROG and TERESA.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e InformáticaPostprin

    Collaborative design and feasibility assessment of computational nutrient sensing for simulated food-intake tracking in a healthcare environment

    Get PDF
    One in four older adults (65 years and over) are living with some form of malnutrition. This increases their odds of hospitalization four-fold and is associated with decreased quality of life and increased mortality. In long-term care (LTC), residents have more complex care needs and the proportion affected is a staggering 54% primarily due to low intake. Tracking intake is important for monitoring whether residents are meeting their nutritional needs however current methods are time-consuming, subjective, and prone to large margins of error. This reduces the utility of tracked data and makes it challenging to identify individuals at-risk in a timely fashion. While technologies exist for tracking food-intake, they have not been designed for use within the LTC context and require a large time burden by the user. Especially in light of the machine learning boom, there is great opportunity to harness learnings from this domain and apply it to the field of nutrition for enhanced food-intake tracking. Additionally, current approaches to monitoring food-intake tracking are limited by the nutritional database to which they are linked making generalizability a challenge. Drawing inspiration from current methods, the desires of end-users (primary users: personal support workers, registered staff, dietitians), and machine learning approaches suitable for this context in which there is limited data available, we investigated novel methods for assessing needs in this environment and imagine an alternative approach. We leveraged image processing and machine learning to remove subjectivity while increasing accuracy and precision to support higher-quality food-intake tracking. This thesis presents the ideation, design, development and evaluation of a collaboratively designed, and feasibility assessment, of computational nutrient sensing for simulated food-intake tracking in the LTC environment. We sought to remove potential barriers to uptake through collaborative design and ongoing end user engagement for developing solution concepts for a novel Automated Food Imaging and Nutrient Intake Tracking (AFINI-T) system while implementing the technology in parallel. More specifically, we demonstrated the effectiveness of applying a modified participatory iterative design process modeled from the Google Sprint framework in the LTC context which identified priority areas and established functional criteria for usability and feasibility. Concurrently, we developed the novel AFINI-T system through the co-integration of image processing and machine learning and guided by the application of food-intake tracking in LTC to address three questions: (1) where is there food? (i.e., food segmentation), (2) how much food was consumed? (i.e., volume estimation) using a fully automatic imaging system for quantifying food-intake. We proposed a novel deep convolutional encoder-decoder food network with depth-refinement (EDFN-D) using an RGB-D camera for quantifying a plate’s remaining food volume relative to reference portions in whole and modified texture foods. To determine (3) what foods are present (i.e., feature extraction and classification), we developed a convolutional autoencoder to learn meaningful food-specific features and developed classifiers which leverage a priori information about when certain foods would be offered and the level of texture modification prescribed to apply real-world constraints of LTC. We sought to address real-world complexity by assessing a wide variety of food items through the construction of a simulated food-intake dataset emulating various degrees of food-intake and modified textures (regular, minced, puréed). To ensure feasibility-related barriers to uptake were mitigated, we employed a feasibility assessment using the collaboratively designed prototype. Finally, this thesis explores the feasibility of applying biophotonic principles to food as a first step to enhancing food database estimates. Motivated by a theoretical optical dilution model, a novel deep neural network (DNN) was evaluated for estimating relative nutrient density of commercially prepared purées. For deeper analysis we describe the link between color and two optically active nutrients, vitamin A, and anthocyanins, and suggest it may be feasible to utilize optical properties of foods to enhance nutritional estimation. This research demonstrates a transdisciplinary approach to designing and implementing a novel food-intake tracking system which addresses several shortcomings of the current method. Upon translation, this system may provide additional insights for supporting more timely nutritional interventions through enhanced monitoring of nutritional intake status among LTC residents
    corecore