372 research outputs found

    Augmentieren von Personen in Monokularen Videodaten

    Get PDF
    When aiming at realistic video augmentation, i.e. the embedding of virtual, 3-dimensional objects into a scene's original content, a series of challenging problems has to be solved. This is especially the case when working with solely monocular input material, as important additional 3D information is missing and has to be recovered during the process, if necessary. In this work, I will present a semi-automatic strategy to tackle this task by providing solutions to individual problems in the context of virtual clothing as an example for realistic video augmentation. Starting with two different approaches for monocular pose and motion estimation, I will show how to build a 3D human body model by estimating detailed shape information as well as basic surface material properties. This information allows to further extract a dynamic illumination model from the provided input material. The illumination model is particularly important for rendering a realistic virtual object and adds a lot of realism to the final video augmentation. The animated human model is able to interact with virtual 3D objects and is used in the context of virtual clothing to animate simulated garments. To achieve the desired realism, I present an additional image-based compositing approach that realistically embeds the simulated garment into the original scene content. Combining the presented approaches provide an integrated strategy for realistic augmentation of actors in monocular video sequences.Unter der Zielsetzung einer realistischen Videoaugmentierung durch das Einbetten virtueller, dreidimensionaler Objekte in eine bestehende Videoaufnahme, gibt eine Reihe interessanter und schwieriger Problemen zu lösen. Besonders im Hinblick auf die Verarbeitung monokularer Eingabedaten fehlen wichtige räumliche Informationen, welche aus den zweidimensionalen Eingabedaten rekonstruiert werden müssen. In dieser Arbeit präsentiere ich eine halbautomatische Verfahrensweise, welche es ermöglicht, die einzelnen Teilprobleme einer umfassenden Videoaugmentierung nacheinander in einer integrierten Strategie zu lösen. Dies demonstriere ich am Beispiel von virtueller Kleidung. Beginnend mit zwei unterschiedlichen Ansätzen zur Posen- und Bewegungsrekonstruktion wird ein realistisches 3D Körpermodell eines Menschen erzeugt. Dazu wird die detaillierte Körperform durch ein geeignetes Verfahren approximiert und eine Rekonstruktion der Oberflächenmaterialen vorgenommen. Diese Informationen werden unter anderem dazu verwendet, aus dem Eingabevideo eine dynamische Szenenbeleuchtung zu rekonstruieren. Die Beleuchtungsinformationen sind besonders wichtig für eine realistische Videoaugmentierung, da gerade eine korrekte Beleuchtung den Realitätsgrad des virtuell generierten Objektes erhöht. Das rekonstruierte und animierte Körpermodell ist durch seinen Detailgrad in der Lage, mit virtuellen Objekten zu interagieren. Dies kommt besonders im Anwendungsfall von virtueller Kleidung zum tragen. Um den gewünschten Realitätsgrad zu erreichen, führe ich ein zusätzliches, bild-basiertes Korrekturverfahren ein, welches hilft, die finale Bildkomposition zu optimieren. Die Kombination aller präsentierter Teilverfahren bildet eine vollumfängliche Strategie zur Augmentierung von monokularem Videomaterial, die zur realistischen Simulation und Einbettung von virtueller Kleidung eines Schauspielers im Originalvideo verwendet werden kann

    3D-TV Production from Conventional Cameras for Sports Broadcast

    Get PDF
    3DTV production of live sports events presents a challenging problem involving conflicting requirements of main- taining broadcast stereo picture quality with practical problems in developing robust systems for cost effective deployment. In this paper we propose an alternative approach to stereo production in sports events using the conventional monocular broadcast cameras for 3D reconstruction of the event and subsequent stereo rendering. This approach has the potential advantage over stereo camera rigs of recovering full scene depth, allowing inter-ocular distance and convergence to be adapted according to the requirements of the target display and enabling stereo coverage from both existing and ‘virtual’ camera positions without additional cameras. A prototype system is presented with results of sports TV production trials for rendering of stereo and free-viewpoint video sequences of soccer and rugby

    Freeform 3D interactions in everyday environments

    Get PDF
    PhD ThesisPersonal computing is continuously moving away from traditional input using mouse and keyboard, as new input technologies emerge. Recently, natural user interfaces (NUI) have led to interactive systems that are inspired by our physical interactions in the real-world, and focus on enabling dexterous freehand input in 2D or 3D. Another recent trend is Augmented Reality (AR), which follows a similar goal to further reduce the gap between the real and the virtual, but predominately focuses on output, by overlaying virtual information onto a tracked real-world 3D scene. Whilst AR and NUI technologies have been developed for both immersive 3D output as well as seamless 3D input, these have mostly been looked at separately. NUI focuses on sensing the user and enabling new forms of input; AR traditionally focuses on capturing the environment around us and enabling new forms of output that are registered to the real world. The output of NUI systems is mainly presented on a 2D display, while the input technologies for AR experiences, such as data gloves and body-worn motion trackers are often uncomfortable and restricting when interacting in the real world. NUI and AR can be seen as very complimentary, and bringing these two fields together can lead to new user experiences that radically change the way we interact with our everyday environments. The aim of this thesis is to enable real-time, low latency, dexterous input and immersive output without heavily instrumenting the user. The main challenge is to retain and to meaningfully combine the positive qualities that are attributed to both NUI and AR systems. I review work in the intersecting research fields of AR and NUI, and explore freehand 3D interactions with varying degrees of expressiveness, directness and mobility in various physical settings. There a number of technical challenges that arise when designing a mixed NUI/AR system, which I will address is this work: What can we capture, and how? How do we represent the real in the virtual? And how do we physically couple input and output? This is achieved by designing new systems, algorithms, and user experiences that explore the combination of AR and NUI

    Real-time human performance capture and synthesis

    Get PDF
    Most of the images one finds in the media, such as on the Internet or in textbooks and magazines, contain humans as the main point of attention. Thus, there is an inherent necessity for industry, society, and private persons to be able to thoroughly analyze and synthesize the human-related content in these images. One aspect of this analysis and subject of this thesis is to infer the 3D pose and surface deformation, using only visual information, which is also known as human performance capture. Human performance capture enables the tracking of virtual characters from real-world observations, and this is key for visual effects, games, VR, and AR, to name just a few application areas. However, traditional capture methods usually rely on expensive multi-view (marker-based) systems that are prohibitively expensive for the vast majority of people, or they use depth sensors, which are still not as common as single color cameras. Recently, some approaches have attempted to solve the task by assuming only a single RGB image is given. Nonetheless, they can either not track the dense deforming geometry of the human, such as the clothing layers, or they are far from real time, which is indispensable for many applications. To overcome these shortcomings, this thesis proposes two monocular human performance capture methods, which for the first time allow the real-time capture of the dense deforming geometry as well as an unseen 3D accuracy for pose and surface deformations. At the technical core, this work introduces novel GPU-based and data-parallel optimization strategies in conjunction with other algorithmic design choices that are all geared towards real-time performance at high accuracy. Moreover, this thesis presents a new weakly supervised multiview training strategy combined with a fully differentiable character representation that shows superior 3D accuracy. However, there is more to human-related Computer Vision than only the analysis of people in images. It is equally important to synthesize new images of humans in unseen poses and also from camera viewpoints that have not been observed in the real world. Such tools are essential for the movie industry because they, for example, allow the synthesis of photo-realistic virtual worlds with real-looking humans or of contents that are too dangerous for actors to perform on set. But also video conferencing and telepresence applications can benefit from photo-real 3D characters, as they can enhance the immersive experience of these applications. Here, the traditional Computer Graphics pipeline for rendering photo-realistic images involves many tedious and time-consuming steps that require expert knowledge and are far from real time. Traditional rendering involves character rigging and skinning, the modeling of the surface appearance properties, and physically based ray tracing. Recent learning-based methods attempt to simplify the traditional rendering pipeline and instead learn the rendering function from data resulting in methods that are easier accessible to non-experts. However, most of them model the synthesis task entirely in image space such that 3D consistency cannot be achieved, and/or they fail to model motion- and view-dependent appearance effects. To this end, this thesis presents a method and ongoing work on character synthesis, which allow the synthesis of controllable photoreal characters that achieve motion- and view-dependent appearance effects as well as 3D consistency and which run in real time. This is technically achieved by a novel coarse-to-fine geometric character representation for efficient synthesis, which can be solely supervised on multi-view imagery. Furthermore, this work shows how such a geometric representation can be combined with an implicit surface representation to boost synthesis and geometric quality.In den meisten Bildern in den heutigen Medien, wie dem Internet, Büchern und Magazinen, ist der Mensch das zentrale Objekt der Bildkomposition. Daher besteht eine inhärente Notwendigkeit für die Industrie, die Gesellschaft und auch für Privatpersonen, die auf den Mensch fokussierten Eigenschaften in den Bildern detailliert analysieren und auch synthetisieren zu können. Ein Teilaspekt der Anaylse von menschlichen Bilddaten und damit Bestandteil der Thesis ist das Rekonstruieren der 3D-Skelett-Pose und der Oberflächendeformation des Menschen anhand von visuellen Informationen, was fachsprachlich auch als Human Performance Capture bezeichnet wird. Solche Rekonstruktionsverfahren ermöglichen das Tracking von virtuellen Charakteren anhand von Beobachtungen in der echten Welt, was unabdingbar ist für Applikationen im Bereich der visuellen Effekte, Virtual und Augmented Reality, um nur einige Applikationsfelder zu nennen. Nichtsdestotrotz basieren traditionelle Tracking-Methoden auf teuren (markerbasierten) Multi-Kamera Systemen, welche für die Mehrheit der Bevölkerung nicht erschwinglich sind oder auf Tiefenkameras, die noch immer nicht so gebräuchlich sind wie herkömmliche Farbkameras. In den letzten Jahren gab es daher erste Methoden, die versuchen, das Tracking-Problem nur mit Hilfe einer Farbkamera zu lösen. Allerdings können diese entweder die Kleidung der Person im Bild nicht tracken oder die Methoden benötigen zu viel Rechenzeit, als dass sie in realen Applikationen genutzt werden könnten. Um diese Probleme zu lösen, stellt die Thesis zwei monokulare Human Performance Capture Methoden vor, die zum ersten Mal eine Echtzeit-Rechenleistung erreichen sowie im Vergleich zu vorherigen Arbeiten die Genauigkeit von Pose und Oberfläche in 3D weiter verbessern. Der Kern der Methoden beinhaltet eine neuartige GPU-basierte und datenparallelisierte Optimierungsstrategie, die im Zusammenspiel mit anderen algorithmischen Designentscheidungen akkurate Ergebnisse erzeugt und dabei eine Echtzeit-Laufzeit ermöglicht. Daneben wird eine neue, differenzierbare und schwach beaufsichtigte, Multi-Kamera basierte Trainingsstrategie in Kombination mit einem komplett differenzierbaren Charaktermodell vorgestellt, welches ungesehene 3D Präzision erreicht. Allerdings spielt nicht nur die Analyse von Menschen in Bildern in Computer Vision eine wichtige Rolle, sondern auch die Möglichkeit, neue Bilder von Personen in unterschiedlichen Posen und Kamera- Blickwinkeln synthetisch zu rendern, ohne dass solche Daten zuvor in der Realität aufgenommen wurden. Diese Methoden sind unabdingbar für die Filmindustrie, da sie es zum Beispiel ermöglichen, fotorealistische virtuelle Welten mit real aussehenden Menschen zu erzeugen, sowie die Möglichkeit bieten, Szenen, die für den Schauspieler zu gefährlich sind, virtuell zu produzieren, ohne dass eine reale Person diese Aktionen tatsächlich ausführen muss. Aber auch Videokonferenzen und Telepresence-Applikationen können von fotorealistischen 3D-Charakteren profitieren, da diese die immersive Erfahrung von solchen Applikationen verstärken. Traditionelle Verfahren zum Rendern von fotorealistischen Bildern involvieren viele mühsame und zeitintensive Schritte, welche Expertenwissen vorraussetzen und zudem auch Rechenzeiten erreichen, die jenseits von Echtzeit sind. Diese Schritte beinhalten das Rigging und Skinning von virtuellen Charakteren, das Modellieren von Reflektions- und Materialeigenschaften sowie physikalisch basiertes Ray Tracing. Vor Kurzem haben Deep Learning-basierte Methoden versucht, die Rendering-Funktion von Daten zu lernen, was in Verfahren resultierte, die eine Nutzung durch Nicht-Experten ermöglicht. Allerdings basieren die meisten Methoden auf Synthese-Verfahren im 2D-Bildbereich und können daher keine 3D-Konsistenz garantieren. Darüber hinaus gelingt es den meisten Methoden auch nicht, bewegungs- und blickwinkelabhängige Effekte zu erzeugen. Daher präsentiert diese Thesis eine neue Methode und eine laufende Forschungsarbeit zum Thema Charakter-Synthese, die es erlauben, fotorealistische und kontrollierbare 3D-Charakteren synthetisch zu rendern, die nicht nur 3D-konsistent sind, sondern auch bewegungs- und blickwinkelabhängige Effekte modellieren und Echtzeit-Rechenzeiten ermöglichen. Dazu wird eine neuartige Grobzu- Fein-Charakterrepräsentation für effiziente Bild-Synthese von Menschen vorgestellt, welche nur anhand von Multi-Kamera-Daten trainiert werden kann. Daneben wird gezeigt, wie diese explizite Geometrie- Repräsentation mit einer impliziten Oberflächendarstellung kombiniert werden kann, was eine bessere Synthese von geomtrischen Deformationen sowie Bildern ermöglicht.ERC Consolidator Grant 4DRepL

    Pedestrian detection and tracking using stereo vision techniques

    Get PDF
    Automated pedestrian detection, counting and tracking has received significant attention from the computer vision community of late. Many of the person detection techniques described so far in the literature work well in controlled environments, such as laboratory settings with a small number of people. This allows various assumptions to be made that simplify this complex problem. The performance of these techniques, however, tends to deteriorate when presented with unconstrained environments where pedestrian appearances, numbers, orientations, movements, occlusions and lighting conditions violate these convenient assumptions. Recently, 3D stereo information has been proposed as a technique to overcome some of these issues and to guide pedestrian detection. This thesis presents such an approach, whereby after obtaining robust 3D information via a novel disparity estimation technique, pedestrian detection is performed via a 3D point clustering process within a region-growing framework. This clustering process avoids using hard thresholds by using bio-metrically inspired constraints and a number of plan view statistics. This pedestrian detection technique requires no external training and is able to robustly handle challenging real-world unconstrained environments from various camera positions and orientations. In addition, this thesis presents a continuous detect-and-track approach, with additional kinematic constraints and explicit occlusion analysis, to obtain robust temporal tracking of pedestrians over time. These approaches are experimentally validated using challenging datasets consisting of both synthetic data and real-world sequences gathered from a number of environments. In each case, the techniques are evaluated using both 2D and 3D groundtruth methodologies

    3D-LIVE : live interactions through 3D visual environments

    Get PDF
    This paper explores Future Internet (FI) 3D-Media technologies and Internet of Things (IoT) in real and virtual environments in order to sense and experiment Real-Time interaction within live situations. The combination of FI testbeds and Living Labs (LL) would enable both researchers and users to explore capacities to enter the 3D Tele-Immersive (TI) application market and to establish new requirements for FI technology and infrastructure. It is expected that combining both FI technology pull and TI market pull would promote and accelerate the creation and adoption, by user communities such as sport practitioners, of innovative TI Services within sport events

    A Deeper Look into DeepCap

    Get PDF
    Human performance capture is a highly important computer vision problem with many applications in movie production and virtual/augmented reality. Many previous performance capture approaches either required expensive multi-view setups or did not recover dense space-time coherent geometry with frame-to-frame correspondences. We propose a novel deep learning approach for monocular dense human performance capture. Our method is trained in a weakly supervised manner based on multi-view supervision completely removing the need for training data with 3D ground truth annotations. The network architecture is based on two separate networks that disentangle the task into a pose estimation and a non-rigid surface deformation step. Extensive qualitative and quantitative evaluations show that our approach outperforms the state of the art in terms of quality and robustness. This work is an extended version of DeepCap where we provide more detailed explanations, comparisons and results as well as applications
    corecore