102 research outputs found

    From images to augmented 3D models: improved visual SLAM and augmented point cloud modeling

    Get PDF
    This thesis investigates into the problem of using monocular image sequences to generate augmented models. The problem is decomposed to two subproblems: monocular visual simultaneously localization and mapping (VSLAM), and the point cloud data modeling. Accordingly, the thesis comprises two major parts. The First part, including Chapters 2, 3 and 4, aims to leverage the system observability theories to improve the VSLAM accuracy. In Chapter 2, a piece-wise linear system is developed to model VSLAM, and two necessary conditions are proved to make the VSLAM completely observable. Based on the First condition, an instantaneous condition for complete observability, the "Optimally Observable and Minimal Cardinality (OOMC) VSLAM" is presented in Chapter 3. The OOMC algorithm selects the feature subset of minimal required cardinality to form the strongest observable VSLAM subsystem. The select feature subset is further used to improve the data association in VSLAM. Based on the second condition, a temporal condition for complete observability, the "Good Features (GF) to Track for VSLAM" is presented in Chapter 4. The GF algorithm ranks the individual features according to their contributions to system observability. Benchmarking experiments of both OOMC and GF algorithms demonstrate improvements in VSLAM performance. The second part, including Chapters 5 and 6, aims to solve the PCD modeling problem in a geometry-driven manner. Chapter 5 presents an algorithm to model PCDs with planar patches via a sparsity-inducing optimization. Chapter 6 extends the PCD modeling to quadratic surface primitives based models. A method is further developed to retrieve the high-level semantic information of the model components. Evaluation on the PCDs generated from VSLAM demonstrates the effectiveness of these geometry-driven PCD modeling approaches.Ph.D

    A COMPUTATION METHOD/FRAMEWORK FOR HIGH LEVEL VIDEO CONTENT ANALYSIS AND SEGMENTATION USING AFFECTIVE LEVEL INFORMATION

    No full text
    VIDEO segmentation facilitates e±cient video indexing and navigation in large digital video archives. It is an important process in a content-based video indexing and retrieval (CBVIR) system. Many automated solutions performed seg- mentation by utilizing information about the \facts" of the video. These \facts" come in the form of labels that describe the objects which are captured by the cam- era. This type of solutions was able to achieve good and consistent results for some video genres such as news programs and informational presentations. The content format of this type of videos is generally quite standard, and automated solutions were designed to follow these format rules. For example in [1], the presence of news anchor persons was used as a cue to determine the start and end of a meaningful news segment. The same cannot be said for video genres such as movies and feature films. This is because makers of this type of videos utilized different filming techniques to design their videos in order to elicit certain affective response from their targeted audience. Humans usually perform manual video segmentation by trying to relate changes in time and locale to discontinuities in meaning [2]. As a result, viewers usually have doubts about the boundary locations of a meaningful video segment due to their different affective responses. This thesis presents an entirely new view to the problem of high level video segmentation. We developed a novel probabilistic method for affective level video content analysis and segmentation. Our method had two stages. In the first stage, a®ective content labels were assigned to video shots by means of a dynamic bayesian 0. Abstract 3 network (DBN). A novel hierarchical-coupled dynamic bayesian network (HCDBN) topology was proposed for this stage. The topology was based on the pleasure- arousal-dominance (P-A-D) model of a®ect representation [3]. In principle, this model can represent a large number of emotions. In the second stage, the visual, audio and a®ective information of the video was used to compute a statistical feature vector to represent the content of each shot. Affective level video segmentation was achieved by applying spectral clustering to the feature vectors. We evaluated the first stage of our proposal by comparing its emotion detec- tion ability with all the existing works which are related to the field of a®ective video content analysis. To evaluate the second stage, we used the time adaptive clustering (TAC) algorithm as our performance benchmark. The TAC algorithm was the best high level video segmentation method [2]. However, it is a very computationally intensive algorithm. To accelerate its computation speed, we developed a modified TAC (modTAC) algorithm which was designed to be mapped easily onto a field programmable gate array (FPGA) device. Both the TAC and modTAC algorithms were used as performance benchmarks for our proposed method. Since affective video content is a perceptual concept, the segmentation per- formance and human agreement rates were used as our evaluation criteria. To obtain our ground truth data and viewer agreement rates, a pilot panel study which was based on the work of Gross et al. [4] was conducted. Experiment results will show the feasibility of our proposed method. For the first stage of our proposal, our experiment results will show that an average improvement of as high as 38% was achieved over previous works. As for the second stage, an improvement of as high as 37% was achieved over the TAC algorithm

    Similarity, Retrieval, and Classification of Motion Capture Data

    Get PDF
    Three-dimensional motion capture data is a digital representation of the complex spatio-temporal structure of human motion. Mocap data is widely used for the synthesis of realistic computer-generated characters in data-driven computer animation and also plays an important role in motion analysis tasks such as activity recognition. Both for efficiency and cost reasons, methods for the reuse of large collections of motion clips are gaining in importance in the field of computer animation. Here, an active field of research is the application of morphing and blending techniques for the creation of new, realistic motions from prerecorded motion clips. This requires the identification and extraction of logically related motions scattered within some data set. Such content-based retrieval of motion capture data, which is a central topic of this thesis, constitutes a difficult problem due to possible spatio-temporal deformations between logically related motions. Recent approaches to motion retrieval apply techniques such as dynamic time warping, which, however, are not applicable to large data sets due to their quadratic space and time complexity. In our approach, we introduce various kinds of relational features describing boolean geometric relations between specified body points and show how these features induce a temporal segmentation of motion capture data streams. By incorporating spatio-temporal invariance into the relational features and induced segments, we are able to adopt indexing methods allowing for flexible and efficient content-based retrieval in large motion capture databases. As a further application of relational motion features, a new method for fully automatic motion classification and retrieval is presented. We introduce the concept of motion templates (MTs), by which the spatio-temporal characteristics of an entire motion class can be learned from training data, yielding an explicit, compact matrix representation. The resulting class MT has a direct, semantic interpretation, and it can be manually edited, mixed, combined with other MTs, extended, and restricted. Furthermore, a class MT exhibits the characteristic as well as the variational aspects of the underlying motion class at a semantically high level. Classification is then performed by comparing a set of precomputed class MTs with unknown motion data and labeling matching portions with the respective motion class label. Here, the crucial point is that the variational (hence uncharacteristic) motion aspects encoded in the class MT are automatically masked out in the comparison, which can be thought of as locally adaptive feature selection

    Large-scale interactive exploratory visual search

    Get PDF
    Large scale visual search has been one of the challenging issues in the era of big data. It demands techniques that are not only highly effective and efficient but also allow users conveniently express their information needs and refine their intents. In this thesis, we focus on developing an exploratory framework for large scale visual search. We also develop a number of enabling techniques in this thesis, including compact visual content representation for scalable search, near duplicate video shot detection, and action based event detection. We propose a novel scheme for extremely low bit rate visual search, which sends compressed visual words consisting of vocabulary tree histogram and descriptor orientations rather than descriptors. Compact representation of video data is achieved through identifying keyframes of a video which can also help users comprehend visual content efficiently. We propose a novel Bag-of-Importance model for static video summarization. Near duplicate detection is one of the key issues for large scale visual search, since there exist a large number nearly identical images and videos. We propose an improved near-duplicate video shot detection approach for more effective shot representation. Event detection has been one of the solutions for bridging the semantic gap in visual search. We particular focus on human action centred event detection. We propose an enhanced sparse coding scheme to model human actions. Our proposed approach is able to significantly reduce computational cost while achieving recognition accuracy highly comparable to the state-of-the-art methods. At last, we propose an integrated solution for addressing the prime challenges raised from large-scale interactive visual search. The proposed system is also one of the first attempts for exploratory visual search. It provides users more robust results to satisfy their exploring experiences

    Robust and affordable localization and mapping for 3D reconstruction. Application to architecture and construction

    Get PDF
    La localización y mapeado simultáneo a partir de una sola cámara en movimiento se conoce como Monocular SLAM. En esta tesis se aborda este problema con cámaras de bajo coste cuyo principal reto consiste en ser robustos al ruido, blurring y otros artefactos que afectan a la imagen. La aproximación al problema es discreta, utilizando solo puntos de la imagen significativos para localizar la cámara y mapear el entorno. La principal contribución es una simplificación del grafo de poses que permite mejorar la precisión en las escenas más habituales, evaluada de forma exhaustiva en 4 datasets. Los resultados del mapeado permiten obtener una reconstrucción 3D de la escena que puede ser utilizada en arquitectura y construcción para Modelar la Información del Edificio (BIM). En la segunda parte de la tesis proponemos incorporar dicha información en un sistema de visualización avanzada usando WebGL que ayude a simplificar la implantación de la metodología BIM.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Doctorado en Informátic

    컴퓨터를 활용한 여러 사람의 동작 연출

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 공과대학 전기·컴퓨터공학부, 2017. 8. 이제희.Choreographing motion is the process of converting written stories or messages into the real movement of actors. In performances or movie, directors spend a consid-erable time and effort because it is the primary factor that audiences concentrate. If multiple actors exist in the scene, choreography becomes more challenging. The fundamental difficulty is that the coordination between actors should precisely be ad-justed. Spatio-temporal coordination is the first requirement that must be satisfied, and causality/mood are also another important coordinations. Directors use several assistant tools such as storyboards or roughly crafted 3D animations, which can visu-alize the flow of movements, to organize ideas or to explain them to actors. However, it is difficult to use the tools because artistry and considerable training effort are required. It also doesnt have ability to give any suggestions or feedbacks. Finally, the amount of manual labor increases exponentially as the number of actor increases. In this thesis, we propose computational approaches on choreographing multiple actor motion. The ultimate goal is to enable novice users easily to generate motions of multiple actors without substantial effort. We first show an approach to generate motions for shadow theatre, where actors should carefully collaborate to achieve the same goal. The results are comparable to ones that are made by professional ac-tors. In the next, we present an interactive animation system for pre-visualization, where users exploits an intuitive graphical interface for scene description. Given a de-scription, the system can generate motions for the characters in the scene that match the description. Finally, we propose two controller designs (combining regression with trajectory optimization, evolutionary deep reinforcement learning) for physically sim-ulated actors, which guarantee physical validity of the resultant motions.Chapter 1 Introduction 1 Chapter 2 Background 8 2.1 Motion Generation Technique 9 2.1.1 Motion Editing and Synthesis for Single-Character 9 2.1.2 Motion Editing and Synthesis for Multi-Character 9 2.1.3 Motion Planning 10 2.1.4 Motion Control by Reinforcement Learning 11 2.1.5 Pose/Motion Estimation from Incomplete Information 11 2.1.6 Diversity on Resultant Motions 12 2.2 Authoring System 12 2.2.1 System using High-level Input 12 2.2.2 User-interactive System 13 2.3 Shadow Theatre 14 2.3.1 Shadow Generation 14 2.3.2 Shadow for Artistic Purpose 14 2.3.3 Viewing Shadow Theatre as Collages/Mosaics of People 15 2.4 Physics-based Controller Design 15 2.4.1 Controllers for Various Characters 15 2.4.2 Trajectory Optimization 15 2.4.3 Sampling-based Optimization 16 2.4.4 Model-Based Controller Design 16 2.4.5 Direct Policy Learning 17 2.4.6 Deep Reinforcement Learning for Control 17 Chapter 3 Motion Generation for Shadow Theatre 19 3.1 Overview 19 3.2 Shadow Theatre Problem 21 3.2.1 Problem Definition 21 3.2.2 Approaches of Professional Actors 22 3.3 Discovery of Principal Poses 24 3.3.1 Optimization Formulation 24 3.3.2 Optimization Algorithm 27 3.4 Animating Principal Poses 29 3.4.1 Initial Configuration 29 3.4.2 Optimization for Motion Generation 30 3.5 Experimental Results 32 3.5.1 Implementation Details 33 3.5.2 Animation 34 3.5.3 3D Fabrication 34 3.6 Discussion 37 Chapter 4 Interactive Animation System for Pre-visualization 40 4.1 Overview 40 4.2 Graphical Scene Description 42 4.3 Candidate Scene Generation 45 4.3.1 Connecting Paths 47 4.3.2 Motion Cascade 47 4.3.3 Motion Selection For Each Cycle 49 4.3.4 Cycle Ordering 51 4.3.5 Generalized Paths and Cycles 52 4.3.6 Motion Editing 54 4.4 Scene Ranking 54 4.4.1 Ranking Criteria 54 4.4.2 Scene Ranking Measures 57 4.5 Scene Refinement 58 4.6 Experimental Results 62 4.7 Discussion 65 Chapter 5 Physics-based Design and Control 69 5.1 Overview 69 5.2 Combining Regression with Trajectory Optimization 70 5.2.1 Simulation and Motor Skills 71 5.2.2 Control Adaptation 75 5.2.3 Control Parameterization 79 5.2.4 Efficient Construction 81 5.2.5 Experimental Results 84 5.2.6 Discussion 89 5.3 Example-Guided Control by Deep Reinforcement Learning 91 5.3.1 System Overview 92 5.3.2 Initial Policy Construction 95 5.3.3 Evolutionary Deep Q-Learning 100 5.3.4 Experimental Results 107 5.3.5 Discussion 114 Chapter 6 Conclusion 119 6.1 Contribution 119 6.2 Future Work 120 요약 135Docto

    Discrete language models for video retrieval

    Get PDF
    Finding relevant video content is important for producers of television news, documentanes and commercials. As digital video collections become more widely available, content-based video retrieval tools will likely grow in importance for an even wider group of users. In this thesis we investigate language modelling approaches, that have been the focus of recent attention within the text information retrieval community, for the video search task. Language models are smoothed discrete generative probability distributions generally of text and provide a neat information retrieval formalism that we believe is equally applicable to traditional visual features as to text. We propose to model colour, edge and texture histogrambased features directly with discrete language models and this approach is compatible with further traditional visual feature representations. We provide a comprehensive and robust empirical study of smoothing methods, hierarchical semantic and physical structures, and fusion methods for this language modelling approach to video retrieval. The advantage of our approach is that it provides a consistent, effective and relatively efficient model for video retrieval
    corecore