140 research outputs found

    Computational Learning for Hand Pose Estimation

    Get PDF
    Rapid advances in human–computer interaction interfaces have been promising a realistic environment for gaming and entertainment in the last few years. However, the use of traditional input devices such as trackballs, keyboards, or joysticks has been a bottleneck for natural interactions between a human and computer as two points of freedom of these devices cannot suitably emulate the interactions in a three-dimensional space. Consequently, a comprehensive hand tracking technology is expected as a smart and intuitive option to these input tools to enhance virtual and augmented reality experiences. In addition, the recent emergence of low-cost depth sensing cameras has led to their broad use of RGB-D data in computer vision, raising expectations of a full 3D interpretation of hand movements for human–computer interaction interfaces. Although the use of hand gestures or hand postures has become essential for a wide range of applications in computer games and augmented/virtual reality, 3D hand pose estimation is still an open and challenging problem because of the following reasons: (i) the hand pose exists in a high-dimensional space because each finger and the palm is associated with several degrees of freedom, (ii) the fingers exhibit self-similarity and often occlude to each other, (iii) global 3D rotations make pose estimation more difficult, and (iv) hands only exist in few pixels in images and the noise in acquired data coupled with fast finger movement confounds continuous hand tracking. The success of hand tracking would naturally depend on synthesizing our knowledge of the hand (i.e., geometric shape, constraints on pose configurations) and latent features about hand poses from the RGB-D data stream (i.e., region of interest, key feature points like finger tips and joints, and temporal continuity). In this thesis, we propose novel methods to leverage the paradigm of analysis by synthesis and create a prediction model using a population of realistic 3D hand poses. The overall goal of this work is to design a concrete framework so the computers can learn and understand about perceptual attributes of human hands (i.e., self-occlusions or self-similarities of the fingers) and to develop a pragmatic solution to the real-time hand pose estimation problem implementable on a standard computer. This thesis can be broadly divided into four parts: learning hand (i) from recommendiations of similar hand poses, (ii) from low-dimensional visual representations, (iii) by hallucinating geometric representations, and (iv) from a manipulating object. Each research work covers our algorithmic contributions to solve the 3D hand pose estimation problem. Additionally, the research work in the appendix proposes a pragmatic technique for applying our ideas to mobile devices with low computational power. Following a given structure, we first overview the most relevant works on depth sensor-based 3D hand pose estimation in the literature both with and without manipulating an object. Two different approaches prevalent for categorizing hand pose estimation, model-based methods and appearance-based methods, are discussed in detail. In this chapter, we also introduce some works relevant to deep learning and trials to achieve efficient compression of the network structure. Next, we describe a synthetic 3D hand model and its motion constraints for simulating realistic human hand movements. The section for the primary research work starts in the following chapter. We discuss our attempts to produce a better estimation model for 3D hand pose estimation by learning hand articulations from recommendations of similar poses. Specifically, the unknown pose parameters for input depth data are estimated by collaboratively learning the known parameters of all neighborhood poses. Subsequently, we discuss deep-learned, discriminative, and low-dimensional features and a hierarchical solution of the stated problem based on the matrix completion framework. This work is further extended by incorporating a function of geometric properties on the surface of the hand described by heat diffusion, which is robust to capture both the local geometry of the hand and global structural representations. The problem of the hands interactions with a physical object is also considered in the following chapter. The main insight is that the interacting object can be a source of constraint on hand poses. In this view, we employ pose dependency on the shape of the object to learn the discriminative features of the hand–object interaction, rather than losing hand information caused by partial or full object occlusions. Subsequently, we present a compressive learning technique in the appendix. Our approach is flexible, enabling us to add more layers and go deeper in the deep learning architecture while keeping the number of parameters the same. Finally, we conclude this thesis work by summarizing the presented approaches for hand pose estimation and then propose future directions to further achieve performance improvements through (i) realistically rendered synthetic hand images, (ii) incorporating RGB images as an input, (iii) hand perseonalization, (iv) use of unstructured point cloud, and (v) embedding sensing techniques

    Contributions to the Modelling of Auditory Hallucinations, Social robotics, and Multiagent Systems

    Get PDF
    165 p.The Thesis covers three diverse lines of work that have been tackled with the central endeavor of modeling and understanding the phenomena under consideration. Firstly, the Thesis works on the problem of finding brain connectivity biomarkers of auditory hallucinations, a rather frequent phenomena that can be related some pathologies, but which is also present in healthy population. We apply machine learning techniques to assess the significance of effective brain connections extracted by either dynamical causal modeling or Granger causality. Secondly, the Thesis deals with the usefulness of social robotics strorytelling as a therapeutic tools for children at risk of exclussion. The Thesis reports on the observations gathered in several therapeutic sessions carried out in Spain and Bulgaria, under the supervision of tutors and caregivers. Thirdly, the Thesis deals with the spatio-temporal dynamic modeling of social agents trying to explain the phenomena of opinion survival of the social minorities. The Thesis proposes a eco-social model endowed with spatial mobility of the agents. Such mobility and the spatial perception of the agents are found to be strong mechanisms explaining opinion propagation and survival

    A study of deep learning and its applications to face recognition techniques

    Get PDF
    El siguiente trabajo es el resultado de la tesis de maestría de Fernando Suzacq. La tesis se centró alrededor de la investigación sobre el reconocimiento facial en 3D, sin la reconstrucción de la profundidad ni la utilización de modelos 3D genéricos. Esta investigación resultó en la escritura de un paper y su posterior publicación en IEEE Transactions on Pattern Analysis and Machine Intelligence. Mediante el uso de iluminación activa, se mejora el reconocimiento facial en 2D y se lo hace más robusto a condiciones de baja iluminación o ataques de falsificación de identidad. La idea central del trabajo es la proyección de un patrón de luz de alta frecuencia sobre la cara de prueba. De la captura de esta imagen, nos es posible recuperar información real 3D, que se desprende de las deformaciones de este patrón, junto con una imagen 2D de la cara de prueba. Este proceso evita tener que lidiar con la difícil tarea de reconstrucción 3D. En el trabajo se presenta la teoría que fundamenta este proceso, se explica su construcción y se proveen los resultados de distintos experimentos realizados que sostienen su validez y utilidad. Para el desarrollo de esta investigación, fue necesario el estudio de la teoría existente y una revisión del estado del arte en este problema particular. Parte del resultado de este trabajo se presenta también en este documento, como marco teórico sobre la publicación

    Inference of natural language predicates in the open domain

    Get PDF
    Inference of predicates in natural language is a common task for humans in everyday scenarios, and thus for natural language processing by machines, such as in question answering. The question Did Arsenal beat Man United? can be affirmed by a text Arsenal obliterated Man United on Saturday if an inference is drawn that the text predicate obliterate entails beat in the question. In a world of vast and varied text resources, automatic language inference is necessary for bridging this gap between records and queries. A promising model of such inference between predicates is an Entailment Graph (EG), a structure of meaning postulates such as x obliterates y entails x defeats y. EGs are constructed using unsupervised distributional methods over a large corpus, learning representations of natural language predicates contained within. Entailment is directional, and correctly, EGs fail to confirm the opposite, that x defeats y entails x obliterates y; these distinctions are important for language understanding applications. In an EG, postulates are typically defined for a predicate argument pair (x, y) over a fixed vocabulary of such binary valence predicates, which relate two arguments. However, EG meaning postulates are limited in terms of their predicates in two ways. First, using the conventional approach, entailments may only be learned for predicates of the same valence, typically binary to binary entailment, ignoring entailments between valencies and their applications. For example, the binary relation Arsenal defeats Man United leads to an inference in humans that Arsenal is the winner, a unary relation applying to the subject Arsenal. Yet using conventional means, it is not possible to learn these in EGs. Second, only a limited vocabulary of predicates may be learned in training. This is because of the natural Zipfian frequency distribution of predicates in text corpora, which includes an unbounded long tail of rarely-mentioned predicates like obliterate. This distribution simultaneously makes it impractical to learn entailments for every predicate in a language by reading corpora, and also very likely that many of these unlearned predicates may be involved in real queries. This thesis explores inference in the open domain of natural language predicates beyond a fixed vocabulary of binary predicates. First, Entailment Graph valency is addressed. The distributional learning method is refined to enable learning entailments between predicates of different valencies. This improves recall in question answering by leveraging all available predicates in the reference text to answer questions. Second, the problem of overall predicate sparsity in EGs is explored, in which Language Model encoding is applied unsupervised with an EG. This provides a means of approximating missing premise predicates at test-time, which improves both recall and precision. However, while approximating missing hypothesis predicates is shown to be possible in principle, it remains a challenge. Finally, a behavioral study is presented on Large Language Models (containing one billion parameters or more) which investigates their ability to perform language inference involving fully open-domain premise and hypothesis predicates. While superficially performant, this class of model is found to merely approximate language inference, utilizing unsound methods to mimic reasoning including memorized training data and proxies learned from corpus distributions, which have no direct relationship with meaning

    Synthesizing and Editing Photo-realistic Visual Objects

    Get PDF
    In this thesis we investigate novel methods of synthesizing new images of a deformable visual object using a collection of images of the object. We investigate both parametric and non-parametric methods as well as a combination of the two methods for the problem of image synthesis. Our main focus are complex visual objects, specifically deformable objects and objects with varying numbers of visible parts. We first introduce sketch-driven image synthesis system, which allows the user to draw ellipses and outlines in order to sketch a rough shape of animals as a constraint to the synthesized image. This system interactively provides feedback in the form of ellipse and contour suggestions to the partial sketch of the user. The user's sketch guides the non-parametric synthesis algorithm that blends patches from two exemplar images in a coarse-to-fine fashion to create a final image. We evaluate the method and synthesized images through two user studies. Instead of non-parametric blending of patches, a parametric model of the appearance is more desirable as its appearance representation is shared between all images of the dataset. Hence, we propose Context-Conditioned Component Analysis, a probabilistic generative parametric model, which described images with a linear combination of basis functions. The basis functions are evaluated for each pixel using a context vector computed from the local shape information. We evaluate C-CCA qualitatively and quantitatively on inpainting, appearance transfer and reconstruction tasks. Drawing samples of C-CCA generates novel, globally-coherent images, which, unfortunately, lack high-frequency details due to dimensionality reduction and misalignment. We develop a non-parametric model that enhances the samples of C-CCA with locally-coherent, high-frequency details. The non-parametric model efficiently finds patches from the dataset that match the C-CCA sample and blends the patches together. We analyze the results of the combined method on the datasets of horse and elephant images

    Development Of A High Performance Mosaicing And Super-Resolution Algorithm

    Get PDF
    In this dissertation, a high-performance mosaicing and super-resolution algorithm is described. The scale invariant feature transform (SIFT)-based mosaicing algorithm builds an initial mosaic which is iteratively updated by the robust super resolution algorithm to achieve the final high-resolution mosaic. Two different types of datasets are used for testing: high altitude balloon data and unmanned aerial vehicle data. To evaluate our algorithm, five performance metrics are employed: mean square error, peak signal to noise ratio, singular value decomposition, slope of reciprocal singular value curve, and cumulative probability of blur detection. Extensive testing shows that the proposed algorithm is effective in improving the captured aerial data and the performance metrics are accurate in quantifying the evaluation of the algorithm
    corecore