490 research outputs found

    On the 3D point cloud for human-pose estimation

    Get PDF
    This thesis aims at investigating methodologies for estimating a human pose from a 3D point cloud that is captured by a static depth sensor. Human-pose estimation (HPE) is important for a range of applications, such as human-robot interaction, healthcare, surveillance, and so forth. Yet, HPE is challenging because of the uncertainty in sensor measurements and the complexity of human poses. In this research, we focus on addressing challenges related to two crucial components in the estimation process, namely, human-pose feature extraction and human-pose modeling. In feature extraction, the main challenge involves reducing feature ambiguity. We propose a 3D-point-cloud feature called viewpoint and shape feature histogram (VISH) to reduce feature ambiguity by capturing geometric properties of the 3D point cloud of a human. The feature extraction consists of three steps: 3D-point-cloud pre-processing, hierarchical structuring, and feature extraction. In the pre-processing step, 3D points corresponding to a human are extracted and outliers from the environment are removed to retain the 3D points of interest. This step is important because it allows us to reduce the number of 3D points by keeping only those points that correspond to the human body for further processing. In the hierarchical structuring, the pre-processed 3D point cloud is partitioned and replicated into a tree structure as nodes. Viewpoint feature histogram (VFH) and shape features are extracted from each node in the tree to provide a descriptor to represent each node. As the features are obtained based on histograms, coarse-level details are highlighted in large regions and fine-level details are highlighted in small regions. Therefore, the features from the point cloud in the tree can capture coarse level to fine level information to reduce feature ambiguity. In human-pose modeling, the main challenges involve reducing the dimensionality of human-pose space and designing appropriate factors that represent the underlying probability distributions for estimating human poses. To reduce the dimensionality, we propose a non-parametric action-mixture model (AMM). It represents high-dimensional human-pose space using low-dimensional manifolds in searching human poses. In each manifold, a probability distribution is estimated based on feature similarity. The distributions in the manifolds are then redistributed according to the stationary distribution of a Markov chain that models the frequency of human actions. After the redistribution, the manifolds are combined according to a probability distribution determined by action classification. Experiments were conducted using VISH features as input to the AMM. The results showed that the overall error and standard deviation of the AMM were reduced by about 7.9% and 7.1%, respectively, compared with a model without action classification. To design appropriate factors, we consider the AMM as a Bayesian network and propose a mapping that converts the Bayesian network to a neural network called NN-AMM. The proposed mapping consists of two steps: structure identification and parameter learning. In structure identification, we have developed a bottom-up approach to build a neural network while preserving the Bayesian-network structure. In parameter learning, we have created a part-based approach to learn synaptic weights by decomposing a neural network into parts. Based on the concept of distributed representation, the NN-AMM is further modified into a scalable neural network called NND-AMM. A neural-network-based system is then built by using VISH features to represent 3D-point-cloud input and the NND-AMM to estimate 3D human poses. The results showed that the proposed mapping can be utilized to design AMM factors automatically. The NND-AMM can provide more accurate human-pose estimates with fewer hidden neurons than both the AMM and NN-AMM can. Both the NN-AMM and NND-AMM can adapt to different types of input, showing the advantage of using neural networks to design factors

    Going Deeper into Action Recognition: A Survey

    Full text link
    Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader

    Deep Learning for Free-Hand Sketch: A Survey

    Get PDF
    Free-hand sketches are highly illustrative, and have been widely used by humans to depict objects or stories from ancient times to the present. The recent prevalence of touchscreen devices has made sketch creation a much easier task than ever and consequently made sketch-oriented applications increasingly popular. The progress of deep learning has immensely benefited free-hand sketch research and applications. This paper presents a comprehensive survey of the deep learning techniques oriented at free-hand sketch data, and the applications that they enable. The main contents of this survey include: (i) A discussion of the intrinsic traits and unique challenges of free-hand sketch, to highlight the essential differences between sketch data and other data modalities, e.g., natural photos. (ii) A review of the developments of free-hand sketch research in the deep learning era, by surveying existing datasets, research topics, and the state-of-the-art methods through a detailed taxonomy and experimental evaluation. (iii) Promotion of future work via a discussion of bottlenecks, open problems, and potential research directions for the community.Comment: This paper is accepted by IEEE TPAM

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF

    Computer vision beyond the visible : image understanding through language

    Get PDF
    In the past decade, deep neural networks have revolutionized computer vision. High performing deep neural architectures trained for visual recognition tasks have pushed the field towards methods relying on learned image representations instead of hand-crafted ones, in the seek of designing end-to-end learning methods to solve challenging tasks, ranging from long-lasting ones such as image classification to newly emerging tasks like image captioning. As this thesis is framed in the context of the rapid evolution of computer vision, we present contributions that are aligned with three major changes in paradigm that the field has recently experienced, namely 1) the power of re-utilizing deep features from pre-trained neural networks for different tasks, 2) the advantage of formulating problems with end-to-end solutions given enough training data, and 3) the growing interest of describing visual data with natural language rather than pre-defined categorical label spaces, which can in turn enable visual understanding beyond scene recognition. The first part of the thesis is dedicated to the problem of visual instance search, where we particularly focus on obtaining meaningful and discriminative image representations which allow efficient and effective retrieval of similar images given a visual query. Contributions in this part of the thesis involve the construction of sparse Bag-of-Words image representations from convolutional features from a pre-trained image classification neural network, and an analysis of the advantages of fine-tuning a pre-trained object detection network using query images as training data. The second part of the thesis presents contributions to the problem of image-to-set prediction, understood as the task of predicting a variable-sized collection of unordered elements for an input image. We conduct a thorough analysis of current methods for multi-label image classification, which are able to solve the task in an end-to-end manner by simultaneously estimating both the label distribution and the set cardinality. Further, we extend the analysis of set prediction methods to semantic instance segmentation, and present an end-to-end recurrent model that is able to predict sets of objects (binary masks and categorical labels) in a sequential manner. Finally, the third part of the dissertation takes insights learned in the previous two parts in order to present deep learning solutions to connect images with natural language in the context of cooking recipes and food images. First, we propose a retrieval-based solution in which the written recipe and the image are encoded into compact representations that allow the retrieval of one given the other. Second, as an alternative to the retrieval approach, we propose a generative model to predict recipes directly from food images, which first predicts ingredients as sets and subsequently generates the rest of the recipe one word at a time by conditioning both on the image and the predicted ingredients.En l'última dècada, les xarxes neuronals profundes han revolucionat el camp de la visió per computador. Els resultats favorables obtinguts amb arquitectures neuronals profundes entrenades per resoldre tasques de reconeixement visual han causat un canvi de paradigma cap al disseny de mètodes basats en representacions d'imatges apreses de manera automàtica, deixant enrere les tècniques tradicionals basades en l'enginyeria de representacions. Aquest canvi ha permès l'aparició de tècniques basades en l'aprenentatge d'extrem a extrem (end-to-end), capaces de resoldre de manera efectiva molts dels problemes tradicionals de la visió per computador (e.g. classificació d'imatges o detecció d'objectes), així com nous problemes emergents com la descripció textual d'imatges (image captioning). Donat el context de la ràpida evolució de la visió per computador en el qual aquesta tesi s'emmarca, presentem contribucions alineades amb tres dels canvis més importants que la visió per computador ha experimentat recentment: 1) la reutilització de representacions extretes de models neuronals pre-entrenades per a tasques auxiliars, 2) els avantatges de formular els problemes amb solucions end-to-end entrenades amb grans bases de dades, i 3) el creixent interès en utilitzar llenguatge natural en lloc de conjunts d'etiquetes categòriques pre-definits per descriure el contingut visual de les imatges, facilitant així l'extracció d'informació visual més enllà del reconeixement de l'escena i els elements que la composen La primera part de la tesi està dedicada al problema de la cerca d'imatges (image retrieval), centrada especialment en l'obtenció de representacions visuals significatives i discriminatòries que permetin la recuperació eficient i efectiva d'imatges donada una consulta formulada amb una imatge d'exemple. Les contribucions en aquesta part de la tesi inclouen la construcció de representacions Bag-of-Words a partir de descriptors locals obtinguts d'una xarxa neuronal entrenada per classificació, així com un estudi dels avantatges d'utilitzar xarxes neuronals per a detecció d'objectes entrenades utilitzant les imatges d'exemple, amb l'objectiu de millorar les capacitats discriminatòries de les representacions obtingudes. La segona part de la tesi presenta contribucions al problema de predicció de conjunts a partir d'imatges (image to set prediction), entès com la tasca de predir una col·lecció no ordenada d'elements de longitud variable donada una imatge d'entrada. En aquest context, presentem una anàlisi exhaustiva dels mètodes actuals per a la classificació multi-etiqueta d'imatges, que són capaços de resoldre la tasca de manera integral calculant simultàniament la distribució probabilística sobre etiquetes i la cardinalitat del conjunt. Seguidament, estenem l'anàlisi dels mètodes de predicció de conjunts a la segmentació d'instàncies semàntiques, presentant un model recurrent capaç de predir conjunts d'objectes (representats per màscares binàries i etiquetes categòriques) de manera seqüencial. Finalment, la tercera part de la tesi estén els coneixements apresos en les dues parts anteriors per presentar solucions d'aprenentatge profund per connectar imatges amb llenguatge natural en el context de receptes de cuina i imatges de plats cuinats. En primer lloc, proposem una solució basada en algoritmes de cerca, on la recepta escrita i la imatge es codifiquen amb representacions compactes que permeten la recuperació d'una donada l'altra. En segon lloc, com a alternativa a la solució basada en algoritmes de cerca, proposem un model generatiu capaç de predir receptes (compostes pels seus ingredients, predits com a conjunts, i instruccions) directament a partir d'imatges de menjar.Postprint (published version

    Parametric active learning techniques for 3D hand pose estimation

    Get PDF
    Active learning (AL) has recently gained popularity for deep learning (DL) models due to efficient and informative sampling, especially when the models require large-scale datasets. The DL models designed for 3D-HPE demand accurate and diverse large-scale datasets that are time-consuming, costly and require experts. This thesis aims to explore AL primarily for the 3D hand pose estimation (3D-HPE) task for the first time. The thesis delves directly into an AL methodology customised for 3D-HPE learners to address this. Because predominantly the learners are regression-based algorithms, a Bayesian approximation of a DL architecture is presented to model uncertainties. This approximation generates data and model- dependent uncertainties that are further combined with the data representativeness AL function, CoreSet, for sampling. Despite being the first work, it creates informative samples and minimal joint errors with less training data on three well-known depth datasets. The second AL algorithm continues to improve the selection following a new trend of parametric samplers. Precisely, this is proceeded task-agnostic with a Graph Convolutional Network (GCN) to offer higher order of representations between labelled and unlabelled data. The newly selected unlabelled images are ranked based on uncertainty or GCN feature distribution. Another novel sampler extends this idea, and tackles encountered AL issues, like cold-start and distribution shift, by training in a self-supervised way with contrastive learning. It shows leveraging the visual concepts from labelled and unlabelled images while attaining state-of-the-art results. The last part of the thesis brings prior AL insights and achievements in a unified parametric-based sampler proposal for the multi-modal 3D-HPE task. This sampler trains multi-variational auto-encoders to align the modalities and provide better selection representation. Several query functions are studied to open a new direction in deep AL sampling.Open Acces

    Vision-based Person Re-identification in a Queue

    Get PDF
    corecore