Search CORE

24,358 research outputs found

Learning Human Pose Estimation Features with Convolutional Networks

Author: Andriluka Mykhaylo
Bregler Christoph
Jain Arjun
Taylor Graham W.
Tompson Jonathan
Publication venue
Publication date: 01/01/2014
Field of study

This paper introduces a new architecture for human pose estimation using a multi- layer convolutional network architecture and a modified learning technique that learns low-level features and higher-level weak spatial models. Unconstrained human pose estimation is one of the hardest problems in computer vision, and our new architecture and learning schema shows significant improvement over the current state-of-the-art results. The main contribution of this paper is showing, for the first time, that a specific variation of deep learning is able to outperform all existing traditional architectures on this task. The paper also discusses several lessons learned while researching alternatives, most notably, that it is possible to learn strong low-level feature detectors on features that might even just cover a few pixels in the image. Higher-level spatial models improve somewhat the overall result, but to a much lesser extent then expected. Many researchers previously argued that the kinematic structure and top-down information is crucial for this domain, but with our purely bottom up, and weak spatial model, we could improve other more complicated architectures that currently produce the best results. This mirrors what many other researchers, like those in the speech recognition, object recognition, and other domains have experienced

arXiv.org e-Print Archive

CiteSeerX

MPG.PuRe

Cross-domain self-supervised complete geometric representation learning for real-scanned point cloud based pathological gait analysis

Author: Gu X
Guo Y
Lo B
Yang G-Z
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/08/2021
Field of study

Accurate lower-limb pose estimation is a prereq-uisite of skeleton based pathological gait analysis. To achievethis goal in free-living environments for long-term monitoring,single depth sensor has been proposed in research. However,the depth map acquired from a single viewpoint encodes onlypartial geometric information of the lower limbs and exhibitslarge variations across different viewpoints. Existing off-the-shelfthree-dimensional (3D) pose tracking algorithms and publicdatasets for depth based human pose estimation are mainlytargeted at activity recognition applications. They are relativelyinsensitive to skeleton estimation accuracy, especially at thefoot segments. Furthermore, acquiring ground truth skeletondata for detailed biomechanics analysis also requires consid-erable efforts. To address these issues, we propose a novelcross-domain self-supervised complete geometric representationlearning framework, with knowledge transfer from the unlabelledsynthetic point clouds of full lower-limb surfaces. The proposedmethod can significantly reduce the number of ground truthskeletons (with only 1%) in the training phase, meanwhileensuring accurate and precise pose estimation and capturingdiscriminative features across different pathological gait patternscompared to other methods

Spiral - Imperial College Digital Repository

Learning-based depth and pose prediction for 3D scene reconstruction in endoscopy

Author: Rau Anita
Publication venue: UCL (University College London)
Publication date: 28/09/2022
Field of study

Colorectal cancer is the third most common cancer worldwide. Early detection and treatment of pre-cancerous tissue during colonoscopy is critical to improving prognosis. However, navigating within the colon and inspecting the endoluminal tissue comprehensively are challenging, and success in both varies based on the endoscopist's skill and experience. Computer-assisted interventions in colonoscopy show much promise in improving navigation and inspection. For instance, 3D reconstruction of the colon during colonoscopy could promote more thorough examinations and increase adenoma detection rates which are associated with improved survival rates. Given the stakes, this thesis seeks to advance the state of research from feature-based traditional methods closer to a data-driven 3D reconstruction pipeline for colonoscopy. More specifically, this thesis explores different methods that improve subtasks of learning-based 3D reconstruction. The main tasks are depth prediction and camera pose estimation. As training data is unavailable, the author, together with her co-authors, proposes and publishes several synthetic datasets and promotes domain adaptation models to improve applicability to real data. We show, through extensive experiments, that our depth prediction methods produce more robust results than previous work. Our pose estimation network trained on our new synthetic data outperforms self-supervised methods on real sequences. Our box embeddings allow us to interpret the geometric relationship and scale difference between two images of the same surface without the need for feature matches that are often unobtainable in surgical scenes. Together, the methods introduced in this thesis help work towards a complete, data-driven 3D reconstruction pipeline for endoscopy

UCL Discovery

Lucid Data Dreaming for Video Object Segmentation

Author: Benenson Rodrigo
Brox Thomas
Ilg Eddy
Khoreva Anna
Schiele Bernt
Publication venue
Publication date: 01/01/2019
Field of study

Convolutional networks reach top quality in pixel-level video object segmentation but require a large amount of training data (1k~100k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x~1000x less annotated data than competing methods. Our approach is suitable for both single and multiple object segmentation. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the video object segmentation task.Comment: Accepted in International Journal of Computer Vision (IJCV

arXiv.org e-Print Archive

MPG.PuRe

Ihmisten asennon tunnistus syvyyskameralla

Author: Arasalo Ossi
Publication venue
Publication date: 16/12/2019
Field of study

Human pose estimation has many applications from activity analysis to autonomous cars. Modern advances in deep learning research have enabled real time multi-person pose estimation in complex environments. In this thesis, a state of the art deep learning architecture is adapted to work with depth sensors. A dataset is generated using computer graphics instead of annotating thousands of images by hand. Results are promising; trained neural network detects humans in multi-person environments even when occlusion is present. However, there are challenges rising from the difference between the real world and the synthetic data generation, which has to be addressed.Automaattisella ihmisten asennon tunnistuksella on lukuisia sovelluksia aktiviteettianalyysistä itsenäisiin autoihin. Nykyaikainen kehitys syvien neuroverkkojen alalla on mahdollistanut useamman ihmisen reaaliaikaisen asennon tunnistamisen monimutkaisissa ympäristöissä. Tässä diplomityössä adaptoidaan nykyaikainen neuroverkko arkkitehtuuri toimimaan syvyyskameralla saaduilla kuvilla. Neuroverkon opetukseen tarvittava opetusdata generoidoon tietokonegrafiikan avulla sen sijaan, että opetusdata luotaisiin käsityönä. Saadut tulokset ovat lupaavia; opetettu neuroverkko kykenee tunnistamaan samanaikaisesti usean ihmisen monimutkaisessa ympäristössä. Kaikesta huolimatta, simuloidun datan ja todellisen maailman välinen eroavaisuus aiheuttaa ongelmia, jotka täytyy ottaa huomioon

Aaltodoc Publication Archive

3D 손 포즈 인식을 위한 인조 데이터의 이용

Author: Yang John
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2021.8. 양한열.3D hand pose estimation (HPE) based on RGB images has been studied for a long time. Relevant methods have focused mainly on optimization of neural framework for graphically connected finger joints. Training RGB-based HPE models has not been easy to train because of the scarcity on RGB hand pose datasets; unlike human body pose datasets, the finger joints that span hand postures are structured delicately and exquisitely. Such structure makes accurately annotating each joint with unique 3D world coordinates difficult, which is why many conventional methods rely on synthetic data samples to cover large variations of hand postures. Synthetic dataset consists of very precise annotations of ground truths, and further allows control over the variety of data samples, yielding a learning model to be trained with a large pose space. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human hand pose estimation is a particularly interesting example of this synthetic-to-real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this dissertation, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images. We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks. Since a fixed set of dataset provides a finite distribution of data samples, the generalization of a learning pose estimation network is limited in terms of pose, RGB and viewpoint spaces. We further propose to augment the data automatically such that the augmented pose sampling is performed in favor of training pose estimators generalization performance. Such auto-augmentation of poses is performed within a learning feature space in order to avoid computational burden of generating synthetic sample for every iteration of updates. The proposed effort can be considered as generating and utilizing synthetic samples for network training in the feature space. This allows training efficiency by requiring less number of real data samples, enhanced generalization power over multiple dataset domains and estimation performance caused by efficient augmentation.2D 이미지에서 사람의 손 모양과 포즈를 인식하고 구현흐는 연구는 각 손가락 조인트들의 3D 위치를 검출하는 것을 목표로한다. 손 포즈는 손가락 조인트들로 구성되어 있고 손목 관절부터 MCP, PIP, DIP 조인트들로 사람 손을 구성하는 신체적 요소들을 의미한다. 손 포즈 정보는 다양한 분야에서 활용될수 있고 손 제스쳐 감지 연구 분야에서 손 포즈 정보가 매우 훌륭한 입력 특징 값으로 사용된다. 사람의 손 포즈 검출 연구를 실제 시스템에 적용하기 위해서는 높은 정확도, 실시간성, 다양한 기기에 사용 가능하도록 가벼운 모델이 필요하고, 이것을 가능케 하기 위해서 학습한 인공신경망 모델을 학습하는데에는 많은 데이터가 필요로 한다. 하지만 사람 손 포즈를 측정하는 기계들이 꽤 불안정하고, 이 기계들을 장착하고 있는 이미지는 사람 손 피부 색과는 많이 달라 학습에 사용하기가 적절하지 않다. 그러기 때문에 본 논문에서는 이러한 문제를 해결하기 위해 인공적으로 만들어낸 데이터를 재가공 및 증량하여 학습에 사용하고, 그것을 통해 더 좋은 학습성과를 이루려고 한다. 인공적으로 만들어낸 사람 손 이미지 데이터들은 실제 사람 손 피부색과는 비슷할지언정 디테일한 텍스쳐가 많이 달라, 실제로 인공 데이터를 학습한 모델은 실제 손 데이터에서 성능이 현저히 많이 떨어진다. 이 두 데이타의 도메인을 줄이기 위해서 첫번째로는 사람손의 구조를 먼저 학습 시키기위해, 손 모션을 재가공하여 그 움직임 구조를 학스한 시간적 정보를 뺀 나머지만 실제 손 이미지 데이터에 학습하였고 크게 효과를 내었다. 이때 실제 사람 손모션을 모방하는 방법론을 제시하였다. 두번째로는 두 도메인이 다른 데이터를 네트워크 피쳐 공간에서 align시켰다. 그뿐만아니라 인공 포즈를 특정 데이터들로 augment하지 않고 네트워크가 많이 보지 못한 포즈가 만들어지도록 하나의 확률 모델로서 설정하여 그것에서 샘플링하는 구조를 제안하였다. 본 논문에서는 인공 데이터를 더 효과적으로 사용하여 annotation이 어려운 실제 데이터를 더 모으는 수고스러움 없이 인공 데이터들을 더 효과적으로 만들어 내는 것 뿐만 아니라, 더 안전하고 지역적 특징과 시간적 특징을 활용해서 포즈의 성능을 개선하는 방법들을 제안했다. 또한, 네트워크가 스스로 필요한 데이터를 찾아서 학습할수 있는 자동 데이터 증량 방법론도 함께 제안하였다. 이렇게 제안된 방법을 결합해서 더 나은 손 포즈의 성능을 향상 할 수 있다.1. Introduction 1 2. Related Works 14 3. Preliminaries: 3D Hand Mesh Model 27 4. SeqHAND: RGB-sequence-based 3D Hand Pose and Shape Estimation 31 5. Hand Pose Auto-Augment 66 6. Conclusion 85 Abstract (Korea) 101 감사의 글 103박

SNU Open Repository and Archive