24,358 research outputs found
Learning Human Pose Estimation Features with Convolutional Networks
This paper introduces a new architecture for human pose estimation using a
multi- layer convolutional network architecture and a modified learning
technique that learns low-level features and higher-level weak spatial models.
Unconstrained human pose estimation is one of the hardest problems in computer
vision, and our new architecture and learning schema shows significant
improvement over the current state-of-the-art results. The main contribution of
this paper is showing, for the first time, that a specific variation of deep
learning is able to outperform all existing traditional architectures on this
task. The paper also discusses several lessons learned while researching
alternatives, most notably, that it is possible to learn strong low-level
feature detectors on features that might even just cover a few pixels in the
image. Higher-level spatial models improve somewhat the overall result, but to
a much lesser extent then expected. Many researchers previously argued that the
kinematic structure and top-down information is crucial for this domain, but
with our purely bottom up, and weak spatial model, we could improve other more
complicated architectures that currently produce the best results. This mirrors
what many other researchers, like those in the speech recognition, object
recognition, and other domains have experienced
Cross-domain self-supervised complete geometric representation learning for real-scanned point cloud based pathological gait analysis
Accurate lower-limb pose estimation is a prereq-uisite of skeleton based pathological gait analysis. To achievethis goal in free-living environments for long-term monitoring,single depth sensor has been proposed in research. However,the depth map acquired from a single viewpoint encodes onlypartial geometric information of the lower limbs and exhibitslarge variations across different viewpoints. Existing off-the-shelfthree-dimensional (3D) pose tracking algorithms and publicdatasets for depth based human pose estimation are mainlytargeted at activity recognition applications. They are relativelyinsensitive to skeleton estimation accuracy, especially at thefoot segments. Furthermore, acquiring ground truth skeletondata for detailed biomechanics analysis also requires consid-erable efforts. To address these issues, we propose a novelcross-domain self-supervised complete geometric representationlearning framework, with knowledge transfer from the unlabelledsynthetic point clouds of full lower-limb surfaces. The proposedmethod can significantly reduce the number of ground truthskeletons (with only 1%) in the training phase, meanwhileensuring accurate and precise pose estimation and capturingdiscriminative features across different pathological gait patternscompared to other methods
Learning-based depth and pose prediction for 3D scene reconstruction in endoscopy
Colorectal cancer is the third most common cancer worldwide. Early detection and treatment of pre-cancerous tissue during colonoscopy is critical to improving prognosis. However, navigating within the colon and inspecting the endoluminal tissue comprehensively are challenging, and success in both varies based on the endoscopist's skill and experience. Computer-assisted interventions in colonoscopy show much promise in improving navigation and inspection. For instance, 3D reconstruction of the colon during colonoscopy could promote more thorough examinations and increase adenoma detection rates which are associated with improved survival rates. Given the stakes, this thesis seeks to advance the state of research from feature-based traditional methods closer to a data-driven 3D reconstruction pipeline for colonoscopy.
More specifically, this thesis explores different methods that improve subtasks of learning-based 3D reconstruction. The main tasks are depth prediction and camera pose estimation. As training data is unavailable, the author, together with her co-authors, proposes and publishes several synthetic datasets and promotes domain adaptation models to improve applicability to real data. We show, through extensive experiments, that our depth prediction methods produce more robust results than previous work. Our pose estimation network trained on our new synthetic data outperforms self-supervised methods on real sequences. Our box embeddings allow us to interpret the geometric relationship and scale difference between two images of the same surface without the need for feature matches that are often unobtainable in surgical scenes. Together, the methods introduced in this thesis help work towards a complete, data-driven 3D reconstruction pipeline for endoscopy
Lucid Data Dreaming for Video Object Segmentation
Convolutional networks reach top quality in pixel-level video object
segmentation but require a large amount of training data (1k~100k) to deliver
such results. We propose a new training strategy which achieves
state-of-the-art results across three evaluation datasets while using 20x~1000x
less annotated data than competing methods. Our approach is suitable for both
single and multiple object segmentation. Instead of using large training sets
hoping to generalize across domains, we generate in-domain training data using
the provided annotation on the first frame of each video to synthesize ("lucid
dream") plausible future video frames. In-domain per-video training data allows
us to train high quality appearance- and motion-based models, as well as tune
the post-processing stage. This approach allows to reach competitive results
even when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the video object segmentation task a smaller
training set that is closer to the target domain is more effective. This
changes the mindset regarding how many training samples and general
"objectness" knowledge are required for the video object segmentation task.Comment: Accepted in International Journal of Computer Vision (IJCV
Ihmisten asennon tunnistus syvyyskameralla
Human pose estimation has many applications from activity analysis to autonomous cars. Modern advances in deep learning research have enabled real time multi-person pose estimation in complex environments. In this thesis, a state of the art deep learning architecture is adapted to work with depth sensors. A dataset is generated using computer graphics instead of annotating thousands of images by hand. Results are promising; trained neural network detects humans in multi-person environments even when occlusion is present. However, there are challenges rising from the difference between the real world and the synthetic data generation, which has to be addressed.Automaattisella ihmisten asennon tunnistuksella on lukuisia sovelluksia aktiviteettianalyysistรค itsenรคisiin autoihin. Nykyaikainen kehitys syvien neuroverkkojen alalla on mahdollistanut useamman ihmisen reaaliaikaisen asennon tunnistamisen monimutkaisissaย ympรคristรถissรค. Tรคssรค diplomityรถssรค adaptoidaan nykyaikainen neuroverkko arkkitehtuuri toimimaan syvyyskameralla saaduilla kuvilla. Neuroverkon opetukseen tarvittava opetusdata generoidoon tietokonegrafiikan avulla sen sijaan, ettรค opetusdata luotaisiin kรคsityรถnรค. Saadut tulokset ovat lupaavia; opetettu neuroverkko kykenee tunnistamaan samanaikaisesti usean ihmisen monimutkaisessa ympรคristรถssรค. Kaikesta huolimatta, simuloidun datan ja todellisen maailman vรคlinen eroavaisuus aiheuttaa ongelmia, jotka tรคytyy ottaa huomioon
3D ์ ํฌ์ฆ ์ธ์์ ์ํ ์ธ์กฐ ๋ฐ์ดํฐ์ ์ด์ฉ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ์ตํฉ๊ณผํ๊ธฐ์ ๋ํ์ ์ตํฉ๊ณผํ๋ถ(์ง๋ฅํ์ตํฉ์์คํ
์ ๊ณต), 2021.8. ์ํ์ด.3D hand pose estimation (HPE) based on RGB images has been studied for a long time. Relevant methods have focused mainly on optimization of neural framework for graphically connected finger joints. Training RGB-based HPE models has not been easy to train because of the scarcity on RGB hand pose datasets; unlike human body pose datasets, the finger joints that span hand postures are structured delicately and exquisitely. Such structure makes accurately annotating each joint with unique 3D world coordinates difficult, which is why many conventional methods rely on synthetic data samples to cover large variations of hand postures.
Synthetic dataset consists of very precise annotations of ground truths, and further allows control over the variety of data samples, yielding a learning model to be trained with a large pose space. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human hand pose estimation is a particularly interesting example of this synthetic-to-real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability.
In this dissertation, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images.
We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for
3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks.
Since a fixed set of dataset provides a finite distribution of data samples, the generalization of a learning pose estimation network is limited in terms of pose, RGB and viewpoint spaces. We further propose to augment the data automatically such that the augmented pose sampling is performed in favor of training pose estimators generalization performance. Such auto-augmentation of poses is performed within a learning feature space in order to avoid computational burden of generating synthetic sample for every iteration of updates. The proposed
effort can be considered as generating and utilizing synthetic samples for network training in the feature space. This allows training efficiency by requiring less number of real data samples, enhanced generalization power over multiple dataset domains and estimation performance caused by efficient augmentation.2D ์ด๋ฏธ์ง์์ ์ฌ๋์ ์ ๋ชจ์๊ณผ ํฌ์ฆ๋ฅผ ์ธ์ํ๊ณ ๊ตฌํํ๋ ์ฐ๊ตฌ๋ ๊ฐ ์๊ฐ๋ฝ ์กฐ์ธํธ๋ค์ 3D ์์น๋ฅผ ๊ฒ์ถํ๋ ๊ฒ์ ๋ชฉํ๋กํ๋ค. ์ ํฌ์ฆ๋ ์๊ฐ๋ฝ ์กฐ์ธํธ๋ค๋ก ๊ตฌ์ฑ๋์ด ์๊ณ ์๋ชฉ ๊ด์ ๋ถํฐ MCP, PIP, DIP ์กฐ์ธํธ๋ค๋ก ์ฌ๋ ์์ ๊ตฌ์ฑํ๋ ์ ์ฒด์ ์์๋ค์ ์๋ฏธํ๋ค. ์ ํฌ์ฆ ์ ๋ณด๋ ๋ค์ํ ๋ถ์ผ์์ ํ์ฉ๋ ์ ์๊ณ ์ ์ ์ค์ณ ๊ฐ์ง ์ฐ๊ตฌ ๋ถ์ผ์์ ์ ํฌ์ฆ ์ ๋ณด๊ฐ ๋งค์ฐ ํ๋ฅญํ ์
๋ ฅ ํน์ง ๊ฐ์ผ๋ก ์ฌ์ฉ๋๋ค.
์ฌ๋์ ์ ํฌ์ฆ ๊ฒ์ถ ์ฐ๊ตฌ๋ฅผ ์ค์ ์์คํ
์ ์ ์ฉํ๊ธฐ ์ํด์๋ ๋์ ์ ํ๋, ์ค์๊ฐ์ฑ, ๋ค์ํ ๊ธฐ๊ธฐ์ ์ฌ์ฉ ๊ฐ๋ฅํ๋๋ก ๊ฐ๋ฒผ์ด ๋ชจ๋ธ์ด ํ์ํ๊ณ , ์ด๊ฒ์ ๊ฐ๋ฅ์ผ ํ๊ธฐ ์ํด์ ํ์ตํ ์ธ๊ณต์ ๊ฒฝ๋ง ๋ชจ๋ธ์ ํ์ตํ๋๋ฐ์๋ ๋ง์ ๋ฐ์ดํฐ๊ฐ ํ์๋ก ํ๋ค. ํ์ง๋ง ์ฌ๋ ์ ํฌ์ฆ๋ฅผ ์ธก์ ํ๋ ๊ธฐ๊ณ๋ค์ด ๊ฝค ๋ถ์์ ํ๊ณ , ์ด ๊ธฐ๊ณ๋ค์ ์ฅ์ฐฉํ๊ณ ์๋ ์ด๋ฏธ์ง๋ ์ฌ๋ ์ ํผ๋ถ ์๊ณผ๋ ๋ง์ด ๋ฌ๋ผ ํ์ต์ ์ฌ์ฉํ๊ธฐ๊ฐ ์ ์ ํ์ง ์๋ค. ๊ทธ๋ฌ๊ธฐ ๋๋ฌธ์ ๋ณธ ๋
ผ๋ฌธ์์๋ ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ์ธ๊ณต์ ์ผ๋ก ๋ง๋ค์ด๋ธ ๋ฐ์ดํฐ๋ฅผ ์ฌ๊ฐ๊ณต ๋ฐ ์ฆ๋ํ์ฌ ํ์ต์ ์ฌ์ฉํ๊ณ , ๊ทธ๊ฒ์ ํตํด ๋ ์ข์ ํ์ต์ฑ๊ณผ๋ฅผ ์ด๋ฃจ๋ ค๊ณ ํ๋ค.
์ธ๊ณต์ ์ผ๋ก ๋ง๋ค์ด๋ธ ์ฌ๋ ์ ์ด๋ฏธ์ง ๋ฐ์ดํฐ๋ค์ ์ค์ ์ฌ๋ ์ ํผ๋ถ์๊ณผ๋ ๋น์ทํ ์ง์ธ์ ๋ํ
์ผํ ํ
์ค์ณ๊ฐ ๋ง์ด ๋ฌ๋ผ, ์ค์ ๋ก ์ธ๊ณต ๋ฐ์ดํฐ๋ฅผ ํ์ตํ ๋ชจ๋ธ์ ์ค์ ์ ๋ฐ์ดํฐ์์ ์ฑ๋ฅ์ด ํ์ ํ ๋ง์ด ๋จ์ด์ง๋ค. ์ด ๋ ๋ฐ์ดํ์ ๋๋ฉ์ธ์ ์ค์ด๊ธฐ ์ํด์ ์ฒซ๋ฒ์งธ๋ก๋ ์ฌ๋์์ ๊ตฌ์กฐ๋ฅผ ๋จผ์ ํ์ต ์ํค๊ธฐ์ํด, ์ ๋ชจ์
์ ์ฌ๊ฐ๊ณตํ์ฌ ๊ทธ ์์ง์ ๊ตฌ์กฐ๋ฅผ ํ์คํ ์๊ฐ์ ์ ๋ณด๋ฅผ ๋บ ๋๋จธ์ง๋ง ์ค์ ์ ์ด๋ฏธ์ง ๋ฐ์ดํฐ์ ํ์ตํ์๊ณ ํฌ๊ฒ ํจ๊ณผ๋ฅผ ๋ด์๋ค.
์ด๋ ์ค์ ์ฌ๋ ์๋ชจ์
์ ๋ชจ๋ฐฉํ๋ ๋ฐฉ๋ฒ๋ก ์ ์ ์ํ์๋ค.
๋๋ฒ์งธ๋ก๋ ๋ ๋๋ฉ์ธ์ด ๋ค๋ฅธ ๋ฐ์ดํฐ๋ฅผ ๋คํธ์ํฌ ํผ์ณ ๊ณต๊ฐ์์ align์์ผฐ๋ค. ๊ทธ๋ฟ๋ง์๋๋ผ ์ธ๊ณต ํฌ์ฆ๋ฅผ ํน์ ๋ฐ์ดํฐ๋ค๋ก augmentํ์ง ์๊ณ ๋คํธ์ํฌ๊ฐ ๋ง์ด ๋ณด์ง ๋ชปํ ํฌ์ฆ๊ฐ ๋ง๋ค์ด์ง๋๋ก ํ๋์ ํ๋ฅ ๋ชจ๋ธ๋ก์ ์ค์ ํ์ฌ ๊ทธ๊ฒ์์ ์ํ๋งํ๋ ๊ตฌ์กฐ๋ฅผ ์ ์ํ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์ธ๊ณต ๋ฐ์ดํฐ๋ฅผ ๋ ํจ๊ณผ์ ์ผ๋ก ์ฌ์ฉํ์ฌ annotation์ด ์ด๋ ค์ด ์ค์ ๋ฐ์ดํฐ๋ฅผ ๋ ๋ชจ์ผ๋ ์๊ณ ์ค๋ฌ์ ์์ด ์ธ๊ณต ๋ฐ์ดํฐ๋ค์ ๋ ํจ๊ณผ์ ์ผ๋ก ๋ง๋ค์ด ๋ด๋ ๊ฒ ๋ฟ๋ง ์๋๋ผ, ๋ ์์ ํ๊ณ ์ง์ญ์ ํน์ง๊ณผ ์๊ฐ์ ํน์ง์ ํ์ฉํด์ ํฌ์ฆ์ ์ฑ๋ฅ์ ๊ฐ์ ํ๋ ๋ฐฉ๋ฒ๋ค์ ์ ์ํ๋ค. ๋ํ, ๋คํธ์ํฌ๊ฐ ์ค์ค๋ก ํ์ํ ๋ฐ์ดํฐ๋ฅผ ์ฐพ์์ ํ์ตํ ์ ์๋ ์๋ ๋ฐ์ดํฐ ์ฆ๋ ๋ฐฉ๋ฒ๋ก ๋ ํจ๊ป ์ ์ํ์๋ค. ์ด๋ ๊ฒ ์ ์๋ ๋ฐฉ๋ฒ์ ๊ฒฐํฉํด์ ๋ ๋์ ์ ํฌ์ฆ์ ์ฑ๋ฅ์ ํฅ์ ํ ์ ์๋ค.1. Introduction 1
2. Related Works 14
3. Preliminaries: 3D Hand Mesh Model 27
4. SeqHAND: RGB-sequence-based 3D Hand Pose and Shape Estimation 31
5. Hand Pose Auto-Augment 66
6. Conclusion 85
Abstract (Korea) 101
๊ฐ์ฌ์ ๊ธ 103๋ฐ
- โฆ