665 research outputs found
2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images
We present a technique for estimating the spatial layout of humans in still imagesāthe position of the head, torso and arms. The theme we explore is that once a person is localized using an upper body detector, the search for their body parts can be considerably simplified using weak constraints on position and appearance arising from that detection. Our approach is capable of estimating upper body pose in highly challenging uncontrolled images, without prior knowledge of background, clothing, lighting, or the location and scale of the person in the image. People are only required to be upright and seen from the front or the back (not side). We evaluate the stages of our approach experimentally using ground truth layout annotation on a variety of challenging material, such as images from the PASCAL VOC 2008 challenge and video frames from TV shows and feature films. We also propose and evaluate techniques for searching a video dataset for people in a specific pose. To this end, we develop three new pose descriptors and compare their classification and retrieval performance to two baselines built on state-of-the-art object detection model
Learning Human Pose Estimation Features with Convolutional Networks
This paper introduces a new architecture for human pose estimation using a
multi- layer convolutional network architecture and a modified learning
technique that learns low-level features and higher-level weak spatial models.
Unconstrained human pose estimation is one of the hardest problems in computer
vision, and our new architecture and learning schema shows significant
improvement over the current state-of-the-art results. The main contribution of
this paper is showing, for the first time, that a specific variation of deep
learning is able to outperform all existing traditional architectures on this
task. The paper also discusses several lessons learned while researching
alternatives, most notably, that it is possible to learn strong low-level
feature detectors on features that might even just cover a few pixels in the
image. Higher-level spatial models improve somewhat the overall result, but to
a much lesser extent then expected. Many researchers previously argued that the
kinematic structure and top-down information is crucial for this domain, but
with our purely bottom up, and weak spatial model, we could improve other more
complicated architectures that currently produce the best results. This mirrors
what many other researchers, like those in the speech recognition, object
recognition, and other domains have experienced
2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images
Abstract We present a technique for estimating the spatial layout of humans in still imagesāthe position of the head, torso and arms. The theme we explore is that once a person is localized using an upper body detector, the search for their body parts can be considerably simplified using weak constraints on position and appearance arising from that detection. Our approach is capable of estimating upper body pose in highly challenging uncontrolled images, without prior knowledge of background, clothing, lighting, or the location and scale of the person in the image. People are only required to be upright and seen from the front or the back (not side). We evaluate the stages of our approach experimentally using ground truth layout annotation on a variety of challenging material, such as images from the PASCAL VOC 2008 challenge and video frames from TV shows and feature films. We also propose and evaluate techniques for searching a video dataset for people in a specific pose. To this end, we develop three new pose descriptors and compare their clas
Pareto Optimized Large Mask Approach for Efficient and Background Humanoid Shape Removal
The purpose of automated video object removal is to not only detect and remove the object of interest automatically, but also to utilize background context to inpaint the foreground area. Video inpainting requires to fill spatiotemporal gaps in a video with convincing material, necessitating both temporal and spatial consistency; the inpainted part must seamlessly integrate into the background in a variety of scenes, and it must maintain a consistent appearance in subsequent frames even if its surroundings change noticeably. We introduce deep learning-based methodology for removing unwanted human-like shapes in videos. The method uses Pareto-optimized Generative Adversarial Networks (GANs) technology, which is a novel contribution. The system automatically selects the Region of Interest (ROI) for each humanoid shape and uses a skeleton detection module to determine which humanoid shape to retain. The semantic masks of human like shapes are created using a semantic-aware occlusion-robust model that has four primary components: feature extraction, and local, global, and semantic branches. The global branch encodes occlusion-aware information to make the extracted features resistant to occlusion, while the local branch retrieves fine-grained local characteristics. A modified big mask inpainting approach is employed to eliminate a person from the image, leveraging Fast Fourier convolutions and utilizing polygonal chains and rectangles with unpredictable aspect ratios. The inpainter network takes the input image and the mask to create an output image excluding the background humanoid shapes. The generator uses an encoder-decoder structure with included skip connections to recover spatial information and dilated convolution and squeeze and excitation blocks to make the regions behind the humanoid shapes consistent with their surroundings. The discriminator avoids dissimilar structure at the patch scale, and the refiner network catches features around the boundaries of each background humanoid shape. The efficiency was assessed using the Structural Learned Perceptual Image Patch Similarity, Frechet Inception Distance, and Similarity Index Measure metrics and showed promising results in fully automated background person removal task. The method is evaluated on two video object segmentation datasets (DAVIS indicating respective values of 0.02, FID of 5.01 and SSIM of 0.79 and YouTube-VOS, resulting in 0.03, 6.22, 0.78 respectively) as well a database of 66 distinct video sequences of people behind a desk in an office environment (0.02, 4.01, and 0.78 respectively).publishedVersio
Recommended from our members
Applications and Advances in Similarity-based Machine Learning
Similarity-based machine learning methods differ from traditional machine learning methods in that they also use pairwise similarity relations between objects to infer the labels of unlabeled objects. A recent comparative study for classification problems by Baumann et al. [2019] demonstrated that similarity-based techniques have superior performance and robustness when compared to well-established machine learning techniques. Similarity-based machine learning methods benefit from two advantages that could explain superior their performance: They can make use of the pairwise relations between unlabeled objects, and they are robust due to the transitive property of pairwise similarities. A challenge for similarity-based machine learning methods on large datasets is that the number of pairwise similarity grows quadratically in the size of the dataset. For large datasets, it thus becomes practically impossible to compute all possible pairwise similarities. In 2016, Hochbaum and Baumann proposed the technique of sparse computation to address this growth by computing only those pairwise similarities that are relevant. Their proposed implementation of sparse computation is still difficult to scale to millions objects. This dissertation focuses on advancing the practical implementations of sparse computation to larger datasets and on two applications for which similarity-based machine learning was particularly effective. The applications that are studied here are cell identification in calcium-imaging movies and detecting aberrant linking behavior in directed networks. For sparse computation we present faster, geometric algorithms and a technique, named sparse-reduced computation, that combines sparse computation with compression. The geometric algorithms compute the exact same output as the original implementation of sparse computation, but identify the relevant pairwise similarities faster by using the concept of data shifting for identifying objects in the same or neighboring blocks. Empirical results on datasets with up to 10 million objects show a significant reduction in running time. Sparse-reduced computation combines sparse computation with a technique for compressing highly-similar or identical objects, enabling the use of similarity-based machine learning on massively-large datasets. The computational results demonstrate that sparse-reduced computation provides a significant reduction in running time with a minute loss in accuracy.A major problem facing neuroscientists today is cell identification in calcium-imaging movies. These movies are in-vivo recordings of thousands of neurons at cellular resolution. There is a great need for automated approaches to extract the activity of single neurons from these movies since manual post-processing takes tens of hours per dataset. We present the HNCcorr algorithm for cell identification in calcium-imaging movies. The name HNCcorr is derived from its use of the similarity-based Hochbaum's Normalized Cut (HNC) model with pairwise similarities derived from correlation. In HNCcorr, the task of cell detection is approached as a clustering problem. HNCcorr utilizes HNC to detect cells in these movies as coherent clusters of pixels that are highly distinct from the remaining pixels. HNCcorr guarantees, unlike existing methodologies for cell identification, a globally optimal solution to the underlying optimization problem. Of independent interest is a novel method, named similarity-squared, that we devised for measuring similarity between pixels. We provide an experimental study and demonstrate that HNCcorr is a top performer on the Neurofinder cell identification benchmark and that it improves over algorithms based on matrix factorization.The second application is detecting aberrant agents, such as fake news sources or spam websites, based on their link behavior in networks. Across contexts, a distinguishing characteristic between normal and aberrant agents is that normal agents rarely link to aberrant ones. We refer to this phenomenon as aberrant linking behavior. We present an Markov Random Fields (MRF) formulation, with links as the pairwise similarities, that detects aberrant agents based on aberrant linking behavior and any prior information (if given). This MRF formulation is solved optimally and in polynomial time. We compare the optimal solution for the MRF formulation to well-known algorithms based on random walks. In our empirical experiment with twenty-three different datasets, the MRF method outperforms the other detection algorithms. This work represents the first use of optimization methods for detecting aberrant agents as well as the first time that MRF is applied to directed graphs
A new framework for sign language recognition based on 3D handshape identification and linguistic modeling
Current approaches to sign recognition by computer generally have at least some of the following limitations: they rely on laboratory
conditions for sign production, are limited to a small vocabulary, rely on 2D modeling (and therefore cannot deal with occlusions
and off-plane rotations), and/or achieve limited success. Here we propose a new framework that (1) provides a new tracking method
less dependent than others on laboratory conditions and able to deal with variations in background and skin regions (such as the
face, forearms, or other hands); (2) allows for identification of 3D hand configurations that are linguistically important in American
Sign Language (ASL); and (3) incorporates statistical information reflecting linguistic constraints in sign production. For purposes of
large-scale computer-based sign language recognition from video, the ability to distinguish hand configurations accurately is critical.
Our current method estimates the 3D hand configuration to distinguish among 77 hand configurations linguistically relevant for
ASL. Constraining the problem in this way makes recognition of 3D hand configuration more tractable and provides the information
specifically needed for sign recognition. Further improvements are obtained by incorporation of statistical information about linguistic
dependencies among handshapes within a sign derived from an annotated corpus of almost 10,000 sign tokens
- ā¦