24 research outputs found
Face Detection with the Faster R-CNN
The Faster R-CNN has recently demonstrated impressive results on various
object detection benchmarks. By training a Faster R-CNN model on the large
scale WIDER face dataset, we report state-of-the-art results on two widely used
face detection benchmarks, FDDB and the recently released IJB-A.Comment: technical repor
Recommended from our members
Understanding the Dynamic Visual World: From Motion to Semantics
We live in a dynamic world, which is continuously in motion. Perceiving and interpreting the dynamic surroundings is an essential capability for an intelligent agent. Human beings have the remarkable capability to learn from limited data, with partial or little annotation, in sharp contrast to computational perception models that rely on large-scale, manually labeled data. Reliance on strongly supervised models with manually labeled data inherently prohibits us from modeling the dynamic visual world, as manual annotations are tedious, expensive, and not scalable, especially if we would like to solve multiple scene understanding tasks at the same time. Even worse, in some cases, manual annotations are completely infeasible, such as the motion vector of each pixel (i.e., optical flow) since humans cannot reliably produce these types of labeling. In fact, living in a dynamic world, when we move around, motion information, as a result of moving camera, independently moving objects, and scene geometry, consists of abundant information, revealing the structure and complexity of our dynamic visual world. As the famous psychologist James J. Gibson suggested, āwe must perceive in order to move, but we also must move in order to perceiveā. In this thesis, we investigate how to use the motion information contained in unlabeled or partially labeled videos to better understand and synthesize the dynamic visual world.
This thesis consists of three parts. In the first part, we focus on the āmove to perceiveā aspect. When moving through the world, it is natural for an intelligent agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, far away mountains donāt move much; nearby trees move a lot. This natural relationship between the appearance of objects and their apparent motion is a rich source of information about the relationship between the distance of objects and their appearance in images. We present a pretext task of estimating the relative depth of elements of a scene (i.e., ordering the pixels in an image according to distance from the viewer) recovered from motion field of unlabeled videos. The goal of this pretext task was to induce useful feature representations in deep Convolutional Neural Networks (CNNs). These induced representations, using 1.1 million video frames crawled from YouTube within one hour without any manual labeling, provide valuable starting features for the training of neural networks for downstream tasks. It is promising to match or even surpass what ImageNet pre-training gives us today, which needs a huge amount of manual labeling, on tasks such as semantic image segmentation as all of our training data comes almost for free.
In the second part, we study the āperceive to moveā aspect. As we humans look around, we do not solve a single vision task at a time. Instead, we perceive our surroundings in a holistic manner, doing visual understanding using all visual cues jointly. By simultaneously solving multiple tasks together, one task can influence another. In specific, we propose a neural network architecture, called SENSE, which shares common feature representations among four closely-related tasks: optical flow estimation, disparity estimation from stereo, occlusion detection, and semantic segmentation. The key insight is that sharing features makes the network more compact and induces better feature representations. For real-world data, however, not all an- notations of the four tasks mentioned above are always available at the same time. To this end, loss functions are designed to exploit interactions of different tasks and do not need manual annotations, to better handle partially labeled data in a semi- supervised manner, leading to superior understanding performance of the dynamic visual world.
Understanding the motion contained in a video enables us to perceive the dynamic visual world in a novel manner. In the third part, we present an approach, called SuperSloMo, which synthesizes slow-motion videos from a standard frame-rate video. Converting a plain video into a slow-motion version enables us to see memorable moments in our life that are hard to see clearly otherwise with naked eyes: a difficult skateboard trick, a dog catching a ball, etc. Such a technique also has wide applications such as generating smooth view transition on a head-mounted virtual reality (VR) devices, compressing videos, synthesizing videos with motion blur, etc
SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation
Human-centric video frame interpolation has great potential for improving
people's entertainment experiences and finding commercial applications in the
sports analysis industry, e.g., synthesizing slow-motion videos. Although there
are multiple benchmark datasets available in the community, none of them is
dedicated for human-centric scenarios. To bridge this gap, we introduce
SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video
frames of high-resolution (720p) slow-motion sports videos crawled from
YouTube. We re-train several state-of-the-art methods on our benchmark, and the
results show a decrease in their accuracy compared to other datasets. It
highlights the difficulty of our benchmark and suggests that it poses
significant challenges even for the best-performing methods, as human bodies
are highly deformable and occlusions are frequent in sports videos. To improve
the accuracy, we introduce two loss terms considering the human-aware priors,
where we add auxiliary supervision to panoptic segmentation and human keypoints
detection, respectively. The loss terms are model agnostic and can be easily
plugged into any video frame interpolation approaches. Experimental results
validate the effectiveness of our proposed loss terms, leading to consistent
performance improvement over 5 existing models, which establish strong baseline
models on our benchmark. The dataset and code can be found at:
https://neu-vi.github.io/SportsSlomo/.Comment: Project Page: https://neu-vi.github.io/SportsSlomo
Direct Superpoints Matching for Fast and Robust Point Cloud Registration
Although deep neural networks endow the downsampled superpoints with
discriminative feature representations, directly matching them is usually not
used alone in state-of-the-art methods, mainly for two reasons. First, the
correspondences are inevitably noisy, so RANSAC-like refinement is usually
adopted. Such ad hoc postprocessing, however, is slow and not differentiable,
which can not be jointly optimized with feature learning. Second, superpoints
are sparse and thus more RANSAC iterations are needed. Existing approaches use
the coarse-to-fine strategy to propagate the superpoints correspondences to the
point level, which are not discriminative enough and further necessitates the
postprocessing refinement. In this paper, we present a simple yet effective
approach to extract correspondences by directly matching superpoints using a
global softmax layer in an end-to-end manner, which are used to determine the
rigid transformation between the source and target point cloud. Compared with
methods that directly predict corresponding points, by leveraging the rich
information from the superpoints matchings, we can obtain more accurate
estimation of the transformation and effectively filter out outliers without
any postprocessing refinement. As a result, our approach is not only fast, but
also achieves state-of-the-art results on the challenging ModelNet and 3DMatch
benchmarks. Our code and model weights will be publicly released
Diagnosing Human-object Interaction Detectors
Although we have witnessed significant progress in human-object interaction
(HOI) detection with increasingly high mAP (mean Average Precision), a single
mAP score is too concise to obtain an informative summary of a model's
performance and to understand why one approach is better than another. In this
paper, we introduce a diagnosis toolbox for analyzing the error sources of the
existing HOI detection models. We first conduct holistic investigations in the
pipeline of HOI detection, consisting of human-object pair detection and then
interaction classification. We define a set of errors and the oracles to fix
each of them. By measuring the mAP improvement obtained from fixing an error
using its oracle, we can have a detailed analysis of the significance of
different errors. We then delve into the human-object detection and interaction
classification, respectively, and check the model's behavior. For the first
detection task, we investigate both recall and precision, measuring the
coverage of ground-truth human-object pairs as well as the noisiness level in
the detections. For the second classification task, we compute mAP for
interaction classification only, without considering the detection scores. We
also measure the performance of the models in differentiating human-object
pairs with and without actual interactions using the AP (Average Precision)
score. Our toolbox is applicable for different methods across different
datasets and available at https://github.com/neu-vi/Diag-HOI