18 research outputs found
Contextual Action Recognition with R*CNN
There are multiple cues in an image which reveal what action a person is
performing. For example, a jogger has a pose that is characteristic for
jogging, but the scene (e.g. road, trail) and the presence of other joggers can
be an additional source of information. In this work, we exploit the simple
observation that actions are accompanied by contextual cues to build a strong
action recognition system. We adapt RCNN to use more than one region for
classification while still maintaining the ability to localize the action. We
call our system R*CNN. The action-specific models and the feature maps are
trained jointly, allowing for action specific representations to emerge. R*CNN
achieves 90.2% mean AP on the PASAL VOC Action dataset, outperforming all other
approaches in the field by a significant margin. Last, we show that R*CNN is
not limited to action recognition. In particular, R*CNN can also be used to
tackle fine-grained tasks such as attribute classification. We validate this
claim by reporting state-of-the-art performance on the Berkeley Attributes of
People dataset
Detect-and-Track: Efficient Pose Estimation in Videos
This paper addresses the problem of estimating and tracking human body
keypoints in complex, multi-person video. We propose an extremely lightweight
yet highly effective approach that builds upon the latest advancements in human
detection and video understanding. Our method operates in two-stages: keypoint
estimation in frames or short clips, followed by lightweight tracking to
generate keypoint predictions linked over the entire video. For frame-level
pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D
extension of this model, which leverages temporal information over small clips
to generate more robust frame predictions. We conduct extensive ablative
experiments on the newly released multi-person video pose estimation benchmark,
PoseTrack, to validate various design choices of our model. Our approach
achieves an accuracy of 55.2% on the validation and 51.8% on the test set using
the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art
performance on the ICCV 2017 PoseTrack keypoint tracking challenge.Comment: In CVPR 2018. Ranked first in ICCV 2017 PoseTrack challenge (keypoint
tracking in videos). Code: https://github.com/facebookresearch/DetectAndTrack
and webpage: https://rohitgirdhar.github.io/DetectAndTrack
Actions and Attributes from Wholes and Parts
We investigate the importance of parts for the tasks of action and attribute
classification. We develop a part-based approach by leveraging convolutional
network features inspired by recent advances in computer vision. Our part
detectors are a deep version of poselets and capture parts of the human body
under a distinct set of poses. For the tasks of action and attribute
classification, we train holistic convolutional neural networks and show that
adding parts leads to top-performing results for both tasks. In addition, we
demonstrate the effectiveness of our approach when we replace an oracle person
detector, as is the default in the current evaluation protocol for both tasks,
with a state-of-the-art person detection system
Embodied Question Answering
We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where
an agent is spawned at a random location in a 3D environment and asked a
question ("What color is the car?"). In order to answer, the agent must first
intelligently navigate to explore the environment, gather information through
first-person (egocentric) vision, and then answer the question ("orange").
This challenging task requires a range of AI skills -- active perception,
language understanding, goal-driven navigation, commonsense reasoning, and
grounding of language into actions. In this work, we develop the environments,
end-to-end-trained reinforcement learning agents, and evaluation protocols for
EmbodiedQA.Comment: 20 pages, 13 figures, Webpage: https://embodiedqa.org
Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild
Recognizing scenes and objects in 3D from a single image is a longstanding
goal of computer vision with applications in robotics and AR/VR. For 2D
recognition, large datasets and scalable solutions have led to unprecedented
advances. In 3D, existing benchmarks are small in size and approaches
specialize in few object categories and specific domains, e.g. urban driving
scenes. Motivated by the success of 2D recognition, we revisit the task of 3D
object detection by introducing a large benchmark, called Omni3D. Omni3D
re-purposes and combines existing datasets resulting in 234k images annotated
with more than 3 million instances and 98 categories. 3D detection at such
scale is challenging due to variations in camera intrinsics and the rich
diversity of scene and object types. We propose a model, called Cube R-CNN,
designed to generalize across camera and scene types with a unified approach.
We show that Cube R-CNN outperforms prior works on the larger Omni3D and
existing benchmarks. Finally, we prove that Omni3D is a powerful dataset for 3D
object recognition and show that it improves single-dataset performance and can
accelerate learning on new smaller datasets via pre-training.Comment: CVPR 2023, Project website: https://omni3d.garrickbrazil.com