26 research outputs found
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
We introduce a model for bidirectional retrieval of images and sentences
through a multi-modal embedding of visual and natural language data. Unlike
previous models that directly map images or sentences into a common embedding
space, our model works on a finer level and embeds fragments of images
(objects) and fragments of sentences (typed dependency tree relations) into a
common space. In addition to a ranking objective seen in previous work, this
allows us to add a new fragment alignment objective that learns to directly
associate these fragments across modalities. Extensive experimental evaluation
shows that reasoning on both the global level of images and sentences and the
finer level of their respective fragments significantly improves performance on
image-sentence retrieval tasks. Additionally, our model provides interpretable
predictions since the inferred inter-modal fragment alignment is explicit
Large-Scale Video Classification with Convolutional Neural Networks
Convolutional Neural Networks (CNNs) have been es-tablished as a powerful class of models for image recog-nition problems. Encouraged by these results, we pro-vide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multi-ple approaches for extending the connectivity of the a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated archi-tecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant per-formance improvements compared to strong feature-based baselines (55.3 % to 63.9%), but only a surprisingly mod-est improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant per-formance improvements compared to the UCF-101 baseline model (63.3 % up from 43.9%). 1
ImageNet Large Scale Visual Recognition Challenge
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in
object category classification and detection on hundreds of object categories
and millions of images. The challenge has been run annually from 2010 to
present, attracting participation from more than fifty institutions.
This paper describes the creation of this benchmark dataset and the advances
in object recognition that have been possible as a result. We discuss the
challenges of collecting large-scale ground truth annotation, highlight key
breakthroughs in categorical object recognition, provide a detailed analysis of
the current state of the field of large-scale image classification and object
detection, and compare the state-of-the-art computer vision accuracy with human
accuracy. We conclude with lessons learned in the five years of the challenge,
and propose future directions and improvements.Comment: 43 pages, 16 figures. v3 includes additional comparisons with PASCAL
VOC (per-category comparisons in Table 3, distribution of localization
difficulty in Fig 16), a list of queries used for obtaining object detection
images (Appendix C), and some additional reference
Staged learning of agile motor skills
Motor learning lies at the heart of how humans and animals acquire their skills. Understanding of this process enables many benefits in Robotics, physics-based Computer Animation, and other areas of science and engineering. In this thesis, we develop a computational framework for learning of agile, integrated motor skills.
Our algorithm draws inspiration from the process by which humans and animals acquire their skills in nature. Specifically, all skills are learned through a process of staged, incremental learning, during which progressively more complex skills are acquired and subsequently integrated with prior abilities. Accordingly, our learning algorithm is comprised of three phases. In the first phase, a few seed motions that accomplish goals of a skill are acquired. In the second phase, additional motions are collected through active exploration. Finally, the third phase generalizes from observations made in the second phase to yield a dynamics model that is relevant to the goals of a skill.
We apply our learning algorithm to a simple, planar character in a physical simulation and learn a variety of integrated skills such as hopping, flipping, rolling, stopping, getting up and continuous acrobatic maneuvers. Aspects of each skill, such as length, height and speed of the motion can be interactively controlled through a user interface. Furthermore, we show that the algorithm can be used without modification to learn all skills for a whole family of parameterized characters of similar structure. Finally, we demonstrate that our approach also scales to a more complex quadruped character.Science, Faculty ofComputer Science, Department ofGraduat
Object Discovery in 3D scenes via Shape Analysis
Abstract — We present a method for discovering object models from 3D meshes of indoor environments. Our algorithm first decomposes the scene into a set of candidate mesh segments and then ranks each segment according to its ”objectness ” – a quality that distinguishes objects from clutter. To do so, we propose five intrinsic shape measures: compactness, symmetry, smoothness, and local and global convexity. We additionally propose a recurrence measure, codifying the intuition that frequently occurring geometries are more likely to correspond to complete objects. We evaluate our method in both supervised and unsupervised regimes on a dataset of 58 indoor scenes collected using an Open Source implementation of Kinect Fusion [1]. We show that our approach can reliably and efficiently distinguish objects from clutter, with Average Precision score of.92. We make our dataset available to the public. I
Deep fragment embeddings for bidirectional image sentence mapping
We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike pre-vious models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (ob-jects) and fragments of sentences (typed dependency tree relations) into a com-mon space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive exper-imental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.