7,695 research outputs found
CSGNet: Neural Shape Parser for Constructive Solid Geometry
We present a neural architecture that takes as input a 2D or 3D shape and
outputs a program that generates the shape. The instructions in our program are
based on constructive solid geometry principles, i.e., a set of boolean
operations on shape primitives defined recursively. Bottom-up techniques for
this shape parsing task rely on primitive detection and are inherently slow
since the search space over possible primitive combinations is large. In
contrast, our model uses a recurrent neural network that parses the input shape
in a top-down manner, which is significantly faster and yields a compact and
easy-to-interpret sequence of modeling instructions. Our model is also more
effective as a shape detector compared to existing state-of-the-art detection
techniques. We finally demonstrate that our network can be trained on novel
datasets without ground-truth program annotations through policy gradient
techniques.Comment: Accepted at CVPR-201
gMotion: A spatio-temporal grammar for the procedural generation of motion graphics
Creating by hand compelling 2D animations that choreograph several groups of shapes requires a large number of manual edits. We present a method to procedurally generate motion graphics with timeslice grammars. Timeslice grammars are to time what split grammars are to space. We use this grammar to formally model motion graphics, manipulating them in both temporal and spatial components. We are able to combine both these aspects by representing animations as sets of affine transformations sampled uniformly in both space and time. Rules and operators in the grammar manipulate all spatio-temporal matrices as a whole, allowing us to expressively construct animation with few rules. The grammar animates shapes, which are represented as highly tessellated polygons, by applying the affine transforms to each shape vertex given the vertex position and the animation time. We introduce a small set of operators showing how we can produce 2D animations of geometric objects, by combining the expressive power of the grammar model, the composability of the operators with themselves, and the capabilities that derive from using a unified spatio-temporal representation for animation data. Throughout the paper, we show how timeslice grammars can produce a wide variety of animations that would take artists hours of tedious and time-consuming work. In particular, in cases where change of shapes is very common, our grammar can add motion detail to large collections of shapes with greater control over per-shape animations along with a compact rules structure
Solving Bongard Problems with a Visual Language and Pragmatic Reasoning
More than 50 years ago Bongard introduced 100 visual concept learning
problems as a testbed for intelligent vision systems. These problems are now
known as Bongard problems. Although they are well known in the cognitive
science and AI communities only moderate progress has been made towards
building systems that can solve a substantial subset of them. In the system
presented here, visual features are extracted through image processing and then
translated into a symbolic visual vocabulary. We introduce a formal language
that allows representing complex visual concepts based on this vocabulary.
Using this language and Bayesian inference, complex visual concepts can be
induced from the examples that are provided in each Bongard problem. Contrary
to other concept learning problems the examples from which concepts are induced
are not random in Bongard problems, instead they are carefully chosen to
communicate the concept, hence requiring pragmatic reasoning. Taking pragmatic
reasoning into account we find good agreement between the concepts with high
posterior probability and the solutions formulated by Bongard himself. While
this approach is far from solving all Bongard problems, it solves the biggest
fraction yet
Representation and Detection of Shapes in Images
We present a set of techniques that can be used to represent and detect shapes in images. Our methods revolve around a particular shape representation based on the description of objects using triangulated polygons. This representation is similar to the medial axis transform and has important properties from a computational perspective. The first problem we consider is the detection of non-rigid objects in images using deformable models. We present an efficient algorithm to solve this problem in a wide range of situations, and show examples in both natural and medical images. We also consider the problem of learning an accurate non-rigid shape model for a class of objects from examples. We show how to learn good models while constraining them to the form required by the detection algorithm. Finally, we consider the problem of low-level image segmentation and grouping. We describe a stochastic grammar that generates arbitrary triangulated polygons while capturing Gestalt principles of shape regularity. This grammar is used as a prior model over random shapes in a low level algorithm that detects objects in images
Joint Video and Text Parsing for Understanding Events and Answering Queries
We propose a framework for parsing video and text jointly for understanding
events and answering user queries. Our framework produces a parse graph that
represents the compositional structures of spatial information (objects and
scenes), temporal information (actions and events) and causal information
(causalities between events and fluents) in the video and text. The knowledge
representation of our framework is based on a spatial-temporal-causal And-Or
graph (S/T/C-AOG), which jointly models possible hierarchical compositions of
objects, scenes and events as well as their interactions and mutual contexts,
and specifies the prior probabilistic distribution of the parse graphs. We
present a probabilistic generative model for joint parsing that captures the
relations between the input video/text, their corresponding parse graphs and
the joint parse graph. Based on the probabilistic model, we propose a joint
parsing system consisting of three modules: video parsing, text parsing and
joint inference. Video parsing and text parsing produce two parse graphs from
the input video and text respectively. The joint inference module produces a
joint parse graph by performing matching, deduction and revision on the video
and text parse graphs. The proposed framework has the following objectives:
Firstly, we aim at deep semantic parsing of video and text that goes beyond the
traditional bag-of-words approaches; Secondly, we perform parsing and reasoning
across the spatial, temporal and causal dimensions based on the joint S/T/C-AOG
representation; Thirdly, we show that deep joint parsing facilitates subsequent
applications such as generating narrative text descriptions and answering
queries in the forms of who, what, when, where and why. We empirically
evaluated our system based on comparison against ground-truth as well as
accuracy of query answering and obtained satisfactory results
- …