352 research outputs found
Thick 2D Relations for Document Understanding
We use a propositional language of qualitative rectangle relations to detect the reading order from document images. To this end, we define the notion of a document encoding rule and we analyze possible formalisms to express document encoding rules such as LATEX and SGML. Document encoding rules expressed in the propositional language of rectangles are used to build a reading order detector for document images. In order to achieve robustness and avoid brittleness when applying the system to real life document images, the notion of a thick boundary interpretation for a qualitative relation is introduced. The framework is tested on a collection of heterogeneous document images showing recall rates up to 89%
VideoGraph: Recognizing Minutes-Long Human Activities in Videos
Many human activities take minutes to unfold. To represent them, related
works opt for statistical pooling, which neglects the temporal structure.
Others opt for convolutional methods, as CNN and Non-Local. While successful in
learning temporal concepts, they are short of modeling minutes-long temporal
dependencies. We propose VideoGraph, a method to achieve the best of two
worlds: represent minutes-long human activities and learn their underlying
temporal structure. VideoGraph learns a graph-based representation for human
activities. The graph, its nodes and edges are learned entirely from video
datasets, making VideoGraph applicable to problems without node-level
annotation. The result is improvements over related works on benchmarks:
Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to
learn the temporal structure of human activities in minutes-long videos
Siamese Instance Search for Tracking
In this paper we present a tracker, which is radically different from
state-of-the-art trackers: we apply no model updating, no occlusion detection,
no combination of trackers, no geometric matching, and still deliver
state-of-the-art tracking performance, as demonstrated on the popular online
tracking benchmark (OTB) and six very challenging YouTube videos. The presented
tracker simply matches the initial patch of the target in the first frame with
candidates in a new frame and returns the most similar patch by a learned
matching function. The strength of the matching function comes from being
extensively trained generically, i.e., without any data of the target, using a
Siamese deep neural network, which we design for tracking. Once learned, the
matching function is used as is, without any adapting, to track previously
unseen targets. It turns out that the learned matching function is so powerful
that a simple tracker built upon it, coined Siamese INstance search Tracker,
SINT, which only uses the original observation of the target from the first
frame, suffices to reach state-of-the-art performance. Further, we show the
proposed tracker even allows for target re-identification after the target was
absent for a complete video shot.Comment: This paper is accepted to the IEEE Conference on Computer Vision and
Pattern Recognition, 201
Unified Embedding and Metric Learning for Zero-Exemplar Event Detection
Event detection in unconstrained videos is conceived as a content-based video
retrieval with two modalities: textual and visual. Given a text describing a
novel event, the goal is to rank related videos accordingly. This task is
zero-exemplar, no video examples are given to the novel event.
Related works train a bank of concept detectors on external data sources.
These detectors predict confidence scores for test videos, which are ranked and
retrieved accordingly. In contrast, we learn a joint space in which the visual
and textual representations are embedded. The space casts a novel event as a
probability of pre-defined events. Also, it learns to measure the distance
between an event and its related videos.
Our model is trained end-to-end on publicly available EventNet. When applied
to TRECVID Multimedia Event Detection dataset, it outperforms the
state-of-the-art by a considerable margin.Comment: IEEE CVPR 201
Explaining with Counter Visual Attributes and Examples
In this paper, we aim to explain the decisions of neural networks by
utilizing multimodal information. That is counter-intuitive attributes and
counter visual examples which appear when perturbed samples are introduced.
Different from previous work on interpreting decisions using saliency maps,
text, or visual patches we propose to use attributes and counter-attributes,
and examples and counter-examples as part of the visual explanations. When
humans explain visual decisions they tend to do so by providing attributes and
examples. Hence, inspired by the way of human explanations in this paper we
provide attribute-based and example-based explanations. Moreover, humans also
tend to explain their visual decisions by adding counter-attributes and
counter-examples to explain what is not seen. We introduce directed
perturbations in the examples to observe which attribute values change when
classifying the examples into the counter classes. This delivers intuitive
counter-attributes and counter-examples. Our experiments with both coarse and
fine-grained datasets show that attributes provide discriminating and
human-understandable intuitive and counter-intuitive explanations.Comment: arXiv admin note: substantial text overlap with arXiv:1910.07416,
arXiv:1904.0827
Diagnosing Rarity in Human-Object Interaction Detection
Human-object interaction (HOI) detection is a core task in computer vision.
The goal is to localize all human-object pairs and recognize their
interactions. An interaction defined by a tuple leads to a
long-tailed visual recognition challenge since many combinations are rarely
represented. The performance of the proposed models is limited especially for
the tail categories, but little has been done to understand the reason. To that
end, in this paper, we propose to diagnose rarity in HOI detection. We propose
a three-step strategy, namely Detection, Identification and Recognition where
we carefully analyse the limiting factors by studying state-of-the-art models.
Our findings indicate that detection and identification steps are altered by
the interaction signals like occlusion and relative location, as a result
limiting the recognition accuracy.Comment: Accepted at CVPR'20 Workshop on Learning from Limited Label
- …