1,866 research outputs found
Data-Driven Sparse Structure Selection for Deep Neural Networks
Deep convolutional neural networks have liberated its extraordinary power on
various tasks. However, it is still very challenging to deploy state-of-the-art
models into real-world applications due to their high computational complexity.
How can we design a compact and effective network without massive experiments
and expert knowledge? In this paper, we propose a simple and effective
framework to learn and prune deep models in an end-to-end manner. In our
framework, a new type of parameter -- scaling factor is first introduced to
scale the outputs of specific structures, such as neurons, groups or residual
blocks. Then we add sparsity regularizations on these factors, and solve this
optimization problem by a modified stochastic Accelerated Proximal Gradient
(APG) method. By forcing some of the factors to zero, we can safely remove the
corresponding structures, thus prune the unimportant parts of a CNN. Comparing
with other structure selection methods that may need thousands of trials or
iterative fine-tuning, our method is trained fully end-to-end in one training
pass without bells and whistles. We evaluate our method, Sparse Structure
Selection with several state-of-the-art CNNs, and demonstrate very promising
results with adaptive depth and width selection.Comment: ECCV Camera ready versio
Crowdsourcing in Computer Vision
Computer vision systems require large amounts of manually annotated data to
properly learn challenging visual concepts. Crowdsourcing platforms offer an
inexpensive method to capture human knowledge and understanding, for a vast
number of visual perception tasks. In this survey, we describe the types of
annotations computer vision researchers have collected using crowdsourcing, and
how they have ensured that this data is of high quality while annotation effort
is minimized. We begin by discussing data collection on both classic (e.g.,
object recognition) and recent (e.g., visual story-telling) vision tasks. We
then summarize key design decisions for creating effective data collection
interfaces and workflows, and present strategies for intelligently selecting
the most important data instances to annotate. Finally, we conclude with some
thoughts on the future of crowdsourcing in computer vision.Comment: A 69-page meta review of the field, Foundations and Trends in
Computer Graphics and Vision, 201
What Actions are Needed for Understanding Human Actions in Videos?
What is the right way to reason about human activities? What directions
forward are most promising? In this work, we analyze the current state of human
activity understanding in videos. The goal of this paper is to examine
datasets, evaluation metrics, algorithms, and potential future directions. We
look at the qualitative attributes that define activities such as pose
variability, brevity, and density. The experiments consider multiple
state-of-the-art algorithms and multiple datasets. The results demonstrate that
while there is inherent ambiguity in the temporal extent of activities, current
datasets still permit effective benchmarking. We discover that fine-grained
understanding of objects and pose when combined with temporal reasoning is
likely to yield substantial improvements in algorithmic accuracy. We present
the many kinds of information that will be needed to achieve substantial gains
in activity understanding: objects, verbs, intent, and sequential reasoning.
The software and additional information will be made available to provide other
researchers detailed diagnostics to understand their own algorithms.Comment: ICCV201
Much Ado About Time: Exhaustive Annotation of Temporal Data
Large-scale annotated datasets allow AI systems to learn from and build upon
the knowledge of the crowd. Many crowdsourcing techniques have been developed
for collecting image annotations. These techniques often implicitly rely on the
fact that a new input image takes a negligible amount of time to perceive. In
contrast, we investigate and determine the most cost-effective way of obtaining
high-quality multi-label annotations for temporal data such as videos. Watching
even a short 30-second video clip requires a significant time investment from a
crowd worker; thus, requesting multiple annotations following a single viewing
is an important cost-saving strategy. But how many questions should we ask per
video? We conclude that the optimal strategy is to ask as many questions as
possible in a HIT (up to 52 binary questions after watching a 30-second video
clip in our experiments). We demonstrate that while workers may not correctly
answer all questions, the cost-benefit analysis nevertheless favors consensus
from multiple such cheap-yet-imperfect iterations over more complex
alternatives. When compared with a one-question-per-video baseline, our method
is able to achieve a 10% improvement in recall 76.7% ours versus 66.7%
baseline) at comparable precision (83.8% ours versus 83.0% baseline) in about
half the annotation time (3.8 minutes ours compared to 7.1 minutes baseline).
We demonstrate the effectiveness of our method by collecting multi-label
annotations of 157 human activities on 1,815 videos.Comment: HCOMP 2016 Camera Read
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
Every moment counts in action recognition. A comprehensive understanding of
human activity in video requires labeling every frame according to the actions
occurring, placing multiple labels densely over a video sequence. To study this
problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new
dataset of dense labels over unconstrained internet videos. Modeling multiple,
dense labels benefits from temporal relations within and across classes. We
define a novel variant of long short-term memory (LSTM) deep networks for
modeling these temporal relations via multiple input and output connections. We
show that this model improves action labeling accuracy and further enables
deeper understanding tasks ranging from structured retrieval to action
prediction.Comment: To appear in IJC
Adversarial attacks hidden in plain sight
Convolutional neural networks have been used to achieve a string of successes
during recent years, but their lack of interpretability remains a serious
issue. Adversarial examples are designed to deliberately fool neural networks
into making any desired incorrect classification, potentially with very high
certainty. Several defensive approaches increase robustness against adversarial
attacks, demanding attacks of greater magnitude, which lead to visible
artifacts. By considering human visual perception, we compose a technique that
allows to hide such adversarial attacks in regions of high complexity, such
that they are imperceptible even to an astute observer. We carry out a user
study on classifying adversarially modified images to validate the perceptual
quality of our approach and find significant evidence for its concealment with
regards to human visual perception
Food Ingredients Recognition through Multi-label Learning
Automatically constructing a food diary that tracks the ingredients consumed
can help people follow a healthy diet. We tackle the problem of food
ingredients recognition as a multi-label learning problem. We propose a method
for adapting a highly performing state of the art CNN in order to act as a
multi-label predictor for learning recipes in terms of their list of
ingredients. We prove that our model is able to, given a picture, predict its
list of ingredients, even if the recipe corresponding to the picture has never
been seen by the model. We make public two new datasets suitable for this
purpose. Furthermore, we prove that a model trained with a high variability of
recipes and ingredients is able to generalize better on new data, and visualize
how it specializes each of its neurons to different ingredients.Comment: 8 page
Hierarchical ResNeXt Models for Breast Cancer Histology Image Classification
Microscopic histology image analysis is a cornerstone in early detection of
breast cancer. However these images are very large and manual analysis is error
prone and very time consuming. Thus automating this process is in high demand.
We proposed a hierarchical system of convolutional neural networks (CNN) that
classifies automatically patches of these images into four pathologies: normal,
benign, in situ carcinoma and invasive carcinoma. We evaluated our system on
the BACH challenge dataset of image-wise classification and a small dataset
that we used to extend it. Using a train/test split of 75%/25%, we achieved an
accuracy rate of 0.99 on the test split for the BACH dataset and 0.96 on that
of the extension. On the test of the BACH challenge, we've reached an accuracy
of 0.81 which rank us to the 8th out of 51 teams
- …
