19,126 research outputs found
Embedding Visual Hierarchy with Deep Networks for Large-Scale Visual Recognition
In this paper, a level-wise mixture model (LMM) is developed by embedding
visual hierarchy with deep networks to support large-scale visual recognition
(i.e., recognizing thousands or even tens of thousands of object classes), and
a Bayesian approach is used to adapt a pre-trained visual hierarchy
automatically to the improvements of deep features (that are used for image and
object class representation) when more representative deep networks are learned
along the time. Our LMM model can provide an end-to-end approach for jointly
learning: (a) the deep networks to extract more discriminative deep features
for image and object class representation; (b) the tree classifier for
recognizing large numbers of object classes hierarchically; and (c) the visual
hierarchy adaptation for achieving more accurate indexing of large numbers of
object classes hierarchically. By supporting joint learning of the tree
classifier, the deep networks and the visual hierarchy adaptation, our LMM
algorithm can provide an effective approach for controlling inter-level error
propagation effectively, thus it can achieve better accuracy rates on
large-scale visual recognition. Our experiments are carried on ImageNet1K and
ImageNet10K image sets, and our LMM algorithm can achieve very competitive
results on both the accuracy rates and the computation efficiency as compared
with the baseline methods
Jointly Localizing and Describing Events for Dense Video Captioning
Automatically describing a video with natural language is regarded as a
fundamental challenge in computer vision. The problem nevertheless is not
trivial especially when a video contains multiple events to be worthy of
mention, which often happens in real videos. A valid question is how to
temporally localize and then describe events, which is known as "dense video
captioning." In this paper, we present a novel framework for dense video
captioning that unifies the localization of temporal event proposals and
sentence generation of each proposal, by jointly training them in an end-to-end
manner. To combine these two worlds, we integrate a new design, namely
descriptiveness regression, into a single shot detection structure to infer the
descriptive complexity of each detected proposal via sentence generation. This
in turn adjusts the temporal locations of each event proposal. Our model
differs from existing dense video captioning methods since we propose a joint
and global optimization of detection and captioning, and the framework uniquely
capitalizes on an attribute-augmented video captioning architecture. Extensive
experiments are conducted on ActivityNet Captions dataset and our framework
shows clear improvements when compared to the state-of-the-art techniques. More
remarkably, we obtain a new record: METEOR of 12.96% on ActivityNet Captions
official test set.Comment: CVPR 2018 Spotlight, Rank 1 in ActivityNet Captions Challenge 201
End-to-End Video Classification with Knowledge Graphs
Video understanding has attracted much research attention especially since
the recent availability of large-scale video benchmarks. In this paper, we
address the problem of multi-label video classification. We first observe that
there exists a significant knowledge gap between how machines and humans learn.
That is, while current machine learning approaches including deep neural
networks largely focus on the representations of the given data, humans often
look beyond the data at hand and leverage external knowledge to make better
decisions. Towards narrowing the gap, we propose to incorporate external
knowledge graphs into video classification. In particular, we unify traditional
"knowledgeless" machine learning models and knowledge graphs in a novel
end-to-end framework. The framework is flexible to work with most existing
video classification algorithms including state-of-the-art deep models.
Finally, we conduct extensive experiments on the largest public video dataset
YouTube-8M. The results are promising across the board, improving mean average
precision by up to 2.9%.Comment: 9 pages, 5 figure
Design Challenges of Multi-UAV Systems in Cyber-Physical Applications: A Comprehensive Survey, and Future Directions
Unmanned Aerial Vehicles (UAVs) have recently rapidly grown to facilitate a
wide range of innovative applications that can fundamentally change the way
cyber-physical systems (CPSs) are designed. CPSs are a modern generation of
systems with synergic cooperation between computational and physical potentials
that can interact with humans through several new mechanisms. The main
advantages of using UAVs in CPS application is their exceptional features,
including their mobility, dynamism, effortless deployment, adaptive altitude,
agility, adjustability, and effective appraisal of real-world functions anytime
and anywhere. Furthermore, from the technology perspective, UAVs are predicted
to be a vital element of the development of advanced CPSs. Therefore, in this
survey, we aim to pinpoint the most fundamental and important design challenges
of multi-UAV systems for CPS applications. We highlight key and versatile
aspects that span the coverage and tracking of targets and infrastructure
objects, energy-efficient navigation, and image analysis using machine learning
for fine-grained CPS applications. Key prototypes and testbeds are also
investigated to show how these practical technologies can facilitate CPS
applications. We present and propose state-of-the-art algorithms to address
design challenges with both quantitative and qualitative methods and map these
challenges with important CPS applications to draw insightful conclusions on
the challenges of each application. Finally, we summarize potential new
directions and ideas that could shape future research in these areas
Robust Visual Knowledge Transfer via EDA
We address the problem of visual knowledge adaptation by leveraging labeled
patterns from source domain and a very limited number of labeled instances in
target domain to learn a robust classifier for visual categorization. This
paper proposes a new extreme learning machine based cross-domain network
learning framework, that is called Extreme Learning Machine (ELM) based Domain
Adaptation (EDA). It allows us to learn a category transformation and an ELM
classifier with random projection by minimizing the l_(2,1)-norm of the network
output weights and the learning error simultaneously. The unlabeled target
data, as useful knowledge, is also integrated as a fidelity term to guarantee
the stability during cross domain learning. It minimizes the matching error
between the learned classifier and a base classifier, such that many existing
classifiers can be readily incorporated as base classifiers. The network output
weights cannot only be analytically determined, but also transferrable.
Additionally, a manifold regularization with Laplacian graph is incorporated,
such that it is beneficial to semi-supervised learning. Extensively, we also
propose a model of multiple views, referred as MvEDA. Experiments on benchmark
visual datasets for video event recognition and object recognition, demonstrate
that our EDA methods outperform existing cross-domain learning methods.Comment: This paper has been accepted for publication in IEEE Transactions on
Image Processin
Unsupervised Representation Learning by Sorting Sequences
We present an unsupervised representation learning approach using videos
without semantic labels. We leverage the temporal coherence as a supervisory
signal by formulating representation learning as a sequence sorting task. We
take temporally shuffled frames (i.e., in non-chronological order) as inputs
and train a convolutional neural network to sort the shuffled sequences.
Similar to comparison-based sorting algorithms, we propose to extract features
from all frame pairs and aggregate them to predict the correct order. As
sorting shuffled image sequence requires an understanding of the statistical
temporal structure of images, training with such a proxy task allows us to
learn rich and generalizable visual representation. We validate the
effectiveness of the learned representation using our method as pre-training on
high-level recognition problems. The experimental results show that our method
compares favorably against state-of-the-art methods on action recognition,
image classification and object detection tasks.Comment: ICCV 2017. Project page: http://vllab1.ucmerced.edu/~hylee/OPN
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
Improving Interpretability of Deep Neural Networks with Semantic Information
Interpretability of deep neural networks (DNNs) is essential since it enables
users to understand the overall strengths and weaknesses of the models, conveys
an understanding of how the models will behave in the future, and how to
diagnose and correct potential problems. However, it is challenging to reason
about what a DNN actually does due to its opaque or black-box nature. To
address this issue, we propose a novel technique to improve the
interpretability of DNNs by leveraging the rich semantic information embedded
in human descriptions. By concentrating on the video captioning task, we first
extract a set of semantically meaningful topics from the human descriptions
that cover a wide range of visual concepts, and integrate them into the model
with an interpretive loss. We then propose a prediction difference maximization
algorithm to interpret the learned features of each neuron. Experimental
results demonstrate its effectiveness in video captioning using the
interpretable features, which can also be transferred to video action
recognition. By clearly understanding the learned features, users can easily
revise false predictions via a human-in-the-loop procedure.Comment: To appear in CVPR 201
Audio-Based Activities of Daily Living (ADL) Recognition with Large-Scale Acoustic Embeddings from Online Videos
Over the years, activity sensing and recognition has been shown to play a key
enabling role in a wide range of applications, from sustainability and
human-computer interaction to health care. While many recognition tasks have
traditionally employed inertial sensors, acoustic-based methods offer the
benefit of capturing rich contextual information, which can be useful when
discriminating complex activities. Given the emergence of deep learning
techniques and leveraging new, large-scaled multi-media datasets, this paper
revisits the opportunity of training audio-based classifiers without the
onerous and time-consuming task of annotating audio data. We propose a
framework for audio-based activity recognition that makes use of millions of
embedding features from public online video sound clips. Based on the
combination of oversampling and deep learning approaches, our framework does
not require further feature processing or outliers filtering as in prior work.
We evaluated our approach in the context of Activities of Daily Living (ADL) by
recognizing 15 everyday activities with 14 participants in their own homes,
achieving 64.2% and 83.6% averaged within-subject accuracy in terms of top-1
and top-3 classification respectively. Individual class performance was also
examined in the paper to further study the co-occurrence characteristics of the
activities and the robustness of the framework.Comment: 18 pages,7 figures; new version: results update
Towards Storytelling from Visual Lifelogging: An Overview
Visual lifelogging consists of acquiring images that capture the daily
experiences of the user by wearing a camera over a long period of time. The
pictures taken offer considerable potential for knowledge mining concerning how
people live their lives, hence, they open up new opportunities for many
potential applications in fields including healthcare, security, leisure and
the quantified self. However, automatically building a story from a huge
collection of unstructured egocentric data presents major challenges. This
paper provides a thorough review of advances made so far in egocentric data
analysis, and in view of the current state of the art, indicates new lines of
research to move us towards storytelling from visual lifelogging.Comment: 16 pages, 11 figures, Submitted to IEEE Transactions on Human-Machine
System
- …