3,932 research outputs found
Track Everything: Limiting Prior Knowledge in Online Multi-Object Recognition
This paper addresses the problem of online tracking and classification of
multiple objects in an image sequence. Our proposed solution is to first track
all objects in the scene without relying on object-specific prior knowledge,
which in other systems can take the form of hand-crafted features or user-based
track initialization. We then classify the tracked objects with a fast-learning
image classifier that is based on a shallow convolutional neural network
architecture and demonstrate that object recognition improves when this is
combined with object state information from the tracking algorithm. We argue
that by transferring the use of prior knowledge from the detection and tracking
stages to the classification stage we can design a robust, general purpose
object recognition system with the ability to detect and track a variety of
object types. We describe our biologically inspired implementation, which
adaptively learns the shape and motion of tracked objects, and apply it to the
Neovision2 Tower benchmark data set, which contains multiple object types. An
experimental evaluation demonstrates that our approach is competitive with
state-of-the-art video object recognition systems that do make use of
object-specific prior knowledge in detection and tracking, while providing
additional practical advantages by virtue of its generality.Comment: 15 page
Person Search via A Mask-Guided Two-Stream CNN Model
In this work, we tackle the problem of person search, which is a challenging
task consisted of pedestrian detection and person re-identification~(re-ID).
Instead of sharing representations in a single joint model, we find that
separating detector and re-ID feature extraction yields better performance. In
order to extract more representative features for each identity, we segment out
the foreground person from the original image patch. We propose a simple yet
effective re-ID method, which models foreground person and original image
patches individually, and obtains enriched representations from two separate
CNN streams. From the experiments on two standard person search benchmarks of
CUHK-SYSU and PRW, we achieve mAP of and respectively,
surpassing the state of the art by a large margin (more than 5pp).Comment: accepted as poster to ECCV 201
Detection, Recognition and Tracking of Moving Objects from Real-time Video via Visual Vocabulary Model and Species Inspired PSO
In this paper, we address the basic problem of recognizing moving objects in
video images using Visual Vocabulary model and Bag of Words and track our
object of interest in the subsequent video frames using species inspired PSO.
Initially, the shadow free images are obtained by background modelling followed
by foreground modeling to extract the blobs of our object of interest.
Subsequently, we train a cubic SVM with human body datasets in accordance with
our domain of interest for recognition and tracking. During training, using the
principle of Bag of Words we extract necessary features of certain domains and
objects for classification. Subsequently, matching these feature sets with
those of the extracted object blobs that are obtained by subtracting the shadow
free background from the foreground, we detect successfully our object of
interest from the test domain. The performance of the classification by cubic
SVM is satisfactorily represented by confusion matrix and ROC curve reflecting
the accuracy of each module. After classification, our object of interest is
tracked in the test domain using species inspired PSO. By combining the
adaptive learning tools with the efficient classification of description, we
achieve optimum accuracy in recognition of the moving objects. We evaluate our
algorithm benchmark datasets: iLIDS, VIVID, Walking2, Woman. Comparative
analysis of our algorithm against the existing state-of-the-art trackers shows
very satisfactory and competitive results
Facial Expression Recognition in the Wild using Rich Deep Features
Facial Expression Recognition is an active area of research in computer
vision with a wide range of applications. Several approaches have been
developed to solve this problem for different benchmark datasets. However,
Facial Expression Recognition in the wild remains an area where much work is
still needed to serve real-world applications. To this end, in this paper we
present a novel approach towards facial expression recognition. We fuse rich
deep features with domain knowledge through encoding discriminant facial
patches. We conduct experiments on two of the most popular benchmark datasets;
CK and TFE. Moreover, we present a novel dataset that, unlike its precedents,
consists of natural - not acted - expression images. Experimental results show
that our approach achieves state-of-the-art results over standard benchmarks
and our own datasetComment: in International Conference in Image Processing, 201
A review on handwritten character and numeral recognition for Roman, Arabic, Chinese and Indian scripts
There are a lot of intensive researches on handwritten character recognition
(HCR) for almost past four decades. The research has been done on some of
popular scripts such as Roman, Arabic, Chinese and Indian. In this paper we
present a review on HCR work on the four popular scripts. We have summarized
most of the published paper from 2005 to recent and also analyzed the various
methods in creating a robust HCR system. We also added some future direction of
research on HCR.Comment: 8 page
A Hajj And Umrah Location Classification System For Video Crowded Scenes
In this paper, a new automatic system for classifying ritual locations in
diverse Hajj and Umrah video scenes is investigated. This challenging subject
has mostly been ignored in the past due to several problems one of which is the
lack of realistic annotated video datasets. HUER Dataset is defined to model
six different Hajj and Umrah ritual locations[26].
The proposed Hajj and Umrah ritual location classifying system consists of
four main phases: Preprocessing, segmentation, feature extraction, and location
classification phases. The shot boundary detection and background/foregroud
segmentation algorithms are applied to prepare the input video scenes into the
KNN, ANN, and SVM classifiers. The system improves the state of art results on
Hajj and Umrah location classifications, and successfully recognizes the six
Hajj rituals with more than 90% accuracy. The various demonstrated experiments
show the promising results.Comment: 9 pages, 10 figures, 2 tables, 3 algirthm
Indic Handwritten Script Identification using Offline-Online Multimodal Deep Network
In this paper, we propose a novel approach of word-level Indic script
identification using only character-level data in training stage. The
advantages of using character level data for training have been outlined in
section I. Our method uses a multimodal deep network which takes both offline
and online modality of the data as input in order to explore the information
from both the modalities jointly for script identification task. We take
handwritten data in either modality as input and the opposite modality is
generated through intermodality conversion. Thereafter, we feed this
offline-online modality pair to our network. Hence, along with the advantage of
utilizing information from both the modalities, it can work as a single
framework for both offline and online script identification simultaneously
which alleviates the need for designing two separate script identification
modules for individual modality. One more major contribution is that we propose
a novel conditional multimodal fusion scheme to combine the information from
offline and online modality which takes into account the real origin of the
data being fed to our network and thus it combines adaptively. An exhaustive
experiment has been done on a data set consisting of English and six Indic
scripts. Our proposed framework clearly outperforms different frameworks based
on traditional classifiers along with handcrafted features and deep learning
based methods with a clear margin. Extensive experiments show that using only
character level training data can achieve state-of-art performance similar to
that obtained with traditional training using word level data in our framework.Comment: Accepted in Information Fusion, Elsevie
Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking
Most thermal infrared (TIR) tracking methods are discriminative, treating the
tracking problem as a classification task. However, the objective of the
classifier (label prediction) is not coupled to the objective of the tracker
(location estimation). The classification task focuses on the between-class
difference of the arbitrary objects, while the tracking task mainly deals with
the within-class difference of the same objects. In this paper, we cast the TIR
tracking problem as a similarity verification task, which is coupled well to
the objective of the tracking task. We propose a TIR tracker via a Hierarchical
Spatial-aware Siamese Convolutional Neural Network (CNN), named HSSNet. To
obtain both spatial and semantic features of the TIR object, we design a
Siamese CNN that coalesces the multiple hierarchical convolutional layers.
Then, we propose a spatial-aware network to enhance the discriminative ability
of the coalesced hierarchical feature. Subsequently, we train this network end
to end on a large visible video detection dataset to learn the similarity
between paired objects before we transfer the network into the TIR domain.
Next, this pre-trained Siamese network is used to evaluate the similarity
between the target template and target candidates. Finally, we locate the
candidate that is most similar to the tracked target. Extensive experimental
results on the benchmarks VOT-TIR 2015 and VOT-TIR 2016 show that our proposed
method achieves favourable performance compared to the state-of-the-art
methods.Comment: 20 pages, 7 figure
RiFCN: Recurrent Network in Fully Convolutional Network for Semantic Segmentation of High Resolution Remote Sensing Images
Semantic segmentation in high resolution remote sensing images is a
fundamental and challenging task. Convolutional neural networks (CNNs), such as
fully convolutional network (FCN) and SegNet, have shown outstanding
performance in many segmentation tasks. One key pillar of these successes is
mining useful information from features in convolutional layers for producing
high resolution segmentation maps. For example, FCN nonlinearly combines
high-level features extracted from last convolutional layers; whereas SegNet
utilizes a deconvolutional network which takes as input only coarse, high-level
feature maps of the last convolutional layer. However, how to better fuse
multi-level convolutional feature maps for semantic segmentation of remote
sensing images is underexplored. In this work, we propose a novel bidirectional
network called recurrent network in fully convolutional network (RiFCN), which
is end-to-end trainable. It has a forward stream and a backward stream. The
former is a classification CNN architecture for feature extraction, which takes
an input image and produces multi-level convolutional feature maps from shallow
to deep; while in the later, to achieve accurate boundary inference and
semantic segmentation, boundary-aware high resolution feature maps in shallower
layers and high-level but low-resolution features are recursively embedded into
the learning framework (from deep to shallow) to generate a fused feature
representation that draws a holistic picture of not only high-level semantic
information but also low-level fine-grained details. Experimental results on
two widely-used high resolution remote sensing data sets for semantic
segmentation tasks, ISPRS Potsdam and Inria Aerial Image Labeling Data Set,
demonstrate competitive performance obtained by the proposed methodology
compared to other studied approaches
Learning Semantic Correspondences in Technical Documentation
We consider the problem of translating high-level textual descriptions to
formal representations in technical documentation as part of an effort to model
the meaning of such documentation. We focus specifically on the problem of
learning translational correspondences between text descriptions and grounded
representations in the target documentation, such as formal representation of
functions or code templates. Our approach exploits the parallel nature of such
documentation, or the tight coupling between high-level text and the low-level
representations we aim to learn. Data is collected by mining technical
documents for such parallel text-representation pairs, which we use to train a
simple semantic parsing model. We report new baseline results on sixteen novel
datasets, including the standard library documentation for nine popular
programming languages across seven natural languages, and a small collection of
Unix utility manuals.Comment: accepted to ACL-201
- …