814 research outputs found
Geodesic Distance Histogram Feature for Video Segmentation
This paper proposes a geodesic-distance-based feature that encodes global
information for improved video segmentation algorithms. The feature is a joint
histogram of intensity and geodesic distances, where the geodesic distances are
computed as the shortest paths between superpixels via their boundaries. We
also incorporate adaptive voting weights and spatial pyramid configurations to
include spatial information into the geodesic histogram feature and show that
this further improves results. The feature is generic and can be used as part
of various algorithms. In experiments, we test the geodesic histogram feature
by incorporating it into two existing video segmentation frameworks. This leads
to significantly better performance in 3D video segmentation benchmarks on two
datasets
Enhancement of ELDA Tracker Based on CNN Features and Adaptive Model Update
Appearance representation and the observation model are the most important components in designing a robust visual tracking algorithm for video-based sensors. Additionally, the exemplar-based linear discriminant analysis (ELDA) model has shown good performance in object tracking. Based on that, we improve the ELDA tracking algorithm by deep convolutional neural network (CNN) features and adaptive model update. Deep CNN features have been successfully used in various computer vision tasks. Extracting CNN features on all of the candidate windows is time consuming. To address this problem, a two-step CNN feature extraction method is proposed by separately computing convolutional layers and fully-connected layers. Due to the strong discriminative ability of CNN features and the exemplar-based model, we update both object and background models to improve their adaptivity and to deal with the tradeoff between discriminative ability and adaptivity. An object updating method is proposed to select the âgoodâ models (detectors), which are quite discriminative and uncorrelated to other selected models. Meanwhile, we build the background model as a Gaussian mixture model (GMM) to adapt to complex scenes, which is initialized offline and updated online. The proposed tracker is evaluated on a benchmark dataset of 50 video sequences with various challenges. It achieves the best overall performance among the compared state-of-the-art trackers, which demonstrates the effectiveness and robustness of our tracking algorithm
A Temporal Sequence Learning for Action Recognition and Prediction
In this work\footnote {This work was supported in part by the National
Science Foundation under grant IIS-1212948.}, we present a method to represent
a video with a sequence of words, and learn the temporal sequencing of such
words as the key information for predicting and recognizing human actions. We
leverage core concepts from the Natural Language Processing (NLP) literature
used in sentence classification to solve the problems of action prediction and
action recognition. Each frame is converted into a word that is represented as
a vector using the Bag of Visual Words (BoW) encoding method. The words are
then combined into a sentence to represent the video, as a sentence. The
sequence of words in different actions are learned with a simple but effective
Temporal Convolutional Neural Network (T-CNN) that captures the temporal
sequencing of information in a video sentence. We demonstrate that a key
characteristic of the proposed method is its low-latency, i.e. its ability to
predict an action accurately with a partial sequence (sentence). Experiments on
two datasets, \textit{UCF101} and \textit{HMDB51} show that the method on
average reaches 95\% of its accuracy within half the video frames. Results,
also demonstrate that our method achieves compatible state-of-the-art
performance in action recognition (i.e. at the completion of the sentence) in
addition to action prediction.Comment: 10 pages, 8 figures, 2018 IEEE Winter Conference on Applications of
Computer Vision (WACV
Scraping social media photos posted in Kenya and elsewhere to detect and analyze food types
Monitoring population-level changes in diet could be useful for education and for implementing interventions to improve health. Research has shown that data from social media sources can be used for monitoring dietary behavior. We propose a scrape-by-location methodology to create food image datasets from Instagram posts. We used it to collect 3.56 million images over a period of 20 days in March 2019. We also propose a scrape-by-keywords methodology and used it to scrape âŒ30,000 images and their captions of 38 Kenyan food types. We publish two datasets of 104,000 and 8,174 image/caption pairs, respectively. With the first dataset, Kenya104K, we train a Kenyan Food Classifier, called KenyanFC, to distinguish Kenyan food from non-food images posted in
Kenya. We used the second dataset, KenyanFood13, to train a classifier KenyanFTR, short for Kenyan Food Type Recognizer, to recognize 13 popular food types in Kenya. The KenyanFTR is a multimodal deep neural network that can identify 13 types of Kenyan foods using both images and their corresponding captions. Experiments show that the average top-1 accuracy of KenyanFC is 99% over 10,400 tested Instagram images and of KenyanFTR is 81% over 8,174 tested data points. Ablation studies show that three of the 13 food types are particularly difficult to categorize based on image content only and that adding analysis of captions to the image analysis yields a classifier that is 9 percent points more accurate than a classifier that relies only on images. Our food trend analysis revealed that cakes and roasted meats were the most popular foods in photographs on Instagram in Kenya in March 2019.Accepted manuscrip
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
The task of open-vocabulary object-centric image retrieval involves the
retrieval of images containing a specified object of interest, delineated by an
open-set text query. As working on large image datasets becomes standard,
solving this task efficiently has gained significant practical importance.
Applications include targeted performance analysis of retrieved images using
ad-hoc queries and hard example mining during training. Recent advancements in
contrastive-based open vocabulary systems have yielded remarkable
breakthroughs, facilitating large-scale open vocabulary image retrieval.
However, these approaches use a single global embedding per image, thereby
constraining the system's ability to retrieve images containing relatively
small object instances. Alternatively, incorporating local embeddings from
detection pipelines faces scalability challenges, making it unsuitable for
retrieval from large databases.
In this work, we present a simple yet effective approach to object-centric
open-vocabulary image retrieval. Our approach aggregates dense embeddings
extracted from CLIP into a compact representation, essentially combining the
scalability of image retrieval pipelines with the object identification
capabilities of dense detection methods. We show the effectiveness of our
scheme to the task by achieving significantly better results than global
feature approaches on three datasets, increasing accuracy by up to 15 mAP
points. We further integrate our scheme into a large scale retrieval framework
and demonstrate our method's advantages in terms of scalability and
interpretability.Comment: BMVC 202
- âŠ