4,585 research outputs found
Exploiting Deep Features for Remote Sensing Image Retrieval: A Systematic Investigation
Remote sensing (RS) image retrieval is of great significant for geological
information mining. Over the past two decades, a large amount of research on
this task has been carried out, which mainly focuses on the following three
core issues: feature extraction, similarity metric and relevance feedback. Due
to the complexity and multiformity of ground objects in high-resolution remote
sensing (HRRS) images, there is still room for improvement in the current
retrieval approaches. In this paper, we analyze the three core issues of RS
image retrieval and provide a comprehensive review on existing methods.
Furthermore, for the goal to advance the state-of-the-art in HRRS image
retrieval, we focus on the feature extraction issue and delve how to use
powerful deep representations to address this task. We conduct systematic
investigation on evaluating correlative factors that may affect the performance
of deep features. By optimizing each factor, we acquire remarkable retrieval
results on publicly available HRRS datasets. Finally, we explain the
experimental phenomenon in detail and draw conclusions according to our
analysis. Our work can serve as a guiding role for the research of
content-based RS image retrieval
Surgical Phase Recognition of Short Video Shots Based on Temporal Modeling of Deep Features
Recognizing the phases of a laparoscopic surgery (LS) operation form its
video constitutes a fundamental step for efficient content representation,
indexing and retrieval in surgical video databases. In the literature, most
techniques focus on phase segmentation of the entire LS video using
hand-crafted visual features, instrument usage signals, and recently
convolutional neural networks (CNNs). In this paper we address the problem of
phase recognition of short video shots (10s) of the operation, without
utilizing information about the preceding/forthcoming video frames, their phase
labels or the instruments used. We investigate four state-of-the-art CNN
architectures (Alexnet, VGG19, GoogleNet, and ResNet101), for feature
extraction via transfer learning. Visual saliency was employed for selecting
the most informative region of the image as input to the CNN. Video shot
representation was based on two temporal pooling mechanisms. Most importantly,
we investigate the role of 'elapsed time' (from the beginning of the
operation), and we show that inclusion of this feature can increase performance
dramatically (69% vs. 75% mean accuracy). Finally, a long short-term memory
(LSTM) network was trained for video shot classification based on the fusion of
CNN features with 'elapsed time', increasing the accuracy to 86%. Our results
highlight the prominent role of visual saliency, long-range temporal recursion
and 'elapsed time' (a feature so far ignored), for surgical phase recognition.Comment: 6 pages, 4 figures, 6 table
Dynamic Unary Convolution in Transformers
It is uncertain whether the power of transformer architectures can complement existing convolutional neural networks. A few recent attempts have combined convolution with transformer design through a range of structures in series, where the main contribution of this paper is to explore a parallel design approach. While previous transformed-based approaches need to segment the image into patch-wise tokens, we observe that the multi-head self-attention conducted on convolutional features is mainly sensitive to global correlations and that the performance degrades when these correlations are not exhibited. We propose two parallel modules along with multi-head self-attention to enhance the transformer. For local information, a dynamic local enhancement module leverages convolution to dynamically and explicitly enhance positive local patches and suppress the response to less informative ones. For mid-level structure, a novel unary co-occurrence excitation module utilizes convolution to actively search the local co-occurrence between patches. The parallel-designed Dynamic Unary Convolution in Transformer (DUCT) blocks are aggregated into a deep architecture, which is comprehensively evaluated across essential computer vision tasks in image-based classification, segmentation, retrieval and density estimation. Both qualitative and quantitative results show our parallel convolutional-transformer approach with dynamic and unary convolution outperforms existing series-designed structures
Fine-Grained Product Class Recognition for Assisted Shopping
Assistive solutions for a better shopping experience can improve the quality
of life of people, in particular also of visually impaired shoppers. We present
a system that visually recognizes the fine-grained product classes of items on
a shopping list, in shelves images taken with a smartphone in a grocery store.
Our system consists of three components: (a) We automatically recognize useful
text on product packaging, e.g., product name and brand, and build a mapping of
words to product classes based on the large-scale GroceryProducts dataset. When
the user populates the shopping list, we automatically infer the product class
of each entered word. (b) We perform fine-grained product class recognition
when the user is facing a shelf. We discover discriminative patches on product
packaging to differentiate between visually similar product classes and to
increase the robustness against continuous changes in product design. (c) We
continuously improve the recognition accuracy through active learning. Our
experiments show the robustness of the proposed method against cross-domain
challenges, and the scalability to an increasing number of products with
minimal re-training.Comment: Accepted at ICCV Workshop on Assistive Computer Vision and Robotics
(ICCV-ACVR) 201
Adversarial Virtual Exemplar Learning for Label-Frugal Satellite Image Change Detection
Satellite image change detection aims at finding occurrences of targeted
changes in a given scene taken at different instants. This task is highly
challenging due to the acquisition conditions and also to the subjectivity of
changes. In this paper, we investigate satellite image change detection using
active learning. Our method is interactive and relies on a question and answer
model which asks the oracle (user) questions about the most informative display
(dubbed as virtual exemplars), and according to the user's responses, updates
change detections. The main contribution of our method consists in a novel
adversarial model that allows frugally probing the oracle with only the most
representative, diverse and uncertain virtual exemplars. The latter are learned
to challenge the most the trained change decision criteria which ultimately
leads to a better re-estimate of these criteria in the following iterations of
active learning. Conducted experiments show the out-performance of our proposed
adversarial display model against other display strategies as well as the
related work.Comment: arXiv admin note: substantial text overlap with arXiv:2203.1155
Efficient Video Transformers with Spatial-Temporal Token Selection
Video transformers have achieved impressive results on major video
recognition benchmarks, which however suffer from high computational cost. In
this paper, we present STTS, a token selection framework that dynamically
selects a few informative tokens in both temporal and spatial dimensions
conditioned on input video samples. Specifically, we formulate token selection
as a ranking problem, which estimates the importance of each token through a
lightweight scorer network and only those with top scores will be used for
downstream evaluation. In the temporal dimension, we keep the frames that are
most relevant to the action categories, while in the spatial dimension, we
identify the most discriminative region in feature maps without affecting the
spatial context used in a hierarchical way in most video transformers. Since
the decision of token selection is non-differentiable, we employ a
perturbed-maximum based differentiable Top-K operator for end-to-end training.
We mainly conduct extensive experiments on Kinetics-400 with a recently
introduced video transformer backbone, MViT. Our framework achieves similar
results while requiring 20% less computation. We also demonstrate our approach
is generic for different transformer architectures and video datasets. Code is
available at https://github.com/wangjk666/STTS.Comment: Accepted by ECCV 202
- …