509 research outputs found
Divide and Fuse: A Re-ranking Approach for Person Re-identification
As re-ranking is a necessary procedure to boost person re-identification
(re-ID) performance on large-scale datasets, the diversity of feature becomes
crucial to person reID for its importance both on designing pedestrian
descriptions and re-ranking based on feature fusion. However, in many
circumstances, only one type of pedestrian feature is available. In this paper,
we propose a "Divide and use" re-ranking framework for person re-ID. It
exploits the diversity from different parts of a high-dimensional feature
vector for fusion-based re-ranking, while no other features are accessible.
Specifically, given an image, the extracted feature is divided into
sub-features. Then the contextual information of each sub-feature is
iteratively encoded into a new feature. Finally, the new features from the same
image are fused into one vector for re-ranking. Experimental results on two
person re-ID benchmarks demonstrate the effectiveness of the proposed
framework. Especially, our method outperforms the state-of-the-art on the
Market-1501 dataset.Comment: Accepted by BMVC201
Target-Tailored Source-Transformation for Scene Graph Generation
Scene graph generation aims to provide a semantic and structural description
of an image, denoting the objects (with nodes) and their relationships (with
edges). The best performing works to date are based on exploiting the context
surrounding objects or relations,e.g., by passing information among objects. In
these approaches, to transform the representation of source objects is a
critical process for extracting information for the use by target objects. In
this work, we argue that a source object should give what tar-get object needs
and give different objects different information rather than contributing
common information to all targets. To achieve this goal, we propose a
Target-TailoredSource-Transformation (TTST) method to efficiently propagate
information among object proposals and relations. Particularly, for a source
object proposal which will contribute information to other target objects, we
transform the source object feature to the target object feature domain by
simultaneously taking both the source and target into account. We further
explore more powerful representations by integrating language prior with the
visual context in the transformation for the scene graph generation. By doing
so the target object is able to extract target-specific information from the
source object and source relation accordingly to refine its representation. Our
framework is validated on the Visual Genome bench-mark and demonstrated its
state-of-the-art performance for the scene graph generation. The experimental
results show that the performance of object detection and visual relation-ship
detection are promoted mutually by our method
Deep Image Retrieval: A Survey
In recent years a vast amount of visual content has been generated and shared
from various fields, such as social media platforms, medical images, and
robotics. This abundance of content creation and sharing has introduced new
challenges. In particular, searching databases for similar content, i.e.content
based image retrieval (CBIR), is a long-established research area, and more
efficient and accurate methods are needed for real time retrieval. Artificial
intelligence has made progress in CBIR and has significantly facilitated the
process of intelligent search. In this survey we organize and review recent
CBIR works that are developed based on deep learning algorithms and techniques,
including insights and techniques from recent papers. We identify and present
the commonly-used benchmarks and evaluation methods used in the field. We
collect common challenges and propose promising future directions. More
specifically, we focus on image retrieval with deep learning and organize the
state of the art methods according to the types of deep network structure, deep
features, feature enhancement methods, and network fine-tuning strategies. Our
survey considers a wide variety of recent methods, aiming to promote a global
view of the field of instance-based CBIR.Comment: 20 pages, 11 figure
MoMask: Generative Masked Modeling of 3D Human Motions
We introduce MoMask, a novel masked modeling framework for text-driven 3D
human motion generation. In MoMask, a hierarchical quantization scheme is
employed to represent human motion as multi-layer discrete motion tokens with
high-fidelity details. Starting at the base layer, with a sequence of motion
tokens obtained by vector quantization, the residual tokens of increasing
orders are derived and stored at the subsequent layers of the hierarchy. This
is consequently followed by two distinct bidirectional transformers. For the
base-layer motion tokens, a Masked Transformer is designated to predict
randomly masked motion tokens conditioned on text input at training stage.
During generation (i.e. inference) stage, starting from an empty sequence, our
Masked Transformer iteratively fills up the missing tokens; Subsequently, a
Residual Transformer learns to progressively predict the next-layer tokens
based on the results from current layer. Extensive experiments demonstrate that
MoMask outperforms the state-of-art methods on the text-to-motion generation
task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset,
and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly
applied in related tasks without further model fine-tuning, such as text-guided
temporal inpainting.Comment: Project webpage: https://ericguo5513.github.io/momask
SWBT: Similarity Weighted Behavior Transformer with the Imperfect Demonstration for Robotic Manipulation
Imitation learning (IL), aiming to learn optimal control policies from expert
demonstrations, has been an effective method for robot manipulation tasks.
However, previous IL methods either only use expensive expert demonstrations
and omit imperfect demonstrations or rely on interacting with the environment
and learning from online experiences. In the context of robotic manipulation,
we aim to conquer the above two challenges and propose a novel framework named
Similarity Weighted Behavior Transformer (SWBT). SWBT effectively learn from
both expert and imperfect demonstrations without interaction with environments.
We reveal that the easy-to-get imperfect demonstrations, such as forward and
inverse dynamics, significantly enhance the network by learning fruitful
information. To the best of our knowledge, we are the first to attempt to
integrate imperfect demonstrations into the offline imitation learning setting
for robot manipulation tasks. Extensive experiments on the ManiSkill2 benchmark
built on the high-fidelity Sapien simulator and real-world robotic manipulation
tasks demonstrated that the proposed method can extract better features and
improve the success rates for all tasks. Our code will be released upon
acceptance of the paper.Comment: 8 pages, 5 figure
- …