5,399 research outputs found
Retrieve Anyone: A General-purpose Person Re-identification Task with Instructions
Human intelligence can retrieve any person according to both visual and
language descriptions. However, the current computer vision community studies
specific person re-identification (ReID) tasks in different scenarios
separately, which limits the applications in the real world. This paper strives
to resolve this problem by proposing a new instruct-ReID task that requires the
model to retrieve images according to the given image or language
instructions.Our instruct-ReID is a more general ReID setting, where existing
ReID tasks can be viewed as special cases by designing different instructions.
We propose a large-scale OmniReID benchmark and an adaptive triplet loss as a
baseline method to facilitate research in this new setting. Experimental
results show that the baseline model trained on our OmniReID benchmark can
improve +0.5%, +3.3% mAP on Market1501 and CUHK03 for traditional ReID, +2.1%,
+0.2%, +15.3% mAP on PRCC, VC-Clothes, LTCC for clothes-changing ReID, +12.5%
mAP on COCAS+ real2 for clothestemplate based clothes-changing ReID when using
only RGB images, +25.5% mAP on COCAS+ real2 for our newly defined
language-instructed ReID. The dataset, model, and code will be available at
https://github.com/hwz-zju/Instruct-ReID
A survey on generative adversarial networks for imbalance problems in computer vision tasks
Any computer vision application development starts off by acquiring images and data, then preprocessing and pattern recognition steps to perform a task. When the acquired images are highly imbalanced and not adequate, the desired task may not be achievable. Unfortunately, the occurrence of imbalance problems in acquired image datasets in certain complex real-world problems such as anomaly detection, emotion recognition, medical image analysis, fraud detection, metallic surface defect detection, disaster prediction, etc., are inevitable. The performance of computer vision algorithms can significantly deteriorate when the training dataset is imbalanced. In recent years, Generative Adversarial Neural Networks (GANs) have gained immense attention by researchers across a variety of application domains due to their capability to model complex real-world image data. It is particularly important that GANs can not only be used to generate synthetic images, but also its fascinating adversarial learning idea showed good potential in restoring balance in imbalanced datasets. In this paper, we examine the most recent developments of GANs based techniques for addressing imbalance problems in image data. The real-world challenges and implementations of synthetic image generation based on GANs are extensively covered in this survey. Our survey first introduces various imbalance problems in computer vision tasks and its existing solutions, and then examines key concepts such as deep generative image models and GANs. After that, we propose a taxonomy to summarize GANs based techniques for addressing imbalance problems in computer vision tasks into three major categories: 1. Image level imbalances in classification, 2. object level imbalances in object detection and 3. pixel level imbalances in segmentation tasks. We elaborate the imbalance problems of each group, and provide GANs based solutions in each group. Readers will understand how GANs based techniques can handle the problem of imbalances and boost performance of the computer vision algorithms
Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark
In this paper, we introduce a large Multi-Attribute and Language Search
dataset for text-based person retrieval, called MALS, and explore the
feasibility of performing pre-training on both attribute recognition and
image-text matching tasks in one stone. In particular, MALS contains 1,510,330
image-text pairs, which is about 37.5 times larger than prevailing CUHK-PEDES,
and all images are annotated with 27 attributes. Considering the privacy
concerns and annotation costs, we leverage the off-the-shelf diffusion models
to generate the dataset. To verify the feasibility of learning from the
generated data, we develop a new joint Attribute Prompt Learning and Text
Matching Learning (APTM) framework, considering the shared knowledge between
attribute and text. As the name implies, APTM contains an attribute prompt
learning stream and a text matching learning stream. (1) The attribute prompt
learning leverages the attribute prompts for image-attribute alignment, which
enhances the text matching learning. (2) The text matching learning facilitates
the representation learning on fine-grained details, and in turn, boosts the
attribute prompt learning. Extensive experiments validate the effectiveness of
the pre-training on MALS, achieving state-of-the-art retrieval performance via
APTM on three challenging real-world benchmarks. In particular, APTM achieves a
consistent improvement of +6.96%, +7.68%, and +16.95% Recall@1 accuracy on
CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively
Can adversarial networks hallucinate occluded people with a plausible aspect?
When you see a person in a crowd, occluded by other persons, you miss visual information that can be used to recognize, re-identify or simply classify him or her. You can imagine its appearance given your experience, nothing more. Similarly, AI solutions can try to hallucinate missing information with specific deep learning architectures, suitably trained with people with and without occlusions. The goal of this work is to generate a complete image of a person, given an occluded version in input, that should be a) without occlusion b) similar at pixel level to a completely visible people shape c) capable to conserve similar visual attributes (e.g. male/female) of the original one. For the purpose, we propose a new approach by integrating the state-of-the-art of neural network architectures, namely U-nets and GANs, as well as discriminative attribute classification nets, with an architecture specifically designed to de-occlude people shapes. The network is trained to optimize a Loss function which could take into account the aforementioned objectives. As well we propose two datasets for testing our solution: the first one, occluded RAP, created automatically by occluding real shapes of the RAP dataset created by Li et al. (2016) (which collects also attributes of the people aspect); the second is a large synthetic dataset, AiC, generated in computer graphics with data extracted from the GTA video game, that contains 3D data of occluded objects by construction. Results are impressive and outperform any other previous proposal. This result could be an initial step to many further researches to recognize people and their behavior in an open crowded world
- …