49 research outputs found
Visual Pretraining on Large-Scale Image Datasets
This thesis focuses on large-scale visual pretraining in computer vision and addresses various
limitations of previous approaches. It introduces a novel technique called Relative Contrastive Loss
(RCL) to learn feature representations that encompass real-world semantic variations while
respecting positive-negative relativeness. The thesis also presents UniVCL, a unified framework for
unsupervised visual contrastive learning methods, leveraging a graph convolutional network (GCN)
layer for improved object recognition accuracy. Additionally, the thesis explores the transferability gap
between unsupervised and supervised pretraining, emphasizing the role of the multilayer perceptron
(MLP) projector in enhancing transfer performance. HumanBench, a comprehensive benchmark for
human-centric downstream tasks, is proposed, and a pretraining method called PATH is introduced
to learn knowledge in human bodies. The findings confirm the effectiveness of the proposed methods
in enhancing the practicality and performance of large-scale visual pretraining
Visual Pretraining on Large-Scale Image Datasets
This thesis focuses on large-scale visual pretraining in computer vision and addresses various
limitations of previous approaches. It introduces a novel technique called Relative Contrastive Loss
(RCL) to learn feature representations that encompass real-world semantic variations while
respecting positive-negative relativeness. The thesis also presents UniVCL, a unified framework for
unsupervised visual contrastive learning methods, leveraging a graph convolutional network (GCN)
layer for improved object recognition accuracy. Additionally, the thesis explores the transferability gap
between unsupervised and supervised pretraining, emphasizing the role of the multilayer perceptron
(MLP) projector in enhancing transfer performance. HumanBench, a comprehensive benchmark for
human-centric downstream tasks, is proposed, and a pretraining method called PATH is introduced
to learn knowledge in human bodies. The findings confirm the effectiveness of the proposed methods
in enhancing the practicality and performance of large-scale visual pretraining
Continual representation learning for biometric identification
With the explosion of digital data in recent years, continuously learning new
tasks from a stream of data without forgetting previously acquired knowledge
has become increasingly important. In this paper, we propose a new continual
learning (CL) setting, namely ``continual representation learning'', which
focuses on learning better representation in a continuous way. We also provide
two large-scale multi-step benchmarks for biometric identification, where the
visual appearance of different classes are highly relevant. In contrast to
requiring the model to recognize more learned classes, we aim to learn feature
representation that can be better generalized to not only previously unseen
images but also unseen classes/identities. For the new setting, we propose a
novel approach that performs the knowledge distillation over a large number of
identities by applying the neighbourhood selection and consistency relaxation
strategies to improve scalability and flexibility of the continual learning
model. We demonstrate that existing CL methods can improve the representation
in the new setting, and our method achieves better results than the
competitors
Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy
Background: It is necessary and essential to discovery protein function from
the novel primary sequences. Wet lab experimental procedures are not only
time-consuming, but also costly, so predicting protein structure and function
reliably based only on amino acid sequence has significant value. TATA-binding
protein (TBP) is a kind of DNA binding protein, which plays a key role in the
transcription regulation. Our study proposed an automatic approach for
identifying TATA-binding proteins efficiently, accurately, and conveniently.
This method would guide for the special protein identification with
computational intelligence strategies. Results: Firstly, we proposed novel
fingerprint features for TBP based on pseudo amino acid composition,
physicochemical properties, and secondary structure. Secondly, hierarchical
features dimensionality reduction strategies were employed to improve the
performance furthermore. Currently, Pretata achieves 92.92% TATA- binding
protein prediction accuracy, which is better than all other existing methods.
Conclusions: The experiments demonstrate that our method could greatly improve
the prediction accuracy and speed, thus allowing large-scale NGS data
prediction to be practical. A web server is developed to facilitate the other
researchers, which can be accessed at http://server.malab.cn/preTata/
Collective Intelligence for Object Manipulation with Mobile Robots
While natural systems often present collective intelligence that allows them
to self-organize and adapt to changes, the equivalent is missing in most
artificial systems. We explore the possibility of such a system in the context
of cooperative object manipulation using mobile robots. Although conventional
works demonstrate potential solutions for the problem in restricted settings,
they have computational and learning difficulties. More importantly, these
systems do not possess the ability to adapt when facing environmental changes.
In this work, we show that by distilling a planner derived from a
gradient-based soft-body physics simulator into an attention-based neural
network, our multi-robot manipulation system can achieve better performance
than baselines. In addition, our system also generalizes to unseen
configurations during training and is able to adapt toward task completions
when external turbulence and environmental changes are applied
DetToolChain: a new prompting paradigm to unleash detection ability of MLLM
We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM to focus on regional information (e.g., zooming in), read coordinates according to measure standards (e.g., overlaying rulers and compasses), and infer from the contextual information (e.g., overlaying scene graphs). Building upon these tools, the new detection chain-of-thought can automatically decompose the task into simple subtasks, diagnose the predictions, and plan for progressive box refinements. The effectiveness of our framework is demonstrated across a spectrum of detection tasks, especially hard cases. Compared to existing state-of-the-art methods, GPT4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting
UniHCP: A Unified Model for Human-Centric Perceptions
Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian
detection, person re-identification, etc.) play a key role in industrial
applications of visual models. While specific human-centric tasks have their
own relevant semantic aspect to focus on, they also share the same underlying
semantic structure of the human body. However, few works have attempted to
exploit such homogeneity and design a general-propose model for human-centric
tasks. In this work, we revisit a broad range of human-centric tasks and unify
them in a minimalist manner. We propose UniHCP, a Unified Model for
Human-Centric Perceptions, which unifies a wide range of human-centric tasks in
a simplified end-to-end manner with the plain vision transformer architecture.
With large-scale joint training on 33 human-centric datasets, UniHCP can
outperform strong baselines on several in-domain and downstream tasks by direct
evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a
wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing,
86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID,
and 85.8 JI on CrowdHuman for pedestrian detection, performing better than
specialized models tailored for each task.Comment: Accepted for publication at the IEEE/CVF Conference on Computer
Vision and Pattern Recognition 2023 (CVPR 2023
HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining
Human-centric perceptions include a variety of vision tasks, which have
widespread industrial applications, including surveillance, autonomous driving,
and the metaverse. It is desirable to have a general pretrain model for
versatile human-centric downstream tasks. This paper forges ahead along this
path from the aspects of both benchmark and pretraining methods. Specifically,
we propose a \textbf{HumanBench} based on existing datasets to comprehensively
evaluate on the common ground the generalization abilities of different
pretraining methods on 19 datasets from 6 diverse downstream tasks, including
person ReID, pose estimation, human parsing, pedestrian attribute recognition,
pedestrian detection, and crowd counting. To learn both coarse-grained and
fine-grained knowledge in human bodies, we further propose a \textbf{P}rojector
\textbf{A}ssis\textbf{T}ed \textbf{H}ierarchical pretraining method
(\textbf{PATH}) to learn diverse knowledge at different granularity levels.
Comprehensive evaluations on HumanBench show that our PATH achieves new
state-of-the-art results on 17 downstream datasets and on-par results on the
other 2 datasets. The code will be publicly at
\href{https://github.com/OpenGVLab/HumanBench}{https://github.com/OpenGVLab/HumanBench}.Comment: Accepted to CVPR202