198 research outputs found
Semantic-aware Consistency Network for Cloth-changing Person Re-Identification
Cloth-changing Person Re-Identification (CC-ReID) is a challenging task that
aims to retrieve the target person across multiple surveillance cameras when
clothing changes might happen. Despite recent progress in CC-ReID, existing
approaches are still hindered by the interference of clothing variations since
they lack effective constraints to keep the model consistently focused on
clothing-irrelevant regions. To address this issue, we present a Semantic-aware
Consistency Network (SCNet) to learn identity-related semantic features by
proposing effective consistency constraints. Specifically, we generate the
black-clothing image by erasing pixels in the clothing area, which explicitly
mitigates the interference from clothing variations. In addition, to fully
exploit the fine-grained identity information, a head-enhanced attention module
is introduced, which learns soft attention maps by utilizing the proposed
part-based matching loss to highlight head information. We further design a
semantic consistency loss to facilitate the learning of high-level
identity-related semantic features, forcing the model to focus on semantically
consistent cloth-irrelevant regions. By using the consistency constraint, our
model does not require any extra auxiliary segmentation module to generate the
black-clothing image or locate the head region during the inference stage.
Extensive experiments on four cloth-changing person Re-ID datasets (LTCC, PRCC,
Vc-Clothes, and DeepChange) demonstrate that our proposed SCNet makes
significant improvements over prior state-of-the-art approaches. Our code is
available at: https://github.com/Gpn-star/SCNet.Comment: Accepted by ACM MM 202
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation
Referring video object segmentation (RVOS) aims at segmenting an object in a
video following human instruction. Current state-of-the-art methods fall into
an offline pattern, in which each clip independently interacts with text
embedding for cross-modal understanding. They usually present that the offline
pattern is necessary for RVOS, yet model limited temporal association within
each clip. In this work, we break up the previous offline belief and propose a
simple yet effective online model using explicit query propagation, named
OnlineRefer. Specifically, our approach leverages target cues that gather
semantic information and position prior to improve the accuracy and ease of
referring predictions for the current frame. Furthermore, we generalize our
online model into a semi-online framework to be compatible with video-based
backbones. To show the effectiveness of our method, we evaluate it on four
benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and
JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L
backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17,
outperforming all other offline methods.Comment: Accepted by ICCV2023. The code is at
https://github.com/wudongming97/OnlineRefe
Manipulating Predictions over Discrete Inputs in Machine Teaching
Machine teaching often involves the creation of an optimal (typically
minimal) dataset to help a model (referred to as the `student') achieve
specific goals given by a teacher. While abundant in the continuous domain, the
studies on the effectiveness of machine teaching in the discrete domain are
relatively limited. This paper focuses on machine teaching in the discrete
domain, specifically on manipulating student models' predictions based on the
goals of teachers via changing the training data efficiently. We formulate this
task as a combinatorial optimization problem and solve it by proposing an
iterative searching algorithm. Our algorithm demonstrates significant numerical
merit in the scenarios where a teacher attempts at correcting erroneous
predictions to improve the student's models, or maliciously manipulating the
model to misclassify some specific samples to the target class aligned with his
personal profits. Experimental results show that our proposed algorithm can
have superior performance in effectively and efficiently manipulating the
predictions of the model, surpassing conventional baselines.Comment: 8 pages, 2 figure
Referring Multi-Object Tracking
Existing referring understanding tasks tend to involve the detection of a
single text-referred object. In this paper, we propose a new and general
referring understanding task, termed referring multi-object tracking (RMOT).
Its core idea is to employ a language expression as a semantic cue to guide the
prediction of multi-object tracking. To the best of our knowledge, it is the
first work to achieve an arbitrary number of referent object predictions in
videos. To push forward RMOT, we construct one benchmark with scalable
expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18
videos with 818 expressions, and each expression in a video is annotated with
an average of 10.7 objects. Further, we develop a transformer-based
architecture TransRMOT to tackle the new task in an online manner, which
achieves impressive detection performance and outperforms other counterparts.
The dataset and code will be available at https://github.com/wudongming97/RMOT.Comment: Accpeted by CVPR 2023. The dataset and code will be available at
https://github.com/wudongming97/RMO
M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal Aspect-based Sentiment Analysis
Multimodal Aspect-based Sentiment Analysis (MABSA) is a fine-grained
Sentiment Analysis task, which has attracted growing research interests
recently. Existing work mainly utilizes image information to improve the
performance of MABSA task. However, most of the studies overestimate the
importance of images since there are many noise images unrelated to the text in
the dataset, which will have a negative impact on model learning. Although some
work attempts to filter low-quality noise images by setting thresholds, relying
on thresholds will inevitably filter out a lot of useful image information.
Therefore, in this work, we focus on whether the negative impact of noisy
images can be reduced without modifying the data. To achieve this goal, we
borrow the idea of Curriculum Learning and propose a Multi-grained
Multi-curriculum Denoising Framework (M2DF), which can achieve denoising by
adjusting the order of training data. Extensive experimental results show that
our framework consistently outperforms state-of-the-art work on three sub-tasks
of MABSA.Comment: Accepted by EMNLP 202
Language Prompt for Autonomous Driving
A new trend in the computer vision community is to capture objects of
interest following flexible human command represented by a natural language
prompt. However, the progress of using language prompts in driving scenarios is
stuck in a bottleneck due to the scarcity of paired prompt-instance data. To
address this challenge, we propose the first object-centric language prompt set
for driving scenes within 3D, multi-view, and multi-frame space, named
NuPrompt. It expands Nuscenes dataset by constructing a total of 35,367
language descriptions, each referring to an average of 5.3 object tracks. Based
on the object-text pairs from the new benchmark, we formulate a new
prompt-based driving task, \ie, employing a language prompt to predict the
described object trajectory across views and frames. Furthermore, we provide a
simple end-to-end baseline model based on Transformer, named PromptTrack.
Experiments show that our PromptTrack achieves impressive performance on
NuPrompt. We hope this work can provide more new insights for the autonomous
driving community. Dataset and Code will be made public at
\href{https://github.com/wudongming97/Prompt4Driving}{https://github.com/wudongming97/Prompt4Driving}
Mechanical deformation mechanism and verification of sections at junctions of light and dark tunnel in a mountain area
Projects involving junctions of light and dark tunnel in mountainous areas are complex engineering problems that combine tunnel structure, slope rock-soil mass and protection projects. Such junctions suffer from a complex and changeable load. The stress and deformation of the junction varies under different conditions. Thus, it is a major source of inconvenience for construction and monitoring operations. In this paper, according to the load conditions at a junction of light and dark tunnel, we divide the junction hole into thrust, compression, and combined thrust-compression types. Three types of structures were simulated by numerical analysis, and we explored the structural deformation and stress of these types of tunnel under different condition. Thus, in any construction process, the mechanical deformation mechanism and the weak point in the structure should be worked out. Based on the weak parts, some monitoring points were installed, and four fields for monitoring were chosen. The monitoring results show that the actual deformation, stress and structural failure location are basically consistent with the numerical simulation results. The deformation mechanism of light and dark tunnel junction obtained can provide the basis for selecting the treatment measures and controlling the structural deformation. Furthermore, the results can also be used as a reference for similar engineering design, construction and site monitoring projects
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Graphical User Interface (GUI) agents are designed to automate complex tasks
on digital devices, such as smartphones and desktops. Most existing GUI agents
interact with the environment through extracted structured data, which can be
notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops).
To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which
only relies on screenshots for task automation. In our preliminary study, we
have discovered a key challenge in developing visual GUI agents: GUI grounding
-- the capacity to accurately locate screen elements based on instructions. To
tackle this challenge, we propose to enhance SeeClick with GUI grounding
pre-training and devise a method to automate the curation of GUI grounding
data. Along with the efforts above, we have also created ScreenSpot, the first
realistic GUI grounding benchmark that encompasses mobile, desktop, and web
environments. After pre-training, SeeClick demonstrates significant improvement
in ScreenSpot over various baselines. Moreover, comprehensive evaluations on
three widely used benchmarks consistently support our finding that advancements
in GUI grounding directly correlate with enhanced performance in downstream GUI
agent tasks. The model, data and code are available at
https://github.com/njucckevin/SeeClick
- …