81 research outputs found
An Asymptotic Analysis of Minibatch-Based Momentum Methods for Linear Regression Models
Momentum methods have been shown to accelerate the convergence of the
standard gradient descent algorithm in practice and theory. In particular, the
minibatch-based gradient descent methods with momentum (MGDM) are widely used
to solve large-scale optimization problems with massive datasets. Despite the
success of the MGDM methods in practice, their theoretical properties are still
underexplored. To this end, we investigate the theoretical properties of MGDM
methods based on the linear regression models. We first study the numerical
convergence properties of the MGDM algorithm and further provide the
theoretically optimal tuning parameters specification to achieve faster
convergence rate. In addition, we explore the relationship between the
statistical properties of the resulting MGDM estimator and the tuning
parameters. Based on these theoretical findings, we give the conditions for the
resulting estimator to achieve the optimal statistical efficiency. Finally,
extensive numerical experiments are conducted to verify our theoretical
results.Comment: 45 pages, 5 figure
PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation
The Depth-aware Video Panoptic Segmentation (DVPS) is a new challenging
vision problem that aims to predict panoptic segmentation and depth in a video
simultaneously. The previous work solves this task by extending the existing
panoptic segmentation method with an extra dense depth prediction and instance
tracking head. However, the relationship between the depth and panoptic
segmentation is not well explored -- simply combining existing methods leads to
competition and needs carefully weight balancing. In this paper, we present
PolyphonicFormer, a vision transformer to unify these sub-tasks under the DVPS
task and lead to more robust results. Our principal insight is that the depth
can be harmonized with the panoptic segmentation with our proposed new paradigm
of predicting instance level depth maps with object queries. Then the
relationship between the two tasks via query-based learning is explored. From
the experiments, we demonstrate the benefits of our design from both depth
estimation and panoptic segmentation aspects. Since each thing query also
encodes the instance-wise information, it is natural to perform tracking
directly with appearance learning. Our method achieves state-of-the-art results
on two DVPS datasets (Semantic KITTI, Cityscapes), and ranks 1st on the
ICCV-2021 BMTT Challenge video + depth track. Code is available at
https://github.com/HarborYuan/PolyphonicFormer .Comment: Accepted by ECCV 202
Multi-Task Learning with Multi-Query Transformer for Dense Prediction
Previous multi-task dense prediction studies developed complex pipelines such
as multi-modal distillations in multiple stages or searching for task
relational contexts for each task. The core insight beyond these methods is to
maximize the mutual effects between each task. Inspired by the recent
query-based Transformers, we propose a simpler pipeline named Multi-Query
Transformer (MQTransformer) that is equipped with multiple queries from
different tasks to facilitate the reasoning among multiple tasks and simplify
the cross task pipeline. Instead of modeling the dense per-pixel context among
different tasks, we seek a task-specific proxy to perform cross-task reasoning
via multiple queries where each query encodes the task-related context. The
MQTransformer is composed of three key components: shared encoder, cross task
attention and shared decoder. We first model each task with a task-relevant and
scale-aware query, and then both the image feature output by the feature
extractor and the task-relevant query feature are fed into the shared encoder,
thus encoding the query feature from the image feature. Secondly, we design a
cross task attention module to reason the dependencies among multiple tasks and
feature scales from two perspectives including different tasks of the same
scale and different scales of the same task. Then we use a shared decoder to
gradually refine the image features with the reasoned query features from
different tasks. Extensive experiment results on two dense prediction datasets
(NYUD-v2 and PASCAL-Context) show that the proposed method is an effective
approach and achieves the state-of-the-art result
Transformer-Based Visual Segmentation: A Survey
Visual segmentation seeks to partition images, video frames, or point clouds
into multiple segments or groups. This technique has numerous real-world
applications, such as autonomous driving, image editing, robot sensing, and
medical analysis. Over the past decade, deep learning-based methods have made
remarkable strides in this area. Recently, transformers, a type of neural
network based on self-attention originally designed for natural language
processing, have considerably surpassed previous convolutional or recurrent
approaches in various vision processing tasks. Specifically, vision
transformers offer robust, unified, and even simpler solutions for various
segmentation tasks. This survey provides a thorough overview of
transformer-based visual segmentation, summarizing recent advancements. We
first review the background, encompassing problem definitions, datasets, and
prior convolutional methods. Next, we summarize a meta-architecture that
unifies all recent transformer-based approaches. Based on this
meta-architecture, we examine various method designs, including modifications
to the meta-architecture and associated applications. We also present several
closely related settings, including 3D point cloud segmentation, foundation
model tuning, domain-aware segmentation, efficient segmentation, and medical
segmentation. Additionally, we compile and re-evaluate the reviewed methods on
several well-established datasets. Finally, we identify open challenges in this
field and propose directions for future research. The project page can be found
at https://github.com/lxtGH/Awesome-Segmenation-With-Transformer. We will also
continually monitor developments in this rapidly evolving field.Comment: Work in progress. Github:
https://github.com/lxtGH/Awesome-Segmenation-With-Transforme
A well-preserved ‘placoderm’ (stem-group Gnathostomata) upper jaw from the Early Devonian of Mongolia clarifies jaw evolution
The origin of jaws and teeth remains contentious in vertebrate evolution. ‘Placoderms’ (Silurian-Devonian armoured jawed fishes) are central to debates on the origins of these anatomical structures. ‘Acanthothoracids’ are generally considered the most primitive ‘placoderms’. However, they are so far known mainly from disarticulated skeletal elements that are typically incomplete. The structure of the jaws—particularly the jaw hinge—is poorly known, leaving open questions about their jaw function and comparison with other placoderms and modern gnathostomes. Here we describe a near-complete ‘acanthothoracid’ upper jaw, allowing us to reconstruct the likely orientation and angle of the bite and compare its morphology with that of other known ‘placoderm’ groups. We clarify that the bite position is located on the upper jaw cartilage rather than on the dermal cheek and thus show that there is a highly conserved bite morphology among most groups of ‘placoderms’, regardless of their overall cranial geometry. Incorporation of the dermal skeleton appears to provide a sound biomechanical basis for jaw origins. It appears that ‘acanthothoracid’ dentitions were fundamentally similar in location to that of arthrodire ‘placoderms’, rather than resembling bony fishes. Irrespective of current phylogenetic uncertainty, the new data here resolve the likely general condition for ‘placoderms’ as a whole, and as such, ancestral morphology of known jawed vertebrates
Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation
Video segmentation aims to segment and track every pixel in diverse scenarios accurately. In this paper, we present Tube-Link, a versatile framework that addresses multiple core tasks of video segmentation with a unified architecture. Our framework is a near-online approach that takes a short subclip as input and outputs the corresponding spatial-temporal tube masks. To enhance the modeling of cross-tube relationships, we propose an effective way to perform tube-level linking via attention along the queries. In addition, we introduce temporal contrastive learning to instance-wise discriminative features for tube-level association. Our approach offers flexibility and efficiency for both short and long video inputs, as the length of each subclip can be varied according to the needs of datasets or scenarios. Tube-Link outperforms existing specialized architectures by a significant margin on five video segmentation datasets. Specifically, it achieves almost 13% relative improvements on VIPSeg and 4% improvements on KITTI-STEP over the strong baseline Video K-Net. When using a ResNet50 backbone on Youtube-VIS2019 and 2021, Tube-Link boosts IDOL by 3% and 4%, respectively. Code is available at https://github.com/lxtGH/Tube-Link
Towards Open Vocabulary Learning: A Survey
In the field of visual scene understanding, deep neural networks have made
impressive advancements in various core tasks like segmentation, tracking, and
detection. However, most approaches operate on the close-set assumption,
meaning that the model can only identify pre-defined categories that are
present in the training set. Recently, open vocabulary settings were proposed
due to the rapid progress of vision language pre-training. These new approaches
seek to locate and recognize categories beyond the annotated label space. The
open vocabulary approach is more general, practical, and effective compared to
weakly supervised and zero-shot settings. This paper provides a thorough review
of open vocabulary learning, summarizing and analyzing recent developments in
the field. In particular, we begin by comparing it to related concepts such as
zero-shot learning, open-set recognition, and out-of-distribution detection.
Then, we review several closely related tasks in the case of segmentation and
detection, including long-tail problems, few-shot, and zero-shot settings. For
the method survey, we first present the basic knowledge of detection and
segmentation in close-set as the preliminary knowledge. Next, we examine
various scenarios in which open vocabulary learning is used, identifying common
design elements and core ideas. Then, we compare the recent detection and
segmentation approaches in commonly used datasets and benchmarks. Finally, we
conclude with insights, issues, and discussions regarding future research
directions. To our knowledge, this is the first comprehensive literature review
of open vocabulary learning. We keep tracing related works at
https://github.com/jianzongwu/Awesome-Open-Vocabulary.Comment: Project page at https://github.com/jianzongwu/Awesome-Open-Vocabular
- …