9 research outputs found
POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition
Facial Expression Recognition (FER) has received increasing interest in the
computer vision community. As a challenging task, there are three key issues
especially prevalent in FER: inter-class similarity, intra-class discrepancy,
and scale sensitivity. Existing methods typically address some of these issues,
but do not tackle them all in a unified framework. Therefore, in this paper, we
propose a two-stream Pyramid crOss-fuSion TransformER network (POSTER) that
aims to holistically solve these issues. Specifically, we design a
transformer-based cross-fusion paradigm that enables effective collaboration of
facial landmark and direct image features to maximize proper attention to
salient facial regions. Furthermore, POSTER employs a pyramid structure to
promote scale invariance. Extensive experimental results demonstrate that our
POSTER outperforms SOTA methods on RAF-DB with 92.05%, FERPlus with 91.62%,
AffectNet (7 cls) with 67.31%, and AffectNet (8 cls) with 63.34%, respectively
Exploring Parameter-Efficient Fine-tuning for Improving Communication Efficiency in Federated Learning
Federated learning (FL) has emerged as a promising paradigm for enabling the
collaborative training of models without centralized access to the raw data on
local devices. In the typical FL paradigm (e.g., FedAvg), model weights are
sent to and from the server each round to participating clients. However, this
can quickly put a massive communication burden on the system, especially if
more capable models beyond very small MLPs are employed. Recently, the use of
pre-trained models has been shown effective in federated learning optimization
and improving convergence. This opens the door for new research questions. Can
we adjust the weight-sharing paradigm in federated learning, leveraging strong
and readily-available pre-trained models, to significantly reduce the
communication burden while simultaneously achieving excellent performance? To
this end, we investigate the use of parameter-efficient fine-tuning in
federated learning. Specifically, we systemically evaluate the performance of
several parameter-efficient fine-tuning methods across a variety of client
stability, data distribution, and differential privacy settings. By only
locally tuning and globally sharing a small portion of the model weights,
significant reductions in the total communication overhead can be achieved
while maintaining competitive performance in a wide range of federated learning
scenarios, providing insight into a new paradigm for practical and effective
federated systems
A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose
Existing deep learning-based human mesh reconstruction approaches have a
tendency to build larger networks in order to achieve higher accuracy.
Computational complexity and model size are often neglected, despite being key
characteristics for practical use of human mesh reconstruction models (e.g.
virtual try-on systems). In this paper, we present GTRS, a lightweight
pose-based method that can reconstruct human mesh from 2D human pose. We
propose a pose analysis module that uses graph transformers to exploit
structured and implicit joint correlations, and a mesh regression module that
combines the extracted pose feature with the mesh template to reconstruct the
final human mesh. We demonstrate the efficiency and generalization of GTRS by
extensive evaluations on the Human3.6M and 3DPW datasets. In particular, GTRS
achieves better accuracy than the SOTA pose-based method Pose2Mesh while only
using 10.2% of the parameters (Params) and 2.5% of the FLOPs on the challenging
in-the-wild 3DPW dataset. Code will be publicly available
GFM: Building Geospatial Foundation Models via Continual Pretraining
Geospatial technologies are becoming increasingly essential in our world for
a wide range of applications, including agriculture, urban planning, and
disaster response. To help improve the applicability and performance of deep
learning models on these geospatial tasks, various works have begun
investigating foundation models for this domain. Researchers have explored two
prominent approaches for introducing such models in geospatial applications,
but both have drawbacks in terms of limited performance benefit or prohibitive
training cost. Therefore, in this work, we propose a novel paradigm for
building highly effective geospatial foundation models with minimal resource
cost and carbon impact. We first construct a compact yet diverse dataset from
multiple sources to promote feature diversity, which we term GeoPile. Then, we
investigate the potential of continual pretraining from large-scale
ImageNet-22k models and propose a multi-objective continual pretraining
paradigm, which leverages the strong representations of ImageNet while
simultaneously providing the freedom to learn valuable in-domain features. Our
approach outperforms previous state-of-the-art geospatial pretraining methods
in an extensive evaluation on seven downstream datasets covering various tasks
such as change detection, classification, multi-label classification, semantic
segmentation, and super-resolution
FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning
Personalized Federated Learning (PFL) represents a promising solution for
decentralized learning in heterogeneous data environments. Partial model
personalization has been proposed to improve the efficiency of PFL by
selectively updating local model parameters instead of aggregating all of them.
However, previous work on partial model personalization has mainly focused on
Convolutional Neural Networks (CNNs), leaving a gap in understanding how it can
be applied to other popular models such as Vision Transformers (ViTs). In this
work, we investigate where and how to partially personalize a ViT model.
Specifically, we empirically evaluate the sensitivity to data distribution of
each type of layer. Based on the insights that the self-attention layer and the
classification head are the most sensitive parts of a ViT, we propose a novel
approach called FedPerfix, which leverages plugins to transfer information from
the aggregated model to the local client as a personalization. Finally, we
evaluate the proposed approach on CIFAR-100, OrganAMNIST, and Office-Home
datasets and demonstrate its effectiveness in improving the model's performance
compared to several advanced PFL methods.Comment: 2023 IEEE/CVF International Conference on Computer Vision (ICCV
HeatER: An Efficient and Unified Network for Human Reconstruction via Heatmap-based TransformER
Recently, vision transformers have shown great success in 2D human pose
estimation (2D HPE), 3D human pose estimation (3D HPE), and human mesh
reconstruction (HMR) tasks. In these tasks, heatmap representations of the
human structural information are often extracted first from the image by a CNN,
and then further processed with a transformer architecture to provide the final
HPE or HMR estimation. However, existing transformer architectures are not able
to process these heatmap inputs directly, forcing an unnatural flattening of
the features prior to input. Furthermore, much of the performance benefit in
recent HPE and HMR methods has come at the cost of ever-increasing computation
and memory needs. Therefore, to simultaneously address these problems, we
propose HeatER, a novel transformer design which preserves the inherent
structure of heatmap representations when modeling attention while reducing the
memory and computational costs. Taking advantage of HeatER, we build a unified
and efficient network for 2D HPE, 3D HPE, and HMR tasks. A heatmap
reconstruction module is applied to improve the robustness of the estimated
human pose and mesh. Extensive experiments demonstrate the effectiveness of
HeatER on various human pose and mesh datasets. For instance, HeatER
outperforms the SOTA method MeshGraphormer by requiring 5% of Params and 16% of
MACs on Human3.6M and 3DPW datasets. Code will be publicly available