48 research outputs found
FITS: Modeling Time Series with Parameters
In this paper, we introduce FITS, a lightweight yet powerful model for time
series analysis. Unlike existing models that directly process raw time-domain
data, FITS operates on the principle that time series can be manipulated
through interpolation in the complex frequency domain. By discarding
high-frequency components with negligible impact on time series data, FITS
achieves performance comparable to state-of-the-art models for time series
forecasting and anomaly detection tasks, while having a remarkably compact size
of only approximately parameters. Such a lightweight model can be easily
trained and deployed in edge devices, creating opportunities for various
applications. The anonymous code repo is available in:
\url{https://anonymous.4open.science/r/FITS
Time Series is a Special Sequence: Forecasting with Sample Convolution and Interaction
Time series is a special type of sequence data, a set of observations
collected at even time intervals and ordered chronologically. Existing deep
learning techniques use generic sequence models (e.g., recurrent neural
network, Transformer model, or temporal convolutional network) for time series
analysis, which ignore some of its unique properties. In particular, three
components characterize time series: trend, seasonality, and irregular
components, and the former two components enable us to perform forecasting with
reasonable accuracy. Other types of sequence data do not have such
characteristics. Motivated by the above, in this paper, we propose a novel
neural network architecture that conducts sample convolution and interaction
for temporal modeling and apply it for the time series forecasting problem,
namely \textbf{SCINet}. Compared to conventional dilated causal convolution
architectures, the proposed downsample-convolve-interact architecture enables
multi-resolution analysis besides expanding the receptive field of the
convolution operation, which facilitates extracting temporal relation features
with enhanced predictability. Experimental results show that SCINet achieves
significant prediction accuracy improvement over existing solutions across
various real-world time series forecasting datasets
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer
Whole-body mesh recovery aims to estimate the 3D human body, face, and hands
parameters from a single image. It is challenging to perform this task with a
single network due to resolution issues, i.e., the face and hands are usually
located in extremely small regions. Existing works usually detect hands and
faces, enlarge their resolution to feed in a specific network to predict the
parameter, and finally fuse the results. While this copy-paste pipeline can
capture the fine-grained details of the face and hands, the connections between
different parts cannot be easily recovered in late fusion, leading to
implausible 3D rotation and unnatural pose. In this work, we propose a
one-stage pipeline for expressive whole-body mesh recovery, named OSX, without
separate networks for each part. Specifically, we design a Component Aware
Transformer (CAT) composed of a global body encoder and a local face/hand
decoder. The encoder predicts the body parameters and provides a high-quality
feature map for the decoder, which performs a feature-level upsample-crop
scheme to extract high-resolution part-specific features and adopt
keypoint-guided deformable attention to estimate hand and face precisely. The
whole pipeline is simple yet effective without any manual post-processing and
naturally avoids implausible prediction. Comprehensive experiments demonstrate
the effectiveness of OSX. Lastly, we build a large-scale Upper-Body dataset
(UBody) with high-quality 2D and 3D whole-body annotations. It contains persons
with partially visible bodies in diverse real-life scenarios to bridge the gap
between the basic task and downstream applications.Comment: Accepted to CVPR2023; Top-1 on AGORA SMPLX benchmark; Project Page:
https://osx-ubody.github.io
Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes
Humans have long been recorded in a variety of forms since antiquity. For
example, sculptures and paintings were the primary media for depicting human
beings before the invention of cameras. However, most current human-centric
computer vision tasks like human pose estimation and human image generation
focus exclusively on natural images in the real world. Artificial humans, such
as those in sculptures, paintings, and cartoons, are commonly neglected, making
existing models fail in these scenarios. As an abstraction of life, art
incorporates humans in both natural and artificial scenes. We take advantage of
it and introduce the Human-Art dataset to bridge related tasks in natural and
artificial scenarios. Specifically, Human-Art contains 50k high-quality images
with over 123k person instances from 5 natural and 15 artificial scenarios,
which are annotated with bounding boxes, keypoints, self-contact points, and
text information for humans represented in both 2D and 3D. It is, therefore,
comprehensive and versatile for various downstream tasks. We also provide a
rich set of baseline results and detailed analyses for related tasks, including
human detection, 2D and 3D human pose estimation, image generation, and motion
transfer. As a challenging dataset, we hope Human-Art can provide insights for
relevant research and open up new research questions.Comment: CVPR202
Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code
Text-guided diffusion models have revolutionized image generation and
editing, offering exceptional realism and diversity. Specifically, in the
context of diffusion-based editing, where a source image is edited according to
a target prompt, the process commences by acquiring a noisy latent vector
corresponding to the source image via the diffusion model. This vector is
subsequently fed into separate source and target diffusion branches for
editing. The accuracy of this inversion process significantly impacts the final
editing outcome, influencing both essential content preservation of the source
image and edit fidelity according to the target prompt. Prior inversion
techniques aimed at finding a unified solution in both the source and target
diffusion branches. However, our theoretical and empirical analyses reveal that
disentangling these branches leads to a distinct separation of responsibilities
for preserving essential content and ensuring edit fidelity. Building on this
insight, we introduce "Direct Inversion," a novel technique achieving optimal
performance of both branches with just three lines of code. To assess image
editing performance, we present PIE-Bench, an editing benchmark with 700 images
showcasing diverse scenes and editing types, accompanied by versatile
annotations and comprehensive evaluation metrics. Compared to state-of-the-art
optimization-based inversion techniques, our solution not only yields superior
performance across 8 editing methods but also achieves nearly an order of
speed-up
DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D
Expression and Gesture generation with arbitrary length. While previous works
focused on co-speech gesture or expression generation individually, the joint
generation of synchronized expressions and gestures remains barely explored. To
address this, our diffusion-based co-speech motion generation transformer
enables uni-directional information flow from expression to gesture,
facilitating improved matching of joint expression-gesture distributions.
Furthermore, we introduce an outpainting-based sampling strategy for arbitrary
long sequence generation in diffusion models, offering flexibility and
computational efficiency. Our method provides a practical solution that
produces high-quality synchronized expression and gesture generation driven by
speech. Evaluated on two public datasets, our approach achieves
state-of-the-art performance both quantitatively and qualitatively.
Additionally, a user study confirms the superiority of DiffSHEG over prior
approaches. By enabling the real-time generation of expressive and synchronized
motions, DiffSHEG showcases its potential for various applications in the
development of digital humans and embodied agents.Comment: Accepted by CVPR 2024. Project page:
https://jeremycjm.github.io/proj/DiffSHE