49 research outputs found
Implicit Stacked Autoregressive Model for Video Prediction
Future frame prediction has been approached through two primary methods:
autoregressive and non-autoregressive. Autoregressive methods rely on the
Markov assumption and can achieve high accuracy in the early stages of
prediction when errors are not yet accumulated. However, their performance
tends to decline as the number of time steps increases. In contrast,
non-autoregressive methods can achieve relatively high performance but lack
correlation between predictions for each time step. In this paper, we propose
an Implicit Stacked Autoregressive Model for Video Prediction (IAM4VP), which
is an implicit video prediction model that applies a stacked autoregressive
method. Like non-autoregressive methods, stacked autoregressive methods use the
same observed frame to estimate all future frames. However, they use their own
predictions as input, similar to autoregressive methods. As the number of time
steps increases, predictions are sequentially stacked in the queue. To evaluate
the effectiveness of IAM4VP, we conducted experiments on three common future
frame prediction benchmark datasets and weather\&climate prediction benchmark
datasets. The results demonstrate that our proposed model achieves
state-of-the-art performance
A Billion-scale Foundation Model for Remote Sensing Images
As the potential of foundation models in visual tasks has garnered
significant attention, pretraining these models before downstream tasks has
become a crucial step. The three key factors in pretraining foundation models
are the pretraining method, the size of the pretraining dataset, and the number
of model parameters. Recently, research in the remote sensing field has focused
primarily on the pretraining method and the size of the dataset, with limited
emphasis on the number of model parameters. This paper addresses this gap by
examining the effect of increasing the number of model parameters on the
performance of foundation models in downstream tasks such as rotated object
detection and semantic segmentation. We pretrained foundation models with
varying numbers of parameters, including 86M, 605.26M, 1.3B, and 2.4B, to
determine whether performance in downstream tasks improved with an increase in
parameters. To the best of our knowledge, this is the first billion-scale
foundation model in the remote sensing field. Furthermore, we propose an
effective method for scaling up and fine-tuning a vision transformer in the
remote sensing field. To evaluate general performance in downstream tasks, we
employed the DOTA v2.0 and DIOR-R benchmark datasets for rotated object
detection, and the Potsdam and LoveDA datasets for semantic segmentation.
Experimental results demonstrated that, across all benchmark datasets and
downstream tasks, the performance of the foundation models and data efficiency
improved as the number of parameters increased. Moreover, our models achieve
the state-of-the-art performance on several datasets including DIOR-R, Postdam,
and LoveDA.Comment: This work has been submitted to the IEEE for possible publicatio