861 research outputs found
Joint learning of images and videos with a single Vision Transformer
In this study, we propose a method for jointly learning of images and videos
using a single model. In general, images and videos are often trained by
separate models. We propose in this paper a method that takes a batch of images
as input to Vision Transformer IV-ViT, and also a set of video frames with
temporal aggregation by late fusion. Experimental results on two image datasets
and two action recognition datasets are presented.Comment: MVA2023 (18th International Conference on Machine Vision
Applications), Hamamatsu, Japan, 23-25 July 202
GEO-BLEU: Similarity Measure for Geospatial Sequences
In recent geospatial research, the importance of modeling large-scale human
mobility data and predicting trajectories is rising, in parallel with progress
in text generation using large-scale corpora in natural language processing.
Whereas there are already plenty of feasible approaches applicable to
geospatial sequence modeling itself, there seems to be room to improve with
regard to evaluation, specifically about measuring the similarity between
generated and reference trajectories. In this work, we propose a novel
similarity measure, GEO-BLEU, which can be especially useful in the context of
geospatial sequence modeling and generation. As the name suggests, this work is
based on BLEU, one of the most popular measures used in machine translation
research, while introducing spatial proximity to the idea of n-gram. We compare
this measure with an established baseline, dynamic time warping, applying it to
actual generated geospatial sequences. Using crowdsourced annotated data on the
similarity between geospatial sequences collected from over 12,000 cases, we
quantitatively and qualitatively show the proposed method's superiority
- …