365 research outputs found
Non-Autoregressive Coarse-to-Fine Video Captioning
It is encouraged to see that progress has been made to bridge videos and
natural language. However, mainstream video captioning methods suffer from slow
inference speed due to the sequential manner of autoregressive decoding, and
prefer generating generic descriptions due to the insufficient training of
visual words (e.g., nouns and verbs) and inadequate decoding paradigm. In this
paper, we propose a non-autoregressive decoding based model with a
coarse-to-fine captioning procedure to alleviate these defects. In
implementations, we employ a bi-directional self-attention based network as our
language model for achieving inference speedup, based on which we decompose the
captioning procedure into two stages, where the model has different focuses.
Specifically, given that visual words determine the semantic correctness of
captions, we design a mechanism of generating visual words to not only promote
the training of scene-related words but also capture relevant details from
videos to construct a coarse-grained sentence "template". Thereafter, we devise
dedicated decoding algorithms that fill in the "template" with suitable words
and modify inappropriate phrasing via iterative refinement to obtain a
fine-grained description. Extensive experiments on two mainstream video
captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach
achieves state-of-the-art performance, generates diverse descriptions, and
obtains high inference efficiency. Our code is available at
https://github.com/yangbang18/Non-Autoregressive-Video-Captioning.Comment: 9 pages, 6 figures, to be published in AAAI2021. Our code is
available at
https://github.com/yangbang18/Non-Autoregressive-Video-Captionin
OpenPerf: A Benchmarking Framework for the Sustainable Development of the Open-Source Ecosystem
Benchmarking involves designing scientific test methods, tools, and
frameworks to quantitatively and comparably assess specific performance
indicators of certain test subjects. With the development of artificial
intelligence, AI benchmarking datasets such as ImageNet and DataPerf have
gradually become consensus standards in both academic and industrial fields.
However, constructing a benchmarking framework remains a significant challenge
in the open-source domain due to the diverse range of data types, the wide
array of research issues, and the intricate nature of collaboration networks.
This paper introduces OpenPerf, a benchmarking framework designed for the
sustainable development of the open-source ecosystem. This framework defines 9
task benchmarking tasks in the open-source research, encompassing 3 data types:
time series, text, and graphics, and addresses 6 research problems including
regression, classification, recommendation, ranking, network building, and
anomaly detection. Based on the above tasks, we implemented 3 data science task
benchmarks, 2 index-based benchmarks, and 1 standard benchmark. Notably, the
index-based benchmarks have been adopted by the China Electronics
Standardization Institute as evaluation criteria for open-source community
governance. Additionally, we have developed a comprehensive toolkit for
OpenPerf, which not only offers robust data management, tool integration, and
user interface capabilities but also adopts a Benchmarking-as-a-Service (BaaS)
model to serve academic institutions, industries, and foundations. Through its
application in renowned companies and institutions such as Alibaba, Ant Group,
and East China Normal University, we have validated OpenPerf's pivotal role in
the healthy evolution of the open-source ecosystem
VNI-Net: Vector Neurons-based Rotation-Invariant Descriptor for LiDAR Place Recognition
LiDAR-based place recognition plays a crucial role in Simultaneous
Localization and Mapping (SLAM) and LiDAR localization.
Despite the emergence of various deep learning-based and hand-crafting-based
methods, rotation-induced place recognition failure remains a critical
challenge.
Existing studies address this limitation through specific training strategies
or network structures.
However, the former does not produce satisfactory results, while the latter
focuses mainly on the reduced problem of SO(2) rotation invariance. Methods
targeting SO(3) rotation invariance suffer from limitations in discrimination
capability.
In this paper, we propose a new method that employs Vector Neurons Network
(VNN) to achieve SO(3) rotation invariance.
We first extract rotation-equivariant features from neighboring points and
map low-dimensional features to a high-dimensional space through VNN.
Afterwards, we calculate the Euclidean and Cosine distance in the
rotation-equivariant feature space as rotation-invariant feature descriptors.
Finally, we aggregate the features using GeM pooling to obtain global
descriptors.
To address the significant information loss when formulating
rotation-invariant descriptors, we propose computing distances between features
at different layers within the Euclidean space neighborhood.
This greatly improves the discriminability of the point cloud descriptors
while ensuring computational efficiency.
Experimental results on public datasets show that our approach significantly
outperforms other baseline methods implementing rotation invariance, while
achieving comparable results with current state-of-the-art place recognition
methods that do not consider rotation issues
- …