347 research outputs found
Decentralized Cooperative Planning for Automated Vehicles with Hierarchical Monte Carlo Tree Search
Today's automated vehicles lack the ability to cooperate implicitly with
others. This work presents a Monte Carlo Tree Search (MCTS) based approach for
decentralized cooperative planning using macro-actions for automated vehicles
in heterogeneous environments. Based on cooperative modeling of other agents
and Decoupled-UCT (a variant of MCTS), the algorithm evaluates the
state-action-values of each agent in a cooperative and decentralized manner,
explicitly modeling the interdependence of actions between traffic
participants. Macro-actions allow for temporal extension over multiple time
steps and increase the effective search depth requiring fewer iterations to
plan over longer horizons. Without predefined policies for macro-actions, the
algorithm simultaneously learns policies over and within macro-actions. The
proposed method is evaluated under several conflict scenarios, showing that the
algorithm can achieve effective cooperative planning with learned macro-actions
in heterogeneous environments
QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation
Question Generation (QG) aims to automate the task of composing questions for
a passage with a set of chosen answers found within the passage. In recent
years, the introduction of neural generation models has resulted in substantial
improvements of automatically generated questions in terms of quality,
especially compared to traditional approaches that employ manually crafted
heuristics. However, the metrics commonly applied in QG evaluations have been
criticized for their low agreement with human judgement. We therefore propose a
new reference-free evaluation metric that has the potential to provide a better
mechanism for evaluating QG systems, called QAScore. Instead of fine-tuning a
language model to maximize its correlation with human judgements, QAScore
evaluates a question by computing the cross entropy according to the
probability that the language model can correctly generate the masked words in
the answer to that question. Furthermore, we conduct a new crowd-sourcing human
evaluation experiment for the QG evaluation to investigate how QAScore and
other metrics can correlate with human judgements. Experiments show that
QAScore obtains a stronger correlation with the results of our proposed human
evaluation method compared to existing traditional word-overlap-based metrics
such as BLEU and ROUGE, as well as the existing pretrained-model-based metric
BERTScore.Comment: 19 pages, 5 figures, 7 table
MetaFormer Baselines for Vision
MetaFormer, the abstracted architecture of Transformer, has been found to
play a significant role in achieving competitive performance. In this paper, we
further explore the capacity of MetaFormer, again, without focusing on token
mixer design: we introduce several baseline models under MetaFormer using the
most basic or common mixers, and summarize our observations as follows: (1)
MetaFormer ensures solid lower bound of performance. By merely adopting
identity mapping as the token mixer, the MetaFormer model, termed
IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works
well with arbitrary token mixers. When specifying the token mixer as even a
random matrix to mix tokens, the resulting model RandFormer yields an accuracy
of >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results
when new token mixers are adopted. (3) MetaFormer effortlessly offers
state-of-the-art results. With just conventional token mixers dated back five
years ago, the models instantiated from MetaFormer already beat state of the
art. (a) ConvFormer outperforms ConvNeXt. Taking the common depthwise separable
convolutions as the token mixer, the model termed ConvFormer, which can be
regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. (b) CAFormer
sets new record on ImageNet-1K. By simply applying depthwise separable
convolutions as token mixer in the bottom stages and vanilla self-attention in
the top stages, the resulting model CAFormer sets a new record on ImageNet-1K:
it achieves an accuracy of 85.5% at 224x224 resolution, under normal supervised
training without external data or distillation. In our expedition to probe
MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of
activation compared with GELU yet achieves better performance. We expect
StarReLU to find great potential in MetaFormer-like models alongside other
neural networks.Comment: Accepted to TPAMI. Code: https://github.com/sail-sg/metaforme
MetaFormer Is Actually What You Need for Vision
Transformers have shown great potential in computer vision tasks. A common
belief is their attention-based token mixer module contributes most to their
competence. However, recent works show the attention-based module in
Transformers can be replaced by spatial MLPs and the resulted models still
perform quite well. Based on this observation, we hypothesize that the general
architecture of the Transformers, instead of the specific token mixer module,
is more essential to the model's performance. To verify this, we deliberately
replace the attention module in Transformers with an embarrassingly simple
spatial pooling operator to conduct only basic token mixing. Surprisingly, we
observe that the derived model, termed as PoolFormer, achieves competitive
performance on multiple computer vision tasks. For example, on ImageNet-1K,
PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned Vision
Transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with
35%/52% fewer parameters and 50%/62% fewer MACs. The effectiveness of
PoolFormer verifies our hypothesis and urges us to initiate the concept of
"MetaFormer", a general architecture abstracted from Transformers without
specifying the token mixer. Based on the extensive experiments, we argue that
MetaFormer is the key player in achieving superior results for recent
Transformer and MLP-like models on vision tasks. This work calls for more
future research dedicated to improving MetaFormer instead of focusing on the
token mixer modules. Additionally, our proposed PoolFormer could serve as a
starting baseline for future MetaFormer architecture design. Code is available
at https://github.com/sail-sg/poolformer.Comment: CVPR 2022 (Oral). Code: https://github.com/sail-sg/poolforme
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
Joint speech-language training is challenging due to the large demand for
training data and GPU consumption, as well as the modality gap between speech
and language. We present ComSL, a speech-language model built atop a composite
architecture of public pretrained speech-only and language-only models and
optimized data-efficiently for spoken language tasks. Particularly, we propose
to incorporate cross-modality learning into transfer learning and conduct them
simultaneously for downstream tasks in a multi-task learning manner. Our
approach has demonstrated effectiveness in end-to-end speech-to-text
translation tasks, achieving a new state-of-the-art average BLEU score of 31.5
on the multilingual speech to English text translation task for 21 languages,
as measured on the public CoVoST2 evaluation set
Enabling Efficient Interaction between an Algorithm Agent and an LLM: A Reinforcement Learning Approach
Large language models (LLMs) encode a vast amount of world knowledge acquired
from massive text datasets. Recent studies have demonstrated that LLMs can
assist an algorithm agent in solving complex sequential decision making tasks
in embodied environments by providing high-level instructions. However,
interacting with LLMs can be time-consuming, as in many practical scenarios,
they require a significant amount of storage space that can only be deployed on
remote cloud server nodes. Additionally, using commercial LLMs can be costly
since they may charge based on usage frequency. In this paper, we explore how
to enable efficient and cost-effective interactions between the agent and an
LLM. We propose a reinforcement learning based mediator model that determines
when it is necessary to consult LLMs for high-level instructions to accomplish
a target task. Experiments on 4 MiniGrid environments that entail planning
sub-goals demonstrate that our method can learn to solve target tasks with only
a few necessary interactions with an LLM, significantly reducing interaction
costs in testing environments, compared with baseline methods. Experimental
results also suggest that by learning a mediator model to interact with the
LLM, the agent's performance becomes more robust against both exploratory and
stochastic environments.Comment: 10 page
- …