186 research outputs found
Asynchronous and Segmented Bidirectional Encoding for NMT
With the rapid advancement of Neural Machine Translation (NMT), enhancing
translation efficiency and quality has become a focal point of research.
Despite the commendable performance of general models such as the Transformer
in various aspects, they still fall short in processing long sentences and
fully leveraging bidirectional contextual information. This paper introduces an
improved model based on the Transformer, implementing an asynchronous and
segmented bidirectional decoding strategy aimed at elevating translation
efficiency and accuracy. Compared to traditional unidirectional translations
from left-to-right or right-to-left, our method demonstrates heightened
efficiency and improved translation quality, particularly in handling long
sentences. Experimental results on the IWSLT2017 dataset confirm the
effectiveness of our approach in accelerating translation and increasing
accuracy, especially surpassing traditional unidirectional strategies in long
sentence translation. Furthermore, this study analyzes the impact of sentence
length on decoding outcomes and explores the model's performance in various
scenarios. The findings of this research not only provide an effective encoding
strategy for the NMT field but also pave new avenues and directions for future
studies
Generalized Activation via Multivariate Projection
Activation functions are essential to introduce nonlinearity into neural
networks, with the Rectified Linear Unit (ReLU) often favored for its
simplicity and effectiveness. Motivated by the structural similarity between a
shallow Feedforward Neural Network (FNN) and a single iteration of the
Projected Gradient Descent (PGD) algorithm, a standard approach for solving
constrained optimization problems, we consider ReLU as a projection from R onto
the nonnegative half-line R+. Building on this interpretation, we extend ReLU
by substituting it with a generalized projection operator onto a convex cone,
such as the Second-Order Cone (SOC) projection, thereby naturally extending it
to a Multivariate Projection Unit (MPU), an activation function with multiple
inputs and multiple outputs. We further provide mathematical proof establishing
that FNNs activated by SOC projections outperform those utilizing ReLU in terms
of expressive power. Experimental evaluations on widely-adopted architectures
further corroborate MPU's effectiveness against a broader range of existing
activation functions
DTPP: Differentiable Joint Conditional Prediction and Cost Evaluation for Tree Policy Planning in Autonomous Driving
Motion prediction and cost evaluation are vital components in the
decision-making system of autonomous vehicles. However, existing methods often
ignore the importance of cost learning and treat them as separate modules. In
this study, we employ a tree-structured policy planner and propose a
differentiable joint training framework for both ego-conditioned prediction and
cost models, resulting in a direct improvement of the final planning
performance. For conditional prediction, we introduce a query-centric
Transformer model that performs efficient ego-conditioned motion prediction.
For planning cost, we propose a learnable context-aware cost function with
latent interaction features, facilitating differentiable joint learning. We
validate our proposed approach using the real-world nuPlan dataset and its
associated planning test platform. Our framework not only matches
state-of-the-art planning methods but outperforms other learning-based methods
in planning quality, while operating more efficiently in terms of runtime. We
show that joint training delivers significantly better performance than
separate training of the two modules. Additionally, we find that
tree-structured policy planning outperforms the conventional single-stage
planning approach
BatchSampler: Sampling Mini-Batches for Contrastive Learning in Vision, Language, and Graphs
In-Batch contrastive learning is a state-of-the-art self-supervised method
that brings semantically-similar instances close while pushing dissimilar
instances apart within a mini-batch. Its key to success is the negative sharing
strategy, in which every instance serves as a negative for the others within
the mini-batch. Recent studies aim to improve performance by sampling hard
negatives \textit{within the current mini-batch}, whose quality is bounded by
the mini-batch itself. In this work, we propose to improve contrastive learning
by sampling mini-batches from the input data. We present
BatchSampler\footnote{The code is available at
\url{https://github.com/THUDM/BatchSampler}} to sample mini-batches of
hard-to-distinguish (i.e., hard and true negatives to each other) instances. To
make each mini-batch have fewer false negatives, we design the proximity graph
of randomly-selected instances. To form the mini-batch, we leverage random walk
with restart on the proximity graph to help sample hard-to-distinguish
instances. BatchSampler is a simple and general technique that can be directly
plugged into existing contrastive learning models in vision, language, and
graphs. Extensive experiments on datasets of three modalities show that
BatchSampler can consistently improve the performance of powerful contrastive
models, as shown by significant improvements of SimCLR on ImageNet-100, SimCSE
on STS (language), and GraphCL and MVGRL on graph datasets.Comment: 17 pages, 16 figure
Does Negative Sampling Matter? A Review with Insights into its Theory and Applications
Negative sampling has swiftly risen to prominence as a focal point of
research, with wide-ranging applications spanning machine learning, computer
vision, natural language processing, data mining, and recommender systems. This
growing interest raises several critical questions: Does negative sampling
really matter? Is there a general framework that can incorporate all existing
negative sampling methods? In what fields is it applied? Addressing these
questions, we propose a general framework that leverages negative sampling.
Delving into the history of negative sampling, we trace the development of
negative sampling through five evolutionary paths. We dissect and categorize
the strategies used to select negative sample candidates, detailing global,
local, mini-batch, hop, and memory-based approaches. Our review categorizes
current negative sampling methods into five types: static, hard, GAN-based,
Auxiliary-based, and In-batch methods, providing a clear structure for
understanding negative sampling. Beyond detailed categorization, we highlight
the application of negative sampling in various areas, offering insights into
its practical benefits. Finally, we briefly discuss open problems and future
directions for negative sampling.Comment: 20 pages, 11 figure
ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback
ChatGLM is a free-to-use AI service powered by the ChatGLM family of large
language models (LLMs). In this paper, we present the ChatGLM-RLHF pipeline --
a reinforcement learning from human feedback (RLHF) system -- designed to
enhance ChatGLM's alignment with human preferences. ChatGLM-RLHF encompasses
three major components: the collection of human preference data, the training
of the reward model, and the optimization of policies. Throughout the process
of integrating ChatGLM-RLHF into production, we encountered and addressed
several unprecedented challenges. We introduce the strategies to mitigate
reward variance for stabilized large-scale training, implement model
parallelism with fused gradient-descent, and design regularization constraints
to avoid catastrophic forgetting in LLMs. Experiments show that ChatGLM-RLHF
brings significant improvements in alignment tasks compared to the supervised
fine-tuned (SFT) version of ChatGLM. For instance, it achieves on average 15\%
more wins against ChatGLM-SFT in Chinese alignment tasks. The work presents our
practices of aligning LLMs with human preferences, offering insights into the
challenges and solutions in RLHF implementations
CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation
Since the natural language processing (NLP) community started to make large
language models (LLMs), such as GPT-4, act as a critic to evaluate the quality
of generated texts, most of them only train a critique generation model of a
specific scale on specific datasets. We argue that a comprehensive
investigation on the key factor of LLM-based evaluation models, such as scaling
properties, is lacking, so that it is still inconclusive whether these models
have potential to replace GPT-4's evaluation in practical scenarios. In this
paper, we propose a new critique generation model called CritiqueLLM, which
includes a dialogue-based prompting method for high-quality referenced /
reference-free evaluation data. Experimental results show that our model can
achieve comparable evaluation performance to GPT-4 especially in system-level
correlations, and even outperform GPT-4 in 3 out of 8 tasks in a challenging
reference-free setting. We conduct detailed analysis to show promising scaling
properties of our model in the quality of generated critiques. We also
demonstrate that our generated critiques can act as scalable feedback to
directly improve the generation quality of LLMs.Comment: 18 pages, 5 figure
- …