66 research outputs found
Multiobjective Gate Assignment Based on Passenger Walking Distance and Fairness
Passenger walking distance is an important index of the airport service quality. How to shorten the walking distance and balance the airlines' service quality is the focus of much research on airport gate assignment problems. According to the problems of airport passenger service quality, an optimization gate assignment model is established. The gate assignment model is based on minimizing the total walking distance of all passengers and balancing the average walking distance of passengers among different airlines. Lingo is used in the simulation of a large airport gate assignment. Test results show that the optimization model can reduce the average walking distance of passenger effectively, improve the number of flights assigned to gate, balance airline service quality, and enhance the overall service level of airports and airlines. The model provides reference for the airport gate preassignment
Understanding Emergent Abilities of Language Models from the Loss Perspective
Recent studies have put into question the belief that emergent abilities in
language models are exclusive to large models. This skepticism arises from two
observations: 1) smaller models can also exhibit high performance on emergent
abilities and 2) there is doubt on the discontinuous metrics used to measure
these abilities. In this paper, we propose to study emergent abilities in the
lens of pre-training loss, instead of model size or training compute. We
demonstrate that the models with the same pre-training loss, but different
model and data sizes, generate the same performance on various downstream
tasks. We also discover that a model exhibits emergent abilities on certain
tasks -- regardless of the continuity of metrics -- when its pre-training loss
falls below a specific threshold. Before reaching this threshold, its
performance remains at the level of random guessing. This inspires us to
redefine emergent abilities as those that manifest in models with lower
pre-training losses, highlighting that these abilities cannot be predicted by
merely extrapolating the performance trends of models with higher pre-training
losses.Comment: 18 pages, 6 figure
Revisiting Parallel Context Windows: A Frustratingly Simple Alternative and Chain-of-Thought Deterioration
We identify two crucial limitations in the evaluation of recent
parallel-integrated method Parallel Context Windows (PCW), which extends the
maximum context lengths of language models, e.g., 2048 for LLaMA, by harnessing
window-wise attention and positional embedding techniques. We first show that a
simple yet strong baseline, weighted sum ensemble, is missing for the
in-context few-shot classification. Moreover, on more challenging
Chain-of-Thought (CoT) reasoning (e.g., HotpotQA), PCW would present unexpected
deterioration regarding question miscomprehension and false inference. Based on
our findings, we suggest that the existing PCW design may not guarantee
sufficient improvement and practicality in handling lengthy documents in
real-world applications. More community efforts on enabling language models'
long context understanding ability should be paid
AgentTuning: Enabling Generalized Agent Abilities for LLMs
Open large language models (LLMs) with great performance in various tasks
have significantly advanced the development of LLMs. However, they are far
inferior to commercial models such as ChatGPT and GPT-4 when acting as agents
to tackle complex tasks in the real world. These agent tasks employ LLMs as the
central controller responsible for planning, memorization, and tool
utilization, necessitating both fine-grained prompting methods and robust LLMs
to achieve satisfactory performance. Though many prompting methods have been
proposed to complete particular agent tasks, there is lack of research focusing
on improving the agent capabilities of LLMs themselves without compromising
their general abilities. In this work, we present AgentTuning, a simple and
general method to enhance the agent abilities of LLMs while maintaining their
general LLM capabilities. We construct AgentInstruct, a lightweight
instruction-tuning dataset containing high-quality interaction trajectories. We
employ a hybrid instruction-tuning strategy by combining AgentInstruct with
open-source instructions from general domains. AgentTuning is used to
instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show
that AgentTuning enables LLMs' agent capabilities without compromising general
abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent
tasks, demonstrating generalized agent capabilities. We open source the
AgentInstruct and AgentLM-7B, 13B, and 70B models at
https://github.com/THUDM/AgentTuning, serving open and powerful alternatives to
commercial LLMs for agent tasks.Comment: 31 page
ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback
ChatGLM is a free-to-use AI service powered by the ChatGLM family of large
language models (LLMs). In this paper, we present the ChatGLM-RLHF pipeline --
a reinforcement learning from human feedback (RLHF) system -- designed to
enhance ChatGLM's alignment with human preferences. ChatGLM-RLHF encompasses
three major components: the collection of human preference data, the training
of the reward model, and the optimization of policies. Throughout the process
of integrating ChatGLM-RLHF into production, we encountered and addressed
several unprecedented challenges. We introduce the strategies to mitigate
reward variance for stabilized large-scale training, implement model
parallelism with fused gradient-descent, and design regularization constraints
to avoid catastrophic forgetting in LLMs. Experiments show that ChatGLM-RLHF
brings significant improvements in alignment tasks compared to the supervised
fine-tuned (SFT) version of ChatGLM. For instance, it achieves on average 15\%
more wins against ChatGLM-SFT in Chinese alignment tasks. The work presents our
practices of aligning LLMs with human preferences, offering insights into the
challenges and solutions in RLHF implementations
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Protein language models have shown remarkable success in learning biological
information from protein sequences. However, most existing models are limited
by either autoencoding or autoregressive pre-training objectives, which makes
them struggle to handle protein understanding and generation tasks
concurrently. We propose a unified protein language model, xTrimoPGLM, to
address these two types of tasks simultaneously through an innovative
pre-training framework. Our key technical contribution is an exploration of the
compatibility and the potential for joint optimization of the two types of
objectives, which has led to a strategy for training xTrimoPGLM at an
unprecedented scale of 100 billion parameters and 1 trillion training tokens.
Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms
other advanced baselines in 18 protein understanding benchmarks across four
categories. The model also facilitates an atomic-resolution view of protein
structures, leading to an advanced 3D structural prediction model that
surpasses existing language model-based tools. 2) xTrimoPGLM not only can
generate de novo protein sequences following the principles of natural ones,
but also can perform programmable generation after supervised fine-tuning (SFT)
on curated sequences. These results highlight the substantial capability and
versatility of xTrimoPGLM in understanding and generating protein sequences,
contributing to the evolving landscape of foundation models in protein science
CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation
Since the natural language processing (NLP) community started to make large
language models (LLMs), such as GPT-4, act as a critic to evaluate the quality
of generated texts, most of them only train a critique generation model of a
specific scale on specific datasets. We argue that a comprehensive
investigation on the key factor of LLM-based evaluation models, such as scaling
properties, is lacking, so that it is still inconclusive whether these models
have potential to replace GPT-4's evaluation in practical scenarios. In this
paper, we propose a new critique generation model called CritiqueLLM, which
includes a dialogue-based prompting method for high-quality referenced /
reference-free evaluation data. Experimental results show that our model can
achieve comparable evaluation performance to GPT-4 especially in system-level
correlations, and even outperform GPT-4 in 3 out of 8 tasks in a challenging
reference-free setting. We conduct detailed analysis to show promising scaling
properties of our model in the quality of generated critiques. We also
demonstrate that our generated critiques can act as scalable feedback to
directly improve the generation quality of LLMs.Comment: 18 pages, 5 figure
Recommended from our members
Grafted c-kit+/SSEA1− eye-wall progenitor cells delay retinal degeneration in mice by regulating neural plasticity and forming new graft-to-host synapses
Background: Despite diverse pathogenesis, the common pathological change observed in age-related macular degeneration and in most hereditary retinal degeneration (RD) diseases is photoreceptor loss. Photoreceptor replacement by cell transplantation may be a feasible treatment for RD. The major obstacles to clinical translation of stem cell-based cell therapy in RD remain the difficulty of obtaining sufficient quantities of appropriate and safe donor cells and the poor integration of grafted stem cell-derived photoreceptors into the remaining retinal circuitry. Methods: Eye-wall c-kit+/stage-specific embryonic antigen 1 (SSEA1)− cells were isolated via fluorescence-activated cell sorting, and their self-renewal and differentiation potential were detected by immunochemistry and flow cytometry in vitro. After labeling with quantum nanocrystal dots and transplantation into the subretinal space of rd1 RD mice, differentiation and synapse formation by daughter cells of the eye-wall c-kit+/SSEA1− cells were evaluated by immunochemistry and western blotting. Morphological changes of the inner retina of rd1 mice after cell transplantation were demonstrated by immunochemistry. Retinal function of rd1 mice that received cell grafts was tested via flash electroretinograms and the light/dark transition test. Results: Eye-wall c-kit+/SSEA1− cells were self-renewing and clonogenic, and they retained their proliferative potential through more than 20 passages. Additionally, eye-wall c-kit+/SSEA1− cells were capable of differentiating into multiple retinal cell types including photoreceptors, bipolar cells, horizontal cells, amacrine cells, Müller cells, and retinal pigment epithelium cells and of transdifferentiating into smooth muscle cells and endothelial cells in vitro. The levels of synaptophysin and postsynaptic density-95 in the retinas of eye-wall c-kit+/SSEA1− cell-transplanted rd1 mice were significantly increased at 4 weeks post transplantation. The c-kit+/SSEA1− cells were capable of differentiating into functional photoreceptors that formed new synaptic connections with recipient retinas in rd1 mice. Transplantation also partially corrected the abnormalities of inner retina of rd1 mice. At 4 and 8 weeks post transplantation, the rd1 mice that received c-kit+/SSEA1− cells showed significant increases in a-wave and b-wave amplitude and the percentage of time spent in the dark area. Conclusions: Grafted c-kit+/SSEA1− cells restored the retinal function of rd1 mice via regulating neural plasticity and forming new graft-to-host synapses. Electronic supplementary material The online version of this article (doi:10.1186/s13287-016-0451-8) contains supplementary material, which is available to authorized users
GLM-130B: An Open Bilingual Pre-trained Model
We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language
model with 130 billion parameters. It is an attempt to open-source a 100B-scale
model at least as good as GPT-3 (davinci) and unveil how models of such a scale
can be successfully pre-trained. Over the course of this effort, we face
numerous unexpected technical and engineering challenges, particularly on loss
spikes and divergence. In this paper, we introduce the training process of
GLM-130B including its design choices, training strategies for both efficiency
and stability, and engineering efforts. The resultant GLM-130B model offers
significant outperformance over GPT-3 175B (davinci) on a wide range of popular
English benchmarks while the performance advantage is not observed in OPT-175B
and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN
3.0 260B -- the largest Chinese language model -- across related benchmarks.
Finally, we leverage a unique scaling property of GLM-130B to reach INT4
quantization without post training, with almost no performance loss, making it
the first among 100B-scale models and more importantly, allowing its effective
inference on 4RTX 3090 (24G) or 8RTX 2080 Ti (11G) GPUs, the
most affordable GPUs required for using 100B-scale models. The GLM-130B model
weights are publicly accessible and its code, training logs, related toolkit,
and lessons learned are open-sourced at
\url{https://github.com/THUDM/GLM-130B/}.Comment: Accepted to ICLR 202
AgentBench: Evaluating LLMs as Agents
Large Language Models (LLMs) are becoming increasingly smart and autonomous,
targeting real-world pragmatic missions beyond traditional NLP tasks. As a
result, there has been an urgent need to evaluate LLMs as agents on challenging
tasks in interactive environments. We present AgentBench, a multi-dimensional
evolving benchmark that currently consists of 8 distinct environments to assess
LLM-as-Agent's reasoning and decision-making abilities in a multi-turn
open-ended generation setting. Our extensive test over 27 API-based and
open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong
ability of acting as agents in complex environments, there is a significant
disparity in performance between them and OSS competitors. We identify the
typical reasons of failures in environments and LLMs, showing that poor
long-term reasoning, decision-making, and instruction following abilities are
the main obstacles for developing usable LLM agents. Training on code and high
quality multi-turn alignment data could improve agent performance. Datasets,
environments, and an integrated evaluation package for AgentBench are released
at \url{https://github.com/THUDM/AgentBench}.Comment: 55 page
- …