Search CORE

21 research outputs found

Revisiting Parallel Context Windows: A Frustratingly Simple Alternative and Chain-of-Thought Deterioration

Author: Dong Yuxiao
Liu Xiao
Men Kaiwen
Tang Jie
Yang Kejuan
Zeng Aohan
Publication venue
Publication date: 24/05/2023
Field of study

We identify two crucial limitations in the evaluation of recent parallel-integrated method Parallel Context Windows (PCW), which extends the maximum context lengths of language models, e.g., 2048 for LLaMA, by harnessing window-wise attention and positional embedding techniques. We first show that a simple yet strong baseline, weighted sum ensemble, is missing for the in-context few-shot classification. Moreover, on more challenging Chain-of-Thought (CoT) reasoning (e.g., HotpotQA), PCW would present unexpected deterioration regarding question miscomprehension and false inference. Based on our findings, we suggest that the existing PCW design may not guarantee sufficient improvement and practicality in handling lengthy documents in real-world applications. More community efforts on enabling language models' long context understanding ability should be paid

arXiv.org e-Print Archive

AgentTuning: Enabling Generalized Agent Abilities for LLMs

Author: Dong Yuxiao
Liu Mingdao
Liu Xiao
Lu Rui
Tang Jie
Wang Bowen
Zeng Aohan
Publication venue
Publication date: 22/10/2023
Field of study

Open large language models (LLMs) with great performance in various tasks have significantly advanced the development of LLMs. However, they are far inferior to commercial models such as ChatGPT and GPT-4 when acting as agents to tackle complex tasks in the real world. These agent tasks employ LLMs as the central controller responsible for planning, memorization, and tool utilization, necessitating both fine-grained prompting methods and robust LLMs to achieve satisfactory performance. Though many prompting methods have been proposed to complete particular agent tasks, there is lack of research focusing on improving the agent capabilities of LLMs themselves without compromising their general abilities. In this work, we present AgentTuning, a simple and general method to enhance the agent abilities of LLMs while maintaining their general LLM capabilities. We construct AgentInstruct, a lightweight instruction-tuning dataset containing high-quality interaction trajectories. We employ a hybrid instruction-tuning strategy by combining AgentInstruct with open-source instructions from general domains. AgentTuning is used to instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show that AgentTuning enables LLMs' agent capabilities without compromising general abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent tasks, demonstrating generalized agent capabilities. We open source the AgentInstruct and AgentLM-7B, 13B, and 70B models at https://github.com/THUDM/AgentTuning, serving open and powerful alternatives to commercial LLMs for agent tasks.Comment: 31 page

arXiv.org e-Print Archive

Incidence and factors associated of early non-response in first-treatment and drug-naïve patients with schizophrenia: a real-world study

Author: Aohan Bai
Aohan Bai
Jun Ma
Jun Ma
Lin Zhang
Lin Zhang
Xuebing Liu
Xuebing Liu
Yi Li
Yi Li
Zhongyu Tang
Zhongyu Tang
Publication venue: 'Frontiers Media SA'
Publication date: 01/04/2023
Field of study

BackgroundSchizophrenia is a severe and persistent mental condition that causes disability. For subsequent clinical care, it is extremely practical to effectively differentiate between patients who respond to therapy quickly and those who do not. This study set out to document the prevalence and risk factors for patient early non-response.MethodsThe current study included 143 individuals with first-treatment and drug-naïve (FTDN) schizophrenia. Patients were classified as early non-responders based on a Positive and Negative Symptom Scale (PANSS) score reduction of less than 20% after 2 weeks of treatment, otherwise as early responders. Clinical subgroups’ differences in demographic data and general clinical data were compared, and variables related to early non-response to therapy were examined.ResultsTwo weeks later, a total of 73 patients were described as early non-responders, with an incidence of 51.05%. The early non-response subgroup had significantly higher PANSS scores, Positive symptom subscale (PSS) scores, General psychopathology subscale (GPS) scores, Clinical global impression scale - severity of illness (CGI-SI) and Fasting blood glucose (FBG) levels compared to the early-response subgroup. CGI-SI and FBG were risk factors for early non-response.ConclusionHigh rates of early non-response have been seen in FTDN schizophrenia patients, and risk variables for predicting early non-response include CGI-SI scores and FBG levels. However, we need more in-depth studies to confirm the generalizable range of these two parameters

Directory of Open Access Journals

Integrating life cycle assessment and a farmer survey of management practices to study environmental impacts of peach production in Beijing, China

Author: Chen Yongliang
Cui Zhenling
Heal Mathew R.
Li Ziyue
Liu Xuejun
Meng Fanlei
Ren Fengling
Shao Qi
Tang Aohan
Wu Jiechen
Xu Wen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/03/2022
Field of study

Edinburgh Research Explorer

CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

Author: Cheng Jiale
Dong Yuxiao
Feng Zhuoer
Huang Minlie
Ke Pei
Lei Xuanyu
Liu Xiao
Tang Jie
Wang Hongning
Wang Shengyuan
Wen Bosi
Zeng Aohan
Publication venue
Publication date: 30/11/2023
Field of study

Since the natural language processing (NLP) community started to make large language models (LLMs), such as GPT-4, act as a critic to evaluate the quality of generated texts, most of them only train a critique generation model of a specific scale on specific datasets. We argue that a comprehensive investigation on the key factor of LLM-based evaluation models, such as scaling properties, is lacking, so that it is still inconclusive whether these models have potential to replace GPT-4's evaluation in practical scenarios. In this paper, we propose a new critique generation model called CritiqueLLM, which includes a dialogue-based prompting method for high-quality referenced / reference-free evaluation data. Experimental results show that our model can achieve comparable evaluation performance to GPT-4 especially in system-level correlations, and even outperform GPT-4 in 3 out of 8 tasks in a challenging reference-free setting. We conduct detailed analysis to show promising scaling properties of our model in the quality of generated critiques. We also demonstrate that our generated critiques can act as scalable feedback to directly improve the generation quality of LLMs.Comment: 18 pages, 5 figure

arXiv.org e-Print Archive

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Author: Bei Zhilei
Chen Bo
Cheng Xingyi
Dong Yuxiao
Geng Yangli-ao
Gong Jing
Li Pan
Li Shen
Liu Chiming
Song Le
Tan Xu
Tang Jie
Wang Boyan
Zeng Aohan
Zeng Xin
Publication venue
Publication date: 11/01/2024
Field of study

Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science

arXiv.org e-Print Archive

GLM-130B: An Open Bilingual Pre-trained Model

Author: Chen Wenguang
Ding Ming
Dong Yuxiao
Du Zhengxiao
Lai Hanyu
Liu Xiao
Ma Zixuan
Tam Weng Lam
Tang Jie
Wang Zihan
Xia Xiao
Xu Yifan
Xue Yufei
Yang Zhuoyi
Zeng Aohan
Zhai Jidong
Zhang Peng
Zheng Wendi
Publication venue
Publication date: 25/10/2023
Field of study

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4

\times

RTX 3090 (24G) or 8

\times

RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.Comment: Accepted to ICLR 202

arXiv.org e-Print Archive

Impact of emission controls on air quality in Beijing during APEC 2014: implications from water-soluble ions and carbonaceous aerosol in PM2.5 and their precursors

Author: Chen Jianmin
Dore Anthony J.
Hao Tianxiang
Liu Lei
Liu Xuejun
Lu Li
Pan Yuepeng
Tang Aohan
Wu Qinghua
Xu Wen
Zhang Fusuo
Zhang Yangyang
Publication venue: 'Elsevier BV'
Publication date: 01/08/2019
Field of study

Stringent emission controls during the Asia Pacific Economic Cooperation Summit (APEC; November 5–11, 2014) provide a valuable opportunity to examine the impact of such measures on the chemical properties of PM2.5 and other air pollutants. Here, we measured the water-soluble inorganic ions (WSII) and carbonaceous species in PM2.5, NH3 and NO2 at multiple sites in Beijing between September and November 2014. Relative to the pre-APEC period (September and October 2014), significant reductions in the average concentrations of WSII (69% for NO3−, 68% for SO42−, 78% for NH4+, and 29–71% for other species), elemental carbon (EC, 43%) and organic carbon (OC, 45%) in PM2.5 were found during the APEC period. The contributions of secondary inorganic ions (SIA, including SO42−, NO3−, and NH4+) to PM2.5 were significantly lower during the APEC period (9–44%), indicating a combination of lower gaseous precursor emissions and a relative weak secondary aerosol formation. Ion-balance calculations indicated that the PM2.5 sample in the pre-APEC period was alkaline but was acidic during the APEC period. Relatively lower mean concentrations of EC (1.5 μg m−3), OC (10.5 μg m−3), secondary organic carbon (SOC, 3.3 μg m−3), secondary organic aerosol (SOA, 5.9 μg m−3) and primary organic aerosol (POA, 10.0 μg m−3) appeared during the APEC period. The average concentrations of NH3 and NO2 at all road sites were significantly reduced by 48 and 60% during the APEC period, which is consistent with clear reductions in satellite NH3 columns over Beijing city in the same period. This finding suggests that reducing traffic emissions could be a feasible method to control urban NH3 pollution. During the APEC period, concentrations of PM2.5, PM10, NO2, SO2 and CO from the Beijing city monitoring network showed significant reductions at urban (20–60%) and rural (18–57%) sites, whereas O3 concentrations increased significantly (by 93% and 53%, respectively). The control measures taken in the APEC period substantially decreased PM2.5 pollution but can increase ground O3, which also merits attention

NERC Open Research Archive