83 research outputs found
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Evaluating the general abilities of foundation models to tackle human-level
tasks is a vital aspect of their development and application in the pursuit of
Artificial General Intelligence (AGI). Traditional benchmarks, which rely on
artificial datasets, may not accurately represent human-level capabilities. In
this paper, we introduce AGIEval, a novel benchmark specifically designed to
assess foundation model in the context of human-centric standardized exams,
such as college entrance exams, law school admission tests, math competitions,
and lawyer qualification tests. We evaluate several state-of-the-art foundation
models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark.
Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math
competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5%
accuracy on the English test of the Chinese national college entrance exam.
This demonstrates the extraordinary performance of contemporary foundation
models. In contrast, we also find that GPT-4 is less proficient in tasks that
require complex reasoning or specific domain knowledge. Our comprehensive
analyses of model capabilities (understanding, knowledge, reasoning, and
calculation) reveal these models' strengths and limitations, providing valuable
insights into future directions for enhancing their general capabilities. By
concentrating on tasks pertinent to human cognition and decision-making, our
benchmark delivers a more meaningful and robust evaluation of foundation
models' performance in real-world scenarios. The data, code, and all model
outputs are released in https://github.com/microsoft/AGIEval.Comment: 19 page
CMB: A Comprehensive Medical Benchmark in Chinese
Large Language Models (LLMs) provide a possibility to make a great
breakthrough in medicine. The establishment of a standardized medical benchmark
becomes a fundamental cornerstone to measure progression. However, medical
environments in different regions have their local characteristics, e.g., the
ubiquity and significance of traditional Chinese medicine within China.
Therefore, merely translating English-based medical evaluation may result in
\textit{contextual incongruities} to a local region. To solve the issue, we
propose a localized medical benchmark called CMB, a Comprehensive Medical
Benchmark in Chinese, designed and rooted entirely within the native Chinese
linguistic and cultural framework. While traditional Chinese medicine is
integral to this evaluation, it does not constitute its entirety. Using this
benchmark, we have evaluated several prominent large-scale LLMs, including
ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical
domain. It is worth noting that our benchmark is not devised as a leaderboard
competition but as an instrument for self-assessment of model advancements. We
hope this benchmark could facilitate the widespread adoption and enhancement of
medical LLMs within China. Check details in
\url{https://cmedbenchmark.llmzoo.com/}
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Despite the existence of various benchmarks for evaluating natural language
processing models, we argue that human exams are a more suitable means of
evaluating general intelligence for large language models (LLMs), as they
inherently demand a much wider range of abilities such as language
understanding, domain knowledge, and problem-solving skills. To this end, we
introduce M3Exam, a novel benchmark sourced from real and official human exam
questions for evaluating LLMs in a multilingual, multimodal, and multilevel
context. M3Exam exhibits three unique characteristics: (1) multilingualism,
encompassing questions from multiple countries that require strong multilingual
proficiency and cultural knowledge; (2) multimodality, accounting for the
multimodal nature of many exam questions to test the model's multimodal
understanding capability; and (3) multilevel structure, featuring exams from
three critical educational periods to comprehensively assess a model's
proficiency at different levels. In total, M3Exam contains 12,317 questions in
9 diverse languages with three educational levels, where about 23\% of the
questions require processing images for successful solving. We assess the
performance of top-performing LLMs on M3Exam and find that current models,
including GPT-4, still struggle with multilingual text, particularly in
low-resource and non-Latin script languages. Multimodal LLMs also perform
poorly with complex multimodal questions. We believe that M3Exam can be a
valuable resource for comprehensively evaluating LLMs by examining their
multilingual and multimodal abilities and tracking their development. Data and
evaluation code is available at \url{https://github.com/DAMO-NLP-SG/M3Exam}
Explicit Contextual Semantics for Text Comprehension
Who did what to whom is a major focus in natural language understanding,
which is right the aim of semantic role labeling (SRL) task. Despite of sharing
a lot of processing characteristics and even task purpose, it is surprisingly
that jointly considering these two related tasks was never formally reported in
previous work. Thus this paper makes the first attempt to let SRL enhance text
comprehension and inference through specifying verbal predicates and their
corresponding semantic roles. In terms of deep learning models, our embeddings
are enhanced by explicit contextual semantic role labels for more fine-grained
semantics. We show that the salient labels can be conveniently added to
existing models and significantly improve deep learning models in challenging
text comprehension tasks. Extensive experiments on benchmark machine reading
comprehension and inference datasets verify that the proposed semantic learning
helps our system reach new state-of-the-art over strong baselines which have
been enhanced by well pretrained language models from the latest progress.Comment: Proceedings of the 33nd Pacific Asia Conference on Language,
Information and Computation (PACLIC 33
- …