16,341 research outputs found
Assessment in anatomy
From an educational perspective, a very important problem is that of assessment, for establishing competency and as selection criterion for different professional purposes. Among the issues to be addressed are the methods of assessment and/or the type of tests, the range of scores, or the definition of honour degrees. The methods of assessment comprise such different forms such as the spotter examination, short or long essay questions, short answer questions, true-false questions, single best answer questions, multiple choice questions, extended match questions, or several forms of oral approaches such as viva voce examinations.Knowledge about this is important when assessing different educational objectives; assessing educational objectives from the cognitive domain will need different assessment instruments than assessing educational objectives from the psychomotor domain or even the affective domain.There is no golden rule, which type of assessment instrument or format will be the best in measuring certain educational objectives; but one has to respect that there is no assessment instrument, which is capable to assess educational objectives from all domains of educational objectives.Whereas the first two or three levels of progress can be assessed by well-structured written examinations such as multiple choice questions, or multiple answer questions, other and higher level progresses need other instruments, such as a thesis, or direct observation.This is no issue at all in assessment tools, where the students are required to select the appropriate answer from a given set of choices, as in true false questions, MCQ, EMQ, etc. The standard setting is done in these cases by the selection of the true answer
Automated Reading Passage Generation with OpenAI's Large Language Model
The widespread usage of computer-based assessments and individualized
learning platforms has resulted in an increased demand for the rapid production
of high-quality items. Automated item generation (AIG), the process of using
item models to generate new items with the help of computer technology, was
proposed to reduce reliance on human subject experts at each step of the
process. AIG has been used in test development for some time. Still, the use of
machine learning algorithms has introduced the potential to improve the
efficiency and effectiveness of the process greatly. The approach presented in
this paper utilizes OpenAI's latest transformer-based language model, GPT-3, to
generate reading passages. Existing reading passages were used in carefully
engineered prompts to ensure the AI-generated text has similar content and
structure to a fourth-grade reading passage. For each prompt, we generated
multiple passages, the final passage was selected according to the Lexile score
agreement with the original passage. In the final round, the selected passage
went through a simple revision by a human editor to ensure the text was free of
any grammatical and factual errors. All AI-generated passages, along with
original passages were evaluated by human judges according to their coherence,
appropriateness to fourth graders, and readability
Component skills of inferential processing in older readers
Thesis (M.S.)--Boston UniversityThe ability to make inferences has been shown to be a crucial component of successful reading in older students. The current project investigates differences in comprehension of text-based (factual) and inferential information across grade levels and modalities, and seeks to determine which component language and reading skills that are important in making inferences.
1,836 students in grades 6-12 were tested on a computerized battery of language subtests in the auditory and written modalities. Eleven subtests examining performance on lower levels of were administered in addition to a measure of factual and inferential discourse comprehension.
Results demonstrated that students performed better overall in the written modality. Students in older grades were consistently faster and more accurate. Vocabulary knowledge had the biggest effect for performance on inferential questions in the written modality in middle school, while sentence-level skills were most important in high school. In the auditory modality, sentence-level skills were most predictive across question types and grade levels. Implications for theories of inferential processing and for teaching inferences within literacy education frameworks will be discussed
Recommended from our members
Building capacity in climate change policy analysis and negotiation: methods and technologies
Capacity building is often cited as the reason “we cannot just pour money into developing countries” and why so many development projects fail because their design does not address local conditions. It is therefore a key technical and political concept in international development.
Some of the poorest countries in the world are also some of the most vulnerable to the impacts of climate change. Their vulnerability is in part due to a lack of capacity to plan and anticipate the effects of climate change on crops, water resources, urban electricity demand etc. What capacities do these countries lack to deal with climate change? How will they cope? What steps can they take to reduce their vulnerability?
This innovative and high-profile research project was part of a larger project (called C3D) and conducted with non-governmental organisations in Senegal, South Africa and Sri Lanka. The research involved several participatory workshops and a questionnaire to all three research centres
Recommended from our members
Proceedings of QG2010: The Third Workshop on Question Generation
These are the peer-reviewed proceedings of "QG2010, The Third Workshop on Question Generation". The workshop included a special track for "QGSTEC2010: The First Question Generation Shared Task and Evaluation Challenge".
QG2010 was held as part of The Tenth International Conference on Intelligent Tutoring Systems (ITS2010)
Evaluating Large Language Models: A Comprehensive Survey
Large language models (LLMs) have demonstrated remarkable capabilities across
a broad spectrum of tasks. They have attracted significant attention and been
deployed in numerous downstream applications. Nevertheless, akin to a
double-edged sword, LLMs also present potential risks. They could suffer from
private data leaks or yield inappropriate, harmful, or misleading content.
Additionally, the rapid progress of LLMs raises concerns about the potential
emergence of superintelligent systems without adequate safeguards. To
effectively capitalize on LLM capacities as well as ensure their safe and
beneficial development, it is critical to conduct a rigorous and comprehensive
evaluation of LLMs.
This survey endeavors to offer a panoramic perspective on the evaluation of
LLMs. We categorize the evaluation of LLMs into three major groups: knowledge
and capability evaluation, alignment evaluation and safety evaluation. In
addition to the comprehensive review on the evaluation methodologies and
benchmarks on these three aspects, we collate a compendium of evaluations
pertaining to LLMs' performance in specialized domains, and discuss the
construction of comprehensive evaluation platforms that cover LLM evaluations
on capabilities, alignment, safety, and applicability.
We hope that this comprehensive overview will stimulate further research
interests in the evaluation of LLMs, with the ultimate goal of making
evaluation serve as a cornerstone in guiding the responsible development of
LLMs. We envision that this will channel their evolution into a direction that
maximizes societal benefit while minimizing potential risks. A curated list of
related papers has been publicly available at
https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.Comment: 111 page
A Survey on Evaluation of Large Language Models
Large language models (LLMs) are gaining increasing popularity in both
academia and industry, owing to their unprecedented performance in various
applications. As LLMs continue to play a vital role in both research and daily
use, their evaluation becomes increasingly critical, not only at the task
level, but also at the society level for better understanding of their
potential risks. Over the past years, significant efforts have been made to
examine LLMs from various perspectives. This paper presents a comprehensive
review of these evaluation methods for LLMs, focusing on three key dimensions:
what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide
an overview from the perspective of evaluation tasks, encompassing general
natural language processing tasks, reasoning, medical usage, ethics,
educations, natural and social sciences, agent applications, and other areas.
Secondly, we answer the `where' and `how' questions by diving into the
evaluation methods and benchmarks, which serve as crucial components in
assessing performance of LLMs. Then, we summarize the success and failure cases
of LLMs in different tasks. Finally, we shed light on several future challenges
that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to
researchers in the realm of LLMs evaluation, thereby aiding the development of
more proficient LLMs. Our key point is that evaluation should be treated as an
essential discipline to better assist the development of LLMs. We consistently
maintain the related open-source materials at:
https://github.com/MLGroupJLU/LLM-eval-survey.Comment: 23 page
- …