16,341 research outputs found

    Assessment in anatomy

    Get PDF
    From an educational perspective, a very important problem is that of assessment, for establishing competency and as selection criterion for different professional purposes. Among the issues to be addressed are the methods of assessment and/or the type of tests, the range of scores, or the definition of honour degrees. The methods of assessment comprise such different forms such as the spotter examination, short or long essay questions, short answer questions, true-false questions, single best answer questions, multiple choice questions, extended match questions, or several forms of oral approaches such as viva voce examinations.Knowledge about this is important when assessing different educational objectives; assessing educational objectives from the cognitive domain will need different assessment instruments than assessing educational objectives from the psychomotor domain or even the affective domain.There is no golden rule, which type of assessment instrument or format will be the best in measuring certain educational objectives; but one has to respect that there is no assessment instrument, which is capable to assess educational objectives from all domains of educational objectives.Whereas the first two or three levels of progress can be assessed by well-structured written examinations such as multiple choice questions, or multiple answer questions, other and higher level progresses need other instruments, such as a thesis, or direct observation.This is no issue at all in assessment tools, where the students are required to select the appropriate answer from a given set of choices, as in true false questions, MCQ, EMQ, etc. The standard setting is done in these cases by the selection of the true answer

    Automated Reading Passage Generation with OpenAI's Large Language Model

    Full text link
    The widespread usage of computer-based assessments and individualized learning platforms has resulted in an increased demand for the rapid production of high-quality items. Automated item generation (AIG), the process of using item models to generate new items with the help of computer technology, was proposed to reduce reliance on human subject experts at each step of the process. AIG has been used in test development for some time. Still, the use of machine learning algorithms has introduced the potential to improve the efficiency and effectiveness of the process greatly. The approach presented in this paper utilizes OpenAI's latest transformer-based language model, GPT-3, to generate reading passages. Existing reading passages were used in carefully engineered prompts to ensure the AI-generated text has similar content and structure to a fourth-grade reading passage. For each prompt, we generated multiple passages, the final passage was selected according to the Lexile score agreement with the original passage. In the final round, the selected passage went through a simple revision by a human editor to ensure the text was free of any grammatical and factual errors. All AI-generated passages, along with original passages were evaluated by human judges according to their coherence, appropriateness to fourth graders, and readability

    Component skills of inferential processing in older readers

    Full text link
    Thesis (M.S.)--Boston UniversityThe ability to make inferences has been shown to be a crucial component of successful reading in older students. The current project investigates differences in comprehension of text-based (factual) and inferential information across grade levels and modalities, and seeks to determine which component language and reading skills that are important in making inferences. 1,836 students in grades 6-12 were tested on a computerized battery of language subtests in the auditory and written modalities. Eleven subtests examining performance on lower levels of were administered in addition to a measure of factual and inferential discourse comprehension. Results demonstrated that students performed better overall in the written modality. Students in older grades were consistently faster and more accurate. Vocabulary knowledge had the biggest effect for performance on inferential questions in the written modality in middle school, while sentence-level skills were most important in high school. In the auditory modality, sentence-level skills were most predictive across question types and grade levels. Implications for theories of inferential processing and for teaching inferences within literacy education frameworks will be discussed

    Evaluating Large Language Models: A Comprehensive Survey

    Full text link
    Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.Comment: 111 page

    A Survey on Evaluation of Large Language Models

    Full text link
    Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.Comment: 23 page
    • …
    corecore