6,845 research outputs found

    Evaluasi Butir Soal Pilihan Ganda Penilaian Tengah Semester dalam Pembelajaran Tematik untuk Kelas V di SDN Gladak Anyar 4 Pamekasan

    Get PDF
    This research was conducted with the aim of evaluating the validity, level of difficulty, discriminating power, effectiveness of the distractor, and reliability of multiple choice questions in the Mid Semester Assessment in fifth grade thematic learning at SDN Gladak Anyar 4 Pamekasan. The research method used is a quantitative descriptive approach. There are two themes in the Mid Semester Examination Questions, namely theme 6 with 20 questions and theme 7 with 19 questions. Evaluation of the validity of the questions, level of difficulty, discriminating power, effectiveness of the distractor, and reliability was carried out using Microsoft Excel 2010. The subjects of this study were fifth grade students, and data collection was carried out using documentation techniques. The results of this study indicate that the quality of the questions is high. (1) The validity of the questions in theme 6 were 19 questions (95%) and in theme 7 were 18 questions (94.74%) declared valid. (2) The difficulty level of the questions in theme 6 consisted of 13 questions (68.42%) which were categorized as easy and 2 questions (10.53%) which were categorized as difficult. In theme 7, there are 11 questions (61.11%) which are categorized as easy, so there are questions with difficulty levels that do not meet good quality. (3) The discriminating power of questions on theme 6 consisted of 10 items (52.63%) which were categorized as poor and 1 item (5.26%) which were categorized as good. In theme 7, there are 6 items (33.33%) which are categorized as not good and 3 items (16.67%) which are categorized as good. Therefore, the questions fall into the category of moderate discriminating power. (4) The effectiveness of the distractor in theme 6 consisted of 1 item (5.26%) which was categorized as very good, 8 items (42.11%) which were categorized as good, and 6 items (31.58%) which were categorized as poor. In theme 7, there were 4 items (22.22%) which were categorized as very good, 4 items (22.22%) which were categorized as good, and 3 items (16.67%) which were categorized as poor. Thus, the questions fall into the category of the effectiveness of a good distractor. (5) The reliability of the questions in theme 6 is 0.9592, while in theme 7 it is 0.8950, indicating that the questions have high reliability and high quality

    Pengembangan Alat Evaluasi Pembelajaran Matematika Berbasis Two Tier Multiple Choice Menggunakan Ispring Suite 9

    Get PDF
    In the era of the industrial revolution 4.0 and the pandemic era, innovation is needed in thedevelopment of technology-based evaluation tools. One of them is the ispring suite 9. Evaluation tool isimportant because it can help the process of evaluating educators to find out information on theachievement of results during the learning process. In addition to learning outcomes, evaluation can alsodetermine the ability of students to understand concepts. One of them is a two tier multiple choiceevaluation test. Two tier multiple choice is a form of two-tier multiple choice evaluation test. The purposeof this study is to develop a learning evaluation tool that can find out students' understanding of conceptsand can be used online. The research and development model used is 4D. The research instruments usedwere interview sheets, validation sheets, test instruments, and questionnaires. Data analysis techniquesused qualitative and quantitative. The result of this study is a two-tier multiple choice based mathematicsevaluation tool using ispring suite 9 which is seen from: (1) the percentage of validation results frommedia experts is 90.5% and material experts is 96.5%, both of which fall into the very feasible category.(2) the quality of the items seen from the validity obtained 8 valid items with a reliability of 0.815.Judging from the level of difficulty, the percentage is 10% difficult, 80% moderate and 10% easy, thedistinguishing power is obtained 5 questions are included in the good category, 3 questions are quitegood, 1 question is very good, and 1 question is bad, and the effectiveness of the distractor has 9distractors who selected > 5% of all students. (3) The percentage of students 'conceptual understandingafter evaluating two tier multiple choices is 50.5%, which is in the sufficient category, and the results ofthe students' responses are 82% which is included in the very interesting category

    Crowdsourcing Multiple Choice Science Questions

    Full text link
    We present a novel method for obtaining high-quality, domain-targeted multiple choice questions from crowd workers. Generating these questions can be difficult without trading away originality, relevance or diversity in the answer options. Our method addresses these problems by leveraging a large corpus of domain-specific text and a small set of existing questions. It produces model suggestions for document selection and answer distractor choice which aid the human question generation process. With this method we have assembled SciQ, a dataset of 13.7K multiple choice science exam questions (Dataset available at http://allenai.org/data.html). We demonstrate that the method produces in-domain questions by providing an analysis of this new dataset and by showing that humans cannot distinguish the crowdsourced questions from original questions. When using SciQ as additional training data to existing questions, we observe accuracy improvements on real science exams.Comment: accepted for the Workshop on Noisy User-generated Text (W-NUT) 201

    STARC: Structured Annotations for Reading Comprehension

    Full text link
    We present STARC (Structured Annotations for Reading Comprehension), a new annotation framework for assessing reading comprehension with multiple choice questions. Our framework introduces a principled structure for the answer choices and ties them to textual span annotations. The framework is implemented in OneStopQA, a new high-quality dataset for evaluation and analysis of reading comprehension in English. We use this dataset to demonstrate that STARC can be leveraged for a key new application for the development of SAT-like reading comprehension materials: automatic annotation quality probing via span ablation experiments. We further show that it enables in-depth analyses and comparisons between machine and human reading comprehension behavior, including error distributions and guessing ability. Our experiments also reveal that the standard multiple choice dataset in NLP, RACE, is limited in its ability to measure reading comprehension. 47% of its questions can be guessed by machines without accessing the passage, and 18% are unanimously judged by humans as not having a unique correct answer. OneStopQA provides an alternative test set for reading comprehension which alleviates these shortcomings and has a substantially higher human ceiling performance.Comment: ACL 2020. OneStopQA dataset, STARC guidelines and human experiments data are available at https://github.com/berzak/onestop-q

    Using item response theory to explore the psychometric properties of extended matching questions examination in undergraduate medical education

    Get PDF
    BACKGROUND: As assessment has been shown to direct learning, it is critical that the examinations developed to test clinical competence in medical undergraduates are valid and reliable. The use of extended matching questions (EMQ) has been advocated to overcome some of the criticisms of using multiple-choice questions to test factual and applied knowledge. METHODS: We analysed the results from the Extended Matching Questions Examination taken by 4th year undergraduate medical students in the academic year 2001 to 2002. Rasch analysis was used to examine whether the set of questions used in the examination mapped on to a unidimensional scale, the degree of difficulty of questions within and between the various medical and surgical specialties and the pattern of responses within individual questions to assess the impact of the distractor options. RESULTS: Analysis of a subset of items and of the full examination demonstrated internal construct validity and the absence of bias on the majority of questions. Three main patterns of response selection were identified. CONCLUSION: Modern psychometric methods based upon the work of Rasch provide a useful approach to the calibration and analysis of EMQ undergraduate medical assessments. The approach allows for a formal test of the unidimensionality of the questions and thus the validity of the summed score. Given the metric calibration which follows fit to the model, it also allows for the establishment of items banks to facilitate continuity and equity in exam standards

    Learning to Reuse Distractors to support Multiple Choice Question Generation in Education

    Full text link
    Multiple choice questions (MCQs) are widely used in digital learning systems, as they allow for automating the assessment process. However, due to the increased digital literacy of students and the advent of social media platforms, MCQ tests are widely shared online, and teachers are continuously challenged to create new questions, which is an expensive and time-consuming task. A particularly sensitive aspect of MCQ creation is to devise relevant distractors, i.e., wrong answers that are not easily identifiable as being wrong. This paper studies how a large existing set of manually created answers and distractors for questions over a variety of domains, subjects, and languages can be leveraged to help teachers in creating new MCQs, by the smart reuse of existing distractors. We built several data-driven models based on context-aware question and distractor representations, and compared them with static feature-based models. The proposed models are evaluated with automated metrics and in a realistic user test with teachers. Both automatic and human evaluations indicate that context-aware models consistently outperform a static feature-based approach. For our best-performing context-aware model, on average 3 distractors out of the 10 shown to teachers were rated as high-quality distractors. We create a performance benchmark, and make it public, to enable comparison between different approaches and to introduce a more standardized evaluation of the task. The benchmark contains a test of 298 educational questions covering multiple subjects & languages and a 77k multilingual pool of distractor vocabulary for future research.Comment: 24 pages and 4 figures Accepted for publication in IEEE Transactions on Learning technologie
    • …
    corecore