12 research outputs found
The effect of response order on candidate viewing behaviour and item difficulty in a multiple-choice listening test
Studies from various disciplines have reported that spatial location of options in relation to processing order impacts the ultimate choice of the option. A large number of studies have found a primacy effect, that is, the tendency to prefer the first option. In this paper we report on evidence that position of the key in four-option multiple-choice (MC) listening test items may affect item difficulty and thereby potentially introduce construct-irrelevant variance.Two sets of analyses were undertaken. With Study 1 we explored 30 test takersâ processing via eye-tracking on listening items from the Aptis Test. An unexpected finding concerned the amount of processing undertaken on different response options on the MC questions, given their order. Based on this, in Study 2 we looked at the direct effect of key position on item difficulty in a sample of 200 live Aptis items and around 6000 test takers per item.The results suggest that the spatial location of the key in MC listening tests affects the amount of processing it receives and the itemâs difficulty. Given the widespread use of MC tasks in language assessments, these findings seem crucial, particularly for tests that randomize response order. Candidates who by chance have many keys in last position might be significantly disadvantaged
Re-examining the content validation of a grammar test:the (im)possibility of distinguishing vocabulary and structural knowledge
âVocabulary and structural knowledgeâ (Grabe, 1991, p. 379) appears to be a key component of reading ability. However, is this component to be taken as a unitary one or is structural knowledge a separate factor that can therefore also be tested in isolation in, say, a test of syntax? If syntax can be singled out (e.g. in order to investigate its contribution to reading ability), this test of syntactic knowledge would require validation. The usefulness and reliability of using expert judgments as a means of analysing the content or difficulty of test items in language assessment has been questioned for more than two decades. Still, groups of expert judges are often called upon as they are perceived to be the only or at least a very convenient way of establishing key features of items. Such judgments, however, are particularly opaque and thus problematic when judges are required to make categorizations where categories are only vaguely defined or are ontologically questionable in themselves. This is, for example, the case when judges are asked to classify the content of test items based on a distinction between lexis and syntax, a dichotomy corpus linguistics has suggested cannot be maintained. The present paper scrutinizes a study by Shiotsu (2010) that employed expert judgments, on the basis of which claims were made about the relative significance of the components âsyntactic knowledgeâ and âvocabulary knowledgeâ in reading in a second language. By both replicating and partially replicating Shiotsuâs (2010) content analysis study, the paper problematizes not only the issue of the use of expert judgments, but, more importantly, their usefulness in distinguishing between construct components that might, in fact, be difficult to distinguish anyway. This is particularly important for an understanding and diagnosis of learnersâ strengths and weaknesses in reading in a second language
Moving the field of vocabulary assessment forward: The need for more rigorous test development and validation
Copyright © Cambridge University Press 2019. Recently, a large number of vocabulary tests have been made available to language teachers, testers, and researchers. Unfortunately, most of them have been launched with inadequate validation evidence. The field of language testing has become increasingly more rigorous in the area of test validation, but developers of vocabulary tests have generally not given validation sufficient attention in the past. This paper argues for more rigorous and systematic procedures for test development, starting from a more precise specification of the test's purpose, intended testees and educational context, the particular aspects of vocabulary knowledge which are being measured, and the way in which the test scores should be interpreted. It also calls for greater assessment literacy among vocabulary test developers, and greater support for the end users of the tests, for instance, with the provision of detailed users' manuals. Overall, the authors present what they feel are the minimum requirements for vocabulary test development and validation. They argue that the field should self-police itself more rigorously to ensure that these requirements are met or exceeded, and made explicit for those using vocabulary tests
Looking into listening: Using eye-tracking to establish the cognitive validity of the Aptis Listening Test
This study investigated the cognitive processing of 30 test-takers while completing the Aptis Listening Test. The research studied test-takersâ processes according to ones targeted at the different itemlevels in the Aptis Test.Specifically, it examined whether test-takersâ cognitive processes and types of information used corresponded to the ones targeted at the different CEFR levels. To this end, a detailed analysis of test-takersâ verbal recalls was conducted, which were stimulated by a replay of their eye-traces while they had been solving the items. The study also explored the usefulness of quantitative analyses of eye-tracking metrics captured during listening tests.The stimulated recall findings indicate that the Aptis Listening Test successfully taps into the range of cognitive processes and types of information intended by the test developers. The data also shows, however, that the differences between the CEFR levels in relation to the intended cognitive processes could be more pronounced, and that the process of âdiscourse constructionâ could be more evident for B2 items. It is, therefore, suggested that a different item type could help elicit this type of higher-order processing. In terms of types of information used by candidates, a clear difference and progression between the CEFR levels to answer items correctly was observed.The quantitative analysis of the eye-tracking metrics revealed interesting results. A linear mixed effects model analysis, with visit duration on response options as the dependant variable, showed that testtakers looked at the response options of higher-level items significantly longer than at the response options of lower-level items. The results also showed that response options higher up on the screen were looked at significantly longer than response options lower down, regardless of item level. In addition, it was found that better readers focused on the response options significantly longer than poorer readers