Does the text matter in a multiple choice test of comprehension? Language Testing

Abstract

The current study addresses a specific construct validity issue regarding multiplechoice language-comprehension tests by focusing on TOEFL's minitalk passages: Is there evidence that examinees attend to the text passages in answering the test items? To address this problem, we analysed a large sample (n = 337) of minitalk items. The content and structure of the items and their associated text passages were represented by a set of predictor variables that included a wide variety of text and item characteristics identified from the experimental language-comprehension literature. Stepwise and hierarchical regression techniques showed that at least 33% of the item difficulty variance could be accounted for primarily by variables that reflected the content and structure of the whole passage and/or selected portions of the passage; item characteristics, however, accounted for very little of the variance. The pattern of these results was interpreted, with qualifications, as favouring the construct validity of TOEFL's minitalks. Our methodology also allowed a detailed comparison between TOEFL reading and listening (minitalk) items. Several criticisms concerning multiple-choice language-comprehension tests were addressed. Future work is suggested. I Introduction Purpose of current study The purpose of this study is to examine a very specific method of evaluating the construct validity of TOEFL's minitalks. The problem to be addressed is whether there is evidence that examinees attend to the text passages when answering the test items. Roy Freedle and Irene Kostin 3 2 Background Several extreme criticisms regarding the construct validity of multiple-choice tests of language comprehension, especially for reading, were summarized by We interpret the Drum et al. result as showing that there are different 'levels' of construct validity that can be discerned. In the correlational sense being examined herein, the lowest 'level' of validity requires finding any significant support for the effect of text variables on test item difficulty. A higher 'level' of validity requires that if both text and item variables are significant, then the text variables should be more important than the item variables in accounting for test-item difficulty variance. Freedle and Kostin (1996) presented evidence for three logically different types of predictor variables: text variables item variables, and text-by-item overlap variables. For example, regarding the latter type, suppose some content words that occur in the passage are also used in one or more of an item's options. One might hypothesize that people will tend to choose an option if it contains more text words than some other options. Such a variable for predicting item difficulty would represent a text-by-item overlap variable because measuring it requires examination of both the item content and the passage content in order to enter the appropriate lexical overlap count. The three distinct 'levels' of construct validity proposed by Roy Freedle and Irene Kostin 5 3 The use of listening and reading tests as measures of general language comprehension While we certainly do not claim that reading and listening are synonymous, we do claim that reading and listening are strongly related due to the fact that they both require the exercise of a general faculty of language comprehension. Indeed, several studies have demonstrated strong intercorrelations between listening and reading comprehension • Negations. • Referentials. Abrahamsen and Shelton (1989) demonstrated improved comprehension of texts that were modified, in part, so that full noun phrases were substituted in place of referential expressions. This improvement suggests that texts with few referential expressions may be easier than ones with many referential expressions. • Rhetorical organizers. • Fronted structures. • Serial position effects. The work of • Lexical overlap. • Vocabulary, sentence length, passage length and abstractness of text. Longer sentence structures and longer and less frequently occurring words tended to be associated with greater difficulty of text comprehension, as can be inferred from their use in traditional readability formulas (see b New predictor variables tailored for the TOEFL's minitalk listening environment. Above, we have hypothesized that reading and listening tests are measurements of a general language comprehension ability. In part a study by 1) The frequency of emphatic text words. Texts that contain a fairly large number of emphatic words introduce a variety into the stimulus that is absent from minitalk texts with no emphatic words. Texts that use many emphatic words may thus be more memorable. Hence, items may be easier when they are associated with texts that have many emphatic words. (Anticipating now the method section, we used the emphatic words that were highlighted in the script used to make the minitalk recordings -the tapes themselves were not listened to due to time constraints in 8 Multiple-choice tests of comprehension completing the study. For further clarification, we refer the reader to variable v48 below.) 2) The frequency of filled or unfilled pauses. A filled pause consists of such sounds as 'um', or 'er'. An unfilled pause is a pause of about 1 second or longer without any audible speech sound in that interval. The professional speakers who recorded the minitalks inserted these filled and unfilled pauses intentionally to render greater appearance of validity to the minitalks. It is difficult to anticipate the likely effects of having variable numbers of pauses on items associated with minitalks. Blau (1990) found that pauses at constituent boundaries improved second-language (L2) listening comprehension. But Chaudron and Richards (1986) found that pause fillers did not aid lecture retention. Perhaps the effect of pauses on comprehension will prove to be weakly facilitative. Filled pauses ('um', 'er') were indicated on the typescript for some of the minitalks. Long unfilled pauses were indicated by '%' on the same typescript. (Anticipating again the method section, although these two types of pauses are logically distinct, they were combined for this study because of the low frequency of occurrence of each of the two types; see variable v49 below for further clarification.) 3) Estimates of the redundancy of information. As an example of elaboration, they give: 'The food % is very hearty and delicious. Hearty and delicious food is nourishing and tasty.' Repetition of information would consist of repeated segments or paraphrases of information. (The above example can, in fact, be seen to be a combination of both repeated as well as elaborated information over successive sentences.) Roy Freedle and Irene Kostin 9 Given the above evidence, it would appear to be desirable to develop some predictor variables that reflect how test items tap various redundant properties of the minitalks, because such measures of redundancy are likely to be correlated with listening item difficulty. Several scores were developed. Suppose an item's correct option consists of a phrase or clause that essentially restates information that is repeated in more than one sentence of the minitalk. Such an item should be easier than one in which the correct option is a paraphrase of information present in just one sentence of the minitalk, because the first item would be more redundantly represented in the text than the second item. Two types of redundancy scores such as this were developed designated below as ii10 and ii12 (as applied to inference item types) and ss12 and ss14 (as applied to the supporting item types). Another score involved both redundancy and what we call complementary information. That is, sometimes a correct option implicates one sentence in the minitalk that is an incomplete paraphrase in the sense that one must search additional parts of the minitalk to locate additional (complementary) information needed to complete the paraphrase; such items probably will prove to be harder than items that simply paraphrase information in just one minitalk sentence since the examinee will have to do more cognitive work in locating in memory the complementary information and linking it to the partial paraphrase in order to arrive at the correct answer. This other type of 'redundancy' score was developed for both inference items (see ii11 below) and supporting items (ss13 below). 4) A new lexical overlap measure. II Materials and method The total item sample consists of 337 listening comprehension items associated with 69 minitalk passages. The passages and associated items were taken from 47 TOEFL forms administered between 1981 and 1992. Sequence of events for each minitalk The sequence of events associated with each minitalk is as follows. (1) The lead-in (sometimes containing brief contextual information about where a lecture has been given or what its topical content is; e.g., 'Listen to the report on biology', (2) the minitalk itself, (3) the item stem that introduces the problem to be solved and (4) the typed options from which the examinee makes a selection. (The first three events are presented auditorily.) Each of these components of a minitalk was assigned a number of variables that was intended to predict item difficulty (see Appendix A). Types of item studied Four types of item were studied: detail explicit, detail implicit, gist explicit and gist implicit. Roy Freedle and Irene Kostin 11 • Detail explicit (n = 183); also called Supporting Ideas. Example: 'What year did the teacher give for the critical experiment?' • Detail implicit (n = 117); this item type consists of two subtypes: plain inference items and inference-application items. Eighty-two of these were inference items where the inferences can be made based on information available in the text. Example: 'One can infer that the speaker intends to do which of the following %'. These simple inference items were quite similar to those found in the TOEFL reading section. For the other subtype, inference-application items, listeners must use their background knowledge to make the inferences. There were 35 of these items. Example: 'Who is the person giving the lecture?' Usually no direct information is given in the minitalk regarding the lecturer's identity so the information must be inferred strictly from background knowledge that the examinee brings to the task. (Incidentally, inference-application items are not found in the reading section.) • Gist explicit and implicit (n = 37); also called main-idea items. Because the total sample of gist items was so small we decided not to divide this category into its two respective item types. The measure of item difficulty for each item (equated delta) was based on the performance of approximately 2000 examinees. These examinees were randomly selected from a much larger pool of examinees who responded to each TOEFL test form. The equated delta value slightly adjusts the difficulty of each item across forms so that items can be meaningfully compared across groups of people taking different test forms. The adjustment stems from the fact that the examinees who respond to a particular test form differ slightly in overall ability level from those responding to other test forms. Independent and dependent variables used in this study Appendix A presents a list of all the coded variables along with a brief definition of each variable. Not all variables were used in the analyses. In Appendix A those variables without a superscript are the ones used in the analyses in this report while those few variables having a superscript were deleted due to low frequency of occurrence or because of collinearity with other predictor variables (collinearity is defined as a correlation of .8 or higher with any other variable; see There are four groups of independent variables used to predict listening item difficulty; the dependent variable constitutes one additional variable. a Four groups of independent variables 1) Item variables. These variables constitute our so-called 'pure' item variables referred to below. That is, these variables can be coded without reference to the contents of the minitalk passage; only the contents of the item itself are used to quantify these particular variables. 2) Text variables. These variables characterize the content and structure of the minitalk passage itself; the contents of the items themselves do not figure in the coding of these variables. 3) Text/item overlap variables. We define the concept of text-byitem or alternatively text/item overlap variables as variables that necessarily reflect the contents of both the test items as well as the text to which those items apply. 4) Item types. We define item type as a special type of text/item overlap. We do this even though item type per se appears to be a pure item variable. On reflection, however, it seems that an item type cannot be correctly determined without checking how it functions for the passage to which it refers; for this reason item types are another kind of text/item overlap code because both the item and the text must be scanned to arrive at a proper coding. A total of 18 variables were deleted (see rationale for deletion given above) leaving 11 item variables, 31 text variables and 39 text/item overlap variables (the 39 text/item overlap variables include the 4 item type variables). b Dependent variable. There was one dependent variable -equated delta. The dependent variable (v54) is an item's equated delta. Difficult items have large equated deltas; easy items have small equated deltas. Roy Freedle and Irene Kostin 13 Reliability of variables requiring subjective judgement While many of our predictor variables are arrived at objectively (e.g., counting the number of words in a passage), the variables listed below required some degree of subjective judgement. The following percentage agreement was obtained for two raters using a sample size of 35 cases: coherence = 74% agreement; referentials = 92% agreement; negations = 96% agreement; frontings = 93% agreement; rhetorical organizers = 89% agreement; location of relevant text = 84% agreement; and abstractness/concreteness = 87% agreement. In general, it is clear that these subjective measures yielded high reliabilities. Because of the high reliability only one coder was used to code the remaining items and passages. Half the variables were coded by author RF while the remaining half were coded by author IK. III Results and discussion MANOVAs to determine possible interactions with predictor variables We conducted a series of MANOVAs to help us determine whether there were significant interactions between the predictor variables and the four item types. Only one of the 42 MANOVA analyses suggested a significant interaction; this, however, could easily have been due to chance. Because of these results, all further analyses used the combined item-type samples. (For further information on these analyses, see Two item samples Two samples of items were defined. The first sample consisted of the following: 337 items (37 main idea items, 82 inferences, 183 supporting ideas and 35 inference application items). The second sample consisted of 302 items (i.e., the 35 inference application items were deleted from the 337 item sample). The reason for deleting the inference application items from the second sample is as follows: the sample of 302 items is more directly comparable to the reading items studied by Multiple-choice tests of comprehension 3 Correlational results Appendix B presents data that help to identify those variables that are correlationally significant in predicting minitalk item difficulty. In Appendix B we see that 43 different variables -in either or both of the two samples presented in the table -yielded a significant correlation (p Ͻ.05) with item difficulty (equated delta). Thus of the 81 variables examined (i.e., the non-superscripted variables listed in Appendix A) slightly more than half (43) were significant. A perusal of the correlations between the two item samples (n = 302 and n = 337) reveals the great similarity of the two samples. First we will use portions of Appendix B to assess the apparent adequacy of those variables that our literature review suggested would be pertinent to predicting item difficulty. We point out along the way similarities and differences between the reading study results (Freedle and Kostin, 1993) and our current set of listening (minitalk) variables. Overall, the correlational results suggest that many of those variables found to influence comprehension in the experimental language comprehension literature also influence our multiple-choice listening data. We note that seven of the significant variables are pure item variables. However, the very fact that many of the text and text/item overlap variables are significant suggests that this pattern alone can be taken as support for one view of construct validity of the TOEFL minitalk passages and their associated items -the view that contradicts the extreme claim that examinees do not have to pay attention to the passage in order to get the items correct (see review by For example, in Appendix B we see that the presence of negatives was associated with more difficult listening items, as expected. This increased difficulty was found when negatives were present in the correct options (v7), the incorrect options (v11) and the text (v47). The first two scores had a similar effect for TOEFL reading items (see We also see in Appendix B that the presence of more referentials had a significant effect on listening-item comprehension, but the effects were not uniform across item components. Thus, the presence of referentials in the item stem (v4) was associated with easier items, Roy Freedle and Irene Kostin 15 whereas the presence of referentials in the correct option (v9) was associated with harder items. The presence of many extratextual or 'special' text referentials (v45) was associated with easier items. (Terms such as 'you' and 'we' used in a minitalk were classified as extratextual or 'special' referentials). Because it was clear who the 'you' or 'we' referred to in the minitalks (i.e., the listener, and both speaker and listener, respectively), the occurrence of these referentials actually seemed to personalize the listening material and probably for that reason, tended to make items associated with such texts easier. Another explanation is that 'you' and 'we' may have occurred primarily for nonacademic subjects, and such subjects tend to contain easier or more familiar concepts than the academic subjects. In fact, there was some evidence for this latter explanation: there was a high positive correlation of .54 (p Ͻ.01) between the nonacademic subjects and the occurrence of these extratextual referentials. Three rhetorical organizers had a significant effect on listening item difficulty (v29, v31, v32). The list structure was associated with easier items. The problem/solution and compare structures were associated with harder items for the listening section. Similar results were found for reading regarding the problem/solution and the list organizers (see Freedle and Kostin, 1993: 157). Fronted structures had a significant effect on listening item difficulty; if the minitalk had a long string of fronted structures in successive sentences (see text variable v41) this pattern was associated with harder minitalk items. A related fronted variable (a text variable) was also found in the Freedle and Kostin (1993) reading study to be associated with more difficult reading items. Concrete texts (i.e., texts that did not deal primarily with abstract concepts -v16) were associated with easier listening items, as expected. This pattern was also found for reading items (see Average sentence length and passage length effects were not significant for listening items. However, a few other variables that assessed additional aspects of length (v1, v10, ss8) were significant. In all three cases the longer the stem, or the incorrect options, or the length of text that was encountered prior to relevant item information in the text, the more difficult the items tended to be -a result that was consistent with prediction. We observe that one of these 'length' type effects (v10 -number of words in incorrect options) was significant for both reading and minitalks tests. Several additional length effects were significant for reading items (see The vocabulary effect (v14) was significant when applied to the minitalk texts. However, the result was in a direction opposite from 16 Multiple-choice tests of comprehension expected. The more multisyllabic words (involving three or more syllables) in the text, the easier were the items associated with that text. This result is counterintuitive and different from the result reported in Freedle and Kostin's (1993) reading study. We do not have a clear explanation for this unusual result. The development of an alternative vocabulary score (v15) failed to clarify the problem. However, vocabulary did not emerge in any of our regression analyses and for this reason has not affected our conclusion regarding construct validity. Serial position effects were strongly represented among the predictor variables (mm2, mm4, mm12, ii2, ii3, ii5, ss4, ss5). In general, items dealing with information presented early in the minitalk (mm2, mm4, ii2, ss4) tended to be easier ite

    Similar works

    Full text

    thumbnail-image

    Available Versions