11,719 research outputs found

    Topic and background knowledge effects on performance in speaking assessment

    Get PDF
    This study explores the extent to which topic and background knowledge of topic affect spoken performance in a high-stakes speaking test. It is argued that evidence of a substantial influence may introduce construct-irrelevant variance and undermine test fairness. Data were collected from 81 non-native speakers of English who performed on 10 topics across three task types. Background knowledge and general language proficiency were measured using self-report questionnaires and C-tests respectively. Score data were analysed using many-facet Rasch measurement and multiple regression. Findings showed that for two of the three task types, the topics used in the study generally exhibited difficulty measures which were statistically distinct. However, the size of the differences in topic difficulties was too small to have a large practical effect on scores. Participants’ different levels of background knowledge were shown to have a systematic effect on performance. However, these statistically significant differences also failed to translate into practical significance. Findings hold implications for speaking performance assessment

    Integration of a web-based rating system with an oral proficiency interview test: argument-based approach to validation

    Get PDF
    This dissertation focuses on the validation of the Oral Proficiency Interview (OPI), a component of the Oral English Certification Test for international teaching assistants. The rating of oral responses was implemented through an innovative computer technology—a web-based rating system called Rater-Platform (R-Plat). The main purpose of the dissertation was to investigate the validity of interpretations and uses of the OPI scores derived from raters’ assessment of examinees’ performance during the web-based rating process. Following the argument-based validation approach (Kane, 2006), an interpretive argument for the OPI was constructed. The interpretive argument specifies a series of inferences, warrants for each inference, as well as underlying assumptions and specific types of backing necessary to support the assumptions. Of seven inferences—domain description, evaluation, generalization, extrapolation, explanation, utilization, and impact—this study focuses on two. Specifically, it aims to obtain validity evidence for three assumptions underlying the evaluation inference and for three assumptions underlying the generalization inference. The research questions addressed: (1) raters’ perceptions towards R-Plat in terms of clarity, effectiveness, satisfaction, and comfort level; (2) quality of raters’ diagnostic descriptor markings; (3) quality of raters’ comments; (4) quality of OPI scores; (5) quality of individual raters’ OPI ratings; (6) prompt difficulty; and (7) raters’ rating practices. A mixed-methods design was employed to collect and analyze qualitative and quantitative data. Qualitative data consisted of: (a) 14 raters’ responses to open-ended questions about their perceptions towards R-Plat, (b) 5 recordings of individual/focus group interviews on eliciting raters’ perceptions, and (c) 1,900 evaluative units extracted from raters’ comments about examinees’ speaking performance. Quantitative data included: (a) 14 raters’ responses to six-point scale statements about their perceptions, (b) 2,524 diagnostic descriptor markings of examinees’ speaking ability, (c) OPI scores for 279 examinees, (d) 803 individual raters’ ratings, (e) individual prompt ratings divided by each intended prompt level, given by each rater, and (f) individual raters’ ratings on the given prompts, grouped by test administration. The results showed that the assumptions for the evaluation inference were supported. Raters’ responses to questionnaire and individual/focus group interviews revealed positive attitudes towards R-Plat. Diagnostic descriptors and raters’ comments, analyzed by chi-square tests, indicated different speaking ability levels. OPI scores were distributed across different proficiency levels throughout different test administrations. For the generalization inference, both positive and negative evidence was obtained. MFRM analyses showed that OPI scores reliably separated examinees into different speaking ability levels. Observed prompt difficulty matched intended prompt levels, although several problematic prompts were identified. Finally, while the raters used rating scales consistently adequately within the same test administration, they were not consistent in their severity. Overall, the foundational parts for the validity argument were successfully established. The findings of this study allow for moving forward with the investigation of the subsequent inferences in order to construct a complete OPI validity argument. They also suggest important implications for argument-based validation research, for the study of raters and task variability, and for future applications of web-based rating systems for speaking assessment

    A comparison of holistic, analytic, and part marking models in speaking assessment

    Get PDF
    This mixed methods study examined holistic, analytic, and part marking models (MMs) in terms of their measurement properties and impact on candidate CEFR classifications in a semi-direct online speaking test. Speaking performances of 240 candidates were first marked holistically and by part (phase 1). On the basis of phase 1 findings – which suggested stronger measurement properties for the part MM – phase 2 focused on a comparison of part and analytic MMs. Speaking performances of 400 candidates were rated analytically and by part during that phase. Raters provided open comments on their marking experiences. Results suggested a significant impact of MM; approximately 30% and 50% of candidates in phases 1 and 2 respectively were awarded different (adjacent) CEFR levels depending on the choice of MM used to assign scores. There was a trend of higher CEFR levels with the holistic MM and lower CEFR levels with the part MM. While strong correlations were found between all pairings of MMs, further analyses revealed important differences. The part MM was shown to display superior measurement qualities particularly in allowing raters to make finer distinctions between different speaking ability levels. These findings have implications for the scoring validity of speaking tests

    Unstressed Vowels in German Learner English: An Instrumental Study

    Get PDF
    This study investigates the production of vowels in unstressed syllables by advanced German learners of English in comparison with native speakers of Standard Southern British English. Two acoustic properties were measured: duration and formant structure. The results indicate that duration of unstressed vowels is similar in the two groups, though there is some variation depending on the phonetic context. In terms of formant structure, learners produce slightly higher F1 and considerably lower F2, the difference in F2 being statistically significant for each learner. Formant values varied as a function of context and orthographic representation of the vowel

    Children at risk : their phonemic awareness development in holistic instruction

    Get PDF
    Includes bibliographical references (p. 17-19

    SPEAKING ENGLISH PERFORMANCE ASSESSMENT WITH THE FACET RASCH MEASUREMENT MODEL

    Get PDF
    This study aims to assess students' English-speaking abilities based on peer assessment. This study is a quantitative study involving 10 students. Data was collected using tests and student speaking assessment rubrics with score criteria from 1 to 5. Speaking assessment criteria are pronunciation, grammar, vocabulary, fluency and understanding. Data were analyzed using Many Faceted Rasch Measurement (MFRM). The Facets Rasch Measurement model is able to see the interaction between respondents and items at once. The research results show that the item index for criteria/quality (6.39), speaker (0.51), and rater (5.32) as well as the standard deviation value clearly shows a good distribution of item difficulty. Criterion reliability is 0.98 for raters is 0.21, for raters is 0.97

    The development of automatic speech evaluation system for learners of English

    Get PDF
    制度:新 ; 報告番号:甲3183号 ; 学位の種類:博士(教育学) ; 授与年月日:2010/11/30 ; 早大学位記番号:新547
    corecore