It is of paramount importance that formative feedback is meaningful in order to drive student learning. Achieving this, however, relies upon a clear and constructively aligned model of quality being applied consistently across submissions. This poster presentation raises concerns about the inter-rater reliability of code reviews conducted by teaching assistants in the absence of such a model. Five teaching assistants each reviewed 12 purposely selected programs submitted by introductory programming students. An analysis of their reliability revealed that while teaching assistants were self-consistent, they each assessed code quality in different ways. This suggests a need for standard models of program quality and rubrics, alongside supporting technology, to be used during code reviews to improve the reliability of formative feedback

Ghinea, G

Scott, MJ

English

It is of paramount importance that formative feedback is meaningful in order to drive student learning. Achieving this, however, relies upon a clear and constructively aligned model of quality being applied consistently across submissions. This poster presentation raises concerns about the inter-rater reliability of code reviews conducted by teaching assistants in the absence of such a model. Five teaching assistants each reviewed 12 purposely selected programs submitted by introductory programming students. An analysis of their reliability revealed that while teaching assistants were self-consistent, they each assessed code quality in different ways. This suggests a need for standard models of program quality, alongside supporting rubrics and other tools, to be used during code reviews to improve the reliability of formative feedback

Scott, Michael

Ghinea, Gheorghita

Name not available

Reliability in the Assessment of Program Quality by
Teaching Assistants During Code Reviews
Michael James Scott
Department of Computer Science
Brunel University London
United Kingdom
michael.scott@brunel.ac.uk
Gheorghita Ghinea
Department of Computer Science
Brunel University
United Kingdom
george.ghinea@brunel.ac.uk
ABSTRACT
It is of paramount importance that formative feedback is
meaningful in order to drive student learning. Achieving
this, however, relies upon a clear and constructively
aligned model of quality being applied consistently across
submissions. This poster presentation raises concerns about
the inter-rater reliability of code reviews conducted by
teaching assistants in the absence of such a model. Five
teaching assistants each reviewed 12 purposely selected
programs submitted by introductory programming students.
An analysis of their reliability revealed that while teaching
assistants were self-consistent, they each assessed code
quality in different ways. This suggests a need for standard
models of program quality and rubrics, alongside supporting
technology, to be used during code reviews to improve the
reliability of formative feedback.
Categories and Subject Descriptors
K.3.2 [Computers and Education]: Computer and
Information Science Education.
Keywords
Programming, Code Review, Grading, Quality, Assessment,
Reliability, Concordance, Agreement, Consistency.
1. INTRODUCTION
Guidance is important when first learning computer
programming. This is because students often need help to
develop an appreciation for program quality. Such guidance
often consists of formative feedback provided during code
reviews. However, in large undergraduate cohorts, such code
reviews may be conducted by teams of teaching assistants.
For feedback to be meaningful to students, it should be
clear, reliable and constructively align with relevant learning
objectives (c.f. [3, 5]). This is because conflicting feedback
from different sources could cause confusion. Previous work
suggests that reviews by experienced faculty tend to be
correlated, but different reasoning is sometimes applied [1].
It is not clear, then, whether assessments made by teaching
assistants would be as consistent. Of particular concern is
that assessments of program quality may reflect more on
the reviewer than on the student (see [4] for detail on the
idiosyncratic rater effect).
Copyright is held by the author/owner(s).
ITiCSE’15, July 6–8, 2015, Vilnius, Lithuania.
ACM 978-1-4503-2078-8/13/07.
Table 1: Reliability of Assessments (α >= 0.667)
Measure Reliability α
Self-Consistency .841
Agreement Between Teaching Assistants .607
Agreement with Faculty Assessments .522
2. FINDINGS
Five teaching assistants, each with at least one year
of experience, reviewed 12 purposely selected programs
submitted by first-year computing students and made
holistic assessments of their quality using a 3-point scale
(pass, merit, distinction). Minimal instruction was provided
to reflect a less formal formative (rather than summative)
context. After two weeks, they re-reviewed the programs.
On each occasion the programs were presented in a random
order and some elements (e.g., identifiers) were transformed.
The data were analysed using Krippendorf’s alpha [2].
The results, shown in Table 1, show that while the
assessments were adequately self-consistent, there was
low inter-rater reliability and there was considerable
disagreement with ratings provided by faculty. This finding
suggests that teaching assistants use different notions or
standards of program quality when conducting code reviews
and therefore need support. As such, this study provides
a foundation for future work on the development and
evaluation of code review processes, program quality rubrics,
and supporting technologies.
3. REFERENCES
[1] S. Fitzgerald, B. Hanks, R. Lister, R. McCauley, and
L. Murphy. What are we thinking when we grade
programs? In SIGCSE ’13, pages 471–476. ACM, 2013.
[2] A. Hayes and K. Krippendorff. Answering the call for a
standard reliability measure for coding data. Commun.
Methods & Measures, 1(1):77–89, 2007.
[3] A. Pears, J. Harland, M. Hamilton, and R. Hadgraft.
Four feed-forward principles enhance students’
perception of feedback as meaningful. In LaTiCE ’14,
pages 272–277. IEEE, 2014.
[4] S. E. Scullen, M. K. Mount, and M. Goff.
Understanding the latent structure of job performance
ratings. Journal of Applied Psychology, 85(6):956, 2000.
[5] M. Stegeman, E. Barendsen, and S. Smetsers. Towards
an empirically validated model for assessment of code
quality. In Koli Calling ’14, pages 99–108. ACM, 2014.


Reliability in the Assessment of Program Quality by Teaching Assistants During Code Reviews

Brunel University Research Archive

Reliability in the Assessment of Program Quality byTeaching Assistants During Code ReviewsMichael James ScottDepartment of Computer ScienceBrunel University LondonUnited Kingdommichael.scott@brunel.ac.ukGheorghita GhineaDepartment of Computer ScienceBrunel University LondonUnited Kingdomgeorge.ghinea@brunel.ac.ukABSTRACTIt is of paramount importance that formative feedback ismeaningful in order to drive student learning. Achievingthis, however, relies upon a clear and constructivelyaligned model of quality being applied consistently acrosssubmissions. This poster presentation raises concerns aboutthe inter-rater reliability of code reviews conducted byteaching assistants in the absence of such a model. Fiveteaching assistants each reviewed 12 purposely selectedprograms submitted by introductory programming students.An analysis of their reliability revealed that while teachingassistants were self-consistent, they each assessed codequality in different ways. This suggests a need for standardmodels of program quality, alongside supporting rubrics andother tools, to be used during code reviews to improve thereliability of formative feedback.Categories and Subject DescriptorsK.3.2 [Computers and Education]: Computer andInformation Science EducationKeywordsProgramming, Code Review, Code Inspection, Grading,Quality, Assessment, Reliability, Agreement, Consistency.1. INTRODUCTIONGuidance is important when first learning computerprogramming to help students develop an appreciation forquality. This often consists of feedback provided duringcode reviews. However, for such feedback to be meaningful,it should be clear, reliable and constructively align withrelevant learning objectives (c.f. [2, 4]). This is becauseconflicting feedback from different teaching assistants couldcause confusion. Previous work suggests that reviews byexperienced faculty tend to be correlated, but differentreasoning is sometimes applied [1]. However, it remainsunclear whether those done by teaching assistants are asPermission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contactthe Owner/Author. Copyright is held by the owner/author(s).ITiCSE’15, July 04–08, 2015, Vilnius, LithuaniaCopyright 20XX ACM ACM 978-1-4503-3440-2/15/07.http://dx.doi.org/10.1145/2729094.2754844 ...$15.00.Table 1: Reliability of Assessment (E(α) ≥ .667)Measure Krippendorff’s αSelf-Consistency .841Agreement Between Teaching Assistants .607Agreement with Faculty Assessments .522consistent. Of particular concern is that the reviews mayreflect more on the reviewer than on the student (see [3] fordetail on the idiosyncratic rater effect).2. FINDINGSFive experienced teaching assistants (> 1yr) reviewed 12programs selected from first-year undergraduate computingsubmissions and made holistic assessments of their qualityusing a 3-point scale (pass, merit, distinction). Minimalinstruction was provided to reflect a less formal context.After two weeks, they re-reviewed the programs. On eachoccasion the programs were presented in a random order andsome elements (e.g., identifiers) were transformed. The datawere analysed using Krippendorff’s alpha.The results, shown in Table 1, reveal that whilethe assessments were adequately self-consistent, therewas low inter-rater reliability and there was considerabledisagreement with ratings provided by a team of faculty.This finding suggests that teaching assistants apply differentstandards of program quality when conducting code reviewsand therefore require support to improve reliability. Assuch, this study provides a foundation for future work onthe development and evaluation of code review processes,models of program quality, as well as rubrics and other tools.3. REFERENCES[1] S. Fitzgerald, B. Hanks, R. Lister, R. McCauley, andL. Murphy. What are we thinking when we gradeprograms? In SIGCSE ’13, pages 471–476. ACM, 2013.[2] A. Pears, J. Harland, M. Hamilton, and R. Hadgraft.Four feed-forward principles enhance students’perception of feedback as meaningful. In LaTiCE ’14,pages 272–277. IEEE, 2014.[3] S. E. Scullen, M. K. Mount, and M. Goff.Understanding the latent structure of job performanceratings. Journal of Applied Psychology, 85(6):956, 2000.[4] M. Stegeman, E. Barendsen, and S. Smetsers. Towardsan empirically validated model for assessment of codequality. In Koli Calling ’14, pages 99–108. ACM, 2014.

Reliability in the assessment of program quality by teaching assistants during code reviews

Michael James Scott

Gheorghita Ghinea

Crossref

Falmouth University Research Repository (FURR)

Reliability in the Assessment of Program Quality byTeaching Assistants During Code ReviewsMichael James ScottDepartment of Computer ScienceBrunel University LondonUnited Kingdommichael.scott@brunel.ac.ukGheorghita GhineaDepartment of Computer ScienceBrunel UniversityUnited Kingdomgeorge.ghinea@brunel.ac.ukABSTRACTIt is of paramount importance that formative feedback ismeaningful in order to drive student learning. Achievingthis, however, relies upon a clear and constructivelyaligned model of quality being applied consistently acrosssubmissions. This poster presentation raises concerns aboutthe inter-rater reliability of code reviews conducted byteaching assistants in the absence of such a model. Fiveteaching assistants each reviewed 12 purposely selectedprograms submitted by introductory programming students.An analysis of their reliability revealed that while teachingassistants were self-consistent, they each assessed codequality in different ways. This suggests a need for standardmodels of program quality and rubrics, alongside supportingtechnology, to be used during code reviews to improve thereliability of formative feedback.Categories and Subject DescriptorsK.3.2 [Computers and Education]: Computer andInformation Science Education.KeywordsProgramming, Code Review, Grading, Quality, Assessment,Reliability, Concordance, Agreement, Consistency.1. INTRODUCTIONGuidance is important when first learning computerprogramming. This is because students often need help todevelop an appreciation for program quality. Such guidanceoften consists of formative feedback provided during codereviews. However, in large undergraduate cohorts, such codereviews may be conducted by teams of teaching assistants.For feedback to be meaningful to students, it should beclear, reliable and constructively align with relevant learningobjectives (c.f. [3, 5]). This is because conflicting feedbackfrom different sources could cause confusion. Previous worksuggests that reviews by experienced faculty tend to becorrelated, but different reasoning is sometimes applied [1].It is not clear, then, whether assessments made by teachingassistants would be as consistent. Of particular concern isthat assessments of program quality may reflect more onthe reviewer than on the student (see [4] for detail on theidiosyncratic rater effect).Copyright is held by the author/owner(s).ITiCSE’15, July 6–8, 2015, Vilnius, Lithuania.ACM 978-1-4503-2078-8/13/07.Table 1: Reliability of Assessments (α >= 0.667)Measure Reliability αSelf-Consistency .841Agreement Between Teaching Assistants .607Agreement with Faculty Assessments .5222. FINDINGSFive teaching assistants, each with at least one yearof experience, reviewed 12 purposely selected programssubmitted by first-year computing students and madeholistic assessments of their quality using a 3-point scale(pass, merit, distinction). Minimal instruction was providedto reflect a less formal formative (rather than summative)context. After two weeks, they re-reviewed the programs.On each occasion the programs were presented in a randomorder and some elements (e.g., identifiers) were transformed.The data were analysed using Krippendorf’s alpha [2].The results, shown in Table 1, show that while theassessments were adequately self-consistent, there waslow inter-rater reliability and there was considerabledisagreement with ratings provided by faculty. This findingsuggests that teaching assistants use different notions orstandards of program quality when conducting code reviewsand therefore need support. As such, this study providesa foundation for future work on the development andevaluation of code review processes, program quality rubrics,and supporting technologies.3. REFERENCES[1] S. Fitzgerald, B. Hanks, R. Lister, R. McCauley, andL. Murphy. What are we thinking when we gradeprograms? In SIGCSE ’13, pages 471–476. ACM, 2013.[2] A. Hayes and K. Krippendorff. Answering the call for astandard reliability measure for coding data. Commun.Methods & Measures, 1(1):77–89, 2007.[3] A. Pears, J. Harland, M. Hamilton, and R. Hadgraft.Four feed-forward principles enhance students’perception of feedback as meaningful. In LaTiCE ’14,pages 272–277. IEEE, 2014.[4] S. E. Scullen, M. K. Mount, and M. Goff.Understanding the latent structure of job performanceratings. Journal of Applied Psychology, 85(6):956, 2000.[5] M. Stegeman, E. Barendsen, and S. Smetsers. Towardsan empirically validated model for assessment of codequality. In Koli Calling ’14, pages 99–108. ACM, 2014.

http://repository.falmouth.ac.uk/1633/1/sig-alternate-iticse15.pdf

Reliability in the assessment of program quality by teaching assistants during code reviews

Abstract

Similar works

Full text

Available Versions

Name not available

Brunel University Research Archive

Crossref

Falmouth University Research Repository (FURR)