thesis

An Exploration of Traditional and Data Driven Predictors of Programming Performance

Abstract

This thesis investigates factors that can be used to predict the success or failure of students taking an introductory programming course. Four studies were performed to explore how aspects of the teaching context, static factors based upon traditional learning theories, and data-driven metrics derived from aspects of programming behaviour were related to programming performance. In the first study, a systematic review into the worldwide outcomes of programming courses revealed an average pass rate of 67.7\%. This was found to have not significantly changed over time, or to have differed based upon aspects of the teaching context, such as the programming language taught to students. The second study showed that many of the factors based upon traditional learning theories, such as learning styles, are context dependent, and fail to consistently predict programming performance when they are applied across different teaching contexts. The third study explored data-driven metrics derived from the programming behaviour of students. Analysing data logged from students using the BlueJ IDE, 10 new data-driven metrics were identified and validated on three independently gathered datasets. Weaker students were found to make a greater percentage of successive errors, and spend a greater percentage of their lab time resolving errors than stronger students. The Robust Relative algorithm was developed to hybridize four of the strongest data-driven metrics into a performance predictor. The novel relative scoring of students based upon how their resolve times for different types of errors compared to the resolve times of their peers, resulted in a predictor which could explain a large proportion of the variance in the performance of three independent cohorts, R2R^2 = 42.19\%, 43.65\% and 44.17\% - almost double the variance which could be explained by Jadud's Error Quotient metric. The fourth study situated the findings of this thesis within the wider literature, by applying meta-analysis techniques to statistically synthesise fifty years of conflicting research, such that the most important factors for learning programming could be identified. 482 results describing the effects of 116 factors on programming performance were synthesised and consolidated to form a six class theoretical framework. The results showed that the strongest predictors identified over the past fifty years are data-driven metrics based upon programming behaviour. Several of the traditional predictors were also found to be influential, suggesting that both a certain level of scientific maturity and self-concept are necessary for programming. Two thirds of the weakest predictors were based upon demographic and psychological factors, suggesting that age, gender, self-perceived abilities, learning styles, and personality traits have no relevance for programming performance. This thesis argues that factors based upon traditional learning theories struggle to consistently predict programming performance across different teaching contexts because they were not intended to be applied for this purpose. In contrast, the main advantage of using data-driven approaches to derive metrics based upon students' programming processes, is that these metrics are directly based upon the programming behaviours of students, and therefore can encapsulate such changes in their programming knowledge over time. Researchers should continue to explore data-driven predictors in the future

    Similar works