119 research outputs found

    Interpreting random forest models using a feature contribution method

    Get PDF
    Model interpretation is one of the key aspects of the model evaluation process. The explanation of the relationship between model variables and outputs is easy for statistical models, such as linear regressions, thanks to the availability of model parameters and their statistical significance. For “black box” models, such as random forest, this information is hidden inside the model structure. This work presents an approach for computing feature contributions for random forest classification models. It allows for the determination of the influence of each variable on the model prediction for an individual instance. Interpretation of feature contributions for two UCI benchmark datasets shows the potential of the proposed methodology. The robustness of results is demonstrated through an extensive analysis of feature contributions calculated for a large number of generated random forest models

    RobustSPAM for Inference from Noisy Longitudinal Data and Preservation of Privacy

    Get PDF
    The availability of complex temporal datasets in social, health and consumer contexts has driven the development of pattern mining techniques that enable the use of classical machine learning tools for model building. In this work we introduce a robust temporal pattern mining framework for finding predictive patterns in complex timestamped multivariate and noisy data. We design an algorithm RobustSPAM that enables mining of temporal patterns from data with noisy timestamps. We apply our algorithm to social care data from a local government body and investigate how the efficiency and accuracy of the method depends on the level of noise. We further explore the trade-off between the loss of predictivity due to perturbation of timestamps and the risk of person re-identification

    Development of an expected possession value model to analyse team attacking performances in rugby league.

    Get PDF
    This study aimed to evaluate team attacking performances in rugby league via expected possession value (EPV) models. Location data from 59,233 plays in 180 Super League matches across the 2019 Super League season were used. Six EPV models were generated using arbitrary zone sizes (EPV-308 and EPV-77) or aggregated according to the total zone value generated during a match (EPV-37, EPV-19, EPV-13 and EPV-9). Attacking sets were considered as Markov Chains, allowing the value of each zone visited to be estimated based on the outcome of the possession. The Kullback-Leibler Divergence was used to evaluate the reproducibility of the value generated from each zone (the reward distribution) by teams between matches. Decreasing the number of zones improved the reproducibility of reward distributions between matches but reduced the variation in zone values. After six previous matches, the subsequent match's zones had been visited on 95% or more occasions for EPV-19 (95±4%), EPV-13 (100±0%) and EPV-9 (100±0%). The KL Divergence values were infinity (EPV-308), 0.52±0.05 (EPV-77), 0.37±0.03 (EPV-37), 0.20±0.02 (EPV-19), 0.13±0.02 (EPV-13) and 0.10±0.02 (EPV-9). This study supports the use of EPV-19 and EPV-13, but not EPV-9 (too little variation in zone values), to evaluate team attacking performance in rugby league

    Application of Machine Learning Techniques to Predict Teenage Obesity Using Earlier Childhood Measurements from Millennium Cohort Study

    Get PDF
    Obesity is a major global concern with more than 2.1 billion people overweight or obese worldwide, which amounts to almost 30% of the global population. If the current trend continues, the overweight and obese population is likely to increase to 41% by 2030. Individuals developing signs of weight gain or obesity are also at the risk of developing serious illnesses such as type 2 diabetes, respiratory problems, heart disease, stroke, and even death. It is essential to detect childhood obesity as early as possible since children who are either overweight or obese in their younger age tend to stay obese in their adult lives. This research utilises the vast amount of data available via UK's millennium cohort study to construct machine learning driven framework to predict young people at the risk of becoming overweight or obese. The focus of this paper is to develop a framework to predict childhood obesity using earlier childhood data and other relevant features. The use of novel data balancing technique and inclusion of additional relevant features resulted in sensitivity, specificity, and F1-score of 77.32%, 76.81%, and 77.02% respectively. The proposed technique utilises easily obtainable features making it suitable to be used in a clinical and non-clinical environment

    Clustering of match running and performance indicators to assess between- and within-playing position similarity in professional rugby league

    Get PDF
    This study aimed to determine the similarity between and within positions in professional rugby league in terms of technical performance and match displacement. Here, the analyses were repeated on 3 different datasets which consisted of technical features only, displacement features only, and a combined dataset including both. Each dataset contained 7617 observations from the 2018 and 2019 Super League seasons, including 366 players from 11 teams. For each dataset, feature selection was initially used to rank features regarding their importance for predicting a player’s position for each match. Subsets of 12, 11, and 27 features were retained for technical, displacement, and combined datasets for subsequent analyses. Hierarchical cluster analyses were then carried out on the positional means to find logical groupings. For the technical dataset, 3 clusters were found: (1) props, loose forwards, second-row, hooker; (2) halves; (3) wings, centres, fullback. For displacement, 4 clusters were found: (1) second-rows, halves; (2) wings, centres; (3) fullback; (4) props, loose forward, hooker. For the combined dataset, 3 clusters were found: (1) halves, fullback; (2) wings and centres; (3) props, loose forward, hooker, second-rows. These positional clusters can be used to standardise positional groups in research investigating either technical, displacement, or both constructs within rugby league.</p

    Comparing the CORAL and random forest approaches for modelling the in vitro cytotoxicity of silica nanomaterials

    Get PDF
    Nanotechnology is one of the most important technological developments of the twenty-first century. In silico methods such as quantitative structure-activity relationships (QSARs) to predict toxicity promote the safe-by-design approach for the development of new materials, including nanomaterials. In this study, a set of cytotoxicity experimental data corresponding to 19 data points for silica nanomaterials was investigated to compare the widely employed CORAL and Random Forest approaches in terms of their usefulness for developing so-called “nano-QSAR” models. “External” leave-one-out cross-validation (LOO) analysis was performed to validate the two different approaches. An analysis of variable importance measures and signed feature contributions for both algorithms was undertaken in order to interpret the models developed. CORAL showed a more pronounced difference between the average coefficient of determination (R2) between training and LOO (0.83 and 0.65 for training and LOO respectively) compared to Random Forest (0.87 and 0.78 without bootstrap sampling, 0.90 and 0.78 with bootstrap sampling), which may be due to overfitting. The aspect ratio and zeta potential from amongst the nanomaterials’ physico-chemical properties were found to be the two most important variables for the Random Forest and the average feature contributions calculated for the corresponding descriptors were consistent with the clear trends observed in the dataset: less negative zeta potential values and lower aspect ratio values were associated with higher cytotoxicity. In contrast, CORAL failed to capture these trends

    Identification of pattern mining algorithm for rugby league players positional groups separation based on movement patterns

    Get PDF
    The application of pattern mining algorithms to extract movement patterns from sports big data can improve training specificity by facilitating a more granular evaluation of movement. Since movement patterns can only occur as consecutive, non-consecutive, or non-sequential, this study aimed to identify the best set of movement patterns for player movement profiling in professional rugby league and quantify the similarity among distinct movement patterns. Three pattern mining algorithms (l-length Closed Contiguous [LCCspm], Longest Common Subsequence [LCS] and AprioriClose) were used to extract patterns to profile elite rugby football league hookers (n = 22 players) and wingers (n = 28 players) match-games movements across 319 matches. Jaccard similarity score was used to quantify the similarity between algorithms’ movement patterns and machine learning classification modelling identified the best algorithm’s movement patterns to separate playing positions. LCCspm and LCS movement patterns shared a 0.19 Jaccard similarity score. AprioriClose movement patterns shared no significant Jaccard similarity with LCCspm (0.008) and LCS (0.009) patterns. The closed contiguous movement patterns profiled by LCCspm best-separated players into playing positions. Multi-layered Perceptron classification algorithm achieved the highest accuracy of 91.02% and precision, recall and F1 scores of 0.91 respectively. Therefore, we recommend the extraction of closed contiguous (consecutive) over non-consecutive and non-sequential movement patterns for separating groups of players

    Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Datasets

    Get PDF
    The ability to interpret the predictions made by quantitative structure activity relationships (QSARs) offers a number of advantages. Whilst QSARs built using non-linear modelling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modelling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting non-linear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to two widely used linear modelling approaches: linear Support Vector Machines (SVM), or Support Vector Regression (SVR), and Partial Least Squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions, using novel scoring schemes for assessing Heat Map images of substructural contributions. We critically assess different approaches to interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed, public domain benchmark datasets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modelling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpreting non-linear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using Open Source programs, which we have made available to the community. These programs are the rfFC package [https://r-forge.r-project.org/R/?group_id=1725] for the R Statistical Programming Language, along with a Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for Heat Map generation

    Moving beyond velocity derivatives; using global positioning system data to extract sequential movement patterns at different levels of rugby league match-play

    Get PDF
    This study aims to (a) quantify the movement patterns during rugby league match-play and (b) identify if differences exist by levels of competition within the movement patterns and units through the sequential movement pattern (SMP) algorithm. Global Positioning System data were analysed from three competition levels; four Super League regular (regular-SL), three Super League (semi-)Finals (final-SL) and four international rugby league (international) matches. The SMP framework extracted movement pattern data for each athlete within the dataset. Between competition levels, differences were analysed using linear discriminant analysis (LDA). Movement patterns were decomposed into their composite movement units; then Kruskal-Wallis rank-sum and Dunn post-hoc were used to show differences. The SMP algorithm found 121 movement patterns comprised mainly of "walk" and "jog" based movement units. The LDA had an accuracy score of 0.81, showing good separation between competition levels. Linear discriminant 1 and 2 explained 86% and 14% of the variance. The Kruskal-Wallis found differences between competition levels for 9 of 17 movement units. Differences were primarily present between regular-SL and international with other combinations showing less differences. Movement units which showed significant differences between competition levels were mainly composed of low velocities with mixed acceleration and turning angles. The SMP algorithm found 121 movement patterns across all levels of rugby league match-play, of which, 9 were found to show significant differences between competition levels. Of these nine, all showed significant differences present between international and domestic, whereas only four found differences present within the domestic levels. This study shows the SMP algorithm can be used to differentiate between levels of rugby league and that higher levels of competition may have greater velocity demands

    Comparing the CORAL and Random Forest approaches for modelling the in vitro cytotoxicity of silica nanomaterials.

    Get PDF
    Nanotechnology is one of the most important technological developments of the 21st century. In silico methods to predict toxicity, such as quantitative structure-activity relationships (QSARs), promote the safe-by-design approach for the development of new materials, including nanomaterials. In this study, a set of cytotoxicity experimental data corresponding to 19 data points for silica nanomaterials were investigated, to compare the widely employed CORAL and Random Forest approaches in terms of their usefulness for developing so-called 'nano-QSAR' models. 'External' leave-one-out cross-validation (LOO) analysis was performed, to validate the two different approaches. An analysis of variable importance measures and signed feature contributions for both algorithms was undertaken, in order to interpret the models developed. CORAL showed a more pronounced difference between the average coefficient of determination (R²) for training and for LOO (0.83 and 0.65 for training and LOO, respectively), compared to Random Forest (0.87 and 0.78 without bootstrap sampling, 0.90 and 0.78 with bootstrap sampling), which may be due to overfitting. With regard to the physicochemical properties of the nanomaterials, the aspect ratio and zeta potential were found to be the two most important variables for Random Forest, and the average feature contributions calculated for the corresponding descriptors were consistent with the clear trends observed in the data set: less negative zeta potential values and lower aspect ratio values were associated with higher cytotoxicity. In contrast, CORAL failed to capture these trends
    • …
    corecore