121 research outputs found

    Making the cut: forecasting non impact injuries in professional soccer

    Get PDF
    This paper proposes a methodology to predict work in non-traumatic injuries in professional soccer players. The task to be solved is a classification problem of the player's status with a window of 72 hours. The data set used corresponds to records of complete training by the players of Belgrano de Córdoba professional soccer team of the first division of Argentina. The chosen model is GBM with an AUC of 0.7. Interpretation exercises based on SHAP are performed on the chosen model to analyze the characteristics that determine the model's predictions. In addition, possible extensions are proposed such as the use of the results of the model at the time of contractual negotiation given the estimated proportion of time that the player will spend outside due to injury and the economic cost of those absences given, at least, by the direct salary cost of that player. Another approach to the injury forecasting problem based on survival time models is also discussed

    Detection of Software Vulnerability Communication in Expert Social Media Channels: A Data-driven Approach

    Get PDF
    Conceptually, a vulnerability is: A flaw or weakness in a system’s design, implementation,or operation and management that could be exploited to violate the system’s security policy .Some of these flaws can go undetected and exploited for long periods of time after soft-ware release. Although some software providers are making efforts to avoid this situ-ation, inevitability, users are still exposed to vulnerabilities that allow criminal hackersto take advantage. These vulnerabilities are constantly discussed in specialised forumson social media. Therefore, from a cyber security standpoint, the information found inthese places can be used for countermeasures actions against malicious exploitation ofsoftware. However, manual inspection of the vast quantity of shared content in socialmedia is impractical. For this reason, in this thesis, we analyse the real applicability ofsupervised classification models to automatically detect software vulnerability com-munication in expert social media channels. We cover the following three principal aspects: Firstly, we investigate the applicability of classification models in a range of 5 differ-ent datasets collected from 3 Internet Domains: Dark Web, Deep Web and SurfaceWeb. Since supervised models require labelled data, we have provided a systematiclabelling process using multiple annotators to guarantee accurate labels to carry outexperiments. Using these datasets, we have investigated the classification models withdifferent combinations of learning-based algorithms and traditional features represen-tation. Also, by oversampling the positive instances, we have achieved an increaseof 5% in Positive Recall (on average) in these models. On top of that, we have appiiplied Feature Reduction, Feature Extraction and Feature Selection techniques, whichprovided a reduction on the dimensionality of these models without damaging the accuracy, thus, providing computationally efficient models. Furthermore, in addition to traditional features representation, we have investigated the performance of robust language models, such as Word Embedding (WEMB) andSentence Embedding (SEMB) on the accuracy of classification models. RegardingWEMB, our experiment has shown that this model trained with a small security-vocabulary dataset provides comparable results with WEMB trained in a very large general-vocabulary dataset. Regarding SEMB model, our experiment has shown thatits use overcomes WEMB model in detecting vulnerability communication, recording 8% of Avg. Class Accuracy and 74% of Positive Recall. In addition, we investigate twoDeep Learning algorithms as classifiers, text CNN (Convolutional Neural Network)and RNN (Recurrent Neural Network)-based algorithms, which have improved ourmodel, resulting in the best overall performance for our task

    AUC-Based Extreme Learning Machines for Supervised and Semi-Supervised Imbalanced Classification

    Full text link

    Measuring Integrity for Selection into Medical School : Development of a Situational Judgement Test

    Get PDF

    Distributed context discovering for predictive modeling

    Get PDF
    Click prediction has applications in various areas such as advertising, search and online sales. Usually user-intent information such as query terms and previous click history is used in click prediction. However, this information is not always available. For example, there are no queries from users on the webpages of content publishers, such as personal blogs. The available information for click prediction in this scenario are implicitly derived from users, such as visiting time and IP address. Thus, the existing approaches utilizing user-intent information may be inapplicable in this scenario; and the click prediction problem in this scenario remains unexplored to our knowledge. In addition, the challenges in handling skewed data streams also exist in prediction, since there is often a heavy traffic on webpages and few visitors click on them. In this thesis, we propose to use the pattern-based classification approach to tackle the click prediction problem. Attributes in webpage visits are combined by a pattern mining algorithm to enhance their power in prediction. To make the pattern-based classification handle skewed data streams, we adopt a sliding window to capture recent data, and an undersampling technique to handle the skewness. As a side problem raised by the pattern-based approach, mining patterns from large datasets is addressed by a distributed pattern sampling algorithm proposed by us. This algorithm shows its scalability in experiments. We validate our pattern-based approach in click prediction on a real-world dataset from a Dutch portal website. The experiments show our pattern-based approach can achieve an average AUC of 0.675 over a period of 36 days with a 5-day sized sliding window, which surpasses the baseline, a statically trained classification model without patterns by 0.002. Besides, the average weighted F-measure of our approach is 0.009 higher than the baseline. Therefore, our proposed approach can slightly improve classification performance; yet whether this improvement worth deployment in real scenarios remains a question. Click prediction has applications in various areas such as advertising, search and online sales. Usually user-intent information such as query terms and previous click history is used in click prediction. However, this information is not always available. For example, there are no queries from users on the webpages of content publishers, such as personal blogs. The available information for click prediction in this scenario are implicitly derived from users, such as visiting time and IP address. Thus, the existing approaches utilizing user-intent information may be inapplicable in this scenario; and the click prediction problem in this scenario remains unexplored to our knowledge. In addition, the challenges in handling skewed data streams also exist in prediction, since there is often a heavy traffic on webpages and few visitors click on them. In this thesis, we propose to use the pattern-based classification approach to tackle the click prediction problem. Attributes in webpage visits are combined by a pattern mining algorithm to enhance their power in prediction. To make the pattern-based classification handle skewed data streams, we adopt a sliding window to capture recent data, and an undersampling technique to handle the skewness. As a side problem raised by the pattern-based approach, mining patterns from large datasets is addressed by a distributed pattern sampling algorithm proposed by us. This algorithm shows its scalability in experiments. We validate our pattern-based approach in click prediction on a real-world dataset from a Dutch portal website. The experiments show our pattern-based approach can achieve an average AUC of 0.675 over a period of 36 days with a 5-day sized sliding window, which surpasses the baseline, a statically trained classification model without patterns by 0.002. Besides, the average weighted F-measure of our approach is 0.009 higher than the baseline. Therefore, our proposed approach can slightly improve classification performance; yet whether this improvement worth deployment in real scenarios remains a question

    Measuring Integrity for Selection into Medical School : Development of a Situational Judgement Test

    Get PDF
    This thesis examines the use of a Situational Judgement Test (SJT) for the selection into medical school based on the noncognitive attribute of integrity. Specifically, this thesis describes five studies that investigate how several test characteristics influence a number of quality criteria of an SJT in measuring integrity among medical school applicants

    Measuring Integrity for Selection into Medical School : Development of a Situational Judgement Test

    Get PDF

    Automation of Patient Trajectory Management: A deep-learning system for critical care outreach

    Full text link
    The application of machine learning models to big data has become ubiquitous, however their successful translation into clinical practice is currently mostly limited to the field of imaging. Despite much interest and promise, there are many complex and interrelated barriers that exist in clinical settings, which must be addressed systematically in advance of wide-spread adoption of these technologies. There is limited evidence of comprehensive efforts to consider not only their raw performance metrics, but also their effective deployment, particularly in terms of the ways in which they are perceived, used and accepted by clinicians. The critical care outreach team at St Vincent’s Public Hospital want to automatically prioritise their workload by predicting in-patient deterioration risk, presented as a watch-list application. This work proposes that the proactive management of in-patients at risk of serious deterioration provides a comprehensive case-study in which to understand clinician readiness to adopt deep-learning technology due to the significant known limitations of existing manual processes. Herein is described the development of a proof of concept application uses as its input the subset of real-time clinical data available in the EMR. This data set has the noteworthy challenge of not including any electronically recorded vital signs data. Despite this, the system meets or exceeds similar benchmark models for predicting in-patient death and unplanned ICU admission, using a recurrent neural network architecture, extended with a novel data-augmentation strategy. This augmentation method has been re-implemented in the public MIMIC-III data set to confirm its generalisability. The method is notable for its applicability to discrete time-series data. Furthermore, it is rooted in knowledge of how data entry is performed within the clinical record and is therefore not restricted in applicability to a single clinical domain, instead having the potential for wide-ranging impact. The system was presented to likely end-users to understand their readiness to adopt it into their workflow, using the Technology Adoption Model. In addition to confirming feasibility of predicting risk from this limited data set, this study investigates clinician readiness to adopt artificial intelligence in the critical care setting. This is done with a two-pronged strategy, addressing technical and clinically-focused research questions in parallel

    An interpretable machine learning approach to multimodal stress detection in a simulated office environment

    Get PDF
    Background and objective: Work-related stress affects a large part of today’s workforce and is known to have detrimental effects on physical and mental health. Continuous and unobtrusive stress detection may help prevent and reduce stress by providing personalised feedback and allowing for the development of just-in-time adaptive health interventions for stress management. Previous studies on stress detection in work environments have often struggled to adequately reflect real-world conditions in controlled laboratory experiments. To close this gap, in this paper, we present a machine learning methodology for stress detection based on multimodal data collected from unobtrusive sources in an experiment simulating a realistic group office environment (N=90). Methods: We derive mouse, keyboard and heart rate variability features to detect three levels of perceived stress, valence and arousal with support vector machines, random forests and gradient boosting models using 10-fold cross-validation. We interpret the contributions of features to the model predictions with SHapley Additive exPlanations (SHAP) value plots. Results: The gradient boosting models based on mouse and keyboard features obtained the highest average F1 scores of 0.625, 0.631 and 0.775 for the multiclass prediction of perceived stress, arousal and valence, respectively. Our results indicate that the combination of mouse and keyboard features may be better suited to detect stress in office environments than heart rate variability, despite physiological signal-based stress detection being more established in theory and research. The analysis of SHAP value plots shows that specific mouse movement and typing behaviours may characterise different levels of stress. Conclusions: Our study fills different methodological gaps in the research on the automated detection of stress in office environments, such as approximating real-life conditions in a laboratory and combining physiological and behavioural data sources. Implications for field studies on personalised, interpretable ML-based systems for the real-time detection of stress in real office environments are also discussed

    Improving Engagement Assessment by Model Individualization and Deep Learning

    Get PDF
    This dissertation studies methods that improve engagement assessment for pilots. The major work addresses two challenging problems involved in the assessment: individual variation among pilots and the lack of labeled data for training assessment models. Task engagement is usually assessed by analyzing physiological measurements collected from subjects who are performing a task. However, physiological measurements such as Electroencephalography (EEG) vary from subject to subject. An assessment model trained for one subject may not be applicable to other subjects. We proposed a dynamic classifier selection algorithm for model individualization and compared it to other two methods: base line normalization and similarity-based model replacement. Experimental results showed that baseline normalization and dynamic classifier selection can significantly improve cross-subject engagement assessment. For complex tasks such as piloting an air plane, labeling engagement levels for pilots is challenging. Without enough labeled data, it is very difficult for traditional methods to train valid models for effective engagement assessment. This dissertation proposed to utilize deep learning models to address this challenge. Deep learning models are capable of learning valuable feature hierarchies by taking advantage of both labeled and unlabeled data. Our results showed that deep models are better tools for engagement assessment when label information is scarce. To further verify the power of deep learning techniques for scarce labeled data, we applied the deep learning algorithm to another small size data set, the ADNI data set. The ADNI data set is a public data set containing MRI and PET scans of Alzheimer\u27s Disease (AD) patients for AD diagnosis. We developed a robust deep learning system incorporating dropout and stability selection techniques to identify the different progression stages of AD patients. The experimental results showed that deep learning is very effective in AD diagnosis. In addition, we studied several imbalance learning techniques that are useful when data is highly unbalanced, i.e., when majority classes have many more training samples than minority classes. Conventional machine learning techniques usually tend to classify all data samples into majority classes and to perform poorly for minority classes. Unbalanced learning techniques can balance data sets before training and can improve learning performance
    • …
    corecore