39 research outputs found

    FixOut: an ensemble approach to fairer models

    Get PDF
    In this paper, we address the question of process and model fairness. We propose FixOut, a human-centered and model-agnostic framework, that uses any explanation method (based on feature importance) to assess model's reliance on sensitive features. Given a pre-trained classifier, FixOut first checks whether it relies on user defined sensitive features. If it does, then FixOut employs feature dropout to produce a pool of simplified classifiers that are then aggregated into an ensemble classifier. We present empirical results using different models on several real-world datasets, that show a consistent improvement in terms of widely used fairness metrics, decreased reliance on sensitive features, and without compromising the classifier's accuracy

    Explainable machine learning for project management control

    Get PDF
    Project control is a crucial phase within project management aimed at ensuring —in an integrated manner— that the project objectives are met according to plan. Earned Value Management —along with its various refinements— is the most popular and widespread method for top-down project control. For project control under uncertainty, Monte Carlo simulation and statistical/machine learning models extend the earned value framework by allowing the analysis of deviations, expected times and costs during project progress. Recent advances in explainable machine learning, in particular attribution methods based on Shapley values, can be used to link project control to activity properties, facilitating the interpretation of interrelations between activity characteristics and control objectives. This work proposes a new methodology that adds an explainability layer based on SHAP —Shapley Additive exPlanations— to different machine learning models fitted to Monte Carlo simulations of the project network during tracking control points. Specifically, our method allows for both prospective and retrospective analyses, which have different utilities: forward analysis helps to identify key relationships between the different tasks and the desired outcomes, thus being useful to make execution/replanning decisions; and backward analysis serves to identify the causes of project status during project progress. Furthermore, this method is general, model-agnostic and provides quantifiable and easily interpretable information, hence constituting a valuable tool for project control in uncertain environments

    Hybrid Data-driven Framework for Shale Gas Production Performance Analysis via Game Theory, Machine Learning and Optimization Approaches

    Full text link
    A comprehensive and precise analysis of shale gas production performance is crucial for evaluating resource potential, designing field development plan, and making investment decisions. However, quantitative analysis can be challenging because production performance is dominated by a complex interaction among a series of geological and engineering factors. In this study, we propose a hybrid data-driven procedure for analyzing shale gas production performance, which consists of a complete workflow for dominant factor analysis, production forecast, and development optimization. More specifically, game theory and machine learning models are coupled to determine the dominating geological and engineering factors. The Shapley value with definite physical meanings is employed to quantitatively measure the effects of individual factors. A multi-model-fused stacked model is trained for production forecast, on the basis of which derivative-free optimization algorithms are introduced to optimize the development plan. The complete workflow is validated with actual production data collected from the Fuling shale gas field, Sichuan Basin, China. The validation results show that the proposed procedure can draw rigorous conclusions with quantified evidence and thereby provide specific and reliable suggestions for development plan optimization. Comparing with traditional and experience-based approaches, the hybrid data-driven procedure is advanced in terms of both efficiency and accuracy.Comment: 37 pages, 15 figures, 6 table

    Security Aspects of Internet of Things aided Smart Grids: a Bibliometric Survey

    Full text link
    The integration of sensors and communication technology in power systems, known as the smart grid, is an emerging topic in science and technology. One of the critical issues in the smart grid is its increased vulnerability to cyber threats. As such, various types of threats and defense mechanisms are proposed in literature. This paper offers a bibliometric survey of research papers focused on the security aspects of Internet of Things (IoT) aided smart grids. To the best of the authors' knowledge, this is the very first bibliometric survey paper in this specific field. A bibliometric analysis of all journal articles is performed and the findings are sorted by dates, authorship, and key concepts. Furthermore, this paper also summarizes the types of cyber threats facing the smart grid, the various security mechanisms proposed in literature, as well as the research gaps in the field of smart grid security.Comment: The paper is published in Elsevier's Internet of Things journal. 25 pages + 20 pages of reference

    Assessing eligibility for lung cancer screening using parsimonious ensemble machine learning models: A development and validation study

    Get PDF
    BACKGROUND: Risk-based screening for lung cancer is currently being considered in several countries; however, the optimal approach to determine eligibility remains unclear. Ensemble machine learning could support the development of highly parsimonious prediction models that maintain the performance of more complex models while maximising simplicity and generalisability, supporting the widespread adoption of personalised screening. In this work, we aimed to develop and validate ensemble machine learning models to determine eligibility for risk-based lung cancer screening. METHODS AND FINDINGS: For model development, we used data from 216,714 ever-smokers recruited between 2006 and 2010 to the UK Biobank prospective cohort and 26,616 high-risk ever-smokers recruited between 2002 and 2004 to the control arm of the US National Lung Screening (NLST) randomised controlled trial. The NLST trial randomised high-risk smokers from 33 US centres with at least a 30 pack-year smoking history and fewer than 15 quit-years to annual CT or chest radiography screening for lung cancer. We externally validated our models among 49,593 participants in the chest radiography arm and all 80,659 ever-smoking participants in the US Prostate, Lung, Colorectal and Ovarian (PLCO) Screening Trial. The PLCO trial, recruiting from 1993 to 2001, analysed the impact of chest radiography or no chest radiography for lung cancer screening. We primarily validated in the PLCO chest radiography arm such that we could benchmark against comparator models developed within the PLCO control arm. Models were developed to predict the risk of 2 outcomes within 5 years from baseline: diagnosis of lung cancer and death from lung cancer. We assessed model discrimination (area under the receiver operating curve, AUC), calibration (calibration curves and expected/observed ratio), overall performance (Brier scores), and net benefit with decision curve analysis. Models predicting lung cancer death (UCL-D) and incidence (UCL-I) using 3 variables-age, smoking duration, and pack-years-achieved or exceeded parity in discrimination, overall performance, and net benefit with comparators currently in use, despite requiring only one-quarter of the predictors. In external validation in the PLCO trial, UCL-D had an AUC of 0.803 (95% CI: 0.783, 0.824) and was well calibrated with an expected/observed (E/O) ratio of 1.05 (95% CI: 0.95, 1.19). UCL-I had an AUC of 0.787 (95% CI: 0.771, 0.802), an E/O ratio of 1.0 (95% CI: 0.92, 1.07). The sensitivity of UCL-D was 85.5% and UCL-I was 83.9%, at 5-year risk thresholds of 0.68% and 1.17%, respectively, 7.9% and 6.2% higher than the USPSTF-2021 criteria at the same specificity. The main limitation of this study is that the models have not been validated outside of UK and US cohorts. CONCLUSIONS: We present parsimonious ensemble machine learning models to predict the risk of lung cancer in ever-smokers, demonstrating a novel approach that could simplify the implementation of risk-based lung cancer screening in multiple settings

    Informed classification of sweeteners/bitterants compounds via explainable machine learning

    Get PDF
    Perception of taste is an emergent phenomenon arising from complex molecular interactions between chemical compounds and specific taste receptors. Among all the taste perceptions, the dichotomy of sweet and bitter tastes has been the subject of several machine learning studies for classification purposes. While previous studies have provided accurate sweeteners/bitterants classifiers, there is ample scope to enhance these models by enriching the understanding of the molecular basis of bitter-sweet tastes. Towards these goals, our study focuses on the development and testing of several machine learning strategies coupled with the novel SHapley Additive exPlanations (SHAP) for a rational sweetness/bitterness classification. This allows the identification of the chemical descriptors of interest by allowing a more informed approach toward the rational design and screening of sweeteners/bitterants. To support future research in this field, we make all datasets and machine learning models publicly available and present an easy-to-use code for bitter-sweet taste prediction

    Artificial Intelligence and Machine Learning Approaches to Energy Demand-Side Response: A Systematic Review

    Get PDF
    Recent years have seen an increasing interest in Demand Response (DR) as a means to provide flexibility, and hence improve the reliability of energy systems in a cost-effective way. Yet, the high complexity of the tasks associated with DR, combined with their use of large-scale data and the frequent need for near real-time de-cisions, means that Artificial Intelligence (AI) and Machine Learning (ML) — a branch of AI — have recently emerged as key technologies for enabling demand-side response. AI methods can be used to tackle various challenges, ranging from selecting the optimal set of consumers to respond, learning their attributes and pref-erences, dynamic pricing, scheduling and control of devices, learning how to incentivise participants in the DR schemes and how to reward them in a fair and economically efficient way. This work provides an overview of AI methods utilised for DR applications, based on a systematic review of over 160 papers, 40 companies and commercial initiatives, and 21 large-scale projects. The papers are classified with regards to both the AI/ML algorithm(s) used and the application area in energy DR. Next, commercial initiatives are presented (including both start-ups and established companies) and large-scale innovation projects, where AI methods have been used for energy DR. The paper concludes with a discussion of advantages and potential limitations of reviewed AI techniques for different DR tasks, and outlines directions for future research in this fast-growing area

    Screening the risk of obstructive sleep apnea by utilizing supervised learning techniques based on anthropometric features and snoring events

    Get PDF
    OBJECTIVES: Obstructive sleep apnea (OSA) is typically diagnosed by polysomnography (PSG). However, PSG is time-consuming and has some clinical limitations. This study thus aimed to establish machine learning models to screen for the risk of having moderate-to-severe and severe OSA based on easily acquired features. METHODS: We collected PSG data on 3529 patients from Taiwan and further derived the number of snoring events. Their baseline characteristics and anthropometric measures were obtained, and correlations among the collected variables were investigated. Next, six common supervised machine learning techniques were utilized, including random forest (RF), extreme gradient boosting (XGBoost), k-nearest neighbor (kNN), support vector machine (SVM), logistic regression (LR), and naĂŻve Bayes (NB). First, data were independently separated into a training and validation dataset (80%) and a test dataset (20%). The approach with the highest accuracy in the training and validation phase was employed to classify the test dataset. Next, feature importance was investigated by calculating the Shapley value of every factor, which represented the impact on OSA risk screening. RESULTS: The RF produced the highest accuracy (of >70%) in the training and validation phase in screening for both OSA severities. Hence, we employed the RF to classify the test dataset, and results showed a 79.32% accuracy for moderate-to-severe OSA and 74.37% accuracy for severe OSA. Snoring events and the visceral fat level were the most and second most essential features of screening for OSA risk. CONCLUSIONS: The established model can be considered for screening for the risk of having moderate-to-severe or severe OSA
    corecore