11 research outputs found

    Master your Metrics with Calibration

    Full text link
    Machine learning models deployed in real-world applications are often evaluated with precision-based metrics such as F1-score or AUC-PR (Area Under the Curve of Precision Recall). Heavily dependent on the class prior, such metrics make it difficult to interpret the variation of a model's performance over different subpopulations/subperiods in a dataset. In this paper, we propose a way to calibrate the metrics so that they can be made invariant to the prior. We conduct a large number of experiments on balanced and imbalanced data to assess the behavior of calibrated metrics and show that they improve interpretability and provide a better control over what is really measured. We describe specific real-world use-cases where calibration is beneficial such as, for instance, model monitoring in production, reporting, or fairness evaluation.Comment: Presented at IDA202

    Deep ROC Analysis and AUC as Balanced Average Accuracy to Improve Model Selection, Understanding and Interpretation

    Get PDF
    Optimal performance is critical for decision-making tasks from medicine to autonomous driving, however common performance measures may be too general or too specific. For binary classifiers, diagnostic tests or prognosis at a timepoint, measures such as the area under the receiver operating characteristic curve, or the area under the precision recall curve, are too general because they include unrealistic decision thresholds. On the other hand, measures such as accuracy, sensitivity or the F1 score are measures at a single threshold that reflect an individual single probability or predicted risk, rather than a range of individuals or risk. We propose a method in between, deep ROC analysis, that examines groups of probabilities or predicted risks for more insightful analysis. We translate esoteric measures into familiar terms: AUC and the normalized concordant partial AUC are balanced average accuracy (a new finding); the normalized partial AUC is average sensitivity; and the normalized horizontal partial AUC is average specificity. Along with post-test measures, we provide a method that can improve model selection in some cases and provide interpretation and assurance for patients in each risk group. We demonstrate deep ROC analysis in two case studies and provide a toolkit in Python.Comment: 14 pages, 6 Figures, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), currently under revie

    Cooperative co-evolutionary module identification with application to cancer disease module discovery

    Get PDF
    none10siModule identification or community detection in complex networks has become increasingly important in many scientific fields because it provides insight into the relationship and interaction between network function and topology. In recent years, module identification algorithms based on stochastic optimization algorithms such as evolutionary algorithms have been demonstrated to be superior to other algorithms on small- to medium-scale networks. However, the scalability and resolution limit (RL) problems of these module identification algorithms have not been fully addressed, which impeded their application to real-world networks. This paper proposes a novel module identification algorithm called cooperative co-evolutionary module identification to address these two problems. The proposed algorithm employs a cooperative co-evolutionary framework to handle large-scale networks. We also incorporate a recursive partitioning scheme into the algorithm to effectively address the RL problem. The performance of our algorithm is evaluated on 12 benchmark complex networks. As a medical application, we apply our algorithm to identify disease modules that differentiate low- and high-grade glioma tumors to gain insights into the molecular mechanisms that underpin the progression of glioma. Experimental results show that the proposed algorithm has a very competitive performance compared with other state-of-the-art module identification algorithms.noneHe, S and Jia, G and Zhu, Z and Tennant, DA and Huang, Q and Tang, K and Liu, J and Musolesi, M and Heath, JK and Yao, XHe, S and Jia, G and Zhu, Z and Tennant, DA and Huang, Q and Tang, K and Liu, J and Musolesi, M and Heath, JK and Yao,

    Review of feature selection techniques in Parkinson's disease using OCT-imaging data

    Get PDF
    Several spectral-domain optical coherence tomography studies (OCT) reported a decrease on the macular region of the retina in Parkinson’s disease. Yet, the implication of retinal thinning with visual disability is still unclear. Macular scans acquired from patients with Parkinson’s disease (n = 100) and a control group (n = 248) were used to train several supervised classification models. The goal was to determine the most relevant retinal layers and regions for diagnosis, for which univari- ate and multivariate filter and wrapper feature selection methods were used. In addition, we evaluated the classification ability of the patient group to assess the applicability of OCT measurements as a biomarker of the disease

    A study on the prediction of flight delays of a private aviation airline

    Get PDF
    The delay is a crucial performance indicator of any transportation system, and flight delays cause financial and economic consequences to passengers and airlines. Hence, recognizing them through prediction may improve marketing decisions. The goal is to use machine learning techniques to predict an aviation challenge: flight delay above 15 minutes on departure of a private airline. Business and data understanding of this particular segment of aviation are revised against literature revision, and data preparation, modelling and evaluation are addressed to lead towards a model that may contribute as support for decision-making in a private aviation environment. The results show us which algorithms performed better and what variables contribute the most for the model, thereafter delay on departure.O atraso de voo é um indicador fulcral em toda a indútria de transporte aéreo e esses atrasos têm consequências económicas e financeiras para passageiros e companhias aéras. Reconhecê- los através de predição poderá melhorar decisões estratégicas e operacionais. O objectivo é utilizar técnicas de aprendizagem de máquina (machine learning) para prever um eterno desafio da aviação: atraso de voo à partida, utilizando dados de uma companhia aérea privada. O conhecimento do contexto do negócio e dos dados adquiridos, num segmento singular da aviação, são revistos à luz das literatura vigente e a preparação dos dados, a modelização e respectiva avaliação são conduzidos de modo a contribuir para uma ferramenta de apoio à decisão no contexto da aviação privada. Os resultados obtidos revelam quais dos algoritmos utilizados demonstra uma melhor performance e quais as variáveis dos dados obtidos que mais contribuem para o modelo e consequentemente para o atraso à partida

    Applications Of Wearable Sensors In Delivering Biologically Relevant Signals

    Get PDF
    With continued advancements in wearable technologies, the applications for their use are growing. Wearable sensors can be found in smart watches, fitness trackers, and even our cellphones. The common applications in everyday life are usually step counting, activity tracking, and heart rate monitoring. However, researchers have developed ways to use these similar sensors for clinically relevant diagnostic measures, as well as, improved athletic training and performance. Two areas of interest for the use of wearable sensors are mental health diagnostics in children and heart rate monitoring during intense physical activity from new locations, which are discussed further in this thesis. About 20% of children will experience an anxiety or depressive disorder. These disorders, if left untreated, can lead to comorbidity, substance abuse, and even suicide. Current methods for diagnosis are time consuming and only offered to those most at risk (i.e., reported or referred by a teacher, doctor, or parent). For the children that do get referred to a specialist, the process is often inaccurate. Researchers began using mood induction task to observe behavioral responses to specific stimuli in hopes to improve the accuracy of diagnostics. However, these methods involve long hours of training and watching videos of the activities. Recently, a few studies have focused on using wearable sensors during mood induction tasks in hopes to pick up on relevant movements to distinguish those with and without an internalizing disorder. The first study presented in this thesis focuses on using wearable inertial measurement units during the ‘Bubbles’ mood induction task. A decision tree was developed to identify children with internalizing disorders, accuracy of this model was 71% based on leave-one-subject-out cross validation. The second study focuses on estimating heart rate using wearable photoplethysmography sensors at multiple body locations. Heart rate is an important vital sign used across a variety of contexts. For example, athletes use heart rate to determine whether they are hitting their desired heart rate zones during training and doctors can use heart rate as an early indicator of disease. With the advancements made in wearables, photoplethysmography can now be used to collect signals from devices anywhere on the body. However, estimating heart rate accurately during periods of intense physical activity remains a challenge due to signal corruption cause by motion artifacts. This study focuses on evaluating algorithms for accurately estimating heart rate from photoplethysmograms and determining the optimal body location for wear. A phase vocoder and Wiener filtering approach was used to estimate heart rate from the forearm, shank, and sacrum. The algorithm estimated heart rate to within 6.2 6.9, and 6.7 beats per minute average absolute error for the forearm, shank, and sacrum, respectively, across a wide variety of physical activities selected to induce varying levels of motion artifact

    Evaluating Classifiers During Dataset Shift

    Get PDF
    Deployment of a classifier into a machine learning application likely begins with training different types of algorithms on a subset of the available historical data and then evaluating them on datasets that are drawn from identical distributions. The goal of this evaluation process is to select the classifier that is believed to be most robust in maintaining good future performance, and then deploy that classifier to end-users who use it to make predictions on new data. Often times, predictive models are deployed in conditions that differ from those used in training, meaning that dataset shift occurred. In these situations, there are no guarantees that predictions made by the predictive model in deployment will still be as reliable and accurate as they were during the training of the model. This study demonstrated a technique that can be utilized by others when selecting a classifier for deployment, as well as the first comparative study that evaluates machine learning classifier performance on synthetic datasets with different levels of prior-probability, covariate, and concept dataset shifts. The results from this study showed the impact of dataset shift on the performance of different classifiers for two real-world datasets related to teacher retention in Wisconsin and detecting fraud in testing, as well as demonstrated a framework that can be used by others when selecting a classifier for deployment. By using the methods from this study as a proactive approach to evaluate classifiers on synthetic dataset shift, different classifiers would have been considered for deployment of both predictive models, compared to only using evaluation datasets that were drawn from identical distributions. The results from both real-world datasets also showed that there was no classifier that dealt well with prior-probability shift and that classifiers were affected less by covariate and concept shift than was expected. Two supplemental demonstrations of the methodology showed that it can be extended for additional purposes of evaluating classifiers on dataset shift. Results from analyzing the effects of hyperparameter choices on classifier performance under dataset shift, as well as the effects of actual dataset shift on classifier performance, showed that different hyperparameter configurations have an impact on the performance of a classifier in general, but can also have an impact on how robust that classifier might be to dataset shift
    corecore