11 research outputs found
Master your Metrics with Calibration
Machine learning models deployed in real-world applications are often
evaluated with precision-based metrics such as F1-score or AUC-PR (Area Under
the Curve of Precision Recall). Heavily dependent on the class prior, such
metrics make it difficult to interpret the variation of a model's performance
over different subpopulations/subperiods in a dataset. In this paper, we
propose a way to calibrate the metrics so that they can be made invariant to
the prior. We conduct a large number of experiments on balanced and imbalanced
data to assess the behavior of calibrated metrics and show that they improve
interpretability and provide a better control over what is really measured. We
describe specific real-world use-cases where calibration is beneficial such as,
for instance, model monitoring in production, reporting, or fairness
evaluation.Comment: Presented at IDA202
Deep ROC Analysis and AUC as Balanced Average Accuracy to Improve Model Selection, Understanding and Interpretation
Optimal performance is critical for decision-making tasks from medicine to
autonomous driving, however common performance measures may be too general or
too specific. For binary classifiers, diagnostic tests or prognosis at a
timepoint, measures such as the area under the receiver operating
characteristic curve, or the area under the precision recall curve, are too
general because they include unrealistic decision thresholds. On the other
hand, measures such as accuracy, sensitivity or the F1 score are measures at a
single threshold that reflect an individual single probability or predicted
risk, rather than a range of individuals or risk. We propose a method in
between, deep ROC analysis, that examines groups of probabilities or predicted
risks for more insightful analysis. We translate esoteric measures into
familiar terms: AUC and the normalized concordant partial AUC are balanced
average accuracy (a new finding); the normalized partial AUC is average
sensitivity; and the normalized horizontal partial AUC is average specificity.
Along with post-test measures, we provide a method that can improve model
selection in some cases and provide interpretation and assurance for patients
in each risk group. We demonstrate deep ROC analysis in two case studies and
provide a toolkit in Python.Comment: 14 pages, 6 Figures, submitted to IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), currently under revie
Cooperative co-evolutionary module identification with application to cancer disease module discovery
none10siModule identification or community detection in complex networks has become increasingly important in many scientific fields because it provides insight into the relationship and interaction between network function and topology. In recent years, module identification algorithms based on stochastic optimization algorithms such as evolutionary algorithms have been demonstrated to be superior to other algorithms on small- to medium-scale networks. However, the scalability and resolution limit (RL) problems of these module identification algorithms have not been fully addressed, which impeded their application to real-world networks. This paper proposes a novel module identification algorithm called cooperative co-evolutionary module identification to address these two problems. The proposed algorithm employs a cooperative co-evolutionary framework to handle large-scale networks. We also incorporate a recursive partitioning scheme into the algorithm to effectively address the RL problem. The performance of our algorithm is evaluated on 12 benchmark complex networks. As a medical application, we apply our algorithm to identify disease modules that differentiate low- and high-grade glioma tumors to gain insights into the molecular mechanisms that underpin the progression of glioma. Experimental results show that the proposed algorithm has a very competitive performance compared with other state-of-the-art module identification algorithms.noneHe, S and Jia, G and Zhu, Z and Tennant, DA and Huang, Q and Tang, K and Liu, J and Musolesi, M and Heath, JK and Yao, XHe, S and Jia, G and Zhu, Z and Tennant, DA and Huang, Q and Tang, K and Liu, J and Musolesi, M and Heath, JK and Yao,
Review of feature selection techniques in Parkinson's disease using OCT-imaging data
Several spectral-domain optical coherence tomography studies (OCT) reported a decrease
on the macular region of the retina in Parkinson’s disease. Yet, the implication of retinal
thinning with visual disability is still unclear.
Macular scans acquired from patients with Parkinson’s disease (n = 100) and a control
group (n = 248) were used to train several supervised classification models. The goal was
to determine the most relevant retinal layers and regions for diagnosis, for which univari-
ate and multivariate filter and wrapper feature selection methods were used. In addition,
we evaluated the classification ability of the patient group to assess the applicability of
OCT measurements as a biomarker of the disease
A study on the prediction of flight delays of a private aviation airline
The delay is a crucial performance indicator of any transportation system, and flight delays
cause financial and economic consequences to passengers and airlines. Hence, recognizing
them through prediction may improve marketing decisions. The goal is to use machine learning
techniques to predict an aviation challenge: flight delay above 15 minutes on departure of a
private airline. Business and data understanding of this particular segment of aviation are
revised against literature revision, and data preparation, modelling and evaluation are addressed
to lead towards a model that may contribute as support for decision-making in a private aviation
environment. The results show us which algorithms performed better and what variables
contribute the most for the model, thereafter delay on departure.O atraso de voo é um indicador fulcral em toda a indútria de transporte aéreo e esses atrasos
têm consequências económicas e financeiras para passageiros e companhias aéras. Reconhecê-
los através de predição poderá melhorar decisões estratégicas e operacionais. O objectivo é
utilizar técnicas de aprendizagem de máquina (machine learning) para prever um eterno desafio
da aviação: atraso de voo à partida, utilizando dados de uma companhia aérea privada. O
conhecimento do contexto do negócio e dos dados adquiridos, num segmento singular da
aviação, são revistos à luz das literatura vigente e a preparação dos dados, a modelização e
respectiva avaliação são conduzidos de modo a contribuir para uma ferramenta de apoio à
decisão no contexto da aviação privada. Os resultados obtidos revelam quais dos algoritmos
utilizados demonstra uma melhor performance e quais as variáveis dos dados obtidos que mais
contribuem para o modelo e consequentemente para o atraso à partida
Applications Of Wearable Sensors In Delivering Biologically Relevant Signals
With continued advancements in wearable technologies, the applications for their use are growing. Wearable sensors can be found in smart watches, fitness trackers, and even our cellphones. The common applications in everyday life are usually step counting, activity tracking, and heart rate monitoring. However, researchers have developed ways to use these similar sensors for clinically relevant diagnostic measures, as well as, improved athletic training and performance. Two areas of interest for the use of wearable sensors are mental health diagnostics in children and heart rate monitoring during intense physical activity from new locations, which are discussed further in this thesis.
About 20% of children will experience an anxiety or depressive disorder. These disorders, if left untreated, can lead to comorbidity, substance abuse, and even suicide. Current methods for diagnosis are time consuming and only offered to those most at risk (i.e., reported or referred by a teacher, doctor, or parent). For the children that do get referred to a specialist, the process is often inaccurate. Researchers began using mood induction task to observe behavioral responses to specific stimuli in hopes to improve the accuracy of diagnostics. However, these methods involve long hours of training and watching videos of the activities. Recently, a few studies have focused on using wearable sensors during mood induction tasks in hopes to pick up on relevant movements to distinguish those with and without an internalizing disorder. The first study presented in this thesis focuses on using wearable inertial measurement units during the ‘Bubbles’ mood induction task. A decision tree was developed to identify children with internalizing disorders, accuracy of this model was 71% based on leave-one-subject-out cross validation.
The second study focuses on estimating heart rate using wearable photoplethysmography sensors at multiple body locations. Heart rate is an important vital sign used across a variety of contexts. For example, athletes use heart rate to determine whether they are hitting their desired heart rate zones during training and doctors can use heart rate as an early indicator of disease. With the advancements made in wearables, photoplethysmography can now be used to collect signals from devices anywhere on the body. However, estimating heart rate accurately during periods of intense physical activity remains a challenge due to signal corruption cause by motion artifacts. This study focuses on evaluating algorithms for accurately estimating heart rate from photoplethysmograms and determining the optimal body location for wear. A phase vocoder and Wiener filtering approach was used to estimate heart rate from the forearm, shank, and sacrum. The algorithm estimated heart rate to within 6.2 6.9, and 6.7 beats per minute average absolute error for the forearm, shank, and sacrum, respectively, across a wide variety of physical activities selected to induce varying levels of motion artifact
Evaluating Classifiers During Dataset Shift
Deployment of a classifier into a machine learning application likely begins with training different types of algorithms on a subset of the available historical data and then evaluating them on datasets that are drawn from identical distributions. The goal of this evaluation process is to select the classifier that is believed to be most robust in maintaining good future performance, and then deploy that classifier to end-users who use it to make predictions on new data. Often times, predictive models are deployed in conditions that differ from those used in training, meaning that dataset shift occurred. In these situations, there are no guarantees that predictions made by the predictive model in deployment will still be as reliable and accurate as they were during the training of the model. This study demonstrated a technique that can be utilized by others when selecting a classifier for deployment, as well as the first comparative study that evaluates machine learning classifier performance on synthetic datasets with different levels of prior-probability, covariate, and concept dataset shifts.
The results from this study showed the impact of dataset shift on the performance of different classifiers for two real-world datasets related to teacher retention in Wisconsin and detecting fraud in testing, as well as demonstrated a framework that can be used by others when selecting a classifier for deployment. By using the methods from this study as a proactive approach to evaluate classifiers on synthetic dataset shift, different classifiers would have been considered for deployment of both predictive models, compared to only using evaluation datasets that were drawn from identical distributions. The results from both real-world datasets also showed that there was no classifier that dealt well with prior-probability shift and that classifiers were affected less by covariate and concept shift than was expected. Two supplemental demonstrations of the methodology showed that it can be extended for additional purposes of evaluating classifiers on dataset shift. Results from analyzing the effects of hyperparameter choices on classifier performance under dataset shift, as well as the effects of actual dataset shift on classifier performance, showed that different hyperparameter configurations have an impact on the performance of a classifier in general, but can also have an impact on how robust that classifier might be to dataset shift