Search CORE

The Benefits of the Matthews Correlation Coefficient (MCC) Over the Diagnostic Odds Ratio (DOR) in Binary Classification Assessment

Author: Chicco D.
Jurman G.
Starovoitov V.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

To assess the quality of a binary classification, researchers often take advantage of a four-entry contingency table called confusion matrix, containing true positives, true negatives, false positives, and false negatives. To recap the four values of a confusion matrix in a unique score, researchers and statisticians have developed several rates and metrics. In the past, several scientific studies already showed why the Matthews correlation coefficient (MCC) is more informative and trustworthy than confusion-entropy error, accuracy, F1 score, bookmaker informedness, markedness, and balanced accuracy. In this study, we compare the MCC with the diagnostic odds ratio (DOR), a statistical rate employed sometimes in biomedical sciences. After examining the properties of the MCC and of the DOR, we describe the relationships between them, by also taking advantage of an innovative geometrical plot called confusion tetrahedron, presented here for the first time. We then report some use cases where the MCC and the DOR produce discordant outcomes, and explain why the Matthews correlation coefficient is more informative and reliable between the two. Our results can have a strong impact in computer science and statistics, because they clearly explain why the trustworthiness of the information provided by the Matthews correlation coefficient is higher than the one generated by the diagnostic odds ratio

Belarusian State University of Informatics and Radioelectronics Repository

A machine learning pipeline for quantitative phenotype prediction from genotype data

Author: AJ Smola
C Ambroise
C De Mol
C Furlanello
CC Chang
Cesare Furlanello
E Liu
E Wooten
G Jurman
G Jurman
G Jurman
Giorgio Guzzetta
Giuseppe Jurman
H Zhou
J Marchini
JH Moore
K Baggerly
L Shi
LA Cupples
M Chierici
N Yosef
P Fardin
P Kraft
R Tibshirani
SH Lee
SI Lee
T Casci
The MicroArray Quality Control (MAQC) Consortium
W Valdar
Z Wei
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Quantitative phenotypes emerge everywhere in systems biology and biomedicine due to a direct interest for quantitative traits, or to high individual variability that makes hard or impossible to classify samples into distinct categories, often the case with complex common diseases. Machine learning approaches to genotype-phenotype mapping may significantly improve Genome-Wide Association Studies (GWAS) results by explicitly focusing on predictivity and optimal feature selection in a multivariate setting. It is however essential that stringent and well documented Data Analysis Protocols (DAP) are used to control sources of variability and ensure reproducibility of results. We present a genome-to-phenotype pipeline of machine learning modules for quantitative phenotype prediction. The pipeline can be applied for the direct use of whole-genome information in functional studies. As a realistic example, the problem of fitting complex phenotypic traits in heterogeneous stock mice from single nucleotide polymorphims (SNPs) is here considered. Methods The core element in the pipeline is the L1L2 regularization method based on the naïve elastic net. The method gives at the same time a regression model and a dimensionality reduction procedure suitable for correlated features. Model and SNP markers are selected through a DAP originally developed in the MAQC-II collaborative initiative of the U.S. FDA for the identification of clinical biomarkers from microarray data. The L1L2 approach is compared with standard Support Vector Regression (SVR) and with Recursive Jump Monte Carlo Markov Chain (MCMC). Algebraic indicators of stability of partial lists are used for model selection; the final panel of markers is obtained by a procedure at the chromosome scale, termed ’saturation’, to recover SNPs in Linkage Disequilibrium with those selected. Results With respect to both MCMC and SVR, comparable accuracies are obtained by the L1L2 pipeline. Good agreement is also found between SNPs selected by the L1L2 algorithms and candidate loci previously identified by a standard GWAS. The combination of L1L2-based feature selection with a saturation procedure tackles the issue of neglecting highly correlated features that affects many feature selection algorithms. Conclusions The L1L2 pipeline has proven effective in terms of marker selection and prediction accuracy. This study indicates that machine learning techniques may support quantitative phenotype prediction, provided that adequate DAPs are employed to control bias in model selection.</p

Springer - Publisher Connector

Supervised classification of combined copy number and gene expression data

Author: Furlanello C.
Jurman G.
Merler S.
Paoli S.
Quattrone A.
Riccadonna S.
Publication venue
Publication date: 01/12/2007
Field of study

Summary In this paper we apply a predictive profiling method to genome copy number aberrations (CNA) in combination with gene expression and clinical data to identify molecular patterns of cancer pathophysiology. Predictive models and optimal feature lists for the platforms are developed by a complete validation SVM-based machine learning system. Ranked list of genome CNA sites (assessed by comparative genomic hybridization arrays – aCGH) and of differentially expressed genes (assessed by microarray profiling with Affy HG-U133A chips) are computed and combined on a breast cancer dataset for the discrimination of Luminal/ ER+ (Lum/ER+) and Basal-like/ER- classes. Different encodings are developed and applied to the CNA data, and predictive variable selection is discussed. We analyze the combination of profiling information between the platforms, also considering the pathophysiological data. A specific subset of patients is identified that has a different response to classification by chromosomal gains and losses and by differentially expressed genes, corroborating the idea that genomic CNA can represent an independent source for tumor classification

arXiv.org e-Print Archive

Open Access Repository

Algebraic Comparison of Partial Lists in Bioinformatics

Author: A Gobbi
A Kalousis
A Kossenkov
A Sboner
AC Haury
AL Boulesteix
Arkady B. Khodursky
B Di Camillo
B Efron
B Efron
B Efron
B Schowe
C Cortes
C Cortes
C Furlanello
C Schneider
C Schneider
C Soneson
C Yao
Cesare Furlanello
Consortium The MicroArray Quality Control (MAQC)
D Albanese
D Cai
D Corrada
D Critchlow
D Saari
D Witten
G Guzzetta
G Jurman
G Jurman
G Lance
G Lance
G Smyth
Giuseppe Jurman
GS Cheon
I Guyon
I Jeffery
I Lönnstedt
J Bar-Ilan
J Borda
J Chen
J Ioannidis
J Neter
J Storey
L Ein-Dor
L Kuncheva
L Yu
L Zhang
M Desarkar
M Kauers
M Kauers
M Kendall
M Schimek
M Schimek
M Slawski
M Villarino
M Villarino
O Bousquet
P Baldi
P Diaconis
P Diaconis
P Hall
P Hall
P Krízek
PC Boutros
R Fagin
R Gentleman
R Graham
R Pearson
R Pique-Regi
R Pique-Regi
R Simon
Roberto Visintainer
S Abramov
S Dudoit
S Lin
S Lin
S Mukherjee
S Setlur
S Simićc
S Vanderlooy
Samantha Riccadonna
SK Lau
T Bø
T Calders
V Tusher
Visintainer
W Fury
W Hoeffding
W Shi
X Wang
X Yang
Y Xiao
Y Xiao
Z He
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 08/04/2010
Field of study

The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or just within a meta-analysis comparison, instead of one list it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained. Here we introduce a method, based on the algebraic theory of symmetric groups, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated first on synthetic data in a gene filtering task and then for finding gene profiles on a recent prostate cancer dataset

Public Library of Science (PLOS)

Effect of Size and Heterogeneity of Samples on Biomarker Discovery: Synthetic and Real Data Assessment

Author: A Buness
A Subramanian
AL Boulesteix
Annalisa Barla
B Di Camillo
B Di Camillo
Barbara Di Camillo
C Ambroise
C Desmedt
C Furlanello
C Furlanello
C Sotiriou
CA Davis
Cesare Furlanello
Claudio Cobelli
D Cai
DS Oh
Francesco Sambo
G Jurman
G Jurman
G Jurman
Gianna Toffolo
Giuseppe Jurman
HY Chuang
I Guyon
JE Larkin
Jo-Ann L. Stanton
JP Ioannidis
L Ein-Dor
L Ein-Dor
L Shi
LD Miller
M Zucknick
Margherita Squillario
Matteo Martini
ML Siegal
P Baldi
RA Irizarry
RA Irizarry
S Riccadonna
SY Kim
T Abeel
Tiziana Sanavia
VG Tusher
VK Mootha
VN Vapnik
X Solé
Y Benjamini
Y Sun
Z He
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

MOTIVATION: The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for the discovery of biomarkers using microarray data often provide results with limited overlap. These differences are imputable to 1) dataset size (few subjects with respect to the number of features); 2) heterogeneity of the disease; 3) heterogeneity of experimental protocols and computational pipelines employed in the analysis. In this paper, we focus on the first two issues and assess, both on simulated (through an in silico regulation network model) and real clinical datasets, the consistency of candidate biomarkers provided by a number of different methods. METHODS: We extensively simulated the effect of heterogeneity characteristic of complex diseases on different sets of microarray data. Heterogeneity was reproduced by simulating both intrinsic variability of the population and the alteration of regulatory mechanisms. Population variability was simulated by modeling evolution of a pool of subjects; then, a subset of them underwent alterations in regulatory mechanisms so as to mimic the disease state. RESULTS: The simulated data allowed us to outline advantages and drawbacks of different methods across multiple studies and varying number of samples and to evaluate precision of feature selection on a benchmark with known biomarkers. Although comparable classification accuracy was reached by different methods, the use of external cross-validation loops is helpful in finding features with a higher degree of precision and stability. Application to real data confirmed these results

Archivio istituzionale della ricerca - Università di Genova

Archivio istituzionale della ricerca - Università di Padova

Institutional Research Information System University of Turin

FigShare

Reverse Engineering Gene Networks with ANN: Variability in Network Inference Algorithms

Author: A Barabasi
A Baralla
A Braunstein
A Braunstein
A Krishnan
A Margolin
B Di Camillo
B Matthews
C Bishop
C Marr
C Steinhoff
D Marbach
D Stokic
E Dimitrova
E Keedwell
F He
F Markowetz
F Mordelet
G Altay
G Altay
G Karlebach
Giuseppe Jurman
I Nemenman
J Faith
J Peregrin-Alvarez
J Supper
L Song
M Bailly-Bechet
M Bansal
Marco Grimaldi
MB Eisen
N Friedman
P Baldi
P Erdös
P Langfelder
P Meyer
Paolo Provero
R De Smet
R Neal
R Neal
Roberto Visintainer
S Kauffman
S Lahabar
S Tuna
T Cover
VA Huynh-Thu
Y Zhang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 24/09/2010
Field of study

Motivation :Reconstructing the topology of a gene regulatory network is one of the key tasks in systems biology. Despite of the wide variety of proposed methods, very little work has been dedicated to the assessment of their stability properties. Here we present a methodical comparison of the performance of a novel method (RegnANN) for gene network inference based on multilayer perceptrons with three reference algorithms (ARACNE, CLR, KELLER), focussing our analysis on the prediction variability induced by both the network intrinsic structure and the available data. Results: The extensive evaluation on both synthetic data and a selection of gene modules of "Escherichia coli" indicates that all the algorithms suffer of instability and variability issues with regards to the reconstruction of the topology of the network. This instability makes objectively very hard the task of establishing which method performs best. Nevertheless, RegnANN shows MCC scores that compare very favorably with all the other inference methods tested. Availability: The software for the RegnANN inference algorithm is distributed under GPL3 and it is available at the corresponding author home page (http://mpba.fbk.eu/grimaldi/regnann-supmat

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

Universidade do Minho: RepositoriUM

Neuromechanical and environment aware machine learning tool for human locomotion intent recognition

Author: AJ Young
AJ Young
B Chen
C Pizzolato
D Novak
G Jurman
H Huang
Hartmut Geyer
J Figueiredo
K Lebel
M Liu
MD Djurić-Jovičić
MR Tucker
T Afzal
Y He
Y Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Current research suggests the emergent need to recognize and predict locomotion modes (LMs) and LM transitions to allow a natural and smooth response of lower limb active assistive devices such as prostheses and orthosis for daily life locomotion assistance. This Master dissertation proposes an automatic and user-independent recognition and prediction tool based on machine learning methods. Further, it seeks to determine the gait measures that yielded the best performance in recognizing and predicting several human daily performed LMs and respective LM transitions. The machine learning framework was established using a Gaussian support vector machine (SVM) and discriminative features estimated from three wearable sensors, namely, inertial, force and laser sensors. In addition, a neuro-biomechanical model was used to compute joint angles and muscle activations that were fused with the sensor-based features. Results showed that combining biomechanical features from the Xsens with environment-aware features from the laser sensor resulted in the best recognition and prediction of LM (MCC = 0.99 and MCC = 0.95) and LM transitions (MCC = 0.96 and MCC = 0.98). Moreover, the predicted LM transitions were determined with high prediction time since their detection happened one or more steps before the LM transition occurrence. The developed framework has potential to improve the assistance delivered by locomotion assistive devices to achieve a more natural and smooth motion assistance.This work has been supported in part by the Fundação para a Ciência e Tecnologia (FCT) with the Reference Scholarship under Grant SFRH/BD/108309/2015, and part by the FEDER Funds through the Programa Operacional Regional do Norte and national funds from FCT with the project SmartOs -Controlo Inteligente de um Sistema Ortótico Ativo e Autónomo- under Grant NORTE-01-0145-FEDER-030386, and by the FEDER Funds through the COMPETE 2020—Programa Operacional Competitividade e Internacionalização (POCI)—with the Reference Project under Grant POCI-01-0145-FEDER-006941