Search CORE

154 research outputs found

MissForest - nonparametric missing value imputation for mixed-type data

Author: D. J. Stekhoven
Harley
Kurgan
Latal
LITTLE
Oba
P. Buhlmann
Smit
Troyanskaya
van Buuren
Wille
Wu
Publication venue: 'Oxford University Press (OUP)'
Publication date: 27/09/2011
Field of study

Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a nonparametric method which can cope with different types of variables simultaneously. We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple data sets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in data sets including different types of variables. In our comparative study missForest outperforms other methods of imputation especially in data settings where complex interactions and nonlinear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

A Spatio-Temporal Data Imputation Model for Supporting Analytics at the Edge

Author: D Bertsimas
D Stekhoven
G Carpenter
H Cai
J Honaker
J Xing
L Jiang
L Kim
M Satyanarayanan
N Jiang
NC Guan
O Troyanskaya
P Schmitt
PJ Escamilla-Ambrosio
R Little
R Mazumder
S Buuren
S Oba
T Bo
T Raghunathan
X Wang
Y He
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Current applications developed for the Internet of Things (IoT) usually involve the processing of collected data for delivering analytics and support efficient decision making. The basis for any processing mechanism is data analysis, usually having as an outcome responses in various analytics queries defined by end users or applications. However, as already noted in the respective literature, data analysis cannot be efficient when missing values are present. The research community has already proposed various missing data imputation methods paying more attention of the statistical aspect of the problem. In this paper, we study the problem and propose a method that combines machine learning and a consensus scheme. We focus on the clustering of the IoT devices assuming they observe the same phenomenon and report the collected data to the edge infrastructure. Through a sliding window approach, we try to detect IoT nodes that report similar contextual values to edge nodes and base on them to deliver the replacement value for missing data. We provide the description of our model together with results retrieved by an extensive set of simulations on top of real data. Our aim is to reveal the potentials of the proposed scheme and place it in the respective literature

Crossref

Enlighten

Recommended from our members

Combining macula clinical signs and patient characteristics for age-related macular degeneration diagnosis: a machine learning approach

Author: A Liaw
C Cortes
C Nadeau
Carlo Enrico Traverso
CM Bishop
D Pauleikhoff
DB Rein
DJ Stekhoven
DW Hosmer Jr
Dympna OSullivan
E Dimitriadou
EA Pifer
EL Lamoureux
FG Schlanitz
FL Ferris
G Quellec
IE Murdoch
J Friedman
K Hornik
L Breiman
L Breiman
L Breiman
M Bonetto
M Mehryar
Massimo Nicolo
Mattia Prosperi
Mauro Giacomini
MHA Hijazi
Monica Bonetto
MR Hee
MR Hee
NM Bressler
P Fraccaro
P Serrano-Aguilar
Paolo Fraccaro
Peter Weller
S Kankanahalli
S Sivasankari
T Fawcett
Y Freund
Y Kanagasingam
Y Zheng
YY Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background: To investigate machine learning methods, ranging from simpler interpretable techniques to complex (non-linear) “black-box” approaches, for automated diagnosis of Age-related Macular Degeneration (AMD). Methods: Data from healthy subjects and patients diagnosed with AMD or other retinal diseases were collected during routine visits via an Electronic Health Record (EHR) system. Patients’ attributes included demographics and, for each eye, presence/absence of major AMD-related clinical signs (soft drusen, retinal pigment epitelium, defects/ pigment mottling, depigmentation area, subretinal haemorrhage, subretinal fluid, macula thickness, macular scar, subretinal fibrosis). Interpretable techniques known as white box methods including logistic regression and decision trees as well as less interpreitable techniques known as black box methods, such as support vector machines (SVM), random forests and AdaBoost, were used to develop models (trained and validated on unseen data) to diagnose AMD. The gold standard was confirmed diagnosis of AMD by physicians. Sensitivity, specificity and area under the receiver operating characteristic (AUC) were used to assess performance. Results: Study population included 487 patients (912 eyes). In terms of AUC, random forests, logistic regression and adaboost showed a mean performance of (0.92), followed by SVM and decision trees (0.90). All machine learning models identified soft drusen and age as the most discriminating variables in clinicians’ decision pathways to diagnose AMD. C Conclusions: Both black-box and white box methods performed well in identifying diagnoses of AMD and their decision pathways. Machine learning models developed through the proposed approach, relying on clinical signs identified by retinal specialists, could be embedded into EHR to provide physicians with real time (interpretable) support

City Research Online

Crossref

Springer - Publisher Connector

TRAP

PubMed Central

The University of Manchester - Institutional Repository

Archivio istituzionale della ricerca - Università di Genova

Machine Learning Approach for Prescriptive Plant Breeding

Author: A Liaw
A Singh
AK Singh
AT Mastrodomenico
BH Menze
BL Ma
BS Christenson
C Penone
C Ziyomo
D Pauli
Deniz Akdemir
DJ Stekhoven
DM Lambert
DS Harris
ER Cober
F Gao
FA van Eeuwijk
FH Andrade
G Machado
I Guyon
J Crain
J Jin
J Zhang
James B. Holland
JE Board
JE Specht
JE Vogelmann
Jessica Rutkoski
JH Friedman
JJ Suhre
JL Araus
JL De Bruin
JL De Bruin
JW Singer
K Rincker
Kyle Parmley
L Breiman
M de Felipe
M Garriga
M Kuhn
M Reynolds
MJ Morrison
NR Keep
NV McKinney
OA Montesinos-López
PM Granitto
R Díaz-Uriarte
R Wei
R Wells
RD Cook
RK Teal
RP Koester
S Ghosal
S Thapa
SC Rowntree
SM Hock
SP Conley
WF Schillinger
WJ Ethredge
WR Fehr
X Liu
X Xiao
Publication venue: Iowa State University Digital Repository
Publication date: 20/11/2019
Field of study

We explored the capability of fusing high dimensional phenotypic trait (phenomic) data with a machine learning (ML) approach to provide plant breeders the tools to do both in-season seed yield (SY) prediction and prescriptive cultivar development for targeted agro-management practices (e.g., row spacing and seeding density). We phenotyped 32 SoyNAM parent genotypes in two independent studies each with contrasting agro-management treatments (two row spacing, three seeding densities). Phenotypic trait data (canopy temperature, chlorophyll content, hyperspectral reflectance, leaf area index, and light interception) were generated using an array of sensors at three growth stages during the growing season and seed yield (SY) determined by machine harvest. Random forest (RF) was used to train models for SY prediction using phenotypic traits (predictor variables) to identify the optimal temporal combination of variables to maximize accuracy and resource allocation. RF models were trained using data from both experiments and individually for each agro-management treatment. We report the most important traits agnostic of agro-management practices. Several predictor variables showed conditional importance dependent on the agro-management system. We assembled predictive models to enable in-season SY prediction, enabling the development of a framework to integrate phenomics information with powerful ML for prediction enabled prescriptive plant breeding

Digital Repository @ Iowa State University (ISU)

Crossref

Feline calicivirus and other respiratory pathogens in cats with Feline calicivirus-related symptoms and in clinically healthy cats in Switzerland

Crossref

Risk of pancreatic cancer associated with family history of cancer and other medical conditions by accounting for smoking among relatives

Author: A Carrato
A Farré
A Scarpa
A Tardón
Amundadottir
Austin
Bo Kong
Breslow
Brune
C W Michalski
D O’Driscoll
E Costello
E Domínguez-Muñoz
E Molina-Montes
F X Real
Fehringer
Fiederling
Frank
Hanley
Hariri
Hiripi
Hosmer
I Poves
J Balsells
J Huang
J Kleeff
J Mora
J Perea
Jacobs
Jacobs
Khoury
Klein
Klein
Kristman
L Ilzarbe
L Murray
L Muñoz-Bellvís
L Sharp
Lepage
Lin
Liu
M Hidalgo
M Iglesias
M Löhr
M Márquez
M O’Rorke
M Rava
Maisonneuve
N Malats
Neuhaus
P Gomez-Rubio
Permuth-Wey
Pierce
R Core Development Team
Roberts
Schoenfield
Schulte
Siegel
Stekhoven
T Crnogorac-Jurcevic
T Gress
Turati
V M Barberà
W Greenhalf
W Ye
Wang
X Molero
Zhen
Zimmerman
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

Background Family history (FH) of pancreatic cancer (PC) has been associated with an increased risk of PC, but little is known regarding the role of inherited/environmental factors or that of FH of other comorbidities in PC risk. We aimed to address these issues using multiple methodological approaches. Methods Case-control study including 1431 PC cases and 1090 controls and a reconstructed-cohort study (N = 16 747) made up of their first-degree relatives (FDR). Logistic regression was used to evaluate PC risk associated with FH of cancer, diabetes, allergies, asthma, cystic fibrosis and chronic pancreatitis by relative type and number of affected relatives, by smoking status and other potential effect modifiers, and by tumour stage and location. Familial aggregation of cancer was assessed within the cohort using Cox proportional hazard regression. Results FH of PC was associated with an increased PC risk [odds ratio (OR) = 2.68; 95% confidence interval (CI): 2.27-4.06] when compared with cancer-free FH, the risk being greater when ≥ 2 FDRs suffered PC (OR = 3.88; 95% CI: 2.96-9.73) and among current smokers (OR = 3.16; 95% CI: 2.56-5.78, interaction FHPC*smoking P-value = 0.04). PC cumulative risk by age 75 was 2.2% among FDRs of cases and 0.7% in those of controls [hazard ratio (HR) = 2.42; 95% CI: 2.16-2.71]. PC risk was significantly associated with FH of cancer (OR = 1.30; 95% CI: 1.13-1.54) and diabetes (OR = 1.24; 95% CI: 1.01-1.52), but not with FH of other diseases. Conclusions The concordant findings using both approaches strengthen the notion that FH of cancer, PC or diabetes confers a higher PC risk. Smoking notably increases PC risk associated with FH of PC. Further evaluation of these associations should be undertaken to guide PC prevention strategies

Repositorio Institucional de la Universidad de Alicante

University of Liverpool Repository

Crossref

Catalogo dei prodotti della ricerca

Queen Mary Research Online