154 research outputs found
MissForest - nonparametric missing value imputation for mixed-type data
Modern data acquisition based on high-throughput technology is often facing
the problem of missing data. Algorithms commonly used in the analysis of such
large-scale data often depend on a complete set. Missing value imputation
offers a solution to this problem. However, the majority of available
imputation methods are restricted to one type of variable only: continuous or
categorical. For mixed-type data the different types are usually handled
separately. Therefore, these methods ignore possible relations between variable
types. We propose a nonparametric method which can cope with different types of
variables simultaneously. We compare several state of the art methods for the
imputation of missing values. We propose and evaluate an iterative imputation
method (missForest) based on a random forest. By averaging over many unpruned
classification or regression trees random forest intrinsically constitutes a
multiple imputation scheme. Using the built-in out-of-bag error estimates of
random forest we are able to estimate the imputation error without the need of
a test set. Evaluation is performed on multiple data sets coming from a diverse
selection of biological fields with artificially introduced missing values
ranging from 10% to 30%. We show that missForest can successfully handle
missing values, particularly in data sets including different types of
variables. In our comparative study missForest outperforms other methods of
imputation especially in data settings where complex interactions and nonlinear
relations are suspected. The out-of-bag imputation error estimates of
missForest prove to be adequate in all settings. Additionally, missForest
exhibits attractive computational efficiency and can cope with high-dimensional
data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201
A Spatio-Temporal Data Imputation Model for Supporting Analytics at the Edge
Current applications developed for the Internet of Things (IoT) usually involve the processing of collected data for delivering analytics and support efficient decision making. The basis for any processing mechanism is data analysis, usually having as an outcome responses in various analytics queries defined by end users or applications. However, as already noted in the respective literature, data analysis cannot be efficient when missing values are present. The research community has already proposed various missing data imputation methods paying more attention of the statistical aspect of the problem. In this paper, we study the problem and propose a method that combines machine learning and a consensus scheme. We focus on the clustering of the IoT devices assuming they observe the same phenomenon and report the collected data to the edge infrastructure. Through a sliding window approach, we try to detect IoT nodes that report similar contextual values to edge nodes and base on them to deliver the replacement value for missing data. We provide the description of our model together with results retrieved by an extensive set of simulations on top of real data. Our aim is to reveal the potentials of the proposed scheme and place it in the respective literature
Recommended from our members
Combining macula clinical signs and patient characteristics for age-related macular degeneration diagnosis: a machine learning approach
Background: To investigate machine learning methods, ranging from simpler interpretable techniques to complex (non-linear) “black-box” approaches, for automated diagnosis of Age-related Macular Degeneration (AMD).
Methods: Data from healthy subjects and patients diagnosed with AMD or other retinal diseases were collected during routine visits via an Electronic Health Record (EHR) system. Patients’ attributes included demographics and, for each eye, presence/absence of major AMD-related clinical signs (soft drusen, retinal pigment epitelium, defects/ pigment mottling, depigmentation area, subretinal haemorrhage, subretinal fluid, macula thickness, macular scar, subretinal fibrosis). Interpretable techniques known as white box methods including logistic regression and decision trees as well as less interpreitable techniques known as black box methods, such as support vector machines (SVM), random forests and AdaBoost, were used to develop models (trained and validated on unseen data) to diagnose AMD. The gold standard was confirmed diagnosis of AMD by physicians. Sensitivity, specificity and area under the receiver operating characteristic (AUC) were used to assess performance.
Results: Study population included 487 patients (912 eyes). In terms of AUC, random forests, logistic regression and adaboost showed a mean performance of (0.92), followed by SVM and decision trees (0.90). All machine learning models identified soft drusen and age as the most discriminating variables in clinicians’ decision pathways to diagnose AMD. C
Conclusions: Both black-box and white box methods performed well in identifying diagnoses of AMD and their decision pathways. Machine learning models developed through the proposed approach, relying on clinical signs identified by retinal specialists, could be embedded into EHR to provide physicians with real time (interpretable) support
Machine Learning Approach for Prescriptive Plant Breeding
We explored the capability of fusing high dimensional phenotypic trait (phenomic) data with a machine learning (ML) approach to provide plant breeders the tools to do both in-season seed yield (SY) prediction and prescriptive cultivar development for targeted agro-management practices (e.g., row spacing and seeding density). We phenotyped 32 SoyNAM parent genotypes in two independent studies each with contrasting agro-management treatments (two row spacing, three seeding densities). Phenotypic trait data (canopy temperature, chlorophyll content, hyperspectral reflectance, leaf area index, and light interception) were generated using an array of sensors at three growth stages during the growing season and seed yield (SY) determined by machine harvest. Random forest (RF) was used to train models for SY prediction using phenotypic traits (predictor variables) to identify the optimal temporal combination of variables to maximize accuracy and resource allocation. RF models were trained using data from both experiments and individually for each agro-management treatment. We report the most important traits agnostic of agro-management practices. Several predictor variables showed conditional importance dependent on the agro-management system. We assembled predictive models to enable in-season SY prediction, enabling the development of a framework to integrate phenomics information with powerful ML for prediction enabled prescriptive plant breeding
Risk of pancreatic cancer associated with family history of cancer and other medical conditions by accounting for smoking among relatives
Background
Family history (FH) of pancreatic cancer (PC) has been associated with an increased risk of PC, but little is known regarding the role of inherited/environmental factors or that of FH of other comorbidities in PC risk. We aimed to address these issues using multiple methodological approaches.
Methods
Case-control study including 1431 PC cases and 1090 controls and a reconstructed-cohort study (N = 16 747) made up of their first-degree relatives (FDR). Logistic regression was used to evaluate PC risk associated with FH of cancer, diabetes, allergies, asthma, cystic fibrosis and chronic pancreatitis by relative type and number of affected relatives, by smoking status and other potential effect modifiers, and by tumour stage and location. Familial aggregation of cancer was assessed within the cohort using Cox proportional hazard regression.
Results
FH of PC was associated with an increased PC risk [odds ratio (OR) = 2.68; 95% confidence interval (CI): 2.27-4.06] when compared with cancer-free FH, the risk being greater when ≥ 2 FDRs suffered PC (OR = 3.88; 95% CI: 2.96-9.73) and among current smokers (OR = 3.16; 95% CI: 2.56-5.78, interaction FHPC*smoking P-value = 0.04). PC cumulative risk by age 75 was 2.2% among FDRs of cases and 0.7% in those of controls [hazard ratio (HR) = 2.42; 95% CI: 2.16-2.71]. PC risk was significantly associated with FH of cancer (OR = 1.30; 95% CI: 1.13-1.54) and diabetes (OR = 1.24; 95% CI: 1.01-1.52), but not with FH of other diseases.
Conclusions
The concordant findings using both approaches strengthen the notion that FH of cancer, PC or diabetes confers a higher PC risk. Smoking notably increases PC risk associated with FH of PC. Further evaluation of these associations should be undertaken to guide PC prevention strategies
- …