154 research outputs found

    MissForest - nonparametric missing value imputation for mixed-type data

    Full text link
    Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a nonparametric method which can cope with different types of variables simultaneously. We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple data sets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in data sets including different types of variables. In our comparative study missForest outperforms other methods of imputation especially in data settings where complex interactions and nonlinear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201

    A Spatio-Temporal Data Imputation Model for Supporting Analytics at the Edge

    Get PDF
    Current applications developed for the Internet of Things (IoT) usually involve the processing of collected data for delivering analytics and support efficient decision making. The basis for any processing mechanism is data analysis, usually having as an outcome responses in various analytics queries defined by end users or applications. However, as already noted in the respective literature, data analysis cannot be efficient when missing values are present. The research community has already proposed various missing data imputation methods paying more attention of the statistical aspect of the problem. In this paper, we study the problem and propose a method that combines machine learning and a consensus scheme. We focus on the clustering of the IoT devices assuming they observe the same phenomenon and report the collected data to the edge infrastructure. Through a sliding window approach, we try to detect IoT nodes that report similar contextual values to edge nodes and base on them to deliver the replacement value for missing data. We provide the description of our model together with results retrieved by an extensive set of simulations on top of real data. Our aim is to reveal the potentials of the proposed scheme and place it in the respective literature

    Machine Learning Approach for Prescriptive Plant Breeding

    Get PDF
    We explored the capability of fusing high dimensional phenotypic trait (phenomic) data with a machine learning (ML) approach to provide plant breeders the tools to do both in-season seed yield (SY) prediction and prescriptive cultivar development for targeted agro-management practices (e.g., row spacing and seeding density). We phenotyped 32 SoyNAM parent genotypes in two independent studies each with contrasting agro-management treatments (two row spacing, three seeding densities). Phenotypic trait data (canopy temperature, chlorophyll content, hyperspectral reflectance, leaf area index, and light interception) were generated using an array of sensors at three growth stages during the growing season and seed yield (SY) determined by machine harvest. Random forest (RF) was used to train models for SY prediction using phenotypic traits (predictor variables) to identify the optimal temporal combination of variables to maximize accuracy and resource allocation. RF models were trained using data from both experiments and individually for each agro-management treatment. We report the most important traits agnostic of agro-management practices. Several predictor variables showed conditional importance dependent on the agro-management system. We assembled predictive models to enable in-season SY prediction, enabling the development of a framework to integrate phenomics information with powerful ML for prediction enabled prescriptive plant breeding

    Risk of pancreatic cancer associated with family history of cancer and other medical conditions by accounting for smoking among relatives

    Get PDF
    Background Family history (FH) of pancreatic cancer (PC) has been associated with an increased risk of PC, but little is known regarding the role of inherited/environmental factors or that of FH of other comorbidities in PC risk. We aimed to address these issues using multiple methodological approaches. Methods Case-control study including 1431 PC cases and 1090 controls and a reconstructed-cohort study (N = 16 747) made up of their first-degree relatives (FDR). Logistic regression was used to evaluate PC risk associated with FH of cancer, diabetes, allergies, asthma, cystic fibrosis and chronic pancreatitis by relative type and number of affected relatives, by smoking status and other potential effect modifiers, and by tumour stage and location. Familial aggregation of cancer was assessed within the cohort using Cox proportional hazard regression. Results FH of PC was associated with an increased PC risk [odds ratio (OR) = 2.68; 95% confidence interval (CI): 2.27-4.06] when compared with cancer-free FH, the risk being greater when ≥ 2 FDRs suffered PC (OR = 3.88; 95% CI: 2.96-9.73) and among current smokers (OR = 3.16; 95% CI: 2.56-5.78, interaction FHPC*smoking P-value = 0.04). PC cumulative risk by age 75 was 2.2% among FDRs of cases and 0.7% in those of controls [hazard ratio (HR) = 2.42; 95% CI: 2.16-2.71]. PC risk was significantly associated with FH of cancer (OR = 1.30; 95% CI: 1.13-1.54) and diabetes (OR = 1.24; 95% CI: 1.01-1.52), but not with FH of other diseases. Conclusions The concordant findings using both approaches strengthen the notion that FH of cancer, PC or diabetes confers a higher PC risk. Smoking notably increases PC risk associated with FH of PC. Further evaluation of these associations should be undertaken to guide PC prevention strategies

    An Infestation of Adult Lernaeocera

    No full text
    corecore