178 research outputs found

    MissForest - nonparametric missing value imputation for mixed-type data

    Full text link
    Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a nonparametric method which can cope with different types of variables simultaneously. We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple data sets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in data sets including different types of variables. In our comparative study missForest outperforms other methods of imputation especially in data settings where complex interactions and nonlinear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201

    A Spatio-Temporal Data Imputation Model for Supporting Analytics at the Edge

    Get PDF
    Current applications developed for the Internet of Things (IoT) usually involve the processing of collected data for delivering analytics and support efficient decision making. The basis for any processing mechanism is data analysis, usually having as an outcome responses in various analytics queries defined by end users or applications. However, as already noted in the respective literature, data analysis cannot be efficient when missing values are present. The research community has already proposed various missing data imputation methods paying more attention of the statistical aspect of the problem. In this paper, we study the problem and propose a method that combines machine learning and a consensus scheme. We focus on the clustering of the IoT devices assuming they observe the same phenomenon and report the collected data to the edge infrastructure. Through a sliding window approach, we try to detect IoT nodes that report similar contextual values to edge nodes and base on them to deliver the replacement value for missing data. We provide the description of our model together with results retrieved by an extensive set of simulations on top of real data. Our aim is to reveal the potentials of the proposed scheme and place it in the respective literature

    Social Inequalities of Functioning and Perceived Health in Switzerland–A Representative Cross-Sectional Analysis

    Get PDF
    Many people worldwide live with a disability, i.e. limitations in functioning. The prevalence is expected to increase due to demographic change and the growing importance of non-communicable disease and injury. To date, many epidemiological studies have used simple dichotomous measures of disability, even though the WHO's International Classification of Functioning, Disability, and Health (ICF) provides a multi-dimensional framework of functioning. We aimed to examine associations of socio-economic status (SES) and social integration in 3 core domains of functioning (impairment, pain, limitations in activity and participation) and perceived health. We conducted a secondary analysis of representative cross-sectional data of the Swiss Health Survey 2007 including 10,336 female and 8,424 male Swiss residents aged 15 or more. Guided by a theoretical ICF-based model, 4 mixed effects Poisson regressions were fitted in order to explain functioning and perceived health by indicators of SES and social integration. Analyses were stratified by age groups (15–30, 31–54, ≄55 years). In all age groups, SES and social integration were significantly associated with functional and perceived health. Among the functional domains, impairment and pain were closely related, and both were associated with limitations in activity and participation. SES, social integration and functioning were related to perceived health. We found pronounced social inequalities in functioning and perceived health, supporting our theoretical model. Social factors play a significant role in the experience of health, even in a wealthy country such as Switzerland. These findings await confirmation in other, particularly lower resourced settings

    Machine Learning Approach for Prescriptive Plant Breeding

    Get PDF
    We explored the capability of fusing high dimensional phenotypic trait (phenomic) data with a machine learning (ML) approach to provide plant breeders the tools to do both in-season seed yield (SY) prediction and prescriptive cultivar development for targeted agro-management practices (e.g., row spacing and seeding density). We phenotyped 32 SoyNAM parent genotypes in two independent studies each with contrasting agro-management treatments (two row spacing, three seeding densities). Phenotypic trait data (canopy temperature, chlorophyll content, hyperspectral reflectance, leaf area index, and light interception) were generated using an array of sensors at three growth stages during the growing season and seed yield (SY) determined by machine harvest. Random forest (RF) was used to train models for SY prediction using phenotypic traits (predictor variables) to identify the optimal temporal combination of variables to maximize accuracy and resource allocation. RF models were trained using data from both experiments and individually for each agro-management treatment. We report the most important traits agnostic of agro-management practices. Several predictor variables showed conditional importance dependent on the agro-management system. We assembled predictive models to enable in-season SY prediction, enabling the development of a framework to integrate phenomics information with powerful ML for prediction enabled prescriptive plant breeding

    Risk of pancreatic cancer associated with family history of cancer and other medical conditions by accounting for smoking among relatives

    Get PDF
    Background Family history (FH) of pancreatic cancer (PC) has been associated with an increased risk of PC, but little is known regarding the role of inherited/environmental factors or that of FH of other comorbidities in PC risk. We aimed to address these issues using multiple methodological approaches. Methods Case-control study including 1431 PC cases and 1090 controls and a reconstructed-cohort study (N = 16 747) made up of their first-degree relatives (FDR). Logistic regression was used to evaluate PC risk associated with FH of cancer, diabetes, allergies, asthma, cystic fibrosis and chronic pancreatitis by relative type and number of affected relatives, by smoking status and other potential effect modifiers, and by tumour stage and location. Familial aggregation of cancer was assessed within the cohort using Cox proportional hazard regression. Results FH of PC was associated with an increased PC risk [odds ratio (OR) = 2.68; 95% confidence interval (CI): 2.27-4.06] when compared with cancer-free FH, the risk being greater when ≄ 2 FDRs suffered PC (OR = 3.88; 95% CI: 2.96-9.73) and among current smokers (OR = 3.16; 95% CI: 2.56-5.78, interaction FHPC*smoking P-value = 0.04). PC cumulative risk by age 75 was 2.2% among FDRs of cases and 0.7% in those of controls [hazard ratio (HR) = 2.42; 95% CI: 2.16-2.71]. PC risk was significantly associated with FH of cancer (OR = 1.30; 95% CI: 1.13-1.54) and diabetes (OR = 1.24; 95% CI: 1.01-1.52), but not with FH of other diseases. Conclusions The concordant findings using both approaches strengthen the notion that FH of cancer, PC or diabetes confers a higher PC risk. Smoking notably increases PC risk associated with FH of PC. Further evaluation of these associations should be undertaken to guide PC prevention strategies

    On the Infective Larva of Ostertagia circumcincta

    No full text
    • 

    corecore