37 research outputs found

    MissForest—non-parametric missing value imputation for mixed-type data

    Get PDF
    Motivation: Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. Results: We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. Availability: The â„ť package missForest is freely available from http://stat.ethz.ch/CRAN/. Contact: [email protected]; [email protected]

    Causal Stability Ranking

    Get PDF
    Genotypic causes of a phenotypic trait are typically determined via randomized controlled intervention experiments. Such experiments are often prohibitive with respect to durations and costs. We therefore consider inferring stable rankings of genes, according to their causal effects on a phenotype, from observational data only. Our method allows for efficient design and prioritization of future experiments, and due to its generality it is useable for a broad spectrum of applications

    Causal stability ranking

    Get PDF
    Genotypic causes of a phenotypic trait are typically determined via randomized controlled intervention experiments. Such experiments are often prohibitive with respect to durations and costs, and informative prioritization of experiments is desirable. We therefore consider predicting stable rankings of genes (covariates), according to their total causal effects on a phenotype (response), from observational data. Since causal effects are generally non-identifiable from observational data only, we use a method that can infer lower bounds for the total causal effect under some assumptions. We validated our method, which we call Causal Stability Ranking (CStaR), in two situations. First, we performed knock-out experiments with Arabidopsis thaliana according to a predicted ranking based on observational gene expression data, using flowering time as phenotype of interest. Besides several known regulators of flowering time, we found almost half of the tested top ranking mutants to have a significantly changed flowering time. Second, we compared CStaR to established regression-based methods on a gene expression dataset of Saccharomyces cerevisiae. We found that CStaR outperforms these established methods. Our method allows for efficient design and prioritization of future intervention experiments, and due to its generality it can be used for a broad spectrum of applications. Availability: The full table of ranked genes, all raw data and an example R script for CStaR are available from the Bioinformatics website. Contact: [email protected] Supplementary Information: Supplementary data are available at Bioinformatics onlin

    NGS-pipe: a flexible, easily extendable, and highly configurable framework for NGS analysis

    Get PDF
    Next-generation sequencing is now an established method in genomics, and massive amounts of sequencing data are being generated on a regular basis. Analysis of the sequencing data is typically performed by lab-specific in-house solutions, but the agreement of results from different facilities is often small. General standards for quality control, reproducibility, and documentation are missing.; We developed NGS-pipe, a flexible, transparent, and easy-to-use framework for the design of pipelines to analyze whole-exome, whole-genome, and transcriptome sequencing data. NGS-pipe facilitates the harmonization of genomic data analysis by supporting quality control, documentation, reproducibility, parallelization, and easy adaptation to other NGS experiments. https://github.com/cbg-ethz/NGS-pipe [email protected]

    Genomic variant annotation workflow for clinical applications [version 2; referees: 2 approved]

    No full text
    Annotation and interpretation of DNA aberrations identified through next-generation sequencing is becoming an increasingly important task. Even more so in the context of data analysis pipelines for medical applications, where genomic aberrations are associated with phenotypic and clinical features. Here we describe a workflow to identify potential gene targets in aberrated genes or pathways and their corresponding drugs. To this end, we provide the R/Bioconductor package rDGIdb, an R wrapper to query the drug-gene interaction database (DGIdb). DGIdb accumulates drug-gene interaction data from 15 different resources and allows filtering on different levels. The rDGIdb package makes these resources and tools available to R users. Moreover, rDGIdb queries can be automated through incorporation of the rDGIdb package into NGS sequencing pipelines

    Proteome-wide identification of predominant subcellular protein localizations in a bacterial model organism

    No full text
    Proteomics data provide unique insights into biological systems, including the predominant subcellular localization (SCL) of proteins, which can reveal important clues about their functions. Here we analyzed data of a complete prokaryotic proteome expressed under two conditions mimicking interaction of the emerging pathogen Bartonella henselae with its mammalian host. Normalized spectral count data from cytoplasmic, total membrane, inner and outer membrane fractions allowed us to identify the predominant SCL for 82% of the identified proteins. The spectral count proportion of total membrane versus cytoplasmic fractions indicated the propensity of cytoplasmic proteins to co-fractionate with the inner membrane, and enabled us to distinguish cytoplasmic, peripheral inner membrane and bona fide inner membrane proteins. Principal component analysis and k-nearest neighbor classification training on selected marker proteins or predominantly localized proteins, allowed us to determine an extensive catalog of at least 74 expressed outer membrane proteins, and to extend the SCL assignment to 94% of the identified proteins, including 18% where in silico methods gave no prediction. Suitable experimental proteomics data combined with straightforward computational approaches can thus identify the predominant SCL on a proteome-wide scale. Finally, we present a conceptual approach to identify proteins potentially changing their SCL in a condition-dependent fashion.; The work presented here describes the first prokaryotic proteome-wide subcellular localization (SCL) dataset for the emerging pathogen B. henselae (Bhen). The study indicates that suitable subcellular fractionation experiments combined with straight-forward computational analysis approaches assessing the proportion of spectral counts observed in different subcellular fractions are powerful for determining the predominant SCL of a large percentage of the experimentally observed proteins. This includes numerous cases where in silico prediction methods do not provide any prediction. Avoiding a treatment with harsh conditions, cytoplasmic proteins tend to co-fractionate with proteins of the inner membrane fraction, indicative of close functional interactions. The spectral count proportion (SCP) of total membrane versus cytoplasmic fractions allowed us to obtain a good indication about the relative proximity of individual protein complex members to the inner membrane. Using principal component analysis and k-nearest neighbor approaches, we were able to extend the percentage of proteins with a predominant experimental localization to over 90% of all expressed proteins and identified a set of at least 74 outer membrane (OM) proteins. In general, OM proteins represent a rich source of candidates for the development of urgently needed new therapeutics in combat of resurgence of infectious disease and multi-drug resistant bacteria. Finally, by comparing the data from two infection biology relevant conditions, we conceptually explore methods to identify and visualize potential candidates that may partially change their SCL in these different conditions. The data are made available to researchers as a SCL compendium for Bhen and as an assistance in further improving in silico SCL prediction algorithms

    Impact of bodyweight-adjusted antimicrobial prophylaxis on surgical-site infection rates

    No full text
    Background Antimicrobial prophylaxis (AMP) adjustment according to bodyweight to prevent surgical-site infections (SSI) is controversial. The impact of weight-adjusted AMP dosing on SSI rates was investigated here. Methods Results from a first study of patients undergoing visceral, vascular or trauma operations, and receiving standard AMP, enabled retrospective evaluation of the impact of bodyweight and BMI on SSI rates, and identification of patients eligible for weight-adjusted AMP. In a subsequent observational prospective study, patients weighing at least 80 kg were assigned to receive double-dose AMP. Risk factors for SSI, including ASA classification, duration and type of surgery, wound class, diabetes, weight in kilograms, BMI, age, and AMP dose, were evaluated in multivariable analysis. Results In the first study (3508 patients), bodyweight and BMI significantly correlated with higher rates of all SSI subclasses (both P < 0.001). An 80-kg cut-off identified patients receiving single-dose AMP who were at higher risk of SSI. In the prospective study (2161 patients), 546 patients weighing 80 kg or more who received only single-dose AMP had higher rates of all SSI types than a group of 1615 who received double-dose AMP (odds ratio (OR) 4.40, 95 per cent c.i. 3.18 to 6.23; P < 0.001). In multivariable analysis including 5021 patients from both cohorts, bodyweight (OR 1.01, 1.00 to 1.02; P = 0.008), BMI (OR 1.01, 1.00 to 1.02; P = 0.007) and double-dose AMP (OR 0.33, 0.23 to 0.46; P < 0.001) among other variables were independently associated with SSI rates. Conclusion Double-dose AMP decreases SSI rates in patients weighing 80 kg or more
    corecore