100 research outputs found

    SignS: a parallelized, open-source, freely available, web-based tool for gene selection and molecular signatures for survival and censored data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Censored data are increasingly common in many microarray studies that attempt to relate gene expression to patient survival. Several new methods have been proposed in the last two years. Most of these methods, however, are not available to biomedical researchers, leading to many re-implementations from scratch of ad-hoc, and suboptimal, approaches with survival data.</p> <p>Results</p> <p>We have developed SignS (Signatures for Survival data), an open-source, freely-available, web-based tool and R package for gene selection, building molecular signatures, and prediction with survival data. SignS implements four methods which, according to existing reviews, perform well and, by being of a very different nature, offer complementary approaches. We use parallel computing via MPI, leading to large decreases in user waiting time. Cross-validation is used to asses predictive performance and stability of solutions, the latter an issue of increasing concern given that there are often several solutions with similar predictive performance. Biological interpretation of results is enhanced because genes and signatures in models can be sent to other freely-available on-line tools for examination of PubMed references, GO terms, and KEGG and Reactome pathways of selected genes.</p> <p>Conclusion</p> <p>SignS is the first web-based tool for survival analysis of expression data, and one of the very few with biomedical researchers as target users. SignS is also one of the few bioinformatics web-based applications to extensively use parallelization, including fault tolerance and crash recovery. Because of its combination of methods implemented, usage of parallel computing, code availability, and links to additional data bases, SignS is a unique tool, and will be of immediate relevance to biomedical researchers, biostatisticians and bioinformaticians.</p

    Conditional variable importance for random forests

    Get PDF
    Random forests are becoming increasingly popular in many scientific fields because they can cope with ``small n large p'' problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables. We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. The resulting conditional variable importance is shown to reflect the true impact of each predictor variable more reliably than the original marginal approach

    Epigenetic mechanisms and metabolic reprogramming in fibrogenesis: dual targeting of G9a and DNMT1 for the inhibition of liver fibrosis

    Get PDF
    OBJECTIVE: Hepatic stellate cells (HSC) transdifferentiation into myofibroblasts is central to fibrogenesis. Epigenetic mechanisms, including histone and DNA methylation, play a key role in this process. Concerted action between histone and DNA-mehyltransferases like G9a and DNMT1 is a common theme in gene expression regulation. We aimed to study the efficacy of CM272, a first-in-class dual and reversible G9a/DNMT1 inhibitor, in halting fibrogenesis. DESIGN: G9a and DNMT1 were analysed in cirrhotic human livers, mouse models of liver fibrosis and cultured mouse HSC. G9a and DNMT1 expression was knocked down or inhibited with CM272 in human HSC (hHSC), and transcriptomic responses to transforming growth factor-β1 (TGFβ1) were examined. Glycolytic metabolism and mitochondrial function were analysed with Seahorse-XF technology. Gene expression regulation was analysed by chromatin immunoprecipitation and methylation-specific PCR. Antifibrogenic activity and safety of CM272 were studied in mouse chronic CCl4 administration and bile duct ligation (BDL), and in human precision-cut liver slices (PCLSs) in a new bioreactor technology. RESULTS: G9a and DNMT1 were detected in stromal cells in areas of active fibrosis in human and mouse livers. G9a and DNMT1 expression was induced during mouse HSC activation, and TGFβ1 triggered their chromatin recruitment in hHSC. G9a/DNMT1 knockdown and CM272 inhibited TGFβ1 fibrogenic responses in hHSC. TGFβ1-mediated profibrogenic metabolic reprogramming was abrogated by CM272, which restored gluconeogenic gene expression and mitochondrial function through on-target epigenetic effects. CM272 inhibited fibrogenesis in mice and PCLSs without toxicity. CONCLUSIONS: Dual G9a/DNMT1 inhibition by compounds like CM272 may be a novel therapeutic strategy for treating liver fibrosis

    A random forest approach to the detection of epistatic interactions in case-control studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The key roles of epistatic interactions between multiple genetic variants in the pathogenesis of complex diseases notwithstanding, the detection of such interactions remains a great challenge in genome-wide association studies. Although some existing multi-locus approaches have shown their successes in small-scale case-control data, the "combination explosion" course prohibits their applications to genome-wide analysis. It is therefore indispensable to develop new methods that are able to reduce the search space for epistatic interactions from an astronomic number of all possible combinations of genetic variants to a manageable set of candidates.</p> <p>Results</p> <p>We studied case-control data from the viewpoint of binary classification. More precisely, we treated single nucleotide polymorphism (SNP) markers as categorical features and adopted the random forest to discriminate cases against controls. On the basis of the gini importance given by the random forest, we designed a sliding window sequential forward feature selection (SWSFS) algorithm to select a small set of candidate SNPs that could minimize the classification error and then statistically tested up to three-way interactions of the candidates. We compared this approach with three existing methods on three simulated disease models and showed that our approach is comparable to, sometimes more powerful than, the other methods. We applied our approach to a genome-wide case-control dataset for Age-related Macular Degeneration (AMD) and successfully identified two SNPs that were reported to be associated with this disease.</p> <p>Conclusion</p> <p>Besides existing pure statistical approaches, we demonstrated the feasibility of incorporating machine learning methods into genome-wide case-control studies. The gini importance offers yet another measure for the associations between SNPs and complex diseases, thereby complementing existing statistical measures to facilitate the identification of epistatic interactions and the understanding of epistasis in the pathogenesis of complex diseases.</p

    Growth Strategies of Tropical Tree Species: Disentangling Light and Size Effects

    Get PDF
    An understanding of the drivers of tree growth at the species level is required to predict likely changes of carbon stocks and biodiversity when environmental conditions change. Especially in species-rich tropical forests, it is largely unknown how species differ in their response of growth to resource availability and individual size. We use a hierarchical Bayesian approach to quantify the impact of light availability and tree diameter on growth of 274 woody species in a 50-ha long-term forest census plot in Barro Colorado Island, Panama. Light reaching each individual tree was estimated from yearly vertical censuses of canopy density. The hierarchical Bayesian approach allowed accounting for different sources of error, such as negative growth observations, and including rare species correctly weighted by their abundance. All species grew faster at higher light. Exponents of a power function relating growth to light were mostly between 0 and 1. This indicates that nearly all species exhibit a decelerating increase of growth with light. In contrast, estimated growth rates at standardized conditions (5 cm dbh, 5% light) varied over a 9-fold range and reflect strong growth-strategy differentiation between the species. As a consequence, growth rankings of the species at low (2%) and high light (20%) were highly correlated. Rare species tended to grow faster and showed a greater sensitivity to light than abundant species. Overall, tree size was less important for growth than light and about half the species were predicted to grow faster in diameter when bigger or smaller, respectively. Together light availability and tree diameter only explained on average 12% of the variation in growth rates. Thus, other factors such as soil characteristics, herbivory, or pathogens may contribute considerably to shaping tree growth in the tropics

    Adipose Gene Expression Prior to Weight Loss Can Differentiate and Weakly Predict Dietary Responders

    Get PDF
    BACKGROUND: The ability to identify obese individuals who will successfully lose weight in response to dietary intervention will revolutionize disease management. Therefore, we asked whether it is possible to identify subjects who will lose weight during dietary intervention using only a single gene expression snapshot. METHODOLOGY/PRINCIPAL FINDINGS: The present study involved 54 female subjects from the Nutrient-Gene Interactions in Human Obesity-Implications for Dietary Guidelines (NUGENOB) trial to determine whether subcutaneous adipose tissue gene expression could be used to predict weight loss prior to the 10-week consumption of a low-fat hypocaloric diet. Using several statistical tests revealed that the gene expression profiles of responders (8-12 kgs weight loss) could always be differentiated from non-responders (<4 kgs weight loss). We also assessed whether this differentiation was sufficient for prediction. Using a bottom-up (i.e. black-box) approach, standard class prediction algorithms were able to predict dietary responders with up to 61.1%+/-8.1% accuracy. Using a top-down approach (i.e. using differentially expressed genes to build a classifier) improved prediction accuracy to 80.9%+/-2.2%. CONCLUSION: Adipose gene expression profiling prior to the consumption of a low-fat diet is able to differentiate responders from non-responders as well as serve as a weak predictor of subjects destined to lose weight. While the degree of prediction accuracy currently achieved with a gene expression snapshot is perhaps insufficient for clinical use, this work reveals that the comprehensive molecular signature of adipose tissue paves the way for the future of personalized nutrition

    Machine Learning Approach for Prescriptive Plant Breeding

    Get PDF
    We explored the capability of fusing high dimensional phenotypic trait (phenomic) data with a machine learning (ML) approach to provide plant breeders the tools to do both in-season seed yield (SY) prediction and prescriptive cultivar development for targeted agro-management practices (e.g., row spacing and seeding density). We phenotyped 32 SoyNAM parent genotypes in two independent studies each with contrasting agro-management treatments (two row spacing, three seeding densities). Phenotypic trait data (canopy temperature, chlorophyll content, hyperspectral reflectance, leaf area index, and light interception) were generated using an array of sensors at three growth stages during the growing season and seed yield (SY) determined by machine harvest. Random forest (RF) was used to train models for SY prediction using phenotypic traits (predictor variables) to identify the optimal temporal combination of variables to maximize accuracy and resource allocation. RF models were trained using data from both experiments and individually for each agro-management treatment. We report the most important traits agnostic of agro-management practices. Several predictor variables showed conditional importance dependent on the agro-management system. We assembled predictive models to enable in-season SY prediction, enabling the development of a framework to integrate phenomics information with powerful ML for prediction enabled prescriptive plant breeding

    Individualized markers optimize class prediction of microarray data

    Get PDF
    BACKGROUND: Identification of molecular markers for the classification of microarray data is a challenging task. Despite the evident dissimilarity in various characteristics of biological samples belonging to the same category, most of the marker – selection and classification methods do not consider this variability. In general, feature selection methods aim at identifying a common set of genes whose combined expression profiles can accurately predict the category of all samples. Here, we argue that this simplified approach is often unable to capture the complexity of a disease phenotype and we propose an alternative method that takes into account the individuality of each patient-sample. RESULTS: Instead of using the same features for the classification of all samples, the proposed technique starts by creating a pool of informative gene-features. For each sample, the method selects a subset of these features whose expression profiles are most likely to accurately predict the sample's category. Different subsets are utilized for different samples and the outcomes are combined in a hierarchical framework for the classification of all samples. Moreover, this approach can innately identify subgroups of samples within a given class which share common feature sets thus highlighting the effect of individuality on gene expression. CONCLUSION: In addition to high classification accuracy, the proposed method offers a more individualized approach for the identification of biological markers, which may help in better understanding the molecular background of a disease and emphasize the need for more flexible medical interventions

    Transcription Initiation Activity Sets Replication Origin Efficiency in Mammalian Cells

    Get PDF
    Genomic mapping of DNA replication origins (ORIs) in mammals provides a powerful means for understanding the regulatory complexity of our genome. Here we combine a genome-wide approach to identify preferential sites of DNA replication initiation at 0.4% of the mouse genome with detailed molecular analysis at distinct classes of ORIs according to their location relative to the genes. Our study reveals that 85% of the replication initiation sites in mouse embryonic stem (ES) cells are associated with transcriptional units. Nearly half of the identified ORIs map at promoter regions and, interestingly, ORI density strongly correlates with promoter density, reflecting the coordinated organisation of replication and transcription in the mouse genome. Detailed analysis of ORI activity showed that CpG island promoter-ORIs are the most efficient ORIs in ES cells and both ORI specification and firing efficiency are maintained across cell types. Remarkably, the distribution of replication initiation sites at promoter-ORIs exactly parallels that of transcription start sites (TSS), suggesting a co-evolution of the regulatory regions driving replication and transcription. Moreover, we found that promoter-ORIs are significantly enriched in CAGE tags derived from early embryos relative to all promoters. This association implies that transcription initiation early in development sets the probability of ORI activation, unveiling a new hallmark in ORI efficiency regulation in mammalian cells
    • …
    corecore