128 research outputs found

    Robust Detection of Hierarchical Communities from Escherichia coli Gene Expression Data

    Get PDF
    Determining the functional structure of biological networks is a central goal of systems biology. One approach is to analyze gene expression data to infer a network of gene interactions on the basis of their correlated responses to environmental and genetic perturbations. The inferred network can then be analyzed to identify functional communities. However, commonly used algorithms can yield unreliable results due to experimental noise, algorithmic stochasticity, and the influence of arbitrarily chosen parameter values. Furthermore, the results obtained typically provide only a simplistic view of the network partitioned into disjoint communities and provide no information of the relationship between communities. Here, we present methods to robustly detect coregulated and functionally enriched gene communities and demonstrate their application and validity for Escherichia coli gene expression data. Applying a recently developed community detection algorithm to the network of interactions identified with the context likelihood of relatedness (CLR) method, we show that a hierarchy of network communities can be identified. These communities significantly enrich for gene ontology (GO) terms, consistent with them representing biologically meaningful groups. Further, analysis of the most significantly enriched communities identified several candidate new regulatory interactions. The robustness of our methods is demonstrated by showing that a core set of functional communities is reliably found when artificial noise, modeling experimental noise, is added to the data. We find that noise mainly acts conservatively, increasing the relatedness required for a network link to be reliably assigned and decreasing the size of the core communities, rather than causing association of genes into new communities.Comment: Due to appear in PLoS Computational Biology. Supplementary Figure S1 was not uploaded but is available by contacting the author. 27 pages, 5 figures, 15 supplementary file

    Multicentric validation of proteomic biomarkers in urine specific for diabetic nephropathy

    Get PDF
    Background: Urine proteome analysis is rapidly emerging as a tool for diagnosis and prognosis in disease states. For diagnosis of diabetic nephropathy (DN), urinary proteome analysis was successfully applied in a pilot study. The validity of the previously established proteomic biomarkers with respect to the diagnostic and prognostic potential was assessed on a separate set of patients recruited at three different European centers. In this case-control study of 148 Caucasian patients with diabetes mellitus type 2 and duration >= 5 years, cases of DN were defined as albuminuria >300 mg/d and diabetic retinopathy (n = 66). Controls were matched for gender and diabetes duration (n = 82). Methodology/Principal Findings: Proteome analysis was performed blinded using high-resolution capillary electrophoresis coupled with mass spectrometry (CE-MS). Data were evaluated employing the previously developed model for DN. Upon unblinding, the model for DN showed 93.8% sensitivity and 91.4% specificity, with an AUC of 0.948 (95% CI 0.898-0.978). Of 65 previously identified peptides, 60 were significantly different between cases and controls of this study. In <10% of cases and controls classification by proteome analysis not entirely resulted in the expected clinical outcome. Analysis of patient's subsequent clinical course revealed later progression to DN in some of the false positive classified DN control patients. Conclusions: These data provide the first independent confirmation that profiling of the urinary proteome by CE-MS can adequately identify subjects with DN, supporting the generalizability of this approach. The data further establish urinary collagen fragments as biomarkers for diabetes-induced renal damage that may serve as earlier and more specific biomarkers than the currently used urinary albumin

    Clustering-based approaches to SAGE data mining

    Get PDF
    Serial analysis of gene expression (SAGE) is one of the most powerful tools for global gene expression profiling. It has led to several biological discoveries and biomedical applications, such as the prediction of new gene functions and the identification of biomarkers in human cancer research. Clustering techniques have become fundamental approaches in these applications. This paper reviews relevant clustering techniques specifically designed for this type of data. It places an emphasis on current limitations and opportunities in this area for supporting biologically-meaningful data mining and visualisation

    Relative impact of key sources of systematic noise in Affymetrix and Illumina gene-expression microarray experiments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Systematic processing noise, which includes batch effects, is very common in microarray experiments but is often ignored despite its potential to confound or compromise experimental results. Compromised results are most likely when re-analysing or integrating datasets from public repositories due to the different conditions under which each dataset is generated. To better understand the relative noise-contributions of various factors in experimental-design, we assessed several Illumina and Affymetrix datasets for technical variation between replicate hybridisations of Universal Human Reference (UHRR) and individual or pooled breast-tumour RNA.</p> <p>Results</p> <p>A varying degree of systematic noise was observed in each of the datasets, however in all cases the relative amount of variation between standard control RNA replicates was found to be greatest at earlier points in the sample-preparation workflow. For example, 40.6% of the total variation in reported expressions were attributed to replicate extractions, compared to 13.9% due to amplification/labelling and 10.8% between replicate hybridisations. Deliberate probe-wise batch-correction methods were effective in reducing the magnitude of this variation, although the level of improvement was dependent on the sources of noise included in the model. Systematic noise introduced at the chip, run, and experiment levels of a combined Illumina dataset were found to be highly dependant upon the experimental design. Both UHRR and pools of RNA, which were derived from the samples of interest, modelled technical variation well although the pools were significantly better correlated (4% average improvement) and better emulated the effects of systematic noise, over all probes, than the UHRRs. The effect of this noise was not uniform over all probes, with low GC-content probes found to be more vulnerable to batch variation than probes with a higher GC-content.</p> <p>Conclusions</p> <p>The magnitude of systematic processing noise in a microarray experiment is variable across probes and experiments, however it is generally the case that procedures earlier in the sample-preparation workflow are liable to introduce the most noise. Careful experimental design is important to protect against noise, detailed meta-data should always be provided, and diagnostic procedures should be routinely performed prior to downstream analyses for the detection of bias in microarray studies.</p

    Changes in the serum proteome associated with the development of hepatocellular carcinoma in hepatitis C-related cirrhosis

    Get PDF
    Early diagnosis of hepatocellular carcinoma (HCC) is the key to the delivery of effective therapies. The conventional serological diagnostic test, estimation of serum alpha-fetoprotein (AFP) lacks both sensitivity and specificity as a screening tool and improved tests are needed to complement ultrasound scanning, the major modality for surveillance of groups at high risk of HCC. We have analysed the serum proteome of 182 patients with hepatitis C-induced liver cirrhosis (77 with HCC) by surface-enhanced laser desorption/ionisation time-of-flight mass spectrometry (SELDI). The patients were split into a training set (84 non-HCC, 60 HCC) and a ‘blind' test set (21 non-HCC, 17 HCC). Neural networks developed on the training set were able to classify the blind test set with 94% sensitivity (95% CI 73–99%) and 86% specificity (95% CI 65–95%). Two of the SELDI peaks (23/23.5 kDa) were elevated by an average of 50% in the serum of HCC patients (P<0.001) and were identified as κ and λ immunoglobulin light chains. This approach may permit identification of several individual proteins, which, in combination, may offer a novel way to diagnose HCC

    A machine learning pipeline for quantitative phenotype prediction from genotype data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Quantitative phenotypes emerge everywhere in systems biology and biomedicine due to a direct interest for quantitative traits, or to high individual variability that makes hard or impossible to classify samples into distinct categories, often the case with complex common diseases. Machine learning approaches to genotype-phenotype mapping may significantly improve Genome-Wide Association Studies (GWAS) results by explicitly focusing on predictivity and optimal feature selection in a multivariate setting. It is however essential that stringent and well documented Data Analysis Protocols (DAP) are used to control sources of variability and ensure reproducibility of results. We present a genome-to-phenotype pipeline of machine learning modules for quantitative phenotype prediction. The pipeline can be applied for the direct use of whole-genome information in functional studies. As a realistic example, the problem of fitting complex phenotypic traits in heterogeneous stock mice from single nucleotide polymorphims (SNPs) is here considered.</p> <p>Methods</p> <p>The core element in the pipeline is the L1L2 regularization method based on the naïve elastic net. The method gives at the same time a regression model and a dimensionality reduction procedure suitable for correlated features. Model and SNP markers are selected through a DAP originally developed in the MAQC-II collaborative initiative of the U.S. FDA for the identification of clinical biomarkers from microarray data. The L1L2 approach is compared with standard Support Vector Regression (SVR) and with Recursive Jump Monte Carlo Markov Chain (MCMC). Algebraic indicators of stability of partial lists are used for model selection; the final panel of markers is obtained by a procedure at the chromosome scale, termed ’saturation’, to recover SNPs in Linkage Disequilibrium with those selected.</p> <p>Results</p> <p>With respect to both MCMC and SVR, comparable accuracies are obtained by the L1L2 pipeline. Good agreement is also found between SNPs selected by the L1L2 algorithms and candidate loci previously identified by a standard GWAS. The combination of L1L2-based feature selection with a saturation procedure tackles the issue of neglecting highly correlated features that affects many feature selection algorithms.</p> <p>Conclusions</p> <p>The L1L2 pipeline has proven effective in terms of marker selection and prediction accuracy. This study indicates that machine learning techniques may support quantitative phenotype prediction, provided that adequate DAPs are employed to control bias in model selection.</p

    Integrated genomic analyses of ovarian carcinoma

    Get PDF
    A catalogue of molecular aberrations that cause ovarian cancer is critical for developing and deploying therapies that will improve patients’ lives. The Cancer Genome Atlas project has analysed messenger RNA expression, microRNA expression, promoter methylation and DNA copy number in 489 high-grade serous ovarian adenocarcinomas and the DNA sequences of exons from coding genes in 316 of these tumours. Here we report that high-grade serous ovarian cancer is characterized by TP53 mutations in almost all tumours (96%); low prevalence but statistically recurrent somatic mutations in nine further genes including NF1, BRCA1, BRCA2, RB1 and CDK12; 113 significant focal DNA copy number aberrations; and promoter methylation events involving 168 genes. Analyses delineated four ovarian cancer transcriptional subtypes, three microRNA subtypes, four promoter methylation subtypes and a transcriptional signature associated with survival duration, and shed new light on the impact that tumours with BRCA1/2 (BRCA1 or BRCA2) and CCNE1 aberrations have on survival. Pathway analyses suggested that homologous recombination is defective in about half of the tumours analysed, and that NOTCH and FOXM1 signalling are involved in serous ovarian cancer pathophysiology.National Institutes of Health (U.S.) (Grant U54HG003067)National Institutes of Health (U.S.) (Grant U54HG003273)National Institutes of Health (U.S.) (Grant U54HG003079)National Institutes of Health (U.S.) (Grant U24CA126543)National Institutes of Health (U.S.) (Grant U24CA126544)National Institutes of Health (U.S.) (Grant U24CA126546)National Institutes of Health (U.S.) (Grant U24CA126551)National Institutes of Health (U.S.) (Grant U24CA126554)National Institutes of Health (U.S.) (Grant U24CA126561)National Institutes of Health (U.S.) (Grant U24CA126563)National Institutes of Health (U.S.) (Grant U24CA143882)National Institutes of Health (U.S.) (Grant U24CA143731)National Institutes of Health (U.S.) (Grant U24CA143835)National Institutes of Health (U.S.) (Grant U24CA143845)National Institutes of Health (U.S.) (Grant U24CA143858)National Institutes of Health (U.S.) (Grant U24CA144025)National Institutes of Health (U.S.) (Grant U24CA143866)National Institutes of Health (U.S.) (Grant U24CA143867)National Institutes of Health (U.S.) (Grant U24CA143848)National Institutes of Health (U.S.) (Grant U24CA143843)National Institutes of Health (U.S.) (Grant R21CA135877

    Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression

    Get PDF
    INTRODUCTION: Genomic and transcriptomic alterations affecting key cellular processes such us cell proliferation, differentiation and genomic stability are considered crucial for the development and progression of cancer. Most invasive breast carcinomas are known to derive from precursor in situ lesions. It is proposed that major global expression abnormalities occur in the transition from normal to premalignant stages and further progression to invasive stages. Serial analysis of gene expression (SAGE) was employed to generate a comprehensive global gene expression profile of the major changes occurring during breast cancer malignant evolution. METHODS: In the present study we combined various normal and tumor SAGE libraries available in the public domain with sets of breast cancer SAGE libraries recently generated and sequenced in our laboratory. A recently developed modified t test was used to detect the genes differentially expressed. RESULTS: We accumulated a total of approximately 1.7 million breast tissue-specific SAGE tags and monitored the behavior of more than 25,157 genes during early breast carcinogenesis. We detected 52 transcripts commonly deregulated across the board when comparing normal tissue with ductal carcinoma in situ, and 149 transcripts when comparing ductal carcinoma in situ with invasive ductal carcinoma (P < 0.01). CONCLUSION: A major novelty of our study was the use of a statistical method that correctly accounts for the intra-SAGE and inter-SAGE library sources of variation. The most useful result of applying this modified t statistics beta binomial test is the identification of genes and gene families commonly deregulated across samples within each specific stage in the transition from normal to preinvasive and invasive stages of breast cancer development. Most of the gene expression abnormalities detected at the in situ stage were related to specific genes in charge of regulating the proper homeostasis between cell death and cell proliferation. The comparison of in situ lesions with fully invasive lesions, a much more heterogeneous group, clearly identified as the most importantly deregulated group of transcripts those encoding for various families of proteins in charge of extracellular matrix remodeling, invasion and cell motility functions

    Biomarkers for cystic fibrosis lung disease: Application of SELDI-TOF mass spectrometry to BAL fluid

    Get PDF
    AbstractBackgroundFor cystic fibrosis (CF) patients there is a lack of good assays of disease activity and response to new therapeutic interventions, including gene therapy. Current measures of airways inflammation severity are insensitive or non-specific.MethodsBronchoalveolar lavage fluid from 39 CF children and 38 respiratory disease controls was obtained at bronchoscopy and analysed by surface enhanced laser desorption ionisation time of flight (SELDI-TOF) mass spectrometry. Recognized proteins were assessed for CF disease specificity. Individual protein identification of specific peaks was performed.Results1277 proteins/peptides, >4 kDa, were detected using 12 different surfaces and binding conditions. 202 proteins/peptides were differentially expressed in the CF samples (p<0.001), 167 up-regulated and 35 down-regulated. The most discriminatory biomarker had a mass of 5.163 kDa. The most abundant, with a mass of 10.6 kDa, was identified as s100 A8 (calgranulin A).ConclusionsThe application of SELDI-TOF mass spectrometry allows evaluation of proteins in BAL fluid avoiding the limitations of only analysing predetermined proteins and potentially identifying proteins not previously appreciated as biomarkers. Its application to cystic fibrosis should enable appropriate evaluation of evolving illness, of gene therapy and other new therapies
    corecore