124 research outputs found

    Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Serial analysis of gene expression (SAGE) is used to obtain quantitative snapshots of the transcriptome. These profiles are count-based and are assumed to follow a Binomial or Poisson distribution. However, tag counts observed across multiple libraries (for example, one or more groups of biological replicates) have additional variance that cannot be accommodated by this assumption alone. Several models have been proposed to account for this effect, all of which utilize a continuous prior distribution to explain the excess variance. Here, a Poisson mixture model, which assumes excess variability arises from sampling a mixture of distinct components, is proposed and the merits of this model are discussed and evaluated.</p> <p>Results</p> <p>The goodness of fit of the Poisson mixture model on 15 sets of biological SAGE replicates is compared to the previously proposed hierarchical gamma-Poisson (negative binomial) model, and a substantial improvement is seen. In further support of the mixture model, there is observed: 1) an increase in the number of mixture components needed to fit the expression of tags representing more than one transcript; and 2) a tendency for components to cluster libraries into the same groups. A confidence score is presented that can identify tags that are differentially expressed between groups of SAGE libraries. Several examples where this test outperforms those previously proposed are highlighted.</p> <p>Conclusion</p> <p>The Poisson mixture model performs well as a) a method to represent SAGE data from biological replicates, and b) a basis to assign significance when testing for differential expression between multiple groups of replicates. Code for the R statistical software package is included to assist investigators in applying this model to their own data.</p

    Bias correction and Bayesian analysis of aggregate counts in SAGE libraries

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Tag-based techniques, such as SAGE, are commonly used to sample the mRNA pool of an organism's transcriptome. Incomplete digestion during the tag formation process may allow for multiple tags to be generated from a given mRNA transcript. The probability of forming a tag varies with its relative location. As a result, the observed tag counts represent a biased sample of the actual transcript pool. In SAGE this bias can be avoided by ignoring all but the 3' most tag but will discard a large fraction of the observed data. Taking this bias into account should allow more of the available data to be used leading to increased statistical power.</p> <p>Results</p> <p>Three new hierarchical models, which directly embed a model for the variation in tag formation probability, are proposed and their associated Bayesian inference algorithms are developed. These models may be applied to libraries at both the tag and aggregate level. Simulation experiments and analysis of real data are used to contrast the accuracy of the various methods. The consequences of tag formation bias are discussed in the context of testing differential expression. A description is given as to how these algorithms can be applied in that context.</p> <p>Conclusions</p> <p>Several Bayesian inference algorithms that account for tag formation effects are compared with the DPB algorithm providing clear evidence of superior performance. The accuracy of inferences when using a particular non-informative prior is found to depend on the expression level of a given gene. The multivariate nature of the approach easily allows both univariate and joint tests of differential expression. Calculations demonstrate the potential for false positive and negative findings due to variation in tag formation probabilities across samples when testing for differential expression.</p

    Comparing the old and new generation SELDI-TOF MS: implications for serum protein profiling

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Although the PBS-IIc SELDI-TOF MS apparatus has been extensively used in the search for better biomarkers, issues have been raised concerning the semi-quantitative nature of the technique and its reproducibility. To overcome these limitations, a new SELDI-TOF MS instrument has been introduced: the PCS 4000 series. Changes in this apparatus compared to the older one are a.o. an increased dynamic range of the detector, an adjusted configuration of the detector sensitivity, a raster scan that ensures more complete desorption coverage and an improved detector attenuation mechanism. In the current study, we evaluated the performance of the old PBS-IIc and new PCS 4000 series generation SELDI-TOF MS apparatus.</p> <p>Methods</p> <p>To this end, two different sample sets were profiled after which the same ProteinChip arrays were analysed successively by both instruments. Generated spectra were analysed by the associated software packages. The performance of both instruments was evaluated by assessment of the number of peaks detected in the two sample sets, the biomarker potential and reproducibility of generated peak clusters, and the number of peaks detected following serum fractionation.</p> <p>Results</p> <p>We could not confirm the claimed improved performance of the new PCS 4000 instrument, as assessed by the number of peaks detected, the biomarker potential and the reproducibility. However, the PCS 4000 instrument did prove to be of superior performance in peak detection following profiling of serum fractions.</p> <p>Conclusion</p> <p>As serum fractionation facilitates detection of low abundant proteins through reduction of the dynamic range of serum proteins, it is now increasingly applied in the search for new potential biomarkers. Hence, although the new PCS 4000 instrument did not differ from the old PBS-IIc apparatus in the analysis of crude serum, its superior performance after serum fractionation does hold promise for improved biomarker detection and identification.</p

    Biasogram: visualization of confounding technical bias in gene expression data.

    Get PDF
    Gene expression profiles of clinical cohorts can be used to identify genes that are correlated with a clinical variable of interest such as patient outcome or response to a particular drug. However, expression measurements are susceptible to technical bias caused by variation in extraneous factors such as RNA quality and array hybridization conditions. If such technical bias is correlated with the clinical variable of interest, the likelihood of identifying false positive genes is increased. Here we describe a method to visualize an expression matrix as a projection of all genes onto a plane defined by a clinical variable and a technical nuisance variable. The resulting plot indicates the extent to which each gene is correlated with the clinical variable or the technical variable. We demonstrate this method by applying it to three clinical trial microarray data sets, one of which identified genes that may have been driven by a confounding technical variable. This approach can be used as a quality control step to identify data sets that are likely to yield false positive results

    Clustering-based approaches to SAGE data mining

    Get PDF
    Serial analysis of gene expression (SAGE) is one of the most powerful tools for global gene expression profiling. It has led to several biological discoveries and biomedical applications, such as the prediction of new gene functions and the identification of biomarkers in human cancer research. Clustering techniques have become fundamental approaches in these applications. This paper reviews relevant clustering techniques specifically designed for this type of data. It places an emphasis on current limitations and opportunities in this area for supporting biologically-meaningful data mining and visualisation

    Methodological Deficits in Diagnostic Research Using ‘-Omics’ Technologies: Evaluation of the QUADOMICS Tool and Quality of Recently Published Studies

    Get PDF
    Background: QUADOMICS is an adaptation of QUADAS (a quality assessment tool for use in systematic reviews of diagnostic accuracy studies), which takes into account the particular challenges presented by '-omics' based technologies. Our primary objective was to evaluate the applicability and consistency of QUADOMICS. Subsequently we evaluated and describe the methodological quality of a sample of recently published studies using the tool. Methodology/Principal Findings: 45'-omics'- based diagnostic studies were identified by systematic search of Pubmed using suitable MeSH terms (>Genomics>, >Sensitivity and specificity>, >Diagnosis>). Three investigators independently assessed the quality of the articles using QUADOMICS and met to compare observations and generate a consensus. Consistency and applicability was assessed by comparing each reviewer's original rating with the consensus. Methodological quality was described using the consensus rating. Agreement was above 80% for all three reviewers. Four items presented difficulties with application, mostly due to the lack of a clearly defined gold standard. Methodological quality of our sample was poor; studies met roughly half of the applied criteria (mean ± sd, 54.7±18.4°%). Few studies were carried out in a population that mirrored the clinical situation in which the test would be used in practice, (6, 13.3%);none described patient recruitment sufficiently; and less than half described clinical and physiological factors that might influence the biomarker profile (20, 44.4%). Conclusions: The QUADOMICS tool can consistently be applied to diagnostic '-omics' studies presently published in biomedical journals. A substantial proportion of reports in this research field fail to address design issues that are fundamental to make inferences relevant for patient care. © 2010 Parker et al.This work was supported by the Spanish Agency for Health Technology Assessment, Exp PI06/90311, Instituto de Salud Carlos III and CIBER en Epidemiología y Salud Pública (CIBERESP) in SpainPeer Reviewe

    Comparison of normalisation methods for surface-enhanced laser desorption and ionisation (SELDI) time-of-flight (TOF) mass spectrometry data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Mass spectrometry for biological data analysis is an active field of research, providing an efficient way of high-throughput proteome screening. A popular variant of mass spectrometry is SELDI, which is often used to measure sample populations with the goal of developing (clinical) classifiers. Unfortunately, not only is the data resulting from such measurements quite noisy, variance between replicate measurements of the same sample can be high as well. Normalisation of spectra can greatly reduce the effect of this technical variance and further improve the quality and interpretability of the data. However, it is unclear which normalisation method yields the most informative result.</p> <p>Results</p> <p>In this paper, we describe the first systematic comparison of a wide range of normalisation methods, using two objectives that should be met by a good method. These objectives are minimisation of inter-spectra variance and maximisation of signal with respect to class separation. The former is assessed using an estimation of the coefficient of variation, the latter using the classification performance of three types of classifiers on real-world datasets representing two-class diagnostic problems. To obtain a maximally robust evaluation of a normalisation method, both objectives are evaluated over multiple datasets and multiple configurations of baseline correction and peak detection methods. Results are assessed for statistical significance and visualised to reveal the performance of each normalisation method, in particular with respect to using no normalisation. The normalisation methods described have been implemented in the freely available MASDA R-package.</p> <p>Conclusion</p> <p>In the general case, normalisation of mass spectra is beneficial to the quality of data. The majority of methods we compared performed significantly better than the case in which no normalisation was used. We have shown that normalisation methods that scale spectra by a factor based on the dispersion (e.g., standard deviation) of the data clearly outperform those where a factor based on the central location (e.g., mean) is used. Additional improvements in performance are obtained when these factors are estimated locally, using a sliding window within spectra, instead of globally, over full spectra. The underperforming category of methods using a globally estimated factor based on the central location of the data includes the method used by the majority of SELDI users.</p

    Multicentric validation of proteomic biomarkers in urine specific for diabetic nephropathy

    Get PDF
    Background: Urine proteome analysis is rapidly emerging as a tool for diagnosis and prognosis in disease states. For diagnosis of diabetic nephropathy (DN), urinary proteome analysis was successfully applied in a pilot study. The validity of the previously established proteomic biomarkers with respect to the diagnostic and prognostic potential was assessed on a separate set of patients recruited at three different European centers. In this case-control study of 148 Caucasian patients with diabetes mellitus type 2 and duration &gt;= 5 years, cases of DN were defined as albuminuria &gt;300 mg/d and diabetic retinopathy (n = 66). Controls were matched for gender and diabetes duration (n = 82). Methodology/Principal Findings: Proteome analysis was performed blinded using high-resolution capillary electrophoresis coupled with mass spectrometry (CE-MS). Data were evaluated employing the previously developed model for DN. Upon unblinding, the model for DN showed 93.8% sensitivity and 91.4% specificity, with an AUC of 0.948 (95% CI 0.898-0.978). Of 65 previously identified peptides, 60 were significantly different between cases and controls of this study. In &lt;10% of cases and controls classification by proteome analysis not entirely resulted in the expected clinical outcome. Analysis of patient's subsequent clinical course revealed later progression to DN in some of the false positive classified DN control patients. Conclusions: These data provide the first independent confirmation that profiling of the urinary proteome by CE-MS can adequately identify subjects with DN, supporting the generalizability of this approach. The data further establish urinary collagen fragments as biomarkers for diabetes-induced renal damage that may serve as earlier and more specific biomarkers than the currently used urinary albumin

    An integrative multi-platform analysis for discovering biomarkers of osteosarcoma

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>SELDI-TOF-MS (Surface Enhanced Laser Desorption/Ionization-Time of Flight-Mass Spectrometry) has become an attractive approach for cancer biomarker discovery due to its ability to resolve low mass proteins and high-throughput capability. However, the analytes from mass spectrometry are described only by their mass-to-charge ratio (<it>m</it>/<it>z</it>) values without further identification and annotation. To discover potential biomarkers for early diagnosis of osteosarcoma, we designed an integrative workflow combining data sets from both SELDI-TOF-MS and gene microarray analysis.</p> <p>Methods</p> <p>After extracting the information for potential biomarkers from SELDI data and microarray analysis, their associations were further inferred by link-test to identify biomarkers that could likely be used for diagnosis. Immuno-blot analysis was then performed to examine whether the expression of the putative biomarkers were indeed altered in serum from patients with osteosarcoma.</p> <p>Results</p> <p>Six differentially expressed protein peaks with strong statistical significances were detected by SELDI-TOF-MS. Four of the proteins were up-regulated and two of them were down-regulated. Microarray analysis showed that, compared with an osteoblastic cell line, the expression of 653 genes was changed more than 2 folds in three osteosarcoma cell lines. While expression of 310 genes was increased, expression of the other 343 genes was decreased. The two sets of biomarkers candidates were combined by the link-test statistics, indicating that 13 genes were potential biomarkers for early diagnosis of osteosarcoma. Among these genes, cytochrome c1 (CYC-1) was selected for further experimental validation.</p> <p>Conclusion</p> <p>Link-test on datasets from both SELDI-TOF-MS and microarray high-throughput analysis can accelerate the identification of tumor biomarkers. The result confirmed that CYC-1 may be a promising biomarker for early diagnosis of osteosarcoma.</p
    corecore