19 research outputs found

    Shrinkage Estimation of Expression Fold Change As an Alternative to Testing Hypotheses of Equivalent Expression

    Get PDF
    Research on analyzing microarray data has focused on the problem of identifying differentially expressed genes to the neglect of the problem of how to integrate evidence that a gene is differentially expressed with information on the extent of its differential expression. Consequently, researchers currently prioritize genes for further study either on the basis of volcano plots or, more commonly, according to simple estimates of the fold change after filtering the genes with an arbitrary statistical significance threshold. While the subjective and informal nature of the former practice precludes quantification of its reliability, the latter practice is equivalent to using a hard-threshold estimator of the expression ratio that is not known to perform well in terms of mean-squared error, the sum of estimator variance and squared estimator bias. On the basis of two distinct simulation studies and data from different microarray studies, we systematically compared the performance of several estimators representing both current practice and shrinkage. We find that the threshold-based estimators usually perform worse than the maximum-likelihood estimator (MLE) and they often perform far worse as quantified by estimated mean-squared risk. By contrast, the shrinkage estimators tend to perform as well as or better than the MLE and never much worse than the MLE, as expected from what is known about shrinkage. However, a Bayesian measure of performance based on the prior information that few genes are differentially expressed indicates that hard-threshold estimators perform about as well as the local false discovery rate (FDR), the best of the shrinkage estimators studied. Based on the ability of the latter to leverage information across genes, we conclude that the use of the local-FDR estimator of the fold change instead of informal or threshold-based combinations of statistical tests and non-shrinkage estimators can be expected to substantially improve the reliability of gene prioritization at very little risk of doing so less reliably

    Methods for peptide identification by spectral comparison

    Get PDF
    BACKGROUND: Tandem mass spectrometry followed by database search is currently the predominant technology for peptide sequencing in shotgun proteomics experiments. Most methods compare experimentally observed spectra to the theoretical spectra predicted from the sequences in protein databases. There is a growing interest, however, in comparing unknown experimental spectra to a library of previously identified spectra. This approach has the advantage of taking into account instrument-dependent factors and peptide-specific differences in fragmentation probabilities. It is also computationally more efficient for high-throughput proteomics studies. RESULTS: This paper investigates computational issues related to this spectral comparison approach. Different methods have been empirically evaluated over several large sets of spectra. First, we illustrate that the peak intensities follow a Poisson distribution. This implies that applying a square root transform will optimally stabilize the peak intensity variance. Our results show that the square root did indeed outperform other transforms, resulting in improved accuracy of spectral matching. Second, different measures of spectral similarity were compared, and the results illustrated that the correlation coefficient was most robust. Finally, we examine how to assemble multiple spectra associated with the same peptide to generate a synthetic reference spectrum. Ensemble averaging is shown to provide the best combination of accuracy and efficiency. CONCLUSION: Our results demonstrate that when combined, these methods can boost the sensitivity and specificity of spectral comparison. Therefore they are capable of enhancing and complementing existing tools for consistent and accurate peptide identification

    Host Protein Biomarkers Identify Active Tuberculosis in HIV Uninfected and Co-infected Individuals

    Get PDF
    AbstractBiomarkers for active tuberculosis (TB) are urgently needed to improve rapid TB diagnosis. The objective of this study was to identify serum protein expression changes associated with TB but not latent Mycobacterium tuberculosis infection (LTBI), uninfected states, or respiratory diseases other than TB (ORD). Serum samples from 209 HIV uninfected (HIV−) and co-infected (HIV+) individuals were studied. In the discovery phase samples were analyzed via liquid chromatography and mass spectrometry, and in the verification phase biologically independent samples were analyzed via a multiplex multiple reaction monitoring mass spectrometry (MRM-MS) assay. Compared to LTBI and ORD, host proteins were significantly differentially expressed in TB, and involved in the immune response, tissue repair, and lipid metabolism. Biomarker panels whose composition differed according to HIV status, and consisted of 8 host proteins in HIV− individuals (CD14, SEPP1, SELL, TNXB, LUM, PEPD, QSOX1, COMP, APOC1), or 10 host proteins in HIV+ individuals (CD14, SEPP1, PGLYRP2, PFN1, VASN, CPN2, TAGLN2, IGFBP6), respectively, distinguished TB from ORD with excellent accuracy (AUC=0.96 for HIV− TB, 0.95 for HIV+ TB). These results warrant validation in larger studies but provide promise that host protein biomarkers could be the basis for a rapid, blood-based test for TB

    Validation of differential gene expression algorithms: Application comparing fold-change estimation to hypothesis testing

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Sustained research on the problem of determining which genes are differentially expressed on the basis of microarray data has yielded a plethora of statistical algorithms, each justified by theory, simulation, or ad hoc validation and yet differing in practical results from equally justified algorithms. Recently, a concordance method that measures agreement among gene lists have been introduced to assess various aspects of differential gene expression detection. This method has the advantage of basing its assessment solely on the results of real data analyses, but as it requires examining gene lists of given sizes, it may be unstable.</p> <p>Results</p> <p>Two methodologies for assessing predictive error are described: a cross-validation method and a posterior predictive method. As a nonparametric method of estimating prediction error from observed expression levels, cross validation provides an empirical approach to assessing algorithms for detecting differential gene expression that is fully justified for large numbers of biological replicates. Because it leverages the knowledge that only a small portion of genes are differentially expressed, the posterior predictive method is expected to provide more reliable estimates of algorithm performance, allaying concerns about limited biological replication. In practice, the posterior predictive method can assess when its approximations are valid and when they are inaccurate. Under conditions in which its approximations are valid, it corroborates the results of cross validation. Both comparison methodologies are applicable to both single-channel and dual-channel microarrays. For the data sets considered, estimating prediction error by cross validation demonstrates that empirical Bayes methods based on hierarchical models tend to outperform algorithms based on selecting genes by their fold changes or by non-hierarchical model-selection criteria. (The latter two approaches have comparable performance.) The posterior predictive assessment corroborates these findings.</p> <p>Conclusions</p> <p>Algorithms for detecting differential gene expression may be compared by estimating each algorithm's error in predicting expression ratios, whether such ratios are defined across microarray channels or between two independent groups.</p> <p>According to two distinct estimators of prediction error, algorithms using hierarchical models outperform the other algorithms of the study. The fact that fold-change shrinkage performed as well as conventional model selection criteria calls for investigating algorithms that combine the strengths of significance testing and fold-change estimation.</p

    A Bayesian approach to peptide identification by accurate mass and time tags in proteomics experiments

    No full text
    The Accurate Mass and Time (AMT) tag approach to high-throughput proteomics uses reversed-phase liquid chromatography (RPLC) coupled to high accuracy mass spectrometry to measure both the masses and chromatographic retention-times of tryptic peptides in complex mixtures. These measurements are matched to the mass and predicted retention-times of peptides in a library to identify the associated sequences. This dissertation comprises two journal article manuscripts, a conference paper, and a third manuscript describing a sequence of Bayesian statistical models addressing key aspects of AMT tag matching.The first manuscript described a statistical model that matched accurate mass measurements to the masses of peptides in a library constructed from partial knowledge of the composition of the sample under analysis. Although no individual match was assigned with high confidence, in aggregate they enabled the detection and correction of calibration errors in the mass spectral data.In RPLC, a peptide's relative affinities for the solid and liquid phases, termed "hydrophobicity", is the physical property that determines its retention-time. The second manuscript described a statistical model that used a large data set of measured retention-times of identified peptides to estimate their hydrophobicity. The conference paper described a model for predicting a peptide's hydrophobicity from its sequence. The parameters of the model were fit using the results of the second manuscript and tested using an independent data set of measured retention-times of identified peptides. Together, the models of second manuscript and the conference paper provided estimates of peptide hydrophobicity for arbitrary peptide sequences.The third manuscript described a statistical model integrating a retention-time-matching component (made possible by the availability of estimated peptide hydrophobicities) into the first model, yielding probabilities of correctness for matches between AMT tags and predicted masses and retention-times of peptides in a library. The probabilities were validated by comparison with a set of "gold standard" peptide identifications acquired by MS/MS. The accuracy of the model was verified by demonstrating that its predicted receiver operating characteristic (ROC) curve matched the ROC curve generated by the gold standard data set.L'approche basée sur le temps de rétention et la masse déterminée de façon précise (ou approche AMT, Accurate Mass and Time tag) de la protéomique à haut débit fait appel à la RPLC (chromatographie liquide en phase inversée) couplée à la spectrométrie de masse de haute précision pour mesurer les masses et les temps de rétention des peptides trypsiques d'un mélange complexe. Ces mesures sont comparées aux masses et aux temps de rétention prévus de peptides d'une banque pour identifier les séquences associées. Cette thèse est constituée de deux articles de journaux, d'un article présenté dans une conférence et d'un troisième manuscrit, et décrit une suite de modèles statistiques bayésiens traitant les aspects essentiels des appariements basés sur la stratégie AMT. frLe premier article décrit un modèle statistique qui compare les mesures précises de la masse aux masses des peptides d'une banque construite à partir de la connaissance partielle de la composition de l'échantillon analysé. Bien qu'aucun des appariements n'ait pu être établi avec un degré de confiance élevé, ensemble, ils ont permis de détecter et de corriger des erreurs de calibrage dans les données de spectrométrie de masse. frEn RPLC, l'affinité relative d'un peptide pour les phases solide et liquide, appelée « hydrophobicité », est la propriété physique responsable du temps de rétention du peptide. Le deuxième article décrit un modèle statistique faisant appel à un vaste ensemble de données de mesure de temps de rétention de peptides identifiés pour estimer leur hydrophobicité. L'article présenté dans le cadre d'une conférence décrit un modèle permettant de prédire l'hydrophobicité d'un peptide à partir de sa séquence. Les paramètres du modèle ont été ajustés à l'aide des résultats présentés dans le deuxième article et vérifiés au moyen d'un ensemble indépendant de données de mesure de temps de rétention de peptides connus. Ensemble, les modèles décrits dans le deuxième article et dans l'article de la conférence ont permis d'obtenir une estimation de l'hydrophobicité de séquences peptidiques arbitraires. frLe troisième manuscrit décrit un modèle statistique qui intègre une composante de comparaison des temps de rétention (rendue possible par la disponibilité d'estimations de l'hydrophobicité peptidique) dans le premier modèle, permettant ainsi d'obtenir la probabilité de l'exactitude des appariements entre les marqueurs AMT et les masses et temps de rétention des peptides d'une banque. Les probabilités ont été validées par comparaison avec un ensemble de peptides de référence identifiés par MS/MS (spectrométrie de masse en tandem). La précision du modèle a été vérifiée en démontrant que la courbe de ROC (receiver operating characteristic, caractéristique de fonctionnement du récepteur) correspondait à la courbe de ROC obtenue avec les données de référence. f

    Shrinkage Estimation of Effect Sizes as an Alternative to Hypothesis Testing Followed by Estimation in High-Dimensional Biology: Applications to Differential Gene Expression

    No full text
    Research on analyzing microarray data has focused on the problem of identifying differentially expressed genes to the neglect of the problem of how to integrate evidence that a gene is differentially expressed with information on the extent of its differential expression. Consequently, researchers currently prioritize genes for further study either on the basis of volcano plots or, more commonly, according to simple estimates of the fold change after filtering the genes with an arbitrary statistical significance threshold. While the subjective and informal nature of the former practice precludes quantification of its reliability, the latter practice is equivalent to using a hard-threshold estimator of the expression ratio that is not known to perform well in terms of mean-squared error, the sum of estimator variance and squared estimator bias. On the basis of two distinct simulation studies and data from different microarray studies, we systematically compared the performance of several estimators representing both current practice and shrinkage. We find that the threshold-based estimators usually perform worse than the maximum-likelihood estimator (MLE) and they often perform far worse as quantified by estimated mean-squared risk. By contrast, the shrinkage estimators tend to perform as well as or better than the MLE and never much worse than the MLE, as expected from what is known about shrinkage. However, a Bayesian measure of performance based on the prior information that few genes are differentially expressed indicates that hard-threshold estimators perform about as well as the local false discovery rate (FDR), the best of the shrinkage estimators studied. Based on the ability of the latter to leverage information across genes, we conclude that the use of the local-FDR estimator of the fold change instead of informal or threshold-based combinations of statistical tests and non-shrinkage estimators can be expected to substantially improve the reliability of gene prioritization at very little risk of doing so less reliably. Since the proposed replacement of post-selection estimates with shrunken estimates applies as well to other types of high-dimensional data, it could also improve the analysis of SNP data from genome-wide association studies.

    Competition between Heterochromatic Loci Allows the Abundance of the Silencing Protein, Sir4, to Regulate de novo Assembly of Heterochromatin

    No full text
    Changes in the locations and boundaries of heterochromatin are critical during development, and de novo assembly of silent chromatin in budding yeast is a well-studied model for how new sites of heterochromatin assemble. De novo assembly cannot occur in the G1 phase of the cell cycle and one to two divisions are needed for complete silent chromatin assembly and transcriptional repression. Mutation of DOT1, the histone H3 lysine 79 (K79) methyltransferase, and SET1, the histone H3 lysine 4 (K4) methyltransferase, speed de novo assembly. These observations have led to the model that regulated demethylation of histones may be a mechanism for how cells control the establishment of heterochromatin. We find that the abundance of Sir4, a protein required for the assembly of silent chromatin, decreases dramatically during a G1 arrest and therefore tested if changing the levels of Sir4 would also alter the speed of de novo establishment. Halving the level of Sir4 slows heterochromatin establishment, while increasing Sir4 speeds establishment. yku70Δ and ubp10Δ cells also speed de novo assembly, and like dot1Δ cells have defects in subtelomeric silencing, suggesting that these mutants may indirectly speed de novo establishment by liberating Sir4 from telomeres. Deleting RIF1 and RIF2, which suppresses the subtelomeric silencing defects in these mutants, rescues the advanced de novo establishment in yku70Δ and ubp10Δ cells, but not in dot1Δ cells, suggesting that YKU70 and UBP10 regulate Sir4 availability by modulating subtelomeric silencing, while DOT1 functions directly to regulate establishment. Our data support a model whereby the demethylation of histone H3 K79 and changes in Sir4 abundance and availability define two rate-limiting steps that regulate de novo assembly of heterochromatin
    corecore