5 research outputs found

    Décrypter les données omiques : importance du contrÎle qualité. Application au cancer de l'ovaire

    Get PDF
    Deciphering omics data : on the importance of quality control. Application to ovarian cancer. Over the past 10 years, the size and complexity of biological data have exploded, and quality control is critical to interpret them correctly. Indeed, omics data (high- hroughput genomic and post-genomic data) are often incomplete and contain bias and errors that can easily be misinterpreted as biologically interesting findings. In this work, we show that literature-curated and high-throughput protein-protein interaction data, usually considered independent, are in fact significantly correlated. We examine the yeast interactome from a new perspective by taking into account how thoroughly proteins have been studied, and our results show that this bias can be corrected for by focusing on well- studied proteins. We thus propose a simple and reliable method to estimate the size of an interactome, combining literature-curated data involving well-studied proteins with high- hroughput data. It yields an estimate of at least 37,600 direct physical protein-protein interactions in S.cerevisiae, a significant increase over previous estimates. We then focus on next-generation DNA sequencing data. An analysis of the bias existing between short- eads aligned on each strand of the genome allows us to highlight numerous systematic errors. Furthermore, we observe many positions that exhibit between 20 and 40% of reads carrying the variant allele : these cannot be genotyped correctly.We then propose a method to overcome these biases and reliably call genotypes from NGS data. Finally, we apply our method to exome-seq data produced by the TCGA for tumor and matched normal samples from 520 ovarian cancer patients. We detect on average 30,632 germline variants per patient. Though an integrative approach, we then identify those which are likely to increase cancer risk : in particular, we focused on variants inducing a loss of function of the encoded protein, and selected those that are significantly more present in the patients than in the general population. We find 44 SNVs per patient on average, impacting 334 genes overall in the cohort. Among these genes, 42 have been previously reported as involved in carcinogenesis, confirming that our list is highly enriched in ovarian cancer susceptibility genes. In particular, our results confirm the tumor suppressor role of the MAP3K8 protein, recently identified in other types of cancer.DĂ©crypter les donnĂ©es omiques : importance du contrĂŽle qualitĂ©. Application au cancer de l’ovaire Au cours des dix derniĂšres annĂ©es, la taille et la complexitĂ© des donnĂ©es biologiques ont littĂ©ralement explosĂ©, et une attention particuliĂšre doit ĂȘtre portĂ©e au contrĂŽle qualitĂ©. En effet, certaines donnĂ©es omiques (donnĂ©es gĂ©nomiques et post-gĂ©nomiques obtenues Ă  haut dĂ©bit) sont trĂšs incomplĂštes et/ou contiennent de nombreux biais et erreurs qu’il est facile de confondre avec de l’information biologiquement intĂ©ressante. Dans cette thĂšse, nous montrons que les interactions protĂ©ine-protĂ©ine issues de curation de la littĂ©rature et les interactions identifiĂ©es Ă  haut dĂ©bit sont beaucoup plus corrĂ©lĂ©es que ce qui est communĂ©ment admis. Nous examinons l’interactome de la levure d’un point de vue original, en prenant en compte le degrĂ© d’étude des protĂ©ines par la communautĂ© scientifique et nos rĂ©sultats indiquent que cette corrĂ©lation s’estompe lorsqu’on se restreint aux protĂ©ines trĂšs Ă©tudiĂ©es. Ces observations nous permettent de proposer une mĂ©thode simple et fiable pour estimer la taille d’un interactome. Notre mĂ©thode conduit Ă  une estimation d’au moins 37 600 interactions physiques directes chez S. cerevisiae, et montre que les Ă©valuations prĂ©cĂ©dentes sont trop faibles. Par ailleurs, nous Ă©tudions des donnĂ©es de sĂ©quençage nouvelle gĂ©nĂ©ration de l’ADN. Par une analyse des biais existant entre les short-reads alignĂ©s sur un brin ou sur l’autre du gĂ©nome, nous mettons en Ă©vidence de nombreuses erreurs systĂ©matiques. De plus, nous observons de multiples positions prĂ©sentant entre 20 et 40% de short-reads portant l’allĂšle variant : celles-ci ne peuvent pas ĂȘtre gĂ©notypĂ©es correctement. Nous proposons une mĂ©thode fiable pour appeler les gĂ©notypes Ă  partir des donnĂ©es NGS qui permet de s’affranchir de ses difficultĂ©s. Enfin, nous appliquons cette mĂ©thode sur des donnĂ©es massives de sĂ©quençage d’exome de cellules saines et tumorales de 520 patientes atteintes du cancer de l’ovaire, produites par le consortium TCGA. Nous dĂ©tectons en moyenne 30 632 variants germinaux par patiente. Parmi ces variants, nous identifions ceux les plus enclins Ă  confĂ©rer un risque accru de dĂ©velopper la maladie : nous nous restreignons notamment aux variants induisant une perte de fonction de la protĂ©ine encodĂ©e et significativement plus prĂ©sents chez les patientes que dans la population gĂ©nĂ©rale. Cela conduit Ă  44 SNVs par patiente en moyenne, rĂ©partis sur 334 gĂšnes dans l’ensemble de la cohorte. Parmi ces 334 gĂšnes, 42 ont Ă©tĂ© reportĂ©s comme impliquĂ©s dans la cancerogĂ©nĂšse, confirmant que la liste de candidats identifiĂ©s est fortement enrichie en gĂšnes de susceptibilitĂ© au cancer de l’ovaire. En particulier, nos travaux confirment le rĂŽle de suppresseur de tumeur de la protĂ©ine MAP3K8, trĂšs rĂ©cemment proposĂ©e comme jouant un rĂŽle clĂ© dans d’autres cancers

    New insights into protein-protein interaction data lead to increased estimates of the S. cerevisiae interactome size

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>As protein interactions mediate most cellular mechanisms, protein-protein interaction networks are essential in the study of cellular processes. Consequently, several large-scale interactome mapping projects have been undertaken, and protein-protein interactions are being distilled into databases through literature curation; yet protein-protein interaction data are still far from comprehensive, even in the model organism <it>Saccharomyces cerevisiae</it>. Estimating the interactome size is important for evaluating the completeness of current datasets, in order to measure the remaining efforts that are required.</p> <p>Results</p> <p>We examined the yeast interactome from a new perspective, by taking into account how thoroughly proteins have been studied. We discovered that the set of literature-curated protein-protein interactions is qualitatively different when restricted to proteins that have received extensive attention from the scientific community. In particular, these interactions are less often supported by yeast two-hybrid, and more often by more complex experiments such as biochemical activity assays. Our analysis showed that high-throughput and literature-curated interactome datasets are more correlated than commonly assumed, but that this bias can be corrected for by focusing on well-studied proteins. We thus propose a simple and reliable method to estimate the size of an interactome, combining literature-curated data involving well-studied proteins with high-throughput data. It yields an estimate of at least 37, 600 direct physical protein-protein interactions in <it>S. cerevisiae</it>.</p> <p>Conclusions</p> <p>Our method leads to higher and more accurate estimates of the interactome size, as it accounts for interactions that are genuine yet difficult to detect with commonly-used experimental assays. This shows that we are even further from completing the yeast interactome map than previously expected.</p

    Décrypter les données omiques : importance du contrÎle qualité. Application au cancer de l'ovaire

    No full text
    Deciphering omics data : on the importance of quality control. Application to ovarian cancer. Over the past 10 years, the size and complexity of biological data have exploded, and quality control is critical to interpret them correctly. Indeed, omics data (high- hroughput genomic and post-genomic data) are often incomplete and contain bias and errors that can easily be misinterpreted as biologically interesting findings. In this work, we show that literature-curated and high-throughput protein-protein interaction data, usually considered independent, are in fact significantly correlated. We examine the yeast interactome from a new perspective by taking into account how thoroughly proteins have been studied, and our results show that this bias can be corrected for by focusing on well- studied proteins. We thus propose a simple and reliable method to estimate the size of an interactome, combining literature-curated data involving well-studied proteins with high- hroughput data. It yields an estimate of at least 37,600 direct physical protein-protein interactions in S.cerevisiae, a significant increase over previous estimates. We then focus on next-generation DNA sequencing data. An analysis of the bias existing between short- eads aligned on each strand of the genome allows us to highlight numerous systematic errors. Furthermore, we observe many positions that exhibit between 20 and 40% of reads carrying the variant allele : these cannot be genotyped correctly.We then propose a method to overcome these biases and reliably call genotypes from NGS data. Finally, we apply our method to exome-seq data produced by the TCGA for tumor and matched normal samples from 520 ovarian cancer patients. We detect on average 30,632 germline variants per patient. Though an integrative approach, we then identify those which are likely to increase cancer risk : in particular, we focused on variants inducing a loss of function of the encoded protein, and selected those that are significantly more present in the patients than in the general population. We find 44 SNVs per patient on average, impacting 334 genes overall in the cohort. Among these genes, 42 have been previously reported as involved in carcinogenesis, confirming that our list is highly enriched in ovarian cancer susceptibility genes. In particular, our results confirm the tumor suppressor role of the MAP3K8 protein, recently identified in other types of cancer.DĂ©crypter les donnĂ©es omiques : importance du contrĂŽle qualitĂ©. Application au cancer de l’ovaire Au cours des dix derniĂšres annĂ©es, la taille et la complexitĂ© des donnĂ©es biologiques ont littĂ©ralement explosĂ©, et une attention particuliĂšre doit ĂȘtre portĂ©e au contrĂŽle qualitĂ©. En effet, certaines donnĂ©es omiques (donnĂ©es gĂ©nomiques et post-gĂ©nomiques obtenues Ă  haut dĂ©bit) sont trĂšs incomplĂštes et/ou contiennent de nombreux biais et erreurs qu’il est facile de confondre avec de l’information biologiquement intĂ©ressante. Dans cette thĂšse, nous montrons que les interactions protĂ©ine-protĂ©ine issues de curation de la littĂ©rature et les interactions identifiĂ©es Ă  haut dĂ©bit sont beaucoup plus corrĂ©lĂ©es que ce qui est communĂ©ment admis. Nous examinons l’interactome de la levure d’un point de vue original, en prenant en compte le degrĂ© d’étude des protĂ©ines par la communautĂ© scientifique et nos rĂ©sultats indiquent que cette corrĂ©lation s’estompe lorsqu’on se restreint aux protĂ©ines trĂšs Ă©tudiĂ©es. Ces observations nous permettent de proposer une mĂ©thode simple et fiable pour estimer la taille d’un interactome. Notre mĂ©thode conduit Ă  une estimation d’au moins 37 600 interactions physiques directes chez S. cerevisiae, et montre que les Ă©valuations prĂ©cĂ©dentes sont trop faibles. Par ailleurs, nous Ă©tudions des donnĂ©es de sĂ©quençage nouvelle gĂ©nĂ©ration de l’ADN. Par une analyse des biais existant entre les short-reads alignĂ©s sur un brin ou sur l’autre du gĂ©nome, nous mettons en Ă©vidence de nombreuses erreurs systĂ©matiques. De plus, nous observons de multiples positions prĂ©sentant entre 20 et 40% de short-reads portant l’allĂšle variant : celles-ci ne peuvent pas ĂȘtre gĂ©notypĂ©es correctement. Nous proposons une mĂ©thode fiable pour appeler les gĂ©notypes Ă  partir des donnĂ©es NGS qui permet de s’affranchir de ses difficultĂ©s. Enfin, nous appliquons cette mĂ©thode sur des donnĂ©es massives de sĂ©quençage d’exome de cellules saines et tumorales de 520 patientes atteintes du cancer de l’ovaire, produites par le consortium TCGA. Nous dĂ©tectons en moyenne 30 632 variants germinaux par patiente. Parmi ces variants, nous identifions ceux les plus enclins Ă  confĂ©rer un risque accru de dĂ©velopper la maladie : nous nous restreignons notamment aux variants induisant une perte de fonction de la protĂ©ine encodĂ©e et significativement plus prĂ©sents chez les patientes que dans la population gĂ©nĂ©rale. Cela conduit Ă  44 SNVs par patiente en moyenne, rĂ©partis sur 334 gĂšnes dans l’ensemble de la cohorte. Parmi ces 334 gĂšnes, 42 ont Ă©tĂ© reportĂ©s comme impliquĂ©s dans la cancerogĂ©nĂšse, confirmant que la liste de candidats identifiĂ©s est fortement enrichie en gĂšnes de susceptibilitĂ© au cancer de l’ovaire. En particulier, nos travaux confirment le rĂŽle de suppresseur de tumeur de la protĂ©ine MAP3K8, trĂšs rĂ©cemment proposĂ©e comme jouant un rĂŽle clĂ© dans d’autres cancers

    Decipher omics data, on the importance of quality control.

    No full text
    DĂ©crypter les donnĂ©es omiques : importance du contrĂŽle qualitĂ©. Application au cancer de l’ovaire Au cours des dix derniĂšres annĂ©es, la taille et la complexitĂ© des donnĂ©es biologiques ont littĂ©ralement explosĂ©, et une attention particuliĂšre doit ĂȘtre portĂ©e au contrĂŽle qualitĂ©. En effet, certaines donnĂ©es omiques (donnĂ©es gĂ©nomiques et post-gĂ©nomiques obtenues Ă  haut dĂ©bit) sont trĂšs incomplĂštes et/ou contiennent de nombreux biais et erreurs qu’il est facile de confondre avec de l’information biologiquement intĂ©ressante. Dans cette thĂšse, nous montrons que les interactions protĂ©ine-protĂ©ine issues de curation de la littĂ©rature et les interactions identifiĂ©es Ă  haut dĂ©bit sont beaucoup plus corrĂ©lĂ©es que ce qui est communĂ©ment admis. Nous examinons l’interactome de la levure d’un point de vue original, en prenant en compte le degrĂ© d’étude des protĂ©ines par la communautĂ© scientifique et nos rĂ©sultats indiquent que cette corrĂ©lation s’estompe lorsqu’on se restreint aux protĂ©ines trĂšs Ă©tudiĂ©es. Ces observations nous permettent de proposer une mĂ©thode simple et fiable pour estimer la taille d’un interactome. Notre mĂ©thode conduit Ă  une estimation d’au moins 37 600 interactions physiques directes chez S. cerevisiae, et montre que les Ă©valuations prĂ©cĂ©dentes sont trop faibles. Par ailleurs, nous Ă©tudions des donnĂ©es de sĂ©quençage nouvelle gĂ©nĂ©ration de l’ADN. Par une analyse des biais existant entre les short-reads alignĂ©s sur un brin ou sur l’autre du gĂ©nome, nous mettons en Ă©vidence de nombreuses erreurs systĂ©matiques. De plus, nous observons de multiples positions prĂ©sentant entre 20 et 40% de short-reads portant l’allĂšle variant : celles-ci ne peuvent pas ĂȘtre gĂ©notypĂ©es correctement. Nous proposons une mĂ©thode fiable pour appeler les gĂ©notypes Ă  partir des donnĂ©es NGS qui permet de s’affranchir de ses difficultĂ©s. Enfin, nous appliquons cette mĂ©thode sur des donnĂ©es massives de sĂ©quençage d’exome de cellules saines et tumorales de 520 patientes atteintes du cancer de l’ovaire, produites par le consortium TCGA. Nous dĂ©tectons en moyenne 30 632 variants germinaux par patiente. Parmi ces variants, nous identifions ceux les plus enclins Ă  confĂ©rer un risque accru de dĂ©velopper la maladie : nous nous restreignons notamment aux variants induisant une perte de fonction de la protĂ©ine encodĂ©e et significativement plus prĂ©sents chez les patientes que dans la population gĂ©nĂ©rale. Cela conduit Ă  44 SNVs par patiente en moyenne, rĂ©partis sur 334 gĂšnes dans l’ensemble de la cohorte. Parmi ces 334 gĂšnes, 42 ont Ă©tĂ© reportĂ©s comme impliquĂ©s dans la cancerogĂ©nĂšse, confirmant que la liste de candidats identifiĂ©s est fortement enrichie en gĂšnes de susceptibilitĂ© au cancer de l’ovaire. En particulier, nos travaux confirment le rĂŽle de suppresseur de tumeur de la protĂ©ine MAP3K8, trĂšs rĂ©cemment proposĂ©e comme jouant un rĂŽle clĂ© dans d’autres cancers.Deciphering omics data : on the importance of quality control. Application to ovarian cancer. Over the past 10 years, the size and complexity of biological data have exploded, and quality control is critical to interpret them correctly. Indeed, omics data (high- hroughput genomic and post-genomic data) are often incomplete and contain bias and errors that can easily be misinterpreted as biologically interesting findings. In this work, we show that literature-curated and high-throughput protein-protein interaction data, usually considered independent, are in fact significantly correlated. We examine the yeast interactome from a new perspective by taking into account how thoroughly proteins have been studied, and our results show that this bias can be corrected for by focusing on well- studied proteins. We thus propose a simple and reliable method to estimate the size of an interactome, combining literature-curated data involving well-studied proteins with high- hroughput data. It yields an estimate of at least 37,600 direct physical protein-protein interactions in S.cerevisiae, a significant increase over previous estimates. We then focus on next-generation DNA sequencing data. An analysis of the bias existing between short- eads aligned on each strand of the genome allows us to highlight numerous systematic errors. Furthermore, we observe many positions that exhibit between 20 and 40% of reads carrying the variant allele : these cannot be genotyped correctly.We then propose a method to overcome these biases and reliably call genotypes from NGS data. Finally, we apply our method to exome-seq data produced by the TCGA for tumor and matched normal samples from 520 ovarian cancer patients. We detect on average 30,632 germline variants per patient. Though an integrative approach, we then identify those which are likely to increase cancer risk : in particular, we focused on variants inducing a loss of function of the encoded protein, and selected those that are significantly more present in the patients than in the general population. We find 44 SNVs per patient on average, impacting 334 genes overall in the cohort. Among these genes, 42 have been previously reported as involved in carcinogenesis, confirming that our list is highly enriched in ovarian cancer susceptibility genes. In particular, our results confirm the tumor suppressor role of the MAP3K8 protein, recently identified in other types of cancer

    A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium

    No full text
    We present primary results from the Sequencing Quality Control (SEQC) project, coordinated by the US Food and Drug Administration. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, we assess RNA sequencing (RNA-seq) performance for junction discovery and differential expression profiling and compare it to microarray and quantitative PCR (qPCR) data using complementary metrics. At all sequencing depths, we discover unannotated exon-exon junctions, with >80% validated by qPCR. We find that measurements of relative expression are accurate and reproducible across sites and platforms if specific filters are used. In contrast, RNA-seq and microarrays do not provide accurate absolute measurements, and gene-specific biases are observed for all examined platforms, including qPCR. Measurement performance depends on the platform and data analysis pipeline, and variation is large for transcript-level profiling. The complete SEQC data sets, comprising >100 billion reads (10Tb), provide unique resources for evaluating RNA-seq analyses for clinical and regulatory settings
    corecore