11 research outputs found

    Knowledge Discovery in Biological Databases for Revealing Candidate Genes Linked to Complex Phenotypes

    Get PDF
    Genetics and “omics” studies designed to uncover genotype to phenotype relationships often identify large numbers of potential candidate genes, among which the causal genes are hidden. Scientists generally lack the time and technical expertise to review all relevant information available from the literature, from key model species and from a potentially wide range of related biological databases in a variety of data formats with variable quality and coverage. Computational tools are needed for the integration and evaluation of heterogeneous information in order to prioritise candidate genes and components of interaction networks that, if perturbed through potential interventions, have a positive impact on the biological outcome in the whole organism without producing negative side effects. Here we review several bioinformatics tools and databases that play an important role in biological knowledge discovery and candidate gene prioritization. We conclude with several key challenges that need to be addressed in order to facilitate biological knowledge discovery in the future.&nbsp

    Mutations in DYNC2LI1 disrupt cilia function and cause short rib polydactyly syndrome.

    Get PDF
    The short rib polydactyly syndromes (SRPSs) are a heterogeneous group of autosomal recessive, perinatal lethal skeletal disorders characterized primarily by short, horizontal ribs, short limbs and polydactyly. Mutations in several genes affecting intraflagellar transport (IFT) cause SRPS but they do not account for all cases. Here we identify an additional SRPS gene and further unravel the functional basis for IFT. We perform whole-exome sequencing and identify mutations in a new disease-producing gene, cytoplasmic dynein-2 light intermediate chain 1, DYNC2LI1, segregating with disease in three families. Using primary fibroblasts, we show that DYNC2LI1 is essential for dynein-2 complex stability and that mutations in DYNC2LI1 result in variable length, including hyperelongated, cilia, Hedgehog pathway impairment and ciliary IFT accumulations. The findings in this study expand our understanding of SRPS locus heterogeneity and demonstrate the importance of DYNC2LI1 in dynein-2 complex stability, cilium function, Hedgehog regulation and skeletogenesis

    Décrypter les données omiques : importance du contrôle qualité. Application au cancer de l'ovaire

    Get PDF
    Deciphering omics data : on the importance of quality control. Application to ovarian cancer. Over the past 10 years, the size and complexity of biological data have exploded, and quality control is critical to interpret them correctly. Indeed, omics data (high- hroughput genomic and post-genomic data) are often incomplete and contain bias and errors that can easily be misinterpreted as biologically interesting findings. In this work, we show that literature-curated and high-throughput protein-protein interaction data, usually considered independent, are in fact significantly correlated. We examine the yeast interactome from a new perspective by taking into account how thoroughly proteins have been studied, and our results show that this bias can be corrected for by focusing on well- studied proteins. We thus propose a simple and reliable method to estimate the size of an interactome, combining literature-curated data involving well-studied proteins with high- hroughput data. It yields an estimate of at least 37,600 direct physical protein-protein interactions in S.cerevisiae, a significant increase over previous estimates. We then focus on next-generation DNA sequencing data. An analysis of the bias existing between short- eads aligned on each strand of the genome allows us to highlight numerous systematic errors. Furthermore, we observe many positions that exhibit between 20 and 40% of reads carrying the variant allele : these cannot be genotyped correctly.We then propose a method to overcome these biases and reliably call genotypes from NGS data. Finally, we apply our method to exome-seq data produced by the TCGA for tumor and matched normal samples from 520 ovarian cancer patients. We detect on average 30,632 germline variants per patient. Though an integrative approach, we then identify those which are likely to increase cancer risk : in particular, we focused on variants inducing a loss of function of the encoded protein, and selected those that are significantly more present in the patients than in the general population. We find 44 SNVs per patient on average, impacting 334 genes overall in the cohort. Among these genes, 42 have been previously reported as involved in carcinogenesis, confirming that our list is highly enriched in ovarian cancer susceptibility genes. In particular, our results confirm the tumor suppressor role of the MAP3K8 protein, recently identified in other types of cancer.Décrypter les données omiques : importance du contrôle qualité. Application au cancer de l’ovaire Au cours des dix dernières années, la taille et la complexité des données biologiques ont littéralement explosé, et une attention particulière doit être portée au contrôle qualité. En effet, certaines données omiques (données génomiques et post-génomiques obtenues à haut débit) sont très incomplètes et/ou contiennent de nombreux biais et erreurs qu’il est facile de confondre avec de l’information biologiquement intéressante. Dans cette thèse, nous montrons que les interactions protéine-protéine issues de curation de la littérature et les interactions identifiées à haut débit sont beaucoup plus corrélées que ce qui est communément admis. Nous examinons l’interactome de la levure d’un point de vue original, en prenant en compte le degré d’étude des protéines par la communauté scientifique et nos résultats indiquent que cette corrélation s’estompe lorsqu’on se restreint aux protéines très étudiées. Ces observations nous permettent de proposer une méthode simple et fiable pour estimer la taille d’un interactome. Notre méthode conduit à une estimation d’au moins 37 600 interactions physiques directes chez S. cerevisiae, et montre que les évaluations précédentes sont trop faibles. Par ailleurs, nous étudions des données de séquençage nouvelle génération de l’ADN. Par une analyse des biais existant entre les short-reads alignés sur un brin ou sur l’autre du génome, nous mettons en évidence de nombreuses erreurs systématiques. De plus, nous observons de multiples positions présentant entre 20 et 40% de short-reads portant l’allèle variant : celles-ci ne peuvent pas être génotypées correctement. Nous proposons une méthode fiable pour appeler les génotypes à partir des données NGS qui permet de s’affranchir de ses difficultés. Enfin, nous appliquons cette méthode sur des données massives de séquençage d’exome de cellules saines et tumorales de 520 patientes atteintes du cancer de l’ovaire, produites par le consortium TCGA. Nous détectons en moyenne 30 632 variants germinaux par patiente. Parmi ces variants, nous identifions ceux les plus enclins à conférer un risque accru de développer la maladie : nous nous restreignons notamment aux variants induisant une perte de fonction de la protéine encodée et significativement plus présents chez les patientes que dans la population générale. Cela conduit à 44 SNVs par patiente en moyenne, répartis sur 334 gènes dans l’ensemble de la cohorte. Parmi ces 334 gènes, 42 ont été reportés comme impliqués dans la cancerogénèse, confirmant que la liste de candidats identifiés est fortement enrichie en gènes de susceptibilité au cancer de l’ovaire. En particulier, nos travaux confirment le rôle de suppresseur de tumeur de la protéine MAP3K8, très récemment proposée comme jouant un rôle clé dans d’autres cancers

    Development and application of methodologies and infrastructures for cancer genome analysis within Personalized Medicine

    Full text link
    [eng] Next-generation sequencing (NGS) has revolutionized biomedical sciences, especially in the area of cancer. It has nourished genomic research with extensive collections of sequenced genomes that are investigated to untangle the molecular bases of disease, as well as to identify potential targets for the design of new treatments. To exploit all this information, several initiatives have emerged worldwide, among which the Pan-Cancer project of the ICGC (International Cancer Genome Consortium) stands out. This project has jointly analyzed thousands of tumor genomes of different cancer types in order to elucidate the molecular bases of the origin and progression of cancer. To accomplish this task, new emerging technologies, including virtualization systems such as virtual machines or software containers, were used and had to be adapted to various computing centers. The portability of this system to the supercomputing infrastructure of the BSC (Barcelona Supercomputing Center) has been carried out during the first phase of the thesis. In parallel, other projects promote the application of genomics discoveries into the clinics. This is the case of MedPerCan, a national initiative to design a pilot project for the implementation of personalized medicine in oncology in Catalonia. In this context, we have centered our efforts on the methodological side, focusing on the detection and characterization of somatic variants in tumors. This step is a challenging action, due to the heterogeneity of the different methods, and an essential part, as it lays at the basis of all downstream analyses. On top of the methodological section of the thesis, we got into the biological interpretation of the results to study the evolution of chronic lymphocytic leukemia (CLL) in a close collaboration with the group of Dr. ElĂ­as Campo from the Hospital ClĂ­nic/IDIBAPS. In the first study, we have focused on the Richter transformation (RT), a transformation of CLL into a high-grade lymphoma that leads to a very poor prognosis and with unmet clinical needs. We found that RT has greater genomic, epigenomic and transcriptomic complexity than CLL. Its genome may reflect the imprint of therapies that the patients received prior to RT, indicating the presence of cells exposed to these mutagenic treatments which later expand giving rise to the clinical manifestation of the disease. Multiple NGS- based techniques, including whole-genome sequencing and single-cell DNA and RNA sequencing, among others, confirmed the pre-existence of cells with the RT characteristics years before their manifestation, up to the time of CLL diagnosis. The transcriptomic profile of RT is remarkably different from that of CLL. Of particular importance is the overexpression of the OXPHOS pathway, which could be used as a therapeutic vulnerability. Finally, in a second study, the analysis of a case of CLL in a young adult, based on whole genome and single-cell sequencing at different times of the disease, revealed that the founder clone of CLL did not present any somatic driver mutations and was characterized by germline variants in ATM, suggesting its role in the origin of the disease, and highlighting the possible contribution of germline variants or other non-genetic mechanisms in the initiation of CLL

    Modern Systems for Large-scale Genomics Data Analysis in the Cloud

    Get PDF
    Genomics researchers increasingly turn to cloud computing as a means of accomplishing large-scale analyses efficiently and cost-effectively. Successful operation in the cloud requires careful instrumentation and management to avoid common pitfalls, such as resource bottlenecks and low utilisation that can both drive up costs and extend the timeline of a scientific project. We developed the Butler framework for large-scale scientific workflow management in the cloud to meet these challenges. The cornerstones of Butler design are: ability to support multiple clouds, declarative infrastructure configuration management, scalable, fault-tolerant operation, comprehensive resource monitoring, and automated error detection and recovery. Butler relies on industry-strength open-source components in order to deliver a framework that is robust and scalable to thousands of compute cores and millions of workflow executions. Butler’s error detection and self-healing capabilities are unique among scientific workflow frameworks and ensure that analyses are carried out with minimal human intervention. Butler has been used to analyse over 725TB of DNA sequencing data on the cloud, using 1500 CPU cores, and 6TB of RAM, delivering results with 43\% increased efficiency compared to other tools. The flexible design of this framework allows easy adoption within other fields of Life Sciences and ensures that it will scale together with the demand for scientific analysis in the cloud for years to come. Because many bioinformatics tools have been developed in the context of small sample sizes they often struggle to keep up with the demands for large-scale data processing required for modern research and clinical sequencing projects due to the limitations in their design. The Rheos software system is designed specifically with these large data sets in mind. Utilising the elastic compute capacity of modern academic and commercial clouds, Rheos takes a service-oriented containerised approach to the implementation of modern bioinformatics algorithms, which allows the software to achieve the scalability and ease-of-use required to succeed under increased operational load of massive data sets generated by projects like International Cancer Genomics Consortium (ICGC) Argo and the All of Us initiative. Rheos algorithms are based on an innovative stream-based approach for processing genomic data, which enables Rheos to make faster decisions about the presence of genomic mutations that drive diseases such as cancer, thereby improving the tools' efficacy and relevance to clinical sequencing applications. Our testing of the novel germline Single Nucleotide Polymorphism (SNP) and deletion variant calling algorithms developed within Rheos indicates that Rheos achieves ~98\% accuracy in SNP calling and ~85\% accuracy in deletion calling, which is comparable with other leading tools such as the Genome Analysis Toolkit (GATK), freebayes, and Delly. The two frameworks that we developed provide important contributions to solve the ever-growing need for large scale genomic data analysis on the cloud, by enabling more effective use of existing tools, in the case of Butler, and providing a new, more dynamic and real-time approach to genomic analysis, in the case of Rheos
    corecore