19 research outputs found

    The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features

    Full text link
    Background: Cassava (Manihot esculenta) is an important clonally propagated food crop in tropical and subtropical regions worldwide. Genetic gain by molecular breeding has been limited, partially because cassava is a highly heterozygous crop with a repetitive and difficult-to-assemble genome. Findings: Here we demonstrate that Pacific Biosciences high-fidelity (HiFi) sequencing reads, in combination with the assembler hifiasm, produced genome assemblies at near complete haplotype resolution with higher continuity and accuracy compared to conventional long sequencing reads. We present 2 chromosome-scale haploid genomes phased with Hi-C technology for the diploid African cassava variety TME204. With consensus accuracy >QV46, contig N50 >18 Mb, BUSCO completeness of 99%, and 35k phased gene loci, it is the most accurate, continuous, complete, and haplotype-resolved cassava genome assembly so far. Ab initio gene prediction with RNA-seq data and Iso-Seq transcripts identified abundant novel gene loci, with enriched functionality related to chromatin organization, meristem development, and cell responses. During tissue development, differentially expressed transcripts of different haplotype origins were enriched for different functionality. In each tissue, 20-30% of transcripts showed allele-specific expression (ASE) differences. ASE bias was often tissue specific and inconsistent across different tissues. Direction-shifting was observed in <2% of the ASE transcripts. Despite high gene synteny, the HiFi genome assembly revealed extensive chromosome rearrangements and abundant intra-genomic and inter-genomic divergent sequences, with large structural variations mostly related to LTR retrotransposons. We use the reference-quality assemblies to build a cassava pan-genome and demonstrate its importance in representing the genetic diversity of cassava for downstream reference-guided omics analysis and breeding. Conclusions: The phased and annotated chromosome pairs allow a systematic view of the heterozygous diploid genome organization in cassava with improved accuracy, completeness, and haplotype resolution. They will be a valuable resource for cassava breeding and research. Our study may also provide insights into developing cost-effective and efficient strategies for resolving complex genomes with high resolution, accuracy, and continuity

    Network-driven strategies to integrate and exploit biomedical data

    Get PDF
    [eng] In the quest for understanding complex biological systems, the scientific community has been delving into protein, chemical and disease biology, populating biomedical databases with a wealth of data and knowledge. Currently, the field of biomedicine has entered a Big Data era, in which computational-driven research can largely benefit from existing knowledge to better understand and characterize biological and chemical entities. And yet, the heterogeneity and complexity of biomedical data trigger the need for a proper integration and representation of this knowledge, so that it can be effectively and efficiently exploited. In this thesis, we aim at developing new strategies to leverage the current biomedical knowledge, so that meaningful information can be extracted and fused into downstream applications. To this goal, we have capitalized on network analysis algorithms to integrate and exploit biomedical data in a wide variety of scenarios, providing a better understanding of pharmacoomics experiments while helping accelerate the drug discovery process. More specifically, we have (i) devised an approach to identify functional gene sets associated with drug response mechanisms of action, (ii) created a resource of biomedical descriptors able to anticipate cellular drug response and identify new drug repurposing opportunities, (iii) designed a tool to annotate biomedical support for a given set of experimental observations, and (iv) reviewed different chemical and biological descriptors relevant for drug discovery, illustrating how they can be used to provide solutions to current challenges in biomedicine.[cat] En la cerca d’una millor comprensió dels sistemes biològics complexos, la comunitat científica ha estat aprofundint en la biologia de les proteïnes, fàrmacs i malalties, poblant les bases de dades biomèdiques amb un gran volum de dades i coneixement. En l’actualitat, el camp de la biomedicina es troba en una era de “dades massives” (Big Data), on la investigació duta a terme per ordinadors se’n pot beneficiar per entendre i caracteritzar millor les entitats químiques i biològiques. No obstant, la heterogeneïtat i complexitat de les dades biomèdiques requereix que aquestes s’integrin i es representin d’una manera idònia, permetent així explotar aquesta informació d’una manera efectiva i eficient. L’objectiu d’aquesta tesis doctoral és desenvolupar noves estratègies que permetin explotar el coneixement biomèdic actual i així extreure informació rellevant per aplicacions biomèdiques futures. Per aquesta finalitat, em fet servir algoritmes de xarxes per tal d’integrar i explotar el coneixement biomèdic en diferents tasques, proporcionant un millor enteniment dels experiments farmacoòmics per tal d’ajudar accelerar el procés de descobriment de nous fàrmacs. Com a resultat, en aquesta tesi hem (i) dissenyat una estratègia per identificar grups funcionals de gens associats a la resposta de línies cel·lulars als fàrmacs, (ii) creat una col·lecció de descriptors biomèdics capaços, entre altres coses, d’anticipar com les cèl·lules responen als fàrmacs o trobar nous usos per fàrmacs existents, (iii) desenvolupat una eina per descobrir quins contextos biològics corresponen a una associació biològica observada experimentalment i, finalment, (iv) hem explorat diferents descriptors químics i biològics rellevants pel procés de descobriment de nous fàrmacs, mostrant com aquests poden ser utilitzats per trobar solucions a reptes actuals dins el camp de la biomedicina

    Genetics of Hearing Impairment

    Get PDF
    The inner ear is a complex machinery at the cellular and molecular levels. Many different genes and proteins play roles in the development and maintenance of its structure and function, through participating in diverse molecular networks. A defect in any of these components can result in the loss of hearing. Consequently, hearing impairment encompasses a wide variety of disorders that are clinically and genetically heterogeneous. Understanding their genetic causes and their pathophysiological mechanisms, and characterizing the resulting phenotypes, are essential for developing novel therapies that target the specific defects. The articles and reviews in this book are representative of the many research lines that are currently active in the field, including recent advances in the genes and mutations involved in hearing impairment, the mechanisms through which mutations result in different syndromic or non-syndromic disorders, and the description of the associated phenotypes in humans and in animal models

    The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features

    Get PDF
    Background Cassava (Manihot esculenta) is an important clonally propagated food crop in tropical and subtropical regions worldwide. Genetic gain by molecular breeding has been limited, partially because cassava is a highly heterozygous crop with a repetitive and difficult-to-assemble genome. Findings Here we demonstrate that Pacific Biosciences high-fidelity (HiFi) sequencing reads, in combination with the assembler hifiasm, produced genome assemblies at near complete haplotype resolution with higher continuity and accuracy compared to conventional long sequencing reads. We present 2 chromosome-scale haploid genomes phased with Hi-C technology for the diploid African cassava variety TME204. With consensus accuracy >QV46, contig N50 >18 Mb, BUSCO completeness of 99%, and 35k phased gene loci, it is the most accurate, continuous, complete, and haplotype-resolved cassava genome assembly so far. Ab initio gene prediction with RNA-seq data and Iso-Seq transcripts identified abundant novel gene loci, with enriched functionality related to chromatin organization, meristem development, and cell responses. During tissue development, differentially expressed transcripts of different haplotype origins were enriched for different functionality. In each tissue, 20-30% of transcripts showed allele-specific expression (ASE) differences. ASE bias was often tissue specific and inconsistent across different tissues. Direction-shifting was observed inPeer reviewe

    Development and application of novel bioinformatics tools for protein function prediction

    Get PDF
    Pearson Correlation Coefficient and provides a value between -1 to 1, with -1 being a total negative correlation, 0 is no correlation and 1 is a total positive correlation based on the observed and predicted ligand-binding site residues. Scores of 0.40 to 0.69 are strong positive relationships and 0.70 and higher are strong positive relationships. The downside of MCC is that it does not take into consideration the overall 3D structure of the protein model. Therefore, BDT will also be utilised as this score, which is also scored from -1 to 1, to take into consideration the 3D structure. Both MCC and BDT are only possible to produce when there is an observed (actual) structure available with bound ligands to compare against the predicted structure and hence why MCC and BDT are objective measures of ligand-binding site prediction. The average MCC and BDT score from CASP11 was 0.42 and 0.51, respectively. CASP12 saw the prediction of ligands for low annotation level proteins with no known ligands, demonstrating the potential use of FunFOLD3 in novel protein prediction. The average MCC and BDT score from CASP13 was 0.47 and 0.53. CAFA3 showed FunFOLDQ can be used in the prediction of GO terms, however further refinements are needed to increase specificity of the term predictions. The development option this thesis has explored is the use of docking (preferred orientation of interacting partners) with AutoDock Vina to improve the accuracy of ligand-binding residues by FunFOLD3, as the problem with TBM methods can be that predicted ligand(s) from a similar template will be forced to fit within the ligand-binding pocket. However, with docking, the aim of this method is to predict the preferred orientation of the ligand within the ligand-binding space. Utilisation of docking has also added to the novelty of this research, as different grid box calculations around the ligand-binding space was explored, with varying degrees of success with each grid box calculation. Examples of two CASP targets which had improvements in MCC and BDT score following docking were CASP11 target T0783 (2-C-methyl-D-erythritol 4- phosphate cytidylyltransferase) the MCC and BDT scores by FunFOLD3 were 0.17 and 0.21, respectively. Following docking the MCC and BDT scores increased to 0.63 and 0.45, respectively. CASP13 target T1016 (alpha-ribazole-5'-P phosphatase) had MCC and BDT scores of 0.556 and 0.646 by FunFOLD3, respectively. Following docking the MCC and BDT increased to 0.85 and 0.91, respectively. Lastly, CASP_Commons, a community-wide experiment to find the consensus structures, explored the role of FunFOLD3 with predicting ligands and ligand-binding sites for the novel protein and proteins domains of SARS-CoV-2. The protein domains were non-structural proteins 2, 4 and 6, open reading frames 3a, 6, 7b, 8 and 10, membrane protein and papain�like protease. FunFOLD3 predicted ligands for ten of the protein domains, of which there were a total of 32 targets due to domains being split into smaller residues and subsequent rounds of 3D modelling improvement. Increased understanding of protein structures can provide further insight into a protein’s function, particularly if ligands are bound and identified, an example in this thesis is the prediction of chlorophyll A for non-structural protein 4 (nsp4). Chlorophyll A, like haemoglobin is a porphyrin ring and templates related to nsp4 show a role in blood clotting. Therefore, whilst chlorophyll A might not be the exact ligand, similarities between haemoglobin and chlorophyll A can clearly be determined and assist in understanding the role of nsp4 in the pathology of COVID-19. Identification of GO terms can provide more detailed understanding into the function or functions of proteins and, in proteins with limited annotation information this can assist with comprehending their role. This thesis has focused on improving and developing a function prediction method, FunFOLD3, to better understand the role and function of proteins. The new method of FunFOLD3 which utilises docking will be integrated into the McGuffin group prediction servers and will be benchmarked in subsequent CASP competitions, to critically assess the performance of the developed method

    Univariate and multivariate statistical approaches for the analyses of omics data: sample classification and two-block integration.

    Get PDF
    The wealth of information generated by high-throughput omics technologies in the context of large-scale epidemiological studies has made a significant contribution to the identification of factors influencing the onset and progression of common diseases. Advanced computational and statistical modelling techniques are required to manipulate and extract meaningful biological information from these omics data as several layers of complexity are associated with them. Recent research efforts have concentrated in the development of novel statistical and bioinformatic tools; however, studies thoroughly investigating the applicability and suitability of these novel methods in real data have often fallen behind. This thesis focuses in the analyses of proteomics and transcriptomics data from the EnviroGenoMarker project with the purpose of addressing two main research objectives: i) to critically appraise established and recently developed statistical approaches in their ability to appropriately accommodate the inherently complex nature of real-world omics data and ii) to improve the current understanding of a prevalent condition by identifying biological markers predictive of disease as well as possible biological mechanisms leading to its onset. The specific disease endpoint of interest corresponds to B-cell Lymphoma, a common haematological malignancy for which many challenges related to its aetiology remain unanswered. The seven chapters comprising this thesis are structured in the following manner: the first two correspond to introductory chapters where I describe the main omics technologies and statistical methods employed for their analyses. The third chapter provides a description of the epidemiological project giving rise to the study population and the disease outcome of interest. These are followed by three results chapters that address the research aims described above by applying univariate and multivariate statistical approaches for sample classification and data integration purposes. A summary of findings, concluding general remarks and discussion of open problems offering potential avenues for future research are presented in the final chapter.Open Acces

    Development of computational techniques for genomic data analysis and visualisation in model and non-model organisms

    Get PDF
    This thesis describes the work undertaken by the author between 2011 and 2018. With technological development, genome sequencing became affordable and accessible to the scientific communities. This led to the generation of an enormous amount of genomic data and bioinformatics tools to analyse and visualise these data. However, most of the public resources are designed for model organisms, and gold standard curated genomes. These tools are designed to run in a specifically configured environment as well as dependent on specific data formats. Chapter 1 of my thesis introduces the state of the field, the existing tools, their functionalities, and their limitations that prompted the software developments presented in the following chapters. In chapter 2, I discuss the TGAC Browser, an open-source genome browser and wigExplorer, a BioJS plugin to visualise expression data. In chapter 3, I move towards finding gene families using GeneSeqToFamily, a Galaxy workflow based on the EnsemblCompara GeneTree pipeline. In chapter 4, I focus on a tool developed for visualisation of gene families - Aequatus, an open-source homology browser and ViCTreeView, a plugin developed as a part of the ViCTree project to visualise and explore phylogenetic trees. In chapter 5, I discuss the availability and accessibility of these tools. All the tools and workflows I have developed are open-source, under a free licence, and are available in GitHub and/or the Galaxy ToolShed. I will also discuss the impact that these tools have made on various research projects. I also take this opportunity to discuss the possibilities of future developments of these tools

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Plant Genetics and Molecular Biology

    Get PDF
    This book reviews the latest advances in multiple fields of plant biotechnology and the opportunities that plant genetics, genomics and molecular biology have offered for agriculture improvement. Advanced technologies can dramatically enhance our capacity in understanding the molecular basis of traits and utilizing the available resources for accelerated development of high yielding, nutritious, input-use efficient and climate-smart crop varieties. In this book, readers will discover the significant advances in plant genetics, structural and functional genomics, trait and gene discovery, transcriptomics, proteomics, metabolomics, epigenomics, nanotechnology and analytical & decision support tools in breeding. This book appeals to researchers, academics and other stakeholders of global agriculture

    Genomic introgression mapping of field-derived multiple-anthelmintic resistance in Teladorsagia circumcincta

    Get PDF
    Preventive chemotherapy has long been practiced against nematode parasites of livestock, leading to widespread drug resistance, and is increasingly being adopted for eradication of human parasitic nematodes even though it is similarly likely to lead to drug resistance. Given that the genetic architecture of resistance is poorly understood for any nematode, we have analyzed multidrug resistant Teladorsagia circumcincta, a major parasite of sheep, as a model for analysis of resistance selection. We introgressed a field-derived multiresistant genotype into a partially inbred susceptible genetic background (through repeated backcrossing and drug selection) and performed genome-wide scans in the backcross progeny and drug-selected F2 populations to identify the major genes responsible for the multidrug resistance. We identified variation linking candidate resistance genes to each drug class. Putative mechanisms included target site polymorphism, changes in likely regulatory regions and copy number variation in efflux transporters. This work elucidates the genetic architecture of multiple anthelmintic resistance in a parasitic nematode for the first time and establishes a framework for future studies of anthelmintic resistance in nematode parasites of humans
    corecore