319 research outputs found

    SNiPlay: a web-based tool for detection, management and analysis of SNPs. Application to grapevine diversity projects

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput re-sequencing, new genotyping technologies and the availability of reference genomes allow the extensive characterization of Single Nucleotide Polymorphisms (SNPs) and insertion/deletion events (indels) in many plant species. The rapidly increasing amount of re-sequencing and genotyping data generated by large-scale genetic diversity projects requires the development of integrated bioinformatics tools able to efficiently manage, analyze, and combine these genetic data with genome structure and external data.</p> <p>Results</p> <p>In this context, we developed SNiPlay, a flexible, user-friendly and integrative web-based tool dedicated to polymorphism discovery and analysis. It integrates:</p> <p>1) a pipeline, freely accessible through the internet, combining existing softwares with new tools to detect SNPs and to compute different types of statistical indices and graphical layouts for SNP data. From standard sequence alignments, genotyping data or Sanger sequencing traces given as input, SNiPlay detects SNPs and indels events and outputs submission files for the design of Illumina's SNP chips. Subsequently, it sends sequences and genotyping data into a series of modules in charge of various processes: physical mapping to a reference genome, annotation (genomic position, intron/exon location, synonymous/non-synonymous substitutions), SNP frequency determination in user-defined groups, haplotype reconstruction and network, linkage disequilibrium evaluation, and diversity analysis (Pi, Watterson's Theta, Tajima's D).</p> <p>Furthermore, the pipeline allows the use of external data (such as phenotype, geographic origin, taxa, stratification) to define groups and compare statistical indices.</p> <p>2) a database storing polymorphisms, genotyping data and grapevine sequences released by public and private projects. It allows the user to retrieve SNPs using various filters (such as genomic position, missing data, polymorphism type, allele frequency), to compare SNP patterns between populations, and to export genotyping data or sequences in various formats.</p> <p>Conclusions</p> <p>Our experiments on grapevine genetic projects showed that SNiPlay allows geneticists to rapidly obtain advanced results in several key research areas of plant genetic diversity. Both the management and treatment of large amounts of SNP data are rendered considerably easier for end-users through automation and integration. Current developments are taking into account new advances in high-throughput technologies.</p> <p>SNiPlay is available at: <url>http://sniplay.cirad.fr/</url>.</p

    Emerging model spedies driven by transciptomics

    Get PDF
    This work is focused on 'emerging model species', i.e. question-driven model species which have sufficient molecular resources to investigate a specific phenomenon in molecular biology, developmental biology, molecular ecology and evolution or related molecular fields. This thesis shows how transcriptomic data can be generated, analyzed, and used to investigate such phenomena of interest even in species lacking a reference genome. The initial ButterflyBase resource has proven to be useful to researchers of species without a reference genome but is limited to the Lepidoptera and supports only the older Sanger sequencing technologies. Thanks to Next Generation Sequencing, transcriptome sequencing is more cost effective but the bottleneck of transcriptomic projects is now the bioinformatic analysis and data mining/dissemination. Therefore, this work continues with presenting novel and innovative approaches which effectively overcome this bottleneck. The est2assembly software produces deeply annotated reference transcriptomes stored in the Chado database. The Drupal Bioinformatic Server Framework and genes4all provide species-neutral and an innovative approach in building standardized online databases and associated web services. All public insect mRNA data were analyzed with est2assembly and genes4all to produce the InsectaCentral. With InsectaCentral, a powerful resource is now available to assist molecular biology in any question-driven model insect species. The software presented here was developed according to specifications of the General Model Organism Database (GMOD) community. All software specifications are species-neutral and can be seamlessly deployed to assist any research community. Further through a case studies chapter, it becomes apparent that the transcriptomic approach is more cost-effective than a genomic approach and therefore sequence-driven evolutionary biology will benefit faster with this field

    The iPlant Collaborative: Cyberinfrastructure for Plant Biology

    Get PDF
    The iPlant Collaborative (iPlant) is a United States National Science Foundation (NSF) funded project that aims to create an innovative, comprehensive, and foundational cyberinfrastructure in support of plant biology research (PSCIC, 2006). iPlant is developing cyberinfrastructure that uniquely enables scientists throughout the diverse fields that comprise plant biology to address Grand Challenges in new ways, to stimulate and facilitate cross-disciplinary research, to promote biology and computer science research interactions, and to train the next generation of scientists on the use of cyberinfrastructure in research and education. Meeting humanity's projected demands for agricultural and forest products and the expectation that natural ecosystems be managed sustainably will require synergies from the application of information technologies. The iPlant cyberinfrastructure design is based on an unprecedented period of research community input, and leverages developments in high-performance computing, data storage, and cyberinfrastructure for the physical sciences. iPlant is an open-source project with application programming interfaces that allow the community to extend the infrastructure to meet its needs. iPlant is sponsoring community-driven workshops addressing specific scientific questions via analysis tool integration and hypothesis testing. These workshops teach researchers how to add bioinformatics tools and/or datasets into the iPlant cyberinfrastructure enabling plant scientists to perform complex analyses on large datasets without the need to master the command-line or high-performance computational services

    Towards a new online species-information system for legumes

    Get PDF
    The need for scientists to exchange, share and organise data has resulted in a proliferation of biodiversity research-data portals over recent decades. These cyber-infrastructures have had a major impact on taxonomy and helped the discipline by allowing faster access to bibliographic information, biological and nomenclatural data, and specimen information. Several specialised portals aggregate particular data types for a large number of species, including legumes. Here, we argue that, despite access to such data-aggregation portals, a taxon-focused portal, curated by a community of researchers specialising on a particular taxonomic group and who have the interest, commitment, existing collaborative links, and knowledge necessary to ensure data quality, would be a useful resource in itself and make important contributions to more general data providers. Such an online species-information system focused on Leguminosae (Fabaceae) would serve useful functions in parallel to and different from international data-aggregation portals. We explore best practices for developing a legume-focused portal that would support data sharing, provide a better understanding of what data are available, missing, or erroneous, and, ultimately, facilitate cross-analyses and direct development of novel research. We present a history of legume-focused portals, survey existing data portals to evaluate what is available and which features are of most interest, and discuss how a legume-focused portal might be developed to respond to the needs of the legume-systematics research community and beyond. We propose taking full advantage of existing data sources, informatics tools and protocols to develop a scalable and interactive portal that will be used, contributed to, and fully supported by the legume-systematics community in the easiest manner possible

    A FAIR approach to genomics

    Get PDF
    The aim of this thesis was to increase our understanding on how genome information leads to function and phenotype. To address these questions, I developed a semantic systems biology framework capable of extracting knowledge, biological concepts and emergent system properties, from a vast array of publicly available genome information. In chapter 2, Empusa is described as an infrastructure that bridges the gap between the intended and actual content of a database. This infrastructure was used in chapters 3 and 4 to develop the framework. Chapter 3 describes the development of the Genome Biology Ontology Language and the GBOL stack of supporting tools enforcing consistency within and between the GBOL definitions in the ontology (OWL) and the Shape Expressions (ShEx) language describing the graph structure. A practical implementation of a semantic systems biology framework for FAIR (de novo) genome annotation is provided in chapter 4. The semantic framework and genome annotation tool described in this chapter has been used throughout this thesis to consistently, structurally and functionally annotate and mine microbial genomes used in chapter 5-10. In chapter 5, we introduced how the concept of protein domains and corresponding architectures can be used in comparative functional genomics to provide for a fast, efficient and scalable alternative to sequence-based methods. This allowed us to effectively compare and identify functional variations between hundreds to thousands of genomes. In chapter 6, we used 432 available complete Pseudomonas genomes to study the relationship between domain essentiality and persistence. In this chapter the focus was mainly on domains involved in metabolic functions. The metabolic domain space was explored for domain essentiality and persistence through the integration of heterogeneous data sources including six published metabolic models, a vast gene expression repository and transposon data. In chapter 7, the correlation between the expected and observed genotypes was explored using 16S-rRNA phylogeny and protein domain class content as input. In this chapter it was shown that domain class content yields a higher resolution in comparison to 16S-rRNA when analysing evolutionary distances. Using protein domain classes, we also were able to identify signifying domains, which may have important roles in shaping a species. To demonstrate the use of semantic systems biology workflows in a biotechnological setting we expanded the resource with more than 80.000 bacterial genomes. The genomic information of this resource was mined using a top down approach to identify strains having the trait for 1,3-propanediol production. This resulted in the molecular identification of 49 new species. In addition, we also experimentally verified that 4 species were capable of producing 1,3-propanediol. As discussed in chapter 10, the here developed semantic systems biology workflows were successfully applied in the discovery of key elements in symbiotic relationships, to improve functional genome annotation and in comparative genomics studies. Wet/dry-lab collaboration was often at the basis of the obtained results. The success of the collaboration between the wet and dry field, prompted me to develop an undergraduate course in which the concept of the “Moist” workflow was introduced (Chapter 9).</p

    Computational functional annotation of crop genomics using hierarchical orthologous groups

    Get PDF
    Improving agronomically important traits, such as yield, is important in order to meet the ever growing demands of increased crop production. Knowledge of the genes that have an effect on a given trait can be used to enhance genomic selection by prediction of biologically interesting loci. Candidate genes that are strongly linked to a desired trait can then be targeted by transformation or genome editing. This application of prioritisation of genetic material can accelerate crop improvement. However, the application of this is currently limited due to the lack of accurate annotations and methods to integrate experimental data with evolutionary relationships. Hierarchical orthologous groups (HOGs) provide nested groups of genes that enable the comparison of highly diverged and similar species in a consistent manner. Over 2,250 species are included in the OMA project, resulting in over 600,000 HOGs. This thesis provides the required methodology and a tool to exploit this rich source of information, in the HOGPROP algorithm. The potential of this is then demonstrated in mining crop genome data, from metabolic QTL studies and utilising Gene Ontology (GO) annotations as well as ChEBI terms (Chemical Entities of Biological Interest) in order to prioritise candidate causal genes. Gauging the performance of the tool is also important. When considering GO annotations, the CAFA series of community experiments has provided the most extensive benchmarking to-date. However, this has not fully taken into account the incomplete knowledge of protein function – the open world assumption (OWA). This will require extra negative annotations, for which one such source has been identified based on expertly curated gene phylogenies. These negative annotations are then utilised in the proposed, OWA-compliant, improved framework for benchmarking. The results show that current benchmarks tend to focus on the general terms, which means that conclusions are not merely uninformative, but misleading

    NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data

    Get PDF
    Rapid popularity and adaptation of next generation sequencing (NGS) approaches have generated huge volumes of data. High throughput platforms like Illumina HiSeq produce terabytes of raw data that requires quick processing. Quality control of the data is an important component prior to the downstream analyses. To address these issues, we have developed a quality control pipeline, NGS-QCbox that scales up to process hundreds or thousands of samples. Raspberry is an in-house tool, developed in C language utilizing HTSlib (v1.2.1) (http://htslib.org), for computing read/base level statistics. It can be used as stand-alone application and can process both compressed and uncompressed FASTQ format files. NGS-QCbox integrates Raspberry with other open-source tools for alignment (Bowtie2), SNP calling (SAMtools) and other utilities (bedtools) towards analyzing raw NGS data at higher efficiency and in high-throughput manner. The pipeline implements batch processing of jobs using Bpipe (https://github.com/ssadedin/bpipe) in parallel and internally, a fine grained task parallelization utilizing OpenMP. It reports read and base statistics along with genome coverage and variants in a user friendly format. The pipeline developed presents a simple menu driven interface and can be used in either quick or complete mode. In addition, the pipeline in quick mode outperforms in speed against other similar existing QC pipeline/tools. The NGS-QCbox pipeline, Raspberry tool and associated scripts are made available at the URL https://github.com/CEG-ICRISAT/NGS-QCbox and https://github.com/ CEG-ICRISAT/Raspberry for rapid quality control analysis of large-scale next generation sequencing (Illumina) data

    Advancing systems biology of yeast through machine learning and comparative genomics

    Get PDF
    Synthetic biology has played a pivotal role in accomplishing the production of high value commodities, pharmaceuticals, and bulk chemicals. Fueled by the breakthrough of synthetic biology and metabolic engineering, Saccharomyces cerevisiae and various other yeasts (such as Yarrowia lipolytica, Pichia pastoris) have been proven to be promising microbial cell factories and are frequently used in scientific studies. However, the cellular metabolism and physiological properties for most of the yeast species have not been characterized in detail. To address these knowledge gaps, this thesis aims to leverage the large amounts of data available for yeast species and use state-of-the-art machine learning techniques and comparative genomic analysis to gain a deeper insight into yeast traits and metabolism.In this thesis, machine learning was applied to various unresolved biological problems on yeasts, i.e., gene essentiality, enzyme turnover number (kcat), and protein production. In the first part of the work, machine learning approaches were employed to predict gene essentiality based on sequence features and evolutionary features. It was demonstrated that the essential gene prediction could be substantially improved by integrating evolution-based features. Secondly, a high-quality deep learning model DLKcat was developed to predict kcat\ua0values by combining a graph neural network for substrates and a convolutional neural network for proteins. By predicting kcat profiles for 343 yeast/fungi species, enzyme-constrained models were reconstructed and used to further elucidate the cellular metabolism on a large scale. Lastly, a random forest algorithm was adopted to investigate feature importance analysis on protein production, it was found that post-translational modifications (PTMs) have a relatively higher impact on protein production compared with amino acid composition. In comparative genomics, a comprehensive toolbox HGTphyloDetect was developed to facilitate the identification of horizontal gene transfer (HGT) events. Case studies on some yeast species demonstrated the ability of HGTphyloDetect to identify horizontally acquired genes with high accuracy. In addition, through systematic evolution analysis (e.g., HGT, gene family expansion) and genome-scale metabolic model simulation, the underlying mechanisms for substrate utilization were further probed across large-scale yeast species

    On the diversification of highly host-specific symbionts: the case of feather mites.

    Get PDF
    One of the most relevant and poorly understood topics in Evolutionary Ecology is symbiont evolutionary diversification. Since Fahrenholz's rule (1913), the idea of symbionts speciating following hosts speciation (i.e., cospeciating) has been pervasive. Recent studies, however, have shown that host-shift speciation (speciation after switching to a new host) is almost as relevant as cospeciation in explaining symbiont diversification. Also, these studies have revealed that methodological biases have favored cospeciation. Nonetheless, most symbiont groups, especially those highly host-specific and specialized in which cospeciation is expected to be the rule, such as the feather mites of birds, were yet to be studied. Symbionts are the most abundant and diverse organisms on Earth, and thus essential components of ecosystems. However, symbionts have attracted historically less attention than other organisms and their study entails numerous methodological challenges, so surprisingly little is understood about the basic biology and ecology of many symbiont groups, especially the non-parasitic. By studying vane-dwelling feather mites living permanently on the surface of flight feathers of birds (Acariformes: Astigmata: Analgoidea and Pterolichoidea), this thesis is a contribution to fill this gap. This thesis is divided into three parts: 1) First, resources and molecular tools enabling large-scale studies of feather mites are developed. 2) Then, these and other tools are used to investigate eco-evolutionary aspects relevant to understand feather mite diversification, such as their mode of transmission and the type of interaction they have with their hosts. 3) Finally, feather mites diversification at a macro- and microevolutionary scale is investigated. The first part compiles a global database of bird-feather mites associations. Also, it evaluates and adjusts DNA barcoding and metabarcoding to be suitable methodologies for studying feather mites. The second part reveals feather mites as highly specialist and hostspecific symbionts whose main mode of transmission is vertical. Analyses of feather mites diet reveal them as trophic generalists which maintain a commensalistic-mutualistic relationship with birds. Finally, the last part of the thesis shows host-shift speciation as the primary process driving the diversification of feather mites. Also, it highlights that majorhost switching, despite being an infrequent process, is highly relevant for the diversification of this group. Lastly, analyses of straggling reveal a high rate of preferential straggling governed by ecological filters. Overall, despite feather mites are revealed as highly specialized and host-specific symbionts, the coevolutionary scenario is highly dynamic. Straggling and host-switching are prevalent processes which allow colonizing new hosts in highly specialized and hostspecific symbionts. Accordingly, coevolution and codiversification do not operate in isolated host-symbiont interactions but more likely in a manner compatible with the geographic mosaic of coevolution. Finally, ecological fitting and interspecific competition are most likely the main factors governing the (co)eco-evolutionary dynamics.La diversificación evolutiva de los simbiontes es uno de los aspectos más relevantes, pero menos entendidos en Ecología Evolutiva. Desde la regla de Fahrenholz (1913), la idea de que los simbiontes especian a la par que sus hospedadores (i.e. coespecian) ha sido extremadamente popular. Sin embargo, estudios recientes han encontrado que la especiación por salto de hospedador (el proceso de especiación que ocurre cuando los simbiontes especian a consecuencia de un cambio de hospedador) es casi tan relevante como la coespeciación. Estos estudios, además, han encontrado que problemas metodológicos favorecían que se encontraran evidencias de coespeciación donde no las había. En cualquier caso, los procesos de diversificación evolutiva de la mayoría de los grupos de simbiontes nunca han sido investigados. Especialmente de aquellos altamente especializados y específicos en términos de hospedador, que son aquellos donde el proceso de coespeciación se espera que sea más relevante, como los ácaros de las plumas de las aves. Los organismos simbiontes son el grupo más abundante y diverso de la tierra, por ende, son componentes esenciales de los ecosistemas. Sin embargo, históricamente los simbiontes han atraído menos la atención de los investigadores, en parte debido a que su estudio conlleva numerosos retos metódologicos. De hecho, debido a esto, actualmente se desconoce una gran parte de aspectos sobre su biología básica y ecología, especialmente de aquellos simbiontes no parásitos. Ésta tésis pretende completar este vacío de conocimiento mediante el estudio de los ácaros de las plumas de las aves. La tésis está dividida en tres partes: 1) En la primera parte se han generado recursos y herramientas moleculares para estudios a gran escala en este grupo de simbiontes. 2) Despues, éstas y otras herramientas se han usado para investigar aspectos eco-evolutivos relavantes para entender el proceso de diversificación evolutiva, tales como, el modo de transmisión y el tipo de interacción que mantienen con sus hospedadores. 3) Finalmente, se ha estudiado el proceso de diversificación evolutiva a escala macro y icroevolutiva. La primera parte de la tesis presenta una base de datos global de relaciones ácaroave resultado de una extensa compilación de datos ya presentes en la literatura. También evalua y ajusta metodologías de “DNA barcoding” y “metabarcoding” para el estudio de los ácaros de las plumas. La segunda parte, revela a los ácaros de las plumas como simbiontes altamente especialistas en términos de hospedador cuyo modo de transmisión principal es el vertical. Por otro lado, el análisis de la dieta de los ácaros los sitúa como simbiontes comensales-mutualistas de las aves. Finalmente, la ultima parte de la tesis demuestra que la especiación por salto de hospedador es el proceso principal de diversificación de este grupo de simbiontes. Asimismo, también demuestra que los saltos de hospedador a larga distancia, a pesar de ser muy raros son muy relevantes para la diversificación de este grupo. Por último, los análisis de simbiontes encontrados en hospedadores inesperados (“stragglers”) revelan que este proceso es más prevalente de lo que se pensaba, y que sigue un patrón compatible con que está modulado por filtros ecológicos. A pesar de que los ácaros de las plumas se revelan como altamente especializados y específicos en términos de hospedador, su escenario coevolutivo es muy dinámico. El proceso de “straggling” y de cambio de hospedador son procesos prevalentes que permiten colonizar nuevos hospedadores. De acuerdo con esto, los procesos de coevolución y codiversificación en estos organismos no operan de manera aislada para cada pareja de hospedador y simbionte, si no de una manera similar a un mosaico geográfico de coevolución. Finalmente, el encaje ecológico y la competencia intraspecífica se identifican como los factores potencialmente más relevantes en las dinámicas (co)ecoevolutivas
    corecore