330 research outputs found

    Bioinformatics in microbial biotechnology – a mini review

    Get PDF

    Updates in metabolomics tools and resources: 2014-2015

    Get PDF
    Data processing and interpretation represent the most challenging and time-consuming steps in high-throughput metabolomic experiments, regardless of the analytical platforms (MS or NMR spectroscopy based) used for data acquisition. Improved machinery in metabolomics generates increasingly complex datasets that create the need for more and better processing and analysis software and in silico approaches to understand the resulting data. However, a comprehensive source of information describing the utility of the most recently developed and released metabolomics resources—in the form of tools, software, and databases—is currently lacking. Thus, here we provide an overview of freely-available, and open-source, tools, algorithms, and frameworks to make both upcoming and established metabolomics researchers aware of the recent developments in an attempt to advance and facilitate data processing workflows in their metabolomics research. The major topics include tools and researches for data processing, data annotation, and data visualization in MS and NMR-based metabolomics. Most in this review described tools are dedicated to untargeted metabolomics workflows; however, some more specialist tools are described as well. All tools and resources described including their analytical and computational platform dependencies are summarized in an overview Table

    Developing computational tools for studying cancer metabolism and genomics

    Get PDF
    The interplay between different genomic and epigenomic alterations lead to different prognoses in cancer patients. Advances in high-throughput technologies, like gene expression profiling, next-generation sequencing, proteomics, and fluxomics, have enabled detailed molecular characterization of various tumors, yet studying this interplay is a complex computational problem.Here we set to develop computational approaches to identify and study emerging challenges in cancer metabolism and genomics. We focus on three research questions, addressed by different computational approaches: (1) What is the set of metabolic interactions in cancer metabolism? To this end we generated a computational framework that quantitatively predicts synthetic dosage lethal (SDL) interactions in human metabolism, by developing a new algorithmic-modeling approach. SDLs offer a promising way to selectively kill cancer cells by targeting the SDL partners of activated oncogenes in tumors, which are often difficult to target directly. (2) What is the landscape of metabolic regulation in breast cancer? To this end we established a new framework that utilizes different data types to perform multi-omics data integration and flux prediction, by incorporating machine learn- ing techniques with Genome Scale Metabolic Modeling (GSMM). This enabled us to study the regulation of breast cancer cell line under different growth conditions, from multiple omics data. (3) What is the power of somatic mutations derived from RNA in estimating the tumor mutational burden? Here we develop a new tool to detect somatic mutations from RNA sequencing data without a matched- normal sample. To this end we developed a machine learning pipeline that takes as input a list of single nucleotide variants and classifies them as either somatic or germline, based on read-level features as well as position-specific variant statistics and common germline databases. We showed that detecting somatic mutations directly from RNA enables the identification of expressed mutations, and therefore represent a more relevant metric in estimating the tumor mutational burden, which is significantly associated with patient survival. In sum, my work has been focused around developing computational methods to tackle different research questions in cancer metabolism and genomics, utilizing various types of omics data and a variety of computational approaches. These methods provide new solutions to some important computational challenges, and their applications help to generate promising leads for cancer research, and can be utilized in many future applications, analyzing novel and existing datasets

    A FAIR approach to genomics

    Get PDF
    The aim of this thesis was to increase our understanding on how genome information leads to function and phenotype. To address these questions, I developed a semantic systems biology framework capable of extracting knowledge, biological concepts and emergent system properties, from a vast array of publicly available genome information. In chapter 2, Empusa is described as an infrastructure that bridges the gap between the intended and actual content of a database. This infrastructure was used in chapters 3 and 4 to develop the framework. Chapter 3 describes the development of the Genome Biology Ontology Language and the GBOL stack of supporting tools enforcing consistency within and between the GBOL definitions in the ontology (OWL) and the Shape Expressions (ShEx) language describing the graph structure. A practical implementation of a semantic systems biology framework for FAIR (de novo) genome annotation is provided in chapter 4. The semantic framework and genome annotation tool described in this chapter has been used throughout this thesis to consistently, structurally and functionally annotate and mine microbial genomes used in chapter 5-10. In chapter 5, we introduced how the concept of protein domains and corresponding architectures can be used in comparative functional genomics to provide for a fast, efficient and scalable alternative to sequence-based methods. This allowed us to effectively compare and identify functional variations between hundreds to thousands of genomes. In chapter 6, we used 432 available complete Pseudomonas genomes to study the relationship between domain essentiality and persistence. In this chapter the focus was mainly on domains involved in metabolic functions. The metabolic domain space was explored for domain essentiality and persistence through the integration of heterogeneous data sources including six published metabolic models, a vast gene expression repository and transposon data. In chapter 7, the correlation between the expected and observed genotypes was explored using 16S-rRNA phylogeny and protein domain class content as input. In this chapter it was shown that domain class content yields a higher resolution in comparison to 16S-rRNA when analysing evolutionary distances. Using protein domain classes, we also were able to identify signifying domains, which may have important roles in shaping a species. To demonstrate the use of semantic systems biology workflows in a biotechnological setting we expanded the resource with more than 80.000 bacterial genomes. The genomic information of this resource was mined using a top down approach to identify strains having the trait for 1,3-propanediol production. This resulted in the molecular identification of 49 new species. In addition, we also experimentally verified that 4 species were capable of producing 1,3-propanediol. As discussed in chapter 10, the here developed semantic systems biology workflows were successfully applied in the discovery of key elements in symbiotic relationships, to improve functional genome annotation and in comparative genomics studies. Wet/dry-lab collaboration was often at the basis of the obtained results. The success of the collaboration between the wet and dry field, prompted me to develop an undergraduate course in which the concept of the “Moist” workflow was introduced (Chapter 9).</p

    Metabolomics Data Processing and Data Analysis—Current Best Practices

    Get PDF
    Metabolomics data analysis strategies are central to transforming raw metabolomics data files into meaningful biochemical interpretations that answer biological questions or generate novel hypotheses. This book contains a variety of papers from a Special Issue around the theme “Best Practices in Metabolomics Data Analysis”. Reviews and strategies for the whole metabolomics pipeline are included, whereas key areas such as metabolite annotation and identification, compound and spectral databases and repositories, and statistical analysis are highlighted in various papers. Altogether, this book contains valuable information for researchers just starting in their metabolomics career as well as those that are more experienced and look for additional knowledge and best practice to complement key parts of their metabolomics workflows

    Annotation and comparative analysis of fungal genomes: a hitchhiker's guide to genomics

    Get PDF
    This thesis describes several genome-sequencing projects such as those from the fungi Laccaria bicolor S238N-H82, Glomus intraradices DAOM 197198, Melampsora laricis-populina 98AG31, Puccinia graminis, Pichia pastoris GS115 and Candida bombicola, as well as the one of the haptophyte Emiliania huxleyi CCMP1516. These species are important organisms in many aspects, for instance: L. bicolor and G. intraradices are symbiotic fungi growing associate with trees and present an important ecological niches for promoting tree growth; M. laricis- populina and P. graminis are two devastating fungi threating plants; the tiny yeast P. pastoris is the major protein production platform in the pharmaceutical industry; the biosurfactant production yeast C. bombicola is likely to provide a low ecotoxicity detergent and E. huxleyi places in a unique phylogeny position of chromalveolate and contributes to the global carbon cycle system. The completion of the genome sequence and the subsequent functional studies broaden our understanding of these complex biological systems and promote the species as possible model organisms. However, it is commonly observed that the genome sequencing projects are launched with lots of enthusiasm but often frustratingly difficult to finish. Part of the reason are the ever-increasing expectations regarding quality delivery (both with respect to data and analyses). The Introductory Chapter aims to provide an overview of how best to conduct a genome sequencing project. It explains the importance of understanding the basic biology and genetics of the target organism. It also discusses the latest developments in new (next) generation high throughput sequencing (HTS) technologies, how to handle the data and their applications. The emergence of the new HTS technologies brings the whole biology research into a new frontier. For instance, with the help of the new sequencing technologies, we were able to sequence the genome of our interest, namely Pichia pastoris. This tiny yeast, the analysis of which forms the bulk of this thesis, is an important heterologous production platform because its methanol assimilation properties makes it ideally suitable for large scale industrial production. The unique protein assembly pathway of P. pastoris also attracts much basic research interests. We used the new HTS method to sequence and assemble the GS115 genome into four chromosomes and made it publicly available to the research community (Chapter 2 and Chapter 3). The public release of the GS115 brought broader interests on the comparison of GS115 and its parental strains. By sequencing the parental strain of GS115 with different new sequencing platforms, we identified several point mutations in the coding genes that likely contribute to the higher protein translocation efficiency in GS115. The sequence divergence and copy number variation of rDNA between strains also explains the difference of protein production efficiency (Chapter 4). Before 2008, the Sanger sequencing method was the only technology to obtain high quality complete genomes of eukaryotes. Because of the high cost of the Sanger method, regarding the other genome projects discussed in this thesis, it was necessary to team up with many other partners and to rely on the U.S. Department of Energy Joint Genome Institute (DOE-JGI) and the Broad Institute to generate the genome sequence. The M. larici-populina srain 98AG31 and the Puccinia graminis f. sp. tritici strain CRL 75-36-700-3 are two devastating basidiomycete ‘rusts’ that infect poplar and wheat. Lineage-specific gene family expansions in these two rusts highlight the possible role in their obligate biotrophic life-style. Two large sets of effector-like small-secreted proteins with different pri- mary sequence structures were identified in each organism. The in planta-induced transcriptomic data showed upregulation of these lineage-specific genes and they are likely involved in the establishing of the rust-host interaction. An additional immunolocalization study on M. larici-populina confirmed the accumulation of some candidate effectors in the haustoria and infection hyphae, which is described in Chapter 5

    Mass Spectrometry-Based Proteomics for Studying Microbial Physiology from Isolates to Communities

    Get PDF
    With the advent of whole genome sequencing, a new era of biology was ushered in allowing for “systems-biology” approaches to characterizing microbial systems. The field of systems biology aims to catalogue and understand all of the biological components, their functions, and all of their interactions in a living system as well as communities of living systems. Systems biology can be considered an attempt to measure all of the components of a living system and then produce a data-driven model of the system. This model can then be used to generate hypotheses about how the system will respond to perturbations, which can be tested experimentally. The first step in the process is the determination of a microbial genome. This process has, to a large extent, been fully developed, with hundreds of microbial genome sequences completed and hundreds more being characterized at a breathtaking pace. The developments of technologies to use this information and to further probe the functional components of microbes at a global level are currently being developed. The field of gene expression analysis at the transcript level is one example; it is now possible to simultaneously measure and compare the expression of thousands of mRNA products in a single experiment. The natural extension of these experiments is to simultaneously measure and compare the expression of all the proteins present in a microbial system. This is the field of proteomics. With the development of electrospray ionization, rapid tandem mass spectrometry and database-searching algorithms, mass spectrometry (MS) has become the leader in the attempts to decipher proteomes. This research effort is very young and many challenges still exist. The goal of the work described here was to build a state-of-the-art robust MS-based proteomics platform for the characterization of microbial proteomes from isolates to communities. The research presented here describes the successes and challenges of this objective. Proteome analyses of the metal-reducing bacteria Shewanella oneidensis and the metabolically versatile bacteria Rhodopseudomonas palustris are given as examples of the power of this technology to elucidate proteins important to different metabolic states at a global level. The analysis of microbial proteomes from isolates is only the first step of the challenge. In nature, microbial species do not act alone but are always found in mixtures with other species where their intricate interactions are critical for survival. These studies conclude with some of the first efforts to develop methodologies to measure proteomes of simple controlled mixtures of microbial species and then present the first attempt at measuring the proteome of a natural microbial community, a biofilm from an acid mine drainage system. This microbial system illustrates life at the extreme of nature where life not only exists but flourishes in very acidic conditions with high metal concentrations and high temperatures. The technologies developed through these studies were applied to the first deep characterization of a microbial community proteome, the deciphering of the expressed proteome of the acid mine drainage biofilm. The research presented here has led to development of a state-of-the-art robust proteome pipeline, which can now be applied to the proteome analysis of any microbial isolate for a sequenced species. The first steps have also been made toward developing methodologies to characterize microbial proteomes in their natural environments. These developments are key to integrating proteome technologies with genome and transcriptome technologies for global characterizations of microbial species at the systems level. This will lead to understanding of microbial physiology from a global view where instead of analyzing one gene or protein at a time, hundreds of genes/proteins will be interrogated in microbial species as the adapt and survive in the environment

    An approach to improved microbial eukaryotic genome annotation

    Full text link
    Les nouvelles technologies de sĂ©quençage d’ADN ont accĂ©lĂ©rĂ©es la vitesse Ă  laquelle les donnĂ©es gĂ©nomiques sont gĂ©nĂ©rĂ©es. Par contre, une fois sĂ©quencĂ©es et assemblĂ©es, un dĂ©fi continu est l'annotation structurelle prĂ©cise de ces nouvelles sĂ©quences gĂ©nomiques. Par le sĂ©quençage et l'assemblage du transcriptome (RNA-Seq) du mĂȘme organisme, la prĂ©cision de l'annotation gĂ©nomique peut ĂȘtre amĂ©liorĂ©e, car les lectures de RNA-Seq et les transcrits assemblĂ©s fournissent des informations prĂ©cises sur la structure des gĂšnes. Plusieurs pipelines bio-informatiques actuelles incorporent des informations provenant du RNA-Seq ainsi que des donnĂ©es de similaritĂ© des sĂ©quences protĂ©iques, pour automatiser l'annotation structurelle d’un gĂ©nome de maniĂšre que la qualitĂ© se rapproche Ă  celle de l'annotation par des experts. Les pipelines suivent gĂ©nĂ©ralement un flux de travail similaire. D'abord, les rĂ©gions rĂ©pĂ©titives sont identifiĂ©es afin d'Ă©viter de fausser les alignements de sĂ©quences et les prĂ©dictions de gĂšnes. DeuxiĂšmement, une base de donnĂ©es est construite contenant les donnĂ©es expĂ©rimentales telles que l’alignement des lectures de sĂ©quences, des transcrits et des protĂ©ines, ce qui informe les prĂ©dictions de gĂšnes basĂ©es sur les ModĂšles de Markov CachĂ©s gĂ©nĂ©ralisĂ©s. La derniĂšre Ă©tape est de consolider les alignements de sĂ©quences et les prĂ©dictions de gĂšnes dans un consensus de haute qualitĂ©. Or, les pipelines existants sont complexes et donc susceptibles aux biais et aux erreurs, ce qui peut empoisonner les prĂ©dictions de gĂšnes et la construction de modĂšles consensus. Nous avons dĂ©veloppĂ© une approche amĂ©liorĂ©e pour l'annotation des gĂ©nomes eucaryotes microbiens. Notre approche comprend deux aspects principaux. Le premier est axĂ© sur la crĂ©ation d'un ensemble d'Ă©vidences extrinsĂšques le plus complet et diversifiĂ© afin de mieux informer les prĂ©dictions de gĂšnes. Le deuxiĂšme porte sur la construction du consensus du modĂšle de gĂšnes en utilisant les Ă©vidences extrinsĂšques et les prĂ©dictions par MMC, tel que l'influence de leurs biais potentiel soit rĂ©duite. La comparaison de notre nouvel outil avec trois pipelines populaires dĂ©montre des gains significatifs de sensibilitĂ© et de spĂ©cificitĂ© des modĂšles de gĂšnes, de transcrits, d'exons et d'introns dans l’annotation structural de gĂ©nomes d’eucaryotes microbiens.New sequencing technologies have considerably accelerated the rate at which genomic data is being generated. One ongoing challenge is the accurate structural annotation of those novel genomes once sequenced and assembled, in particular if the organism does not have close relatives with well-annotated genomes. Whole-transcriptome sequencing (RNA-Seq) and assembly—both of which share similarities to whole-genome sequencing and assembly, respectively—have been shown to dramatically increase the accuracy of gene annotation. Read coverage, inferred splice junctions and assembled transcripts can provide valuable information about gene structure. Several annotation pipelines have been developed to automate structural annotation by incorporating information from RNA-Seq, as well as protein sequence similarity data, with the goal of reaching the accuracy of an expert curator. Annotation pipelines follow a similar workflow. The first step is to identify repetitive regions to prevent misinformed sequence alignments and gene predictions. The next step is to construct a database of evidence from experimental data such as RNA-Seq mapping and assembly, and protein sequence alignments, which are used to inform the generalised Hidden Markov Models of gene prediction software. The final step is to consolidate sequence alignments and gene predictions into a high-confidence consensus set. Thus, automated pipelines are complex, and therefore susceptible to incomplete and erroneous use of information, which can poison gene predictions and consensus model building. Here, we present an improved approach to microbial eukaryotic genome annotation. Its conception was based on identifying and mitigating potential sources of error and bias that are present in available pipelines. Our approach has two main aspects. The first is to create a more complete and diverse set of extrinsic evidence to better inform gene predictions. The second is to use extrinsic evidence in tandem with predictions such that the influence of their respective biases in the consensus gene models is reduced. We benchmarked our new tool against three known pipelines, showing significant gains in gene, transcript, exon and intron sensitivity and specificity in the genome annotation of microbial eukaryotes

    From gene to function: using new technologies for solving old problems.

    Get PDF
    Recent advances in DNA sequencing have changed the field of genomics as well as that of proteomics making it possible to generate gigabases of genome and transcriptome sequence data at substantially lower cost than it was possible just ten years ago. In recent years, many high-throughput technologies have been developed to interrogate various aspects of cellular processes, including sequence and structural variation and the transcriptome, epigenome, proteome and interactome. These Next Generation Sequencing (NGS) experimental technologies are more mature and accessible than the computational tools available for individual researchers to move, store, analyse and present data in a user-friendly and reproducible fashion. My research work is placed in this scenario and focuses on the analysis of data produced by NGS technologies as well as on the development of new tools aimed at solving the different problems that arise during NGS data analysis. In order to achieve this aim, my group and I have dealt with several open biomedical problems in collaboration with different research groups of the Sapienza University. Some of these experiments have already given interesting results but mostly have represented the occasion and starting point for the development of new tools able to improve some crucial steps of the analyses, solve problems derived by the system complexity and make the results easier to understand for the researchers. Some examples are IsomirT, a tool for the small RNA-Seq analysis and isomiR identification, Phagotto, a tool for analysing deep sequencing data derived from phage-displayed libraries and FIDEA, a web server for the functional interpretation of differential expression analysis. Recent reports have demonstrated that individual microRNAs can be heterogeneous in length and/or sequence producing multiple mature variants that have been dubbed isomiRs. IsomirT is a useful tool to improve and simplify the search for isomiRs starting directly from the results of a miRNA-sequencing experiment. By using it, we observed the behaviour of isomiRs in different cell types and in different biological replicates. Our results indicate that the distribution of the microRNA variants is similar among replicates and different among cells/tissues suggesting that the isomiRs have a functional role in the cell. The use of the NGS technologies for the analysis of antibody selected sequences both using phage display libraries and in vitro selection processes is becoming increasingly popular. By using these technologies, the experimental group headed by prof. Felici has introduced a new experimental pipeline, named PROFILER, aimed at significantly empowering the analysis of antigen-specific libraries. A key step to exploit this idea has been to develop a new tool, Phagotto, for processing and analysing the data derived by sequencing. PROFILER, in combination with Phagotto, seems ideally suited to streamline and guide rational antigen design, adjuvant selection, and quality control of newly produced vaccines. The publicly available web server FIDEA allows experimentalists to obtain a functional interpretation of the results derived from differential expression analysis and to test their hypothesis quickly and easily. The tool performs an enrichment analysis i.e. an analysis of specific properties that are distributed in a non random fashion in the up-regulated and down-regulated genes, taken both together and separately. It has been shown to be very useful and is being heavily used from scientists all over the world, more than 1500 requests for analysis have been submitted to the server in six months. Furthermore, during the course of the PhD I implemented pipelines for the speeding up and optimization of protocols for NGS data analysis and applied them to biomedical projects. Of course not all the proteins have a complete functional annotation and consequently the issue of predicting the function of proteins with a partial or no functional annotation arises. This can be done both by exploiting the 3D structure of the protein or by inferring the function directly from the sequence. A real challenge, however, is the assessment of the accuracy of existing methods. In this context the help that critical assessment experiments can give is essential. We have had the possibility to be involved, as assessors, in the world wide experiment CASP (Critical Assessment of protein Structure Prediction). In particular, we are involved in the assessment of the residue-residue contacts in which the participant groups provide a list of predicted contacts between residues that hopefully can be used as constraints to fold the protein. We proposed and implemented new methodologies to understand which method works better and where future efforts should be focused
    • 

    corecore