54 research outputs found

    EDGAR 2.0: an enhanced software platform for comparative gene content analyses.

    Get PDF
    The rapidly increasing availability of microbial genome sequences has led to a growing demand for bioinformatics software tools that support the functional analysis based on the comparison of closely related genomes. By utilizing comparative approaches on gene level it is possible to gain insights into the core genes which represent the set of shared features for a set of organisms under study. Vice versa singleton genes can be identified to elucidate the specific properties of an individual genome. Since initial publication, the EDGAR platform has become one of the most established software tools in the field of comparative genomics. Over the last years, the software has been continuously improved and a large number of new analysis features have been added. For the new version, EDGAR 2.0, the gene orthology estimation approach was newly designed and completely re-implemented. Among other new features, EDGAR 2.0 provides extended phylogenetic analysis features like AAI (Average Amino Acid Identity) and ANI (Average Nucleotide Identity) matrices, genome set size statistics and modernized visualizations like interactive synteny plots or Venn diagrams. Thereby, the software supports a quick and user-friendly survey of evolutionary relationships between microbial genomes and simplifies the process of obtaining new biological insights into their differential gene content. All features are offered to the scientific community via a web-based and therefore platform-independent user interface, which allows easy browsing of precomputed datasets. The web server is accessible at http://edgar.computational.bio

    Comparative genomics and transcriptomics of lineages I, II, and III strains of Listeria monocytogenes

    Get PDF
    BACKGROUND: Listeria monocytogenes is a food-borne pathogen that causes infections with a high-mortality rate and has served as an invaluable model for intracellular parasitism. Here, we report complete genome sequences for two L. monocytogenes strains belonging to serotype 4a (L99) and 4b (CLIP80459), and transcriptomes of representative strains from lineages I, II, and III, thereby permitting in-depth comparison of genome- and transcriptome -based data from three lineages of L. monocytogenes. Lineage III, represented by the 4a L99 genome is known to contain strains less virulent for humans. RESULTS: The genome analysis of the weakly pathogenic L99 serotype 4a provides extensive evidence of virulence gene decay, including loss of several important surface proteins. The 4b CLIP80459 genome, unlike the previously sequenced 4b F2365 genome harbours an intact inlB invasion gene. These lineage I strains are characterized by the lack of prophage genes, as they share only a single prophage locus with other L. monocytogenes genomes 1/2a EGD-e and 4a L99. Comparative transcriptome analysis during intracellular growth uncovered adaptive expression level differences in lineages I, II and III of Listeria, notable amongst which was a strong intracellular induction of flagellar genes in strain 4a L99 compared to the other lineages. Furthermore, extensive differences between strains are manifest at levels of metabolic flux control and phosphorylated sugar uptake. Intriguingly, prophage gene expression was found to be a hallmark of intracellular gene expression. Deletion mutants in the single shared prophage locus of lineage II strain EGD-e 1/2a, the lma operon, revealed severe attenuation of virulence in a murine infection model. CONCLUSION: Comparative genomics and transcriptome analysis of L. monocytogenes strains from three lineages implicate prophage genes in intracellular adaptation and indicate that gene loss and decay may have led to the emergence of attenuated lineages

    FdeC, a Novel Broadly Conserved Escherichia coli Adhesin Eliciting Protection against Urinary Tract Infections

    Get PDF
    The increasing antibiotic resistance of pathogenic Escherichia coli species and the absence of a pan-protective vaccine pose major health concerns. We recently identified, by subtractive reverse vaccinology, nine Escherichia coli antigens that protect mice from sepsis. In this study, we characterized one of them, ECOK1_0290, named FdeC (factor adherence E. coli) for its ability to mediate E. coli adhesion to mammalian cells and extracellular matrix. This adhesive propensity was consistent with the X-ray structure of one of the FdeC domains that shows a striking structural homology to Yersinia pseudotuberculosis invasin and enteropathogenic E. coli intimin. Confocal imaging analysis revealed that expression of FdeC on the bacterial surface is triggered by interaction of E. coli with host cells. This phenotype was also observed in bladder tissue sections derived from mice infected with an extraintestinal strain. Indeed, we observed that FdeC contributes to colonization of the bladder and kidney, with the wild-type strain outcompeting the fdeC mutant in cochallenge experiments. Finally, intranasal mucosal immunization with recombinant FdeC significantly reduced kidney colonization in mice challenged transurethrally with uropathogenic E. coli, supporting a role for FdeC in urinary tract infections

    Modelos de compressão e ferramentas para dados ómicos

    Get PDF
    The ever-increasing growth of the development of high-throughput sequencing technologies and as a consequence, generation of a huge volume of data, has revolutionized biological research and discovery. Motivated by that, we investigate in this thesis the methods which are capable of providing an efficient representation of omics data in compressed or encrypted manner, and then, we employ them to analyze omics data. First and foremost, we describe a number of measures for the purpose of quantifying information in and between omics sequences. Then, we present finite-context models (FCMs), substitution-tolerant Markov models (STMMs) and a combination of the two, which are specialized in modeling biological data, in order for data compression and analysis. To ease the storage of the aforementioned data deluge, we design two lossless data compressors for genomic and one for proteomic data. The methods work on the basis of (a) a combination of FCMs and STMMs or (b) the mentioned combination along with repeat models and a competitive prediction model. Tested on various synthetic and real data showed their outperformance over the previously proposed methods in terms of compression ratio. Privacy of genomic data is a topic that has been recently focused by developments in the field of personalized medicine. We propose a tool that is able to represent genomic data in a securely encrypted fashion, and at the same time, is able to compact FASTA and FASTQ sequences by a factor of three. It employs AES encryption accompanied by a shuffling mechanism for improving the data security. The results show it is faster than general-purpose and special-purpose algorithms. Compression techniques can be employed for analysis of omics data. Having this in mind, we investigate the identification of unique regions in a species with respect to close species, that can give us an insight into evolutionary traits. For this purpose, we design two alignment-free tools that can accurately find and visualize distinct regions among two collections of DNA or protein sequences. Tested on modern humans with respect to Neanderthals, we found a number of absent regions in Neanderthals that may express new functionalities associated with evolution of modern humans. Finally, we investigate the identification of genomic rearrangements, that have important roles in genetic disorders and cancer, by employing a compression technique. For this purpose, we design a tool that is able to accurately localize and visualize small- and large-scale rearrangements between two genomic sequences. The results of applying the proposed tool on several synthetic and real data conformed to the results partially reported by wet laboratory approaches, e.g., FISH analysis.O crescente crescimento do desenvolvimento de tecnologias de sequenciamento de alto rendimento e, como consequência, a geração de um enorme volume de dados, revolucionou a pesquisa e descoberta biológica. Motivados por isso, nesta tese investigamos os métodos que fornecem uma representação eficiente de dados ómicros de maneira compactada ou criptografada e, posteriormente, os usamos para análise. Em primeiro lugar, descrevemos uma série de medidas com o objetivo de quantificar informação em e entre sequencias ómicas. Em seguida, apresentamos modelos de contexto finito (FCMs), modelos de Markov tolerantes a substituição (STMMs) e uma combinação dos dois, especializados na modelagem de dados biológicos, para compactação e análise de dados. Para facilitar o armazenamento do dilúvio de dados acima mencionado, desenvolvemos dois compressores de dados sem perda para dados genómicos e um para dados proteómicos. Os métodos funcionam com base em (a) uma combinação de FCMs e STMMs ou (b) na combinação mencionada, juntamente com modelos de repetição e um modelo de previsão competitiva. Testados em vários dados sintéticos e reais mostraram a sua eficiência sobre os métodos do estado-de-arte em termos de taxa de compressão. A privacidade dos dados genómicos é um tópico recentemente focado nos desenvolvimentos do campo da medicina personalizada. Propomos uma ferramenta capaz de representar dados genómicos de maneira criptografada com segurança e, ao mesmo tempo, compactando as sequencias FASTA e FASTQ para um fator de três. Emprega criptografia AES acompanhada de um mecanismo de embaralhamento para melhorar a segurança dos dados. Os resultados mostram que ´e mais rápido que os algoritmos de uso geral e específico. As técnicas de compressão podem ser exploradas para análise de dados ómicos. Tendo isso em mente, investigamos a identificação de regiões únicas em uma espécie em relação a espécies próximas, que nos podem dar uma visão das características evolutivas. Para esse fim, desenvolvemos duas ferramentas livres de alinhamento que podem encontrar e visualizar com precisão regiões distintas entre duas coleções de sequências de DNA ou proteínas. Testados em humanos modernos em relação a neandertais, encontrámos várias regiões ausentes nos neandertais que podem expressar novas funcionalidades associadas à evolução dos humanos modernos. Por último, investigamos a identificação de rearranjos genómicos, que têm papéis importantes em desordens genéticas e cancro, empregando uma técnica de compressão. Para esse fim, desenvolvemos uma ferramenta capaz de localizar e visualizar com precisão os rearranjos em pequena e grande escala entre duas sequências genómicas. Os resultados da aplicação da ferramenta proposta, em vários dados sintéticos e reais, estão em conformidade com os resultados parcialmente relatados por abordagens laboratoriais, por exemplo, análise FISH.Programa Doutoral em Engenharia Informátic

    A detailed view of the intracellular transcriptome of Listeria monocytogenes in murine macrophages using RNA-seq

    Get PDF
    Listeria monocytogenes is a bacterial pathogen and causative agent for the foodborne infection listeriosis, which is mainly a threat for pregnant, elderly or immunocompromised individuals. Due to its ability to invade and colonize diverse eukaryotic cell types including cells from invertebrates, L. monocytogenes has become a well-established model organism for intracellular growth. Almost ten years ago, we and others presented the first whole-genome microarray-based intracellular transcriptome of L. monocytogenes. With the advent of newer technologies addressing transcriptomes in greater detail, we revisit this work, and analyze the intracellular transcriptome of L. monocytogenes during growth in murine macrophages using a deep sequencing based approach.We detected 656 differentially expressed genes of which 367 were upregulated during intracellular growth in macrophages compared to extracellular growth in BHI. This study confirmed ~64% of all regulated genes previously identified by microarray analysis. Many of the regulated genes that were detected in the current study involve transporters for various metals, ions as well as complex sugars such as mannose. We also report changes in antisense transcription, especially upregulations during intracellular bacterial survival. A notable finding was the detection of regulatory changes for a subset of temperate A118-like prophage genes, thereby shedding light on the transcriptional profile of this bacteriophage during intracellular growth. In total, our study provides an updated genome-wide view of the transcriptional landscape of L. monocytogenes during intracellular growth and represents a rich resource for future detailed analysis

    Universal Stress Proteins Are Important for Oxidative and Acid Stress Resistance and Growth of Listeria monocytogenes EGD-e In Vitro and In Vivo

    Get PDF
    Background: Pathogenic bacteria maintain a multifaceted apparatus to resist damage caused by external stimuli. As part of this, the universal stress protein A (UspA) and its homologues, initially discovered in Escherichia coli K-12 were shown to possess an important role in stress resistance and growth in several bacterial species. Methods and Findings: We conducted a study to assess the role of three homologous proteins containing the UspA domain in the facultative intracellular human pathogen Listeria monocytogenes under different stress conditions. The growth properties of three UspA deletion mutants (deltalmo0515, deltalmo1580 and deltalmo2673) were examined either following challenge with a sublethal concentration of hydrogen peroxide or under acidic conditions. We also examined their ability for intracellular survival within murine macrophages. Virulence and growth of usp mutants were further characterized in invertebrate and vertebrate infection models. Tolerance to acidic stress was clearly reduced in Δlmo1580 and deltalmo0515, while oxidative stress dramatically diminished growth in all mutants. Survival within macrophages was significantly decreased in deltalmo1580 and deltalmo2673 as compared to the wild-type strain. Viability of infected Galleria mellonella larvae was markedly higher when injected with deltalmo1580 or deltalmo2673 as compared to wild-type strain inoculation, indicating impaired virulence of bacteria lacking these usp genes. Finally, we observed severely restricted growth of all chromosomal deletion mutants in mice livers and spleens as compared to the load of wild-type bacteria following infection. Conclusion: This work provides distinct evidence that universal stress proteins are strongly involved in listerial stress response and survival under both in vitro and in vivo growth conditions

    Compressão e análise de dados genómicos

    Get PDF
    Doutoramento em InformáticaGenomic sequences are large codi ed messages describing most of the structure of all known living organisms. Since the presentation of the rst genomic sequence, a huge amount of genomics data have been generated, with diversi ed characteristics, rendering the data deluge phenomenon a serious problem in most genomics centers. As such, most of the data are discarded (when possible), while other are compressed using general purpose algorithms, often attaining modest data reduction results. Several speci c algorithms have been proposed for the compression of genomic data, but unfortunately only a few of them have been made available as usable and reliable compression tools. From those, most have been developed to some speci c purpose. In this thesis, we propose a compressor for genomic sequences of multiple natures, able to function in a reference or reference-free mode. Besides, it is very exible and can cope with diverse hardware speci cations. It uses a mixture of nite-context models (FCMs) and eXtended FCMs. The results show improvements over state-of-the-art compressors. Since the compressor can be seen as a unsupervised alignment-free method to estimate algorithmic complexity of genomic sequences, it is the ideal candidate to perform analysis of and between sequences. Accordingly, we de ne a way to approximate directly the Normalized Information Distance, aiming to identify evolutionary similarities in intra- and inter-species. Moreover, we introduce a new concept, the Normalized Relative Compression, that is able to quantify and infer new characteristics of the data, previously undetected by other methods. We also investigate local measures, being able to locate speci c events, using complexity pro les. Furthermore, we present and explore a method based on complexity pro les to detect and visualize genomic rearrangements between sequences, identifying several insights of the genomic evolution of humans. Finally, we introduce the concept of relative uniqueness and apply it to the Ebolavirus, identifying three regions that appear in all the virus sequences outbreak but nowhere in the human genome. In fact, we show that these sequences are su cient to classify di erent sub-species. Also, we identify regions in human chromosomes that are absent from close primates DNA, specifying novel traits in human uniqueness.As sequências genómicas podem ser vistas como grandes mensagens codificadas, descrevendo a maior parte da estrutura de todos os organismos vivos. Desde a apresentação da primeira sequência, um enorme número de dados genómicos tem sido gerado, com diversas características, originando um sério problema de excesso de dados nos principais centros de genómica. Por esta razão, a maioria dos dados é descartada (quando possível), enquanto outros são comprimidos usando algoritmos genéricos, quase sempre obtendo resultados de compressão modestos. Têm também sido propostos alguns algoritmos de compressão para sequências genómicas, mas infelizmente apenas alguns estão disponíveis como ferramentas eficientes e prontas para utilização. Destes, a maioria tem sido utilizada para propósitos específicos. Nesta tese, propomos um compressor para sequências genómicas de natureza múltipla, capaz de funcionar em modo referencial ou sem referência. Além disso, é bastante flexível e pode lidar com diversas especificações de hardware. O compressor usa uma mistura de modelos de contexto-finito (FCMs) e FCMs estendidos. Os resultados mostram melhorias relativamente a compressores estado-dearte. Uma vez que o compressor pode ser visto como um método não supervisionado, que não utiliza alinhamentos para estimar a complexidade algortímica das sequências genómicas, ele é o candidato ideal para realizar análise de e entre sequências. Em conformidade, definimos uma maneira de aproximar directamente a distância de informação normalizada (NID), visando a identificação evolucionária de similaridades em intra e interespécies. Além disso, introduzimos um novo conceito, a compressão relativa normalizada (NRC), que é capaz de quantificar e inferir novas características nos dados, anteriormente indetectados por outros métodos. Investigamos também medidas locais, localizando eventos específicos, usando perfis de complexidade. Propomos e exploramos um novo método baseado em perfis de complexidade para detectar e visualizar rearranjos genómicos entre sequências, identificando algumas características da evolução genómica humana. Por último, introduzimos um novo conceito de singularidade relativa e aplicamo-lo ao Ebolavirus, identificando três regiões presentes em todas as sequências do surto viral, mas ausentes do genoma humano. De facto, mostramos que as três sequências são suficientes para classificar diferentes sub-espécies. Também identificamos regiões nos cromossomas humanos que estão ausentes do ADN de primatas próximos, especificando novas características da singularidade humana

    Accurate and highly interpretable prediction of gene expression from histone modifications

    Get PDF
    Histone Mark Modifications (HMs) are crucial actors in gene regulation, as they actively remodel chromatin to modulate transcriptional activity: aberrant combinatorial patterns of HMs have been connected with several diseases, including cancer. HMs are, however, reversible modifications: understanding their role in disease would allow the design of 'epigenetic drugs' for specific, non-invasive treatments. Standard statistical techniques were not entirely successful in extracting representative features from raw HM signals over gene locations. On the other hand, deep learning approaches allow for effective automatic feature extraction, but at the expense of model interpretation

    Gene function finding through cross-organism ensemble learning

    Get PDF
    Background: Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. Results: Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/. Conclusions: Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available

    QUANTITATIVE METHODS FOR GENOMICS AND LINEAGE TRACING DATA

    Get PDF
    The thesis consists of two parts. The first part discusses a new method for quantifying functional conservation of DNA elements. Evolutionary conservation is an important tool for identifying functional DNA elements in genomes and provides a foundation for studying human diseases using animal models. Conservation in DNA sequences does not necessarily imply conservation in dynamic functional activities. Quantifying functional conservation, however, has been constrained by limited availability of functional genomic data from well-matched samples across species. Here we present FUNCODE, a solution to scoring functional conservation of DNA elements by integrating data across species without requiring manually or exactly matched samples. By using in silico sample matching, FUNCODE more accurately scores functional conservation and offers scalability to new samples and ability to score different data modalities. Applying it to the Encyclopedia of DNA Elements (ENCODE), we systematically scored human-mouse conservation of DNA regulatory elements based on chromatin accessibility. We further demonstrate utility of FUNCODE in finding new cis-regulatory elements, identifying discoveries translatable across species, and cross-species single-cell genomic data integration. The second part of the thesis studies inference of cell state dynamics using lineage barcode data. Natural and induced somatic mutations that accumulate in the genome during development record the phylogenetic relationships of cells; however, whether these lineage barcodes can capture the dynamics of complex progenitor fields remains unclear. Here, we introduce quantitative fate mapping, an approach to simultaneously map the fate and quantify the commitment time, commitment bias, and population size of multiple progenitor groups during development based on a time-scaled phylogeny of their descendants. To reconstruct time-scaled phylogenies from lineage barcodes, we introduce Phylotime, a scalable maximum likelihood clustering approach based on a generalizable barcoding mutagenesis model. We validate these approaches using realistically-simulated barcoding results as well as experimental results from a barcoding stem cell line. We further establish criteria for the minimum number of cells that must be analyzed for robust quantitative fate mapping. Overall, this work demonstrates how lineage barcodes, natural or synthetic, can be used to obtain quantitative fate maps, thus enabling analysis of progenitor dynamics long after embryonic development in any organism
    corecore