1,308 research outputs found

    Integration of Alignment and Phylogeny in the Whole-Genome Era

    Get PDF
    With the development of new sequencing techniques, whole genomes of many species have become available. This huge amount of data gives rise to new opportunities and challenges. These new sequences provide valuable information on relationships among species, e.g. genome recombination and conservation. One of the principal ways to investigate such information is multiple sequence alignment (MSA). Currently, there is large amount of MSA data on the internet, such as the UCSC genome database, but how to effectively use this information to solve classical and new problems is still an area lacking of exploration. In this thesis, we explored how to use this information in four problems, i.e. sequence orthology search problem, multiple alignment improvement problem, short read mapping problem, and genome rearrangement inference problem. For the first problem, we developed a EM algorithm to iteratively align a query with a multiple alignment database with the information from a phylogeny relating the query species and the species in the multiple alignment. We also infer the query\u27s location in the phylogeny. We showed that by doing alignment and phylogeny inference together, we can improve the accuracies for both problems. For the second problem, we developed an optimization algorithm to iteratively refine the multiple alignment quality. Experiment results showed our algorithm is very stable in term of resulting alignments. The results showed that our method is more accurate than existing methods, i.e. Mafft, Clustal-O, and Mavid, on test data from three sets of species from the UCSC genome database. For the third problem, we developed a model, PhyMap, to align a read to a multiple alignment allowing mismatches and indels. PhyMap computes local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyMap uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. Both theoretical computation and experiment results show that our model can differentiate between orthologous and paralogous alignments better than other popular short read mapping tools (BWA, BOWTIE and BLAST). For the fourth problem, we gave a simple genome recombination model which can express insertions, deletions, inversions, translocations and inverted translocations on aligned genome segments. We also developed an MCMC algorithm to infer the order of the query segments. We proved that using any Euclidian metrics to measure distance between two sequence orders in the tree optimization goal function will lead to a degenerated solution where the inferred order will be the order of one of the leaf nodes. We also gave a graph-based formulation of the problem which can represent the probability distribution of the order of the query sequences

    Cross-species network and transcript transfer

    Get PDF
    Metabolic processes, signal transduction, gene regulation, as well as gene and protein expression are largely controlled by biological networks. High-throughput experiments allow the measurement of a wide range of cellular states and interactions. However, networks are often not known in detail for specific biological systems and conditions. Gene and protein annotations are often transferred from model organisms to the species of interest. Therefore, the question arises whether biological networks can be transferred between species or whether they are specific for individual contexts. In this thesis, the following aspects are investigated: (i) the conservation and (ii) the cross-species transfer of eukaryotic protein-interaction and gene regulatory (transcription factor- target) networks, as well as (iii) the conservation of alternatively spliced variants. In the simplest case, interactions can be transferred between species, based solely on the sequence similarity of the orthologous genes. However, such a transfer often results either in the transfer of only a few interactions (medium/high sequence similarity threshold) or in the transfer of many speculative interactions (low sequence similarity threshold). Thus, advanced network transfer approaches also consider the annotations of orthologous genes involved in the interaction transfer, as well as features derived from the network structure, in order to enable a reliable interaction transfer, even between phylogenetically very distant species. In this work, such an approach for the transfer of protein interactions is presented (COIN). COIN uses a sophisticated machine-learning model in order to label transferred interactions as either correctly transferred (conserved) or as incorrectly transferred (not conserved). The comparison and the cross-species transfer of regulatory networks is more difficult than the transfer of protein interaction networks, as a huge fraction of the known regulations is only described in the (not machine-readable) scientific literature. In addition, compared to protein interactions, only a few conserved regulations are known, and regulatory elements appear to be strongly context-specific. In this work, the cross-species analysis of regulatory interaction networks is enabled with software tools and databases for global (ConReg) and thousands of context-specific (CroCo) regulatory interactions that are derived and integrated from the scientific literature, binding site predictions and experimental data. Genes and their protein products are the main players in biological networks. However, to date, the aspect is neglected that a gene can encode different proteins. These alternative proteins can differ strongly from each other with respect to their molecular structure, function and their role in networks. The identification of conserved and species-specific splice variants and the integration of variants in network models will allow a more complete cross-species transfer and comparison of biological networks. With ISAR we support the cross-species transfer and comparison of alternative variants by introducing a gene-structure aware (i.e. exon-intron structure aware) multiple sequence alignment approach for variants from orthologous and paralogous genes. The methods presented here and the appropriate databases allow the cross-species transfer of biological networks, the comparison of thousands of context-specific networks, and the cross-species comparison of alternatively spliced variants. Thus, they can be used as a starting point for the understanding of regulatory and signaling mechanisms in many biological systems.In biologischen Systemen werden Stoffwechselprozesse, SignalĂŒbertragungen sowie die Regulation von Gen- und Proteinexpression maßgeblich durch biologische Netzwerke gesteuert. Hochdurchsatz-Experimente ermöglichen die Messung einer Vielzahl von zellulĂ€ren ZustĂ€nden und Wechselwirkungen. Allerdings sind fĂŒr die meisten Systeme und Kontexte biologische Netzwerke nach wie vor unbekannt. Gen- und Proteinannotationen werden hĂ€ufig von Modellorganismen ĂŒbernommen. Demnach stellt sich die Frage, ob auch biologische Netzwerke und damit die systemischen Eigenschaften Ă€hnlich sind und ĂŒbertragen werden können. In dieser Arbeit wird: (i) Die Konservierung und (ii) die artenĂŒbergreifende Übertragung von eukaryotischen Protein-Interaktions- und regulatorischen (Transkriptionsfaktor-Zielgen) Netzwerken, sowie (iii) die Konservierung von Spleißvarianten untersucht. Interaktionen können im einfachsten Fall nur auf Basis der SequenzĂ€hnlichkeit zwischen orthologen Genen ĂŒbertragen werden. Allerdings fĂŒhrt eine solche Übertragung oft dazu, dass nur sehr wenige Interaktionen ĂŒbertragen werden können (hoher bis mittlerer Sequenzschwellwert) oder dass ein Großteil der ĂŒbertragenden Interaktionen sehr spekulativ ist (niedriger Sequenzschwellwert). Verbesserte Methoden berĂŒcksichtigen deswegen zusĂ€tzlich noch die Annotationen der Orthologen, Eigenschaften der Interaktionspartner sowie die Netzwerkstruktur und können somit auch Interaktionen auf phylogenetisch weit entfernte Arten (zuverlĂ€ssig) ĂŒbertragen. In dieser Arbeit wird ein solcher Ansatz fĂŒr die Übertragung von Protein-Interaktionen vorgestellt (COIN). COIN verwendet Verfahren des maschinellen Lernens, um Interaktionen als richtig (konserviert) oder als falsch ĂŒbertragend (nicht konserviert) zu klassifizieren. Der Vergleich und die artenĂŒbergreifende Übertragung von regulatorischen Interaktionen ist im Vergleich zu Protein-Interaktionen schwieriger, da ein Großteil der bekannten Regulationen nur in der (nicht maschinenlesbaren) wissenschaftlichen Literatur beschrieben ist. Zudem sind im Vergleich zu Protein-Interaktionen nur wenige konservierte Regulationen bekannt und regulatorische Elemente scheinen stark kontextabhĂ€ngig zu sein. In dieser Arbeit wird die artenĂŒbergreifende Analyse von regulatorischen Netzwerken mit Softwarewerkzeugen und Datenbanken fĂŒr globale (ConReg) und kontextspezifische (CroCo) regulatorische Interaktionen ermöglicht. Regulationen wurden dafĂŒr aus Vorhersagen, experimentellen Daten und aus der wissenschaftlichen Literatur abgeleitet und integriert. Grundbaustein fĂŒr viele biologische Netzwerke sind Gene und deren Proteinprodukte. Bisherige Netzwerkmodelle vernachlĂ€ssigen allerdings meist den Aspekt, dass ein Gen verschiedene Proteine kodieren kann, die sich von der Funktion, der Proteinstruktur und der Rolle in Netzwerken stark voneinander unterscheiden können. Die Identifizierung von konservierten und artspezifischen Proteinprodukten und deren Integration in Netzwerkmodelle wĂŒrde einen vollstĂ€ndigeren Übertrag und Vergleich von Netzwerken ermöglichen. In dieser Arbeit wird der artenĂŒbergreifende Vergleich von Proteinprodukten mit einem multiplen Sequenzalignmentverfahren fĂŒr alternative Varianten von paralogen und orthologen Genen unterstĂŒtzt, unter BerĂŒcksichtigung der bekannten Exon-Intron-Grenzen (ISAR). Die in dieser Arbeit vorgestellten Verfahren, Datenbanken und Softwarewerkzeuge ermöglichen die Übertragung von biologischen Netzwerken, den Vergleich von tausenden kontextspezifischen Netzwerken und den artenĂŒbergreifenden Vergleich von alternativen Varianten. Sie können damit die Ausgangsbasis fĂŒr ein VerstĂ€ndnis von Kommunikations- und Regulationsmechanismen in vielen biologischen Systemen bilden

    Statistical methods for biological sequence analysis for DNA binding motifs and protein contacts

    Get PDF
    Over the last decades a revolution in novel measurement techniques has permeated the biological sciences filling the databases with unprecedented amounts of data ranging from genomics, transcriptomics, proteomics and metabolomics to structural and ecological data. In order to extract insights from the vast quantity of data, computational and statistical methods are nowadays crucial tools in the toolbox of every biological researcher. In this thesis I summarize my contributions in two data-rich fields in biological sciences: transcription factor binding to DNA and protein structure prediction from protein sequences with shared evolutionary ancestry. In the first part of my thesis I introduce our work towards a web server for analysing transcription factor binding data with Bayesian Markov Models. In contrast to classical PWM or di-nucleotide models, Bayesian Markov models can capture complex inter-nucleotide dependencies that can arise from shape-readout and alternative binding modes. In addition to giving access to our methods in an easy-to-use, intuitive web-interface, we provide our users with novel tools and visualizations to better evaluate the biological relevance of the inferred binding motifs. We hope that our tools will prove useful for investigating weak and complex transcription factor binding motifs which cannot be predicted accurately with existing tools. The second part discusses a statistical attempt to correct out the phylogenetic bias arising in co-evolution methods applied to the contact prediction problem. Co-evolution methods have revolutionized the protein-structure prediction field more than 10 years ago, and, until very recently, have retained their importance as crucial input features to deep neural networks. As the co-evolution information is extracted from evolutionarily related sequences, we investigated whether the phylogenetic bias to the signal can be corrected out in a principled way using a variation of the Felsenstein's tree-pruning algorithm applied in combination with an independent-pair assumption to derive pairwise amino counts that are corrected for the evolutionary history. Unfortunately, the contact prediction derived from our corrected pairwise amino acid counts did not yield a competitive performance.2021-09-2

    Upcoming challenges for multiple sequence alignment methods in the high-throughput era

    Get PDF
    This review focuses on recent trends in multiple sequence alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for multiple sequence alignment methods in the genomic era, most notably the need to cope with very large sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed sequences and finally, the need to integrate many alternative methods and approaches

    Probabilistic Phylogenetic Inference with Insertions and Deletions

    Get PDF
    A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth–death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new “concordance test” benchmark on real ribosomal RNA alignments, we show that the extended program dnamlΔ improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm

    Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference

    Get PDF
    Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms

    Light Organ Photosensitivity in Deep-Sea Shrimp May Suggest a Novel Role in Counterillumination

    Get PDF
    Extraocular photoreception, the ability to detect and respond to light outside of the eye, has not been previously described in deep-sea invertebrates. Here, we investigate photosensitivity in the bioluminescent light organs (photophores) of deep-sea shrimp, an autogenic system in which the organism possesses the substrates and enzymes to produce light. Through the integration of transcriptomics, in situ hybridization and immunohistochemistry we find evidence for the expression of opsins and phototransduction genes known to play a role in light detection in most animals. Subsequent shipboard light exposure experiments showed ultrastructural changes in the photophore similar to those seen in crustacean eyes, providing further evidence that photophores are light sensitive. In many deep-sea species, it has long been documented that photophores emit light to aid in counterillumination – a dynamic form of camouflage that requires adjusting the organ’s light intensity to “hide” their silhouettes from predators below. However, it remains a mystery how animals fine-tune their photophore luminescence to match the intensity of downwelling light. Photophore photosensitivity allows us to reconsider the organ’s role in counterillumination - not only in light emission but also light detection and regulation

    New Halonotius Species Provide Genomics-Based Insights Into Cobalamin Synthesis in Haloarchaea

    Get PDF
    Hypersaline aquatic and terrestrial ecosystems display a cosmopolitan distribution. These environments teem with microbes and harbor a plethora of prokaryotic lineages that evaded ecological characterization due to the prior inability to cultivate them or to access their genomic information. In order to close the current knowledge gap, we performed two sampling and isolation campaigns in the saline soils of the Odiel Saltmarshes and the salterns of Isla Cristina (Huelva, Spain). From the isolated haloarchaeal strains subjected to high-throughput phylogenetic screening, two were chosen (F15BT and F9-27T) for physiological and genomic characterization due of their relatedness to the genus Halonotius. Comparative genomic analyses were carried out between the isolated strains and the genomes of previously described species Halonotius pteroides CECT 7525T, Halonotius aquaticus F13-13T and environmentaly recovered metagenome-assembled representatives of the genus Halonotius. The topology of the phylogenomic tree showed agreement with the phylogenetic ones based on 16S rRNA and rpoBâ€Č genes, and together with average amino acid and nucleotide identities suggested the two strains as novel species within the genus. We propose the names Halonotius terrestris sp. nov. (type strain F15BT = CECT 9688T = CCM 8954T) and Halonotius roseus sp. nov. (type strain F9-27T = CECT 9745T = CCM 8956T) for these strains. Comparative genomic analyses within the genus highlighted a typical salt-in signature, characterized by acidic proteomes with low isoelectric points, and indicated heterotrophic aerobic lifestyles. Genome-scale metabolic reconstructions revealed that the newly proposed species encode all the necessary enzymatic reactions involved in cobalamin (vitamin B12) biosynthesis. Based on the worldwide distribution of the genus and its abundance in hypersaline habitats we postulate that its members perform a critical function by being able to provide “expensive” commodities (i.e., vitamin B12) to the halophilic microbial communities at large.España , MINECO Project CGL2017-83385-PEspaña, Junta de AndalucĂ­a BIO213España, Ministerio de EducaciĂłn, Cultura y Deporte FEMS-GO-2018-139España FEDER FPU14/0512

    Computational analyses of the plant-associated microbiota

    Get PDF
    Plants harbor phylogenetically diverse microbes on the exterior and interior of all organs and they form intimate relationships with the colonized microbiota. Multi-omics dramatically facilitates and expands our knowledge in plant-microbiota interactions and associations. To establish causalities, manipulation of microbiota populating plants under strictly controlled conditions is a necessity, which forged the development of reductionist approaches for studying plant-microbiota interactions, including the process of deconstruction and reconstruction of the plant microbiota. Deconstruction of the plant microbiota requires the establishment of genome-indexed microbial culture collections representing the plant microbiota of interests. The reconstruction step is to design synthetic microbial communities (SynComs) by mixing the strains from the culture collections and inoculate onto the plants. In this dissertation, I introduced a software named Rbec that is developed to exclusively characterize the accurate microbial composition in SynComs subject to amplicon sequencing by both correcting PCR/sequencing errors and identifying maker gene paralogues within the same strain. Rbec also provides a novel feature for contamination identification in the SynCom experiments, which has been overlooked in previous studies but is a necessity to verify the robustness of the readouts from SynCom experimentations. Further, with the established pipelines for analyzing amplicon sequencing data from either natural or synthetic communities, I analyzed the microbial compositions from different studies including the study of the host preference of Arabidopsis thaliana and Lotus Japonicus commensals, the phycosphere microbiota, the effects of plant metabolites on soil microbiota and how bacterial antibiotics shape root microbiota. Genome-indexed microbial culture collections allow us to study the functional capacities of microbiota. We systematically analyzed the biosynthetic gene clusters and the spread of antimicrobial 2,4-diacetylphloroglucinol synthetic gene clusters in Pseudomonas in established culture collections. Moreover, I studied the recent horizontal gene transfer (HGT) in bacteria from different culture collections assembled from different host plants and sites. This provides an atlas of the active taxa involved in HGT and the frequently transferred functional orthologues in plant-associated niches. In addition, it reveals the selection forces exerted on different taxa in the relevant environments. In summary, our work tried to move the reductionist approaches forward in the aspect of computational analyses. We not only introduced a new computational method for accurately profiling microbial compositions in SynComs, but also digged deeper into the genome- indexed culture collections by making full use of genome sequences. With the valuable integrated genome information of the plant microbiota, it’ll provide the opportunity to study the functional diversities, evolutionary trajectories, genomic contents related to adaptations to hosts. However, with the increased volume of available genomes, novel methodology will be required to fast processing large datasets in a computational-efficient way
    • 

    corecore