201 research outputs found

    Laskennallisia menetelmiä haplotyypien ennustamiseen ja paikallisten rinnastusten merkittävyyden arviointiin

    Get PDF
    This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.Tässä väitöskirjassa esitetään uusia, tarkkoja ja tehokkaita laskennallisia menetelmiä populaation haplotyyppien ennustamiseen genotyypeistä sekä sekvenssien paikallisten rinnastusten merkittävyyden arviointiin. Käytetyt menetelmät perustuvat mm. dynaamiseen ohjelmointiin, jossa pienimmät osaongelmat ratkaistaan ensin ja näistä pienistä ratkaisuosista kootaan suurempien osaongelmien ratkaisuja. Organismin genomi on yleensä koodattu solun sisään DNA:han, yksinkertaistaen jonoon (sekvenssiin) emäksiä A, C, G ja T. Genomi on jäsentynyt kromosomeihin, jotka sisältävät tietyissä paikoissa esiintyviä muutoksia, merkkijaksoja. Diploidin organismin, kuten ihmisen, kromosomit (autosomit) esiintyvät pareittain. Yksilö perii parin toisen kromosomin isältään ja toisen äidiltään. Haplotyyppi on yksilön tietyissä paikoissa esiintyvien merkkijaksojen jono tietyssä kromosomiparin kromosomissa. Haplotyyppien mittaaminen suoraan on vaikeaa, mutta genotyypit ovat helpommin mitattavia. Genotyypit kertovat, mitkä kaksi merkkijaksoa kromosomin vastaavissa kohdissa esiintyy. Haplotyyppiaineistoja käytetään yleisesti esimerkiksi genettisten tautien tutkimiseen. Tämän vuoksi haplotyyppien laskennallinen ennustaminen genotyypeistä on tärkeä tutkimusongelma. Syötteenä ongelmassa on siis näyte tietyn populaation genotyypeistä, joista tulisi ennustaa haplotyypit jokaiselle näytteen yksilölle. Haplotyyppien ennustaminen genotyypeistä on mahdollista, koska haplotyypit ovat samankaltaisia yksilöiden välillä. Samankaltaisuus johtuu evoluution prosesseista, kuten periytymisestä, luonnonvalinnasta, migraatiosta ja isolaatiosta. Tässä väitöskirjassa esitetään kolme menetelmää haplotyypien määritykseen. Näistä tarkin menetelmä, nimeltään BACH, käyttää vaihtuva-asteista Markov-mallia ja bayesilaista tilastotiedettä haplotyyppien ennnustamiseen genotyyppiaineistosta. Menetelmän malli pystyy mallintamaan tarkasti geneettistä kytkentää eli fyysisesti lähekkäin sijaitsevien merkkijaksojen riippuvuutta. Tämä kytkentä näkyy haplotyyppijonojen lokaalina samankaltaisuutena. Paikallista rinnastusta käytetään esimerkiksi etsittäessä eri organismien genomien sekvensseistä samankaltaisia kohtia, esimerkiksi vastaavia geenejä. Paikallisen rinnastuksen hakualgoritmit löytävät vain samankaltaisimman kohdan, mutta eivät kerro, onko löydös tilastollisesti merkittävä. Yleinen tapa määrittää rinnastuksen tilastollista merkittävyyttä on laskea rinnastuksen hyvyydelle (pisteluvulle) p-arvo, joka kertoo rinnastuksen tilastollisen merkittävyyden. Väitöskirjan menetelmä paikallisten rinnastusten merkittävyyden laskemiseksi laskee sekvenssien paikalliselle rinnastukselle odotusarvon, joka antaa yleisesti käytettävälle p‐arvolle tiukan ylärajan. Vaikka malli on yksinkertainen, empiirisissä testeissä menetelmän antaman odotusarvon yksinkertainen johdannainen osoittautuu sangen tarkaksi p‐arvon estimaatiksi. Lähestymistavan etuna on, että sen avulla rinnastuksen aukot (poistot ja lisäykset) voidaan mallintaa suoraviivaisella tavalla

    Detecting ancestral junctions in inbred populations

    Get PDF

    Examining recombination and intra-genomic conflict dynamics in the evolution of anti-microbial resistant bacteria

    Get PDF
    The spread of antimicrobial resistance (AMR) among pathogenic bacterial species threatens to undercut much of the progress made in treating infectious diseases. AMR genes can disseminate between and within populations via horizontal gene transfer (HGT). Selfish mobile genetic elements (MGEs) can encode resistance and spread between host cells. Homologous recombination can alter the core genes of pathogens with resistant donors via HGT too. MGEs may be cured from host genomes through transformation. Hence, MGEs may be able to avoid deletion by disrupting transformation. This work aims to understand how the dynamics of these processes affect the epidemiology of AMR pathogens. To understand these dynamics, I co-developed a new version of the popular recombination detection tool Gubbins. Through simulation studies, I find this new version to be both accurate in reconstructing the relationships between isolates, and efficient in terms of its use of computational resources. I then apply Gubbins to both AMR lineages and species-wide datasets of the pathogen Streptococcus pneumoniae. I find that recombination frequently occurs around core genes involved in both drug resistance and the host immune response. Additionally, an MGE was able to successfully spread within a population by disrupting the transformation machinery, preventing its loss from the host. Finally, I investigate two recent examples of MGEs disrupting transformation in the gram-negative species Acinetobacter baumannii and Legionella pneumophila. I find that while these insertions may decrease the efficiency of transformations within cells, the observed recombination rates largely reflect the selection pressures on isolates. With MGEs only partially able to inhibit these observable transformation events. These results show how selection pressures from clinical interventions shape pathogen genomes through diverse, often interspecies, recombination events. The spread of MGEs can also be favoured by both these selection pressures, and their ability to disrupt host cell machinery.Open Acces

    An integrative functional genomics approach towards quantitative trait gene nomination in existing and emerging mouse genetic reference populations

    Get PDF
    An approach that has been widely applied for the genetic dissection of complex traits is Quantitative Trait Locus (QTL) mapping. QTL mapping identifies genomic regions that harbor polymorphisms, responsible for the observed variation in a complex trait. If these polymorphisms are located within a gene, then these genes are called Quantitative Trait Genes (QTG). Prior to advancements in QTL mapping populations, QTL mapping resolution was often poor, resulting in large QTL intervals. Therefore, after mapping a QTL, fine mapping was initiated to further reduce the QTL interval and to identify the QTG. While successful, fine mapping using genetic approaches have been extremely time and resource intensive, making it the rate-limiting step in QTG discovery. Thus far, only a few QTGs have been successfully identified and validated. The disproportionate ratio of QTLs mapped to QTGs identified has been a cause of concern. Successful QTG discovery relies on the power and resolution with which QTLs are mapped and the genetic architecture of the underlying QTL mapping population. Here, QTL mapping performance in two recently developed QTL mapping populations, namely the expanded BXD Recombinant Inbred (RI) strain panel and the collaborative cross (CC) are assessed. Results indicate that while both the expanded BXD RI strain panel and the CC improve QTL mapping resolution, the CC is able to achieve greater precision and resolution in QTL v mapping. However, neither the BXD RI nor the CC facilitates gene level resolution in QTL mapping. Recent studies have used the integration and convergence of evidence among functional genomics studies as a successful strategy towards the efficient and rapid nomination of QTG. Here, the complementary in silico approach of integrative functional genomics, using GeneWeaver (www.geneweaver.org), is applied towards the reduction of two cocaine-induced locomotor activation QTLs, mapped in the expanded BXD RI strain panel. Integrative functional genomic analyses of these QTLs led to the nomination of Rab3b as a putative QTG. Functional assessment of Rab3b using Rab3bcd knockout mice reveals its role in acute habituation mediated cocaine response, serving as evidence of the efficiency and utility of integrative functional genomics for the identification of highly relevant QTG

    Bayesian Statistical Methods for Genetic Association Studies with Case-Control and Cohort Design

    No full text
    Large-scale genetic association studies are carried out with the hope of discovering single nucleotide polymorphisms involved in the etiology of complex diseases. We propose a coalescent-based model for association mapping which potentially increases the power to detect disease-susceptibility variants in genetic association studies with case-control and cohort design. The approach uses Bayesian partition modelling to cluster haplotypes with similar disease risks by exploiting evolutionary information. We focus on candidate gene regions and we split the chromosomal region of interest into sub-regions or windows of high linkage disequilibrium (LD) therein assuming a perfect phylogeny. The haplotype space is then partitioned into disjoint clusters within which the phenotype-haplotype association is assumed to be the same. The novelty of our approach consists in the fact that the distance used for clustering haplotypes has an evolutionary interpretation, as haplotypes are clustered according to the time to their most recent common mutation. Our approach is fully Bayesian and we develop Markov Chain Monte Carlo algorithms to sample efficiently over the space of possible partitions. We have also developed a Bayesian survival regression model for high-dimension and small sample size settings. We provide a Bayesian variable selection procedure and shrinkage tool by imposing shrinkage priors on the regression coefficients. We have developed a computationally efficient optimization algorithm to explore the posterior surface and find the maximum a posteriori estimates of the regression coefficients. We compare the performance of the proposed methods in simulation studies and using real datasets to both single-marker analyses and recently proposed multi-marker methods and show that our methods perform similarly in localizing the causal allele while yielding lower false positive rates. Moreover, our methods offer computational advantages over other multi-marker approaches

    Inference of transitions to self-fertilization using haplotype genomic variation

    Get PDF
    Mating systems play an essential role in the evolution of natural populations. The reproductive mode of a population affects the evolutionary forces and recombination. Shifts in mating systems change major evolutionary traits of natural populations and affect the life-history cycle on many different levels. Among all transitions of mating schemes, a shift from outcrossing to selfing is one of the major shifts in plants. Such shifts have repeatedly occurred on the phylogenetic level. Despite their importance, there were no published tools to estimate such transitions in natural populations using genetic data on a genome- wide level. Existing estimates rely on estimating the loss-of-function mutations of causal loci. However, such estimates rely on the knowledge of the underlying genetic mechanism to induce the shift from outcrossing to selfing. Thus, such estimates are restricted to be conducted on very few species. In this study, we investigated the genetic consequences of shifts from outcrossing to selfing (Chapter 1). We used extensive simulations of the forward- in-time Wright-Fisher model and the backward-in-time coalescent model. We found the previously described theoretical work on implementing partial selfing in the coalescent to suffice in simulating transitions to selfing. We developed an Approximate Bayesian Computation approach (tsABC) to identify and estimate the date of transitions from outcrossing to selfing using a pairwise comparison of genomes (Chapter2). Finally, in collaboration with Thibaut Sellinger, we introduced the modified PSMC’ (teSMC) to estimate piecewise-constant selfing rates through time jointly with piecewise-constant population sizes for single- population demographies and analyzed its accuracy (Chapter 3). Taken together, we provide not only an approximate Bayesian but also a maximum likelihood approach to identify and estimate transitions to selfing for single populations. We found tsABC to be a versatile tool to identify and estimate transitions to selfing. Under carefully made assumptions for the proposed models, transitions to selfing can be detected under a broad range of scenarios. Moreover, the assumed model in the teSMC method improved the estimates of demography and detected transitions to selfing at least as powerful as the tsABC. The automized parametrization of teSMC allows users with little expertise in scripting to use it. We used both methods to estimate the transition from outcrossing to selfing for three genetic clusters of Arabidopsis thaliana. Our results were consistent with each other and existing estimates from the literature. With our study, we not only contributed to the understanding of evolutionary processes that formed the genetic diversity of natural populations but also provided two powerful tools to investigate the demographic history of natural populations in the context of transitions to selfing. Recombination provides a molecular clock on a separate time scale compared to mutation that interacts with all the four evolutionary forces at various levels. Eventually, that will contribute to understanding the functions of genes and their relationship and interaction with the bearing individual, the population, and the environment. Taken together, selfing as a breeding scheme or reproductive strategy is a crucial trait that interferes and connects evolutionary forces, adaptive potential, and life- history traits of natural populations

    Genomic analyses of Streptococcus mitis and its role as an oral commensal, pathogen, and reservoir of antimicrobial resistance for Streptococcus pneumoniae through horizontal gene transfer

    Get PDF
    Streptococcus mitis is a neglected oral commensal and opportunistic pathogen that is closely related to the global pathogen S. pneumoniae. However, S. mitis has been shown to be a source of AMR for the pneumococcus through horizontal gene transfer (HGT). It is unclear to what extent S. mitis acts as a source of AMR for emerging resistant pneumococcal lineages in context of selective pressure from antibiotics and vaccines. This thesis summarises genomic findings from global S. mitis datasets obtained from carriage and invasive disease. A comprehensive HGT analysis of global S. mitis and pneumococcal carriage isolates revealed that S. mitis contributes to reduced Beta-lactam (β-lactam) susceptibility among commonly carried pneumococcal serotypes that are associated with long carriage duration through HGT. S. mitis was shown to be abundant in the oral cavity of young Malawian children and co-carriage with the pneumococcus was high in the region demonstrating the opportunities for interspecies contact. S. mitis isolates found in carriage and expired respiratory secretions of healthy individuals were shown to harbour AMR genes, and potential S. mitis transmission was identified. Overall, distinct S. mitis populations have been identified that differ in respect to the presence of pneumococcal virulence genes, however, S. mitis isolates that caused infective endocarditis in the UK between 2001 to 2016 were from diverse populations. Together, this snapshot of S. mitis has revealed that the species is more than a commensal, but a highly diverse and complex species that is abundant, has a high prevalence of AMR genes, can potentially be transmitted through respiratory shedding, and is an important source of AMR for the pneumococcus. Therefore, in regions such as Malawi where S. mitis and S. pneumoniae co-carriage is high, it will be important to monitor the influence of AMR HGT from S. mitis and other commensal streptococci

    Multiscale Genomic Analysis of the Corticolimbic System: Uncovering the Molecular and Anatomic Substrates of Anxiety-Related Behavior

    Get PDF
    Genetic diversity generates variation at multiple phenotypic levels, ranging from the most basic molecular to higher-order cognitive and behavioral traits. The far-reaching impact that genes have on higher traits is apparent in several neuropsychiatric conditions such as stress and anxiety disorders. Like most, if not all, neural phenotypes, stress, anxiety, and other emotion-related traits are extremely complex and are defined by the interplay of multiple genetic, environmental, experiential, and epigenetic factors. The work presented in this dissertation is a multi-scalar, integrative analysis of the molecular and neuroanatomic substrates that underlie emotion-related behavior. The amygdala is a principle component of the limbic system that controls emotionality. Using BXD recombinant inbred (RI) mice as model organisms, the anatomy and cellular architecture of the amygdala—specifically, the basolateral amygdala (BLA)—was examined to assess the level of structural variation in this brain region. Quantitative trait locus (QTL) analysis was done to identify genetic loci that modulate the neuroanatomical traits of the BLA. The BXD RI mice were also tested using a variety of behavioral assays, and this showed a significant association between the BLA size and emotion-related behavior. The effect of chronic stress on subsequent behavior and endocrine-response was also examined in several genetically diverse inbred mice. Finally, to explore the molecular mediators of stress and anxiety, microarrays were used to assay gene expression in three key corticolimbic brain regions—the prefrontal cortex, amygdala, and hippocampus. Several large transcriptome data sets were also analyzed. These expression data sets brought focus on an interval on mouse distal chromosome 1 that modulates diverse neural and behavioral traits, and also controls the expression of a plethora of genes. This QTL rich region on mouse distal chromosome 1 (Qrr1) provides insights into how the information in the DNA sequence is conveyed by networks of co-regulated genes that may in turn modulate networks of inter-related phenotypes
    corecore