732 research outputs found

    The EM Algorithm and the Rise of Computational Biology

    Get PDF
    In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Laskennallisia menetelmiä haplotyypien ennustamiseen ja paikallisten rinnastusten merkittävyyden arviointiin

    Get PDF
    This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.Tässä väitöskirjassa esitetään uusia, tarkkoja ja tehokkaita laskennallisia menetelmiä populaation haplotyyppien ennustamiseen genotyypeistä sekä sekvenssien paikallisten rinnastusten merkittävyyden arviointiin. Käytetyt menetelmät perustuvat mm. dynaamiseen ohjelmointiin, jossa pienimmät osaongelmat ratkaistaan ensin ja näistä pienistä ratkaisuosista kootaan suurempien osaongelmien ratkaisuja. Organismin genomi on yleensä koodattu solun sisään DNA:han, yksinkertaistaen jonoon (sekvenssiin) emäksiä A, C, G ja T. Genomi on jäsentynyt kromosomeihin, jotka sisältävät tietyissä paikoissa esiintyviä muutoksia, merkkijaksoja. Diploidin organismin, kuten ihmisen, kromosomit (autosomit) esiintyvät pareittain. Yksilö perii parin toisen kromosomin isältään ja toisen äidiltään. Haplotyyppi on yksilön tietyissä paikoissa esiintyvien merkkijaksojen jono tietyssä kromosomiparin kromosomissa. Haplotyyppien mittaaminen suoraan on vaikeaa, mutta genotyypit ovat helpommin mitattavia. Genotyypit kertovat, mitkä kaksi merkkijaksoa kromosomin vastaavissa kohdissa esiintyy. Haplotyyppiaineistoja käytetään yleisesti esimerkiksi genettisten tautien tutkimiseen. Tämän vuoksi haplotyyppien laskennallinen ennustaminen genotyypeistä on tärkeä tutkimusongelma. Syötteenä ongelmassa on siis näyte tietyn populaation genotyypeistä, joista tulisi ennustaa haplotyypit jokaiselle näytteen yksilölle. Haplotyyppien ennustaminen genotyypeistä on mahdollista, koska haplotyypit ovat samankaltaisia yksilöiden välillä. Samankaltaisuus johtuu evoluution prosesseista, kuten periytymisestä, luonnonvalinnasta, migraatiosta ja isolaatiosta. Tässä väitöskirjassa esitetään kolme menetelmää haplotyypien määritykseen. Näistä tarkin menetelmä, nimeltään BACH, käyttää vaihtuva-asteista Markov-mallia ja bayesilaista tilastotiedettä haplotyyppien ennnustamiseen genotyyppiaineistosta. Menetelmän malli pystyy mallintamaan tarkasti geneettistä kytkentää eli fyysisesti lähekkäin sijaitsevien merkkijaksojen riippuvuutta. Tämä kytkentä näkyy haplotyyppijonojen lokaalina samankaltaisuutena. Paikallista rinnastusta käytetään esimerkiksi etsittäessä eri organismien genomien sekvensseistä samankaltaisia kohtia, esimerkiksi vastaavia geenejä. Paikallisen rinnastuksen hakualgoritmit löytävät vain samankaltaisimman kohdan, mutta eivät kerro, onko löydös tilastollisesti merkittävä. Yleinen tapa määrittää rinnastuksen tilastollista merkittävyyttä on laskea rinnastuksen hyvyydelle (pisteluvulle) p-arvo, joka kertoo rinnastuksen tilastollisen merkittävyyden. Väitöskirjan menetelmä paikallisten rinnastusten merkittävyyden laskemiseksi laskee sekvenssien paikalliselle rinnastukselle odotusarvon, joka antaa yleisesti käytettävälle p‐arvolle tiukan ylärajan. Vaikka malli on yksinkertainen, empiirisissä testeissä menetelmän antaman odotusarvon yksinkertainen johdannainen osoittautuu sangen tarkaksi p‐arvon estimaatiksi. Lähestymistavan etuna on, että sen avulla rinnastuksen aukot (poistot ja lisäykset) voidaan mallintaa suoraviivaisella tavalla

    Tripping over emerging pathogens around the world: A phylogeographical approach for determining the epidemiology of Porcine circovirus-2 (PCV-2), considering global trading

    Get PDF
    AbstractPorcine circovirus-2 (PCV-2) is an emerging virus associated with a number of different syndromes in pigs known as Porcine Circovirus Associated Diseases (PCVAD). Since its identification and characterization in the early 1990s, PCV-2 has achieved a worldwide distribution, becoming endemic in most pig-producing countries, and is currently considered as the main cause of losses on pig farms. In this study, we analyzed the main routes of the spread of PCV-2 between pig-producing countries using phylogenetic and phylogeographical approaches. A search for PCV-2 genome sequences in GenBank was performed, and the 420 PCV-2 sequences obtained were grouped into haplotypes (group of sequences that showed 100% identity), based on the infinite sites model of genome evolution. A phylogenetic hypothesis was inferred by Bayesian Inference for the classification of viral strains and a haplotype network was constructed by Median Joining to predict the geographical distribution of and genealogical relationships between haplotypes. In order to establish an epidemiological and economic context in these analyses, we considered all information about PCV-2 sequences available in GenBank, including papers published on viral isolation, and live pig trading statistics available on the UN Comtrade database (http://comtrade.un.org/). In these analyses, we identified a strong correlation between the means of PCV-2 dispersal predicted by the haplotype network and the statistics on the international trading of live pigs. This correlation provides a new perspective on the epidemiology of PCV-2, highlighting the importance of the movement of animals around the world in the emergence of new pathogens, and showing the need for effective sanitary barriers when trading live animals

    Genetska struktura i demografska prošlost populacija pišmolja Merlangius merlangus (Linnaeus, 1758) s područja Turske određene na temelju varijacija mitohondrijskih DNA sekvenci

    Get PDF
    The genetic diversity, structure, and demographic history of the economically important and overfished Gadidae species Merlangius merlangus were investigated using the non-coding mitochondrial Control Region (CR) from five different sites in the Sea of Marmara and the Black Sea in Turkey. The populations of M. merlangus were found to be genetically diverse, with 14 haplotypes and 15 polymorphic regions. The overall haploid diversity was 0.910 ± 0.024, and the nucleotide diversity was 0.003 ± 0.0003. Genetic distances between populations varied between 0.13% and 8.02%, while genetic distances within M. merlangus populations varied between 0.09% and 0.42%. Principle Coordinates Analysis showed that Marmara, Black Sea, and Karadeniz Ereğli populations were clearly separated. Pairwise FST values varied from 0.12 to 0.69, highlighting high genetic variation among populations. The Black Sea and Marmara lineages of M. merlangus diverged from the North Sea lineage 1.65 (1.08-2.29) mya, whereas the separation between the Atlantic lineage occurred about 0.84 (0.51-1.2) mya. The recent expansion of the whiting population was identified through neutrality tests and mismatch distribution analyses. This study provides important insight into the genetic structure, conservation, and management of this species.Genetska raznolikost, struktura i demografska prošlost ekonomski važne i prelovom ugrožene vrste pišmolja Merlangius merlangus istraživane su korištenjem nekodirajuće mitohondrijske kontrolne regije (CR) s pet različitih lokaliteta u Mramornom i Crnom moru u Turskoj. Populacije M. merlangus pokazale su genetičku raznolikost s 14 haplotipova i 15 polimorfnih regija. Ukupna haplotipna raznolikost iznosila je 0.910 ± 0.024, a nukleotidna raznolikost 0.003 ± 0.0003. Genetske udaljenosti između populacija varirale su između 0.13% i 8.02%, dok su genetske udaljenosti unutar populacija M. merlangus varirale između 0.09% i 0.42%. Analiza glavnih koordinata (PCoA) pokazala je jasnu razdvojenost populacija iz Mramornog i Crnog mora te Karadeniz Ereğli područja. Uparene Fst vrijednosti varirale su od 0.12 do 0.69, ukazujući na visoku genetičku varijabilnost između populacija. Genealoške linije pišmolja iz Mramornog i Crnog mora odvojile su se od linija iz Sjevernog mora prije 1.65 (1.08-2.29) milijuna godina, dok se odvajanje od atlantske linije dogodilo prije oko 0.84 (0.51-1.2) milijuna godina. Nedavno širenje populacije pišmolja utvrđeno je putem testova neutralnosti i analize neusklađenosti distribucije. Ovo istraživanje donosi bitne spoznaje o genetskoj strukturi, očuvanju i upravljanju ovom vrstom

    Inference of transitions to self-fertilization using haplotype genomic variation

    Get PDF
    Mating systems play an essential role in the evolution of natural populations. The reproductive mode of a population affects the evolutionary forces and recombination. Shifts in mating systems change major evolutionary traits of natural populations and affect the life-history cycle on many different levels. Among all transitions of mating schemes, a shift from outcrossing to selfing is one of the major shifts in plants. Such shifts have repeatedly occurred on the phylogenetic level. Despite their importance, there were no published tools to estimate such transitions in natural populations using genetic data on a genome- wide level. Existing estimates rely on estimating the loss-of-function mutations of causal loci. However, such estimates rely on the knowledge of the underlying genetic mechanism to induce the shift from outcrossing to selfing. Thus, such estimates are restricted to be conducted on very few species. In this study, we investigated the genetic consequences of shifts from outcrossing to selfing (Chapter 1). We used extensive simulations of the forward- in-time Wright-Fisher model and the backward-in-time coalescent model. We found the previously described theoretical work on implementing partial selfing in the coalescent to suffice in simulating transitions to selfing. We developed an Approximate Bayesian Computation approach (tsABC) to identify and estimate the date of transitions from outcrossing to selfing using a pairwise comparison of genomes (Chapter2). Finally, in collaboration with Thibaut Sellinger, we introduced the modified PSMC’ (teSMC) to estimate piecewise-constant selfing rates through time jointly with piecewise-constant population sizes for single- population demographies and analyzed its accuracy (Chapter 3). Taken together, we provide not only an approximate Bayesian but also a maximum likelihood approach to identify and estimate transitions to selfing for single populations. We found tsABC to be a versatile tool to identify and estimate transitions to selfing. Under carefully made assumptions for the proposed models, transitions to selfing can be detected under a broad range of scenarios. Moreover, the assumed model in the teSMC method improved the estimates of demography and detected transitions to selfing at least as powerful as the tsABC. The automized parametrization of teSMC allows users with little expertise in scripting to use it. We used both methods to estimate the transition from outcrossing to selfing for three genetic clusters of Arabidopsis thaliana. Our results were consistent with each other and existing estimates from the literature. With our study, we not only contributed to the understanding of evolutionary processes that formed the genetic diversity of natural populations but also provided two powerful tools to investigate the demographic history of natural populations in the context of transitions to selfing. Recombination provides a molecular clock on a separate time scale compared to mutation that interacts with all the four evolutionary forces at various levels. Eventually, that will contribute to understanding the functions of genes and their relationship and interaction with the bearing individual, the population, and the environment. Taken together, selfing as a breeding scheme or reproductive strategy is a crucial trait that interferes and connects evolutionary forces, adaptive potential, and life- history traits of natural populations

    Revisão taxonômica e filogenia de Cheirodontinae (Characiformes: Characidae) : integrando evidência morfológica e molecular

    Get PDF
    A ictiofauna neotropical de água doce é taxonomicamente a mais diversa do planeta, porém sua diversidade ainda é amplamente subestimada. A evidência para essa subestimativa vem do acúmulo de formas morfologicamente distintas, mas não descritas, depositadas em coleções de museus e dos resultados de estudos baseados em DNA (por exemplo, delimitação de espécies) que identificam consistentemente um número maior de linhagens divergentes do que o atualmente é aceito, mesmo dentro de grupos bem estudados. Investigamos a diversidade de linhagens dentro de Cheirodontinae, sequenciamos a subunidade I do citocromo c oxidase mitocondrial (COI) e delineamos linhagens usando 8 diferentes métodos de delimitação de espécies de lócus único. Os resultados fornecem evidências fortes e consistentes para a existência de uma diversidade adicional e não descrita dentro de Cheirodontinae. Dentro de Cheirodontinae, encontramos hipóteses divergentes (morfológicas vs. moleculares) sobre as relações filogenéticas dentro da subfamília. Testamos essas duas hipóteses, usando dados integrados (morfológicos e moleculares: COI, 16S, 12S, RAG1, RAG2, Myh6) propomos uma nova hipótese de relações filogenéticas dentro da subfamília, e subdividimos a subfamília, em 8 clados, que são corroborados pelas diferentes metodologias (parcimônia e modelos). Com esta informação morfológica e molecular, estimamos o tempo de divergência dentro da subfamília usando Evidencia-total com datação FBD, o que nos levou a determinar que Cheirodontinae se diversificou de Aphyocharacinae, aproximadamente 28 Ma. Com esses resultados reconstruímos a história biogeográfica dos Cheirodontinae na região neotropical e testamos a influência do tamanho e sua capacidade de inseminação em sua diversificação.The neotropical freshwater fish fauna is taxonomically the most diverse vertebrate on the planet, however its diversity is still largely underestimated. Evidence for this underestimation comes from the accumulation of morphologically distinct but undescribed forms deposited in museum collections, and from the results of various DNA-based studies (eg, species delimitation) that consistently identify a larger number of divergent lineages than currently accepted, even within well-studied species, we investigated the diversity of lineages within the Cheirodontinae. To investigate this diversity, we sequenced the mitochondrial cytochrome c oxidase subunit I (COI) and delineated lineages using 8 different single-locus species delimitation methods. The results provide strong and consistent evidence for additional and undescribed taxonomic diversity in Cheirodontinae. Within the Cheirodontinae, we find divergent hypotheses (morphological vs. molecular) about phylogenetic relationships within the subfamily. We test these two hypotheses, using integrated data (morphological and molecular: COI, 16S, 12S, RAG1, RAG2, Myh6) we propose a new hypothesis of phylogenetic relationships within the subfamily, and we subdivide the subfamily, into 8 clades, which are corroborated by the different methodologies (parsimony and models). With this morphological and molecular information, we estimated the time of divergence within the subfamily using Total-evidence with FBD dating, this led us to determine that the Cheirodontinae diversified from the Aphyocharacinae, approximately 28 Ma. With these results we reconstruct the biogeographical history of the Cheirodontinae in the neotropical region and we test the influence of the size and its insemination capacity in its diversification

    Divergent Selection and Primary Gene Flow Shape Incipient Speciation of a Riparian Tree on Hawaii Island

    Get PDF
    A long-standing goal of evolutionary biology is to understand the mechanisms underlying the formation of species. Of particular interest is whether or not speciation can occur in the presence of gene flow and without a period of physical isolation. Here, we investigated this process within Hawaiian Metrosideros, a hypervariable and highly dispersible woody species complex that dominates the Hawaiian Islands in continuous stands. Specifically, we investigated the origin of Metrosideros polymorpha var. newellii (newellii), a riparian ecotype endemic to Hawaii Island that is purportedly derived from the archipelago-wide M. polymorpha var. glaberrima (glaberrima). Disruptive selection across a sharp forestriparian ecotone contributes to the isolation of these varieties and is a likely driver of newellii’s origin. We examined genome-wide variation of 42 trees from Hawaii Island and older islands. Results revealed a split between glaberrima and newellii within the past 0.3–1.2 My. Admixture was extensive between lineages within Hawaii Island and between islands, but introgression from populations on older islands (i.e., secondary gene flow) did not appear to contribute to the emergence of newellii. In contrast, recurrent gene flow (i.e., primary gene flow) between glaberrima and newellii contributed to the formation of genomic islands of elevated absolute and relative divergence. These regions were enriched for genes with regulatory functions as well as for signals of positive selection, especially in newellii, consistent with divergent selection underlying their formation. In sum, our results support riparian newellii as a rare case of incipient ecological speciation with primary gene flow in trees

    Mini-Workshop: Recent Developments in Statistical Methods with Applications to Genetics and Genomics

    Get PDF
    Recent progress in high-throughput genomic technologies has revolutionized the field of human genetics and promises to lead to important scientific advances. With new improvements in massively parallel biotechnologies, it is becoming increasingly more efficient to generate vast amounts of information at the genomics, transcriptomics, proteomics, metabolomics etc. levels, opening up as yet unexplored opportunities in the search for the genetic causes of complex traits. Despite this tremendous progress in data generation, it remains very challenging to analyze, integrate and interpret these data. The resulting data are high-dimensional and very sparse, and efficient statistical methods are critical in order to extract the rich information contained in these data. The major focus of the mini-workshop, entitled “Recent Developments in Statistical Methods with Applications to Genetics and Genomics”, has been on integrative methods. Relevant research questions included the optimal study design for integrative genomic analyses; appropriate handling and pre-processing of different types of omics data; statistical methods for integration of multiple types of omics data; adjustment for confounding due to latent factors such as cell or tissue heterogeneity; the optimal use of omics data to enhance or make sense of results identified through genetic studies; and statistical and computational strategies for analysis of multiple types of high-dimensional data

    Molecular Approaches to Identify Cryptic Species and Polymorphic Species within a Complex Community of Fig Wasps

    Get PDF
    Cryptic and polymorphic species can complicate traditional taxonomic research and both of these concerns are common in fig wasp communities. Species identification is very difficult, despite great effort and the ecological importance of fig wasps. Herein, we try to identify all chalcidoid wasp species hosted by one species of fig, using both morphological and molecular methods. We compare the efficiency of four different DNA regions and find that ITS2 is highly effective for species identification, while mitochondrial COI and Cytb regions appear less reliable, possibly due to the interference signals from either nuclear copies of mtDNA, i.e. NUMTs, or the effects of Wolbachia infections. The analyses suggest that combining multiple markers is the best choice for inferring species identifications as any one marker may be unsuitable in a given case
    corecore