Search CORE

46 research outputs found

Computational recovery of enzyme haplotypes from a metagenome

Author: Nicholls Samuel
Publication venue
Publication date: 01/01/2018
Field of study

Aberystwyth Research Portal

University of Birmingham Research Portal

Recommended from our members

Computational methods for understanding genetic variations from next generation sequencing data

Author: Ahn Soyeon, Ph. D.
Publication venue
Publication date: 13/09/2018
Field of study

Studies of human genetic variation reveal critical information about genetic and complex diseases such as cancer, diabetes and heart disease, ultimately leading towards improvements in health and quality of life. Moreover, understanding genetic variations in viral population is of utmost importance to virologists and helps in search for vaccines. Next-generation sequencing technology is capable of acquiring massive amounts of data that can provide insight into the structure of diverse sets of genomic sequences. However, reconstructing heterogeneous sequences is computationally challenging due to the large dimension of the problem and limitations of the sequencing technology.This dissertation is focused on algorithms and analysis for two problems in which we seek to characterize genetic variations: (1) haplotype reconstruction for a single individual, so-called single individual haplotyping (SIH) or haplotype assembly problem, and (2) reconstruction of viral population, the so-called quasispecies reconstruction (QSR) problem. For the SIH problem, we have developed a method that relies on a probabilistic model of the data and employs the sequential Monte Carlo (SMC) algorithm to jointly determine type of variation (i.e., perform genotype calling) and assemble haplotypes. For the QSR problem, we have developed two algorithms. The first algorithm combines agglomerative hierarchical clustering and Bayesian inference to reconstruct quasispecies characterized by low diversity. The second algorithm utilizes tensor factorization framework with successive data removal to reconstruct quasispecies characterized by highly uneven frequencies of its components. Both algorithms outperform existing methods in both benchmarking tests and real data.Electrical and Computer Engineerin

Texas ScholarWorks

De novo approaches to haplotype-aware genome assembly

Author: Baaijens J.A. (Jasmijn)
Publication venue
Publication date: 25/09/2019
Field of study

CWI's Institutional Repository

Phasage d’haplotypes par ASP à partir de longues lectures : une approche d’optimisation flexible

Author: Delahaye Clara
Publication venue: HAL CCSD
Publication date: 15/12/2022
Field of study

Version non corrigée. Une nouvelle version sera disponible d'ici mars 2023.Each chromosome of a di- or polyploid organism has several haplotypes, which are highly similar but diverge on a certain number of positions. However, most of the reference genomes only provide a single sequence for each chromosome, and therefore do not reflect the biological reality.Yet, it is crucial to have access to this information, which is useful in medicine, agronomy and population studies. The recent development of third generation technologies, especially PacBio and Oxford Nanopore Technologies sequencers, has allowed for the production of long reads that facilitate haplotype sequence reconstruction.Bioinformatics methods exist for this task, but they provide only a single solution. This thesis introduces an approach for haplotype phasing based on the search of connected components in a read similarity graph to identify haplotypes. This method uses Answer Set Programming to work on the set ofoptimal solutions. This phasing algorithm has been used to reconstruct haplotypes of the diploid rotifer Adineta vaga.Chaque chromosome d’organisme di- ou polyploïde présente plusieurs haplotypes, qui sont fortement similaires mais divergent sur un certain nombre de positions. Cependant, la majorité des génomes de référence ne renseignent qu’une seule séquence pour chaque chromosome, et ne reflètent donc pas la réalité biologique. Or, il est crucial d’avoir accès à ces informations, qui sont utiles en médecine, en agronomie ou encore dans l’étude des populations. Le récent développement des technologies de troisième génération, notamment des séquenceurs PacBio et Oxford NanoporeTechnologies, a permis la production de lectures longues facilitant la reconstruction des séquences d’haplotypes. Il existe pour cela des méthodes bioinformatiques, mais elles ne fournissent qu’une unique solution. Cette thèse propose une méthode de phasage d’haplotype basée sur la recherchede composantes connexes dans un graph de similarité des lectures pour identifier les haplotypes. Cette méthode utilise l’Answer Set Programming pour travailler sur l’ensemble des solutions optimales. L’algorithme de phasage a permis de reconstruire les haplotypes du rotifère diploïde Adineta vaga

INRIA a CCSD electronic archive server

Models and algorithms for haplotype phasing and variant calling

Author: Li Yanbo
Publication venue
Publication date: 01/01/2022
Field of study

Haplotypes have been increasingly used in genetic studies. Analysis of variations among haplotypes has many applications in population genetics and biomedical research. However, the current DNA sequencing technologies can only read short fragments (called reads) randomly drawn from complete haplotypes, and reads from two haplotypes are mixed together for diploid species like humans. Therefore, recovering the complete haplotype sequences becomes a fundamental problem in computational genomics, which requires calling variants between two haplotypes from reads. With the emerging third-generation sequencing technologies, existing reference-based approaches for haplotype phasing suffer from performance issues to handle long and error-prone reads. At the same time, reference-free methods for variant calling and haplotype phasing meet challenges when applied to large and complex genomes. This thesis addresses a number of critical challenges in haplotype phasing and variant calling in computational genomics. Firstly, we introduce DCHap, a fast reference-based haplotype phasing algorithm that applies a divide-and-conquer strategy. This divide-and-conquer strategy improves both scalability and accuracy for handling third-generation sequencing data in haplotype phasing. Secondly, we propose an algorithm, Kmer2SNP, for reference-free variant calling using graph matching. The introduction of the heterozygous k-mer graph helps to recover SNPs between two haplotypes through the maximum weight matching. Thirdly, we describe Kmer2Haplotype, a pipeline of reference-free haplotype phasing by incorporating both short and long reads from the same individual. The problem of haplotype phasing is modeled as finding the minimum bi-partition of the haplotype-specific k-mer graph. We further design and conduct extensive experiments, and the benchmarking results show the effectiveness and efficiency of the above-proposed approaches compared to the state-of-art baselines. We finally conclude this thesis by discussing future research directions

The Australian National University

Laskennallisia menetelmiä haplotyypien ennustamiseen ja paikallisten rinnastusten merkittävyyden arviointiin

Author: Rastas Pasi
Publication venue: 'University of Helsinki Libraries'
Publication date: 20/11/2009
Field of study

This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.Tässä väitöskirjassa esitetään uusia, tarkkoja ja tehokkaita laskennallisia menetelmiä populaation haplotyyppien ennustamiseen genotyypeistä sekä sekvenssien paikallisten rinnastusten merkittävyyden arviointiin. Käytetyt menetelmät perustuvat mm. dynaamiseen ohjelmointiin, jossa pienimmät osaongelmat ratkaistaan ensin ja näistä pienistä ratkaisuosista kootaan suurempien osaongelmien ratkaisuja. Organismin genomi on yleensä koodattu solun sisään DNA:han, yksinkertaistaen jonoon (sekvenssiin) emäksiä A, C, G ja T. Genomi on jäsentynyt kromosomeihin, jotka sisältävät tietyissä paikoissa esiintyviä muutoksia, merkkijaksoja. Diploidin organismin, kuten ihmisen, kromosomit (autosomit) esiintyvät pareittain. Yksilö perii parin toisen kromosomin isältään ja toisen äidiltään. Haplotyyppi on yksilön tietyissä paikoissa esiintyvien merkkijaksojen jono tietyssä kromosomiparin kromosomissa. Haplotyyppien mittaaminen suoraan on vaikeaa, mutta genotyypit ovat helpommin mitattavia. Genotyypit kertovat, mitkä kaksi merkkijaksoa kromosomin vastaavissa kohdissa esiintyy. Haplotyyppiaineistoja käytetään yleisesti esimerkiksi genettisten tautien tutkimiseen. Tämän vuoksi haplotyyppien laskennallinen ennustaminen genotyypeistä on tärkeä tutkimusongelma. Syötteenä ongelmassa on siis näyte tietyn populaation genotyypeistä, joista tulisi ennustaa haplotyypit jokaiselle näytteen yksilölle. Haplotyyppien ennustaminen genotyypeistä on mahdollista, koska haplotyypit ovat samankaltaisia yksilöiden välillä. Samankaltaisuus johtuu evoluution prosesseista, kuten periytymisestä, luonnonvalinnasta, migraatiosta ja isolaatiosta. Tässä väitöskirjassa esitetään kolme menetelmää haplotyypien määritykseen. Näistä tarkin menetelmä, nimeltään BACH, käyttää vaihtuva-asteista Markov-mallia ja bayesilaista tilastotiedettä haplotyyppien ennnustamiseen genotyyppiaineistosta. Menetelmän malli pystyy mallintamaan tarkasti geneettistä kytkentää eli fyysisesti lähekkäin sijaitsevien merkkijaksojen riippuvuutta. Tämä kytkentä näkyy haplotyyppijonojen lokaalina samankaltaisuutena. Paikallista rinnastusta käytetään esimerkiksi etsittäessä eri organismien genomien sekvensseistä samankaltaisia kohtia, esimerkiksi vastaavia geenejä. Paikallisen rinnastuksen hakualgoritmit löytävät vain samankaltaisimman kohdan, mutta eivät kerro, onko löydös tilastollisesti merkittävä. Yleinen tapa määrittää rinnastuksen tilastollista merkittävyyttä on laskea rinnastuksen hyvyydelle (pisteluvulle) p-arvo, joka kertoo rinnastuksen tilastollisen merkittävyyden. Väitöskirjan menetelmä paikallisten rinnastusten merkittävyyden laskemiseksi laskee sekvenssien paikalliselle rinnastukselle odotusarvon, joka antaa yleisesti käytettävälle p‐arvolle tiukan ylärajan. Vaikka malli on yksinkertainen, empiirisissä testeissä menetelmän antaman odotusarvon yksinkertainen johdannainen osoittautuu sangen tarkaksi p‐arvon estimaatiksi. Lähestymistavan etuna on, että sen avulla rinnastuksen aukot (poistot ja lisäykset) voidaan mallintaa suoraviivaisella tavalla

Helsingin yliopiston digitaalinen arkisto

Computational pan-genomics: status, promises and challenges

Author: Abeel Thomas
Alkan Can
Baaijens Jasmijn
Bakker Paul
Boeva Valentina
Bonnal Raoul
Chiaromonte Francesca
Chikhi Rayan
Ciccarelli Francesca
Cijvat Robin
Datema Erwin
Dijkstra Louis
Duijn Cornelia
Dutilh Bas
Eichler Evan
El-Kebir Mohammed
Ernst Corinna
Eskin Eleazar
Garrison Erik
Ghaffaari Ali
Guryev Victor
Kersey Paul
Klau Gunnar
Kloosterman Wigard
Korbel Jan
Lameijer Eric-Wubbo
Langmead Benjamin
Marschall Tobias
Martin Marcel
Marz Manja
Medvedev Paul
Mu John
Mäkinen Veli
Neerincx Pieter
Novak Adam
Ouwens Klaasjan
Paten Benedict
Peterlongo Pierre
Pisanti Nadia
Porubsky David
Rahmann Sven
Raphael Benjamin
Reinert Knut
Ridder Dick
Ridder Jeroen
Rivals Eric
Sanders Ashley
Schlesner Matthias
Schulz-Trieglaff Ole
Schönhuth Alexander
Sheikhizadeh Siavash
Shneider Carl
Smit Sandra
The Computational Pan-Genomics Consortium
Valenzuela Daniel
Vandin Fabio
Wang Jiayin
Wessels Lodewyk
Ye Kai
Zhang Ying
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

EUR Research Repository

HAL-MINES ParisTech

Archivio della ricerca della Scuola Superiore Sant'Anna

Radboud Repository

HAL-Rennes 1

The EM Algorithm and the Rise of Computational Biology

Author: Citable Link
Jun S. Liu
Xiaodan Fan
Yuan Yuan
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Optimal algorithms for haplotype assembly from whole-genome sequence data

Author: A. Choi
A. Darwiche
Bansal
Browning
D. He
E. Eskin
Frazer
K. Pipatsrisawat
Levy
Lippert
Marchini
Stephens
Wheeler
Publication venue: Oxford University Press
Publication date: 01/06/2010
Field of study

Motivation: Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of ‘haplotype assembly’ is the problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads, and their locations in the haplotypes, which are pre-determined by mapping the reads to a reference genome. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of length 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem

CiteSeerX

Crossref

PubMed Central

eScholarship - University of California

Simulation and graph mining tools for improving gene mapping efficiency

Author: Hintsanen Petteri
Publication venue: 'University of Helsinki Libraries'
Publication date: 30/09/2011
Field of study

Gene mapping is a systematic search for genes that affect observable characteristics of an organism. In this thesis we offer computational tools to improve the efficiency of (disease) gene-mapping efforts. In the first part of the thesis we propose an efficient simulation procedure for generating realistic genetical data from isolated populations. Simulated data is useful for evaluating hypothesised gene-mapping study designs and computational analysis tools. As an example of such evaluation, we demonstrate how a population-based study design can be a powerful alternative to traditional family-based designs in association-based gene-mapping projects. In the second part of the thesis we consider a prioritisation of a (typically large) set of putative disease-associated genes acquired from an initial gene-mapping analysis. Prioritisation is necessary to be able to focus on the most promising candidates. We show how to harness the current biomedical knowledge for the prioritisation task by integrating various publicly available biological databases into a weighted biological graph. We then demonstrate how to find and evaluate connections between entities, such as genes and diseases, from this unified schema by graph mining techniques. Finally, in the last part of the thesis, we define the concept of reliable subgraph and the corresponding subgraph extraction problem. Reliable subgraphs concisely describe strong and independent connections between two given vertices in a random graph, and hence they are especially useful for visualising such connections. We propose novel algorithms for extracting reliable subgraphs from large random graphs. The efficiency and scalability of the proposed graph mining methods are backed by extensive experiments on real data. While our application focus is in genetics, the concepts and algorithms can be applied to other domains as well. We demonstrate this generality by considering coauthor graphs in addition to biological graphs in the experiments.Geenikartoitus on organismin havaittaviin piirteisiin vaikuttavien geenien järjestelmällistä etsintää perimästä. Väitöskirjassa esitetään uusia menetelmiä, joilla voidaan tehostaa sairauksille altistavien geenien kartoitusta. Väitöskirjan alussa tarkastellaan perimän simulointia (tyypillisesti maantieteellisesti) eristäytyneissä populaatioissa ja esitetään tarkoitukseen soveltuva uusi simulaattoriohjelmisto. Simuloidut aineistot ovat hyödyllisiä tutkimussuunnittelussa, jolloin niillä voidaan arvioida suunniteltujen aineistojen tilastollisia ominaisuuksia sekä käytettävien analysointimenetelmien toimintaa. Esimerkkinä tällaisesta tutkimuksesta työssä käydään läpi esitetyllä ohjelmistolla tehty laajahko simulaatiotutkimus. Tulosten perusteella väestöpohjainen tapaus-verrokkitutkimusasetelma vaikuttaa olevan tilastollisesti voimakas vaihtoehto kalliimmille perhe- ja sukupuupohjaisille asetelmille. Toinen osa väitöskirjaa käsittelee mahdollisesti sairauksille altistavien ns. ehdokasgeenien pisteytystä sen mukaan, kuinka vahvat yhteydet niillä on tutkittavaan sairauteen. Pisteytys on tärkeää, koska alustavat aineiston tarkastelut tuottavat tyypillisesti runsaasti ehdokasgeenejä, joiden kaikkien läpikäynti olisi liian työlästä. Pisteytyksellä jatkotutkimukset voidaan kohdistaa lupaavimpiin ehdokkaisiin. Työssä esitetään kuinka tällä hetkellä erillissä tietokannoissa oleva biologinen tieto voidaan esittää yhteinäisessä verkkomuodossa. Lisäksi näytetään kuinka tällaisesta aineistosta voidaan etsiä ehdokasgeenien ja tutkittavan sairauden välisiä yhteyksiä ja pisteyttää niitä verkonlouhinta-algoritmien avulla. Lopuksi työssä esitetään luotettavan aliverkon eristämisongelma ja algoritmeja sen ratkaisemiseen. Ongelmassa tavoitteena on poimia suuresta verkosta suhteellisen pieni aliverkko, joka sisältää vahvoja ja toisistaan riippumattomia yhteyksiä kahden annetun verkon solmun välillä. Siten luotettavat aliverkot soveltuvat erityisen hyvin löydettyjen yhteyksien kuvalliseen esittämiseen. Luotettavia aliverkkoja voidaan soveltaa perinnöllisyystieteen lisäksi myös muilla aloilla, kuten sosiaalisten verkkojen analyysissä

Helsingin yliopiston digitaalinen arkisto