7 research outputs found

    Laskennallisia menetelmiä haplotyypien ennustamiseen ja paikallisten rinnastusten merkittävyyden arviointiin

    Get PDF
    This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.Tässä väitöskirjassa esitetään uusia, tarkkoja ja tehokkaita laskennallisia menetelmiä populaation haplotyyppien ennustamiseen genotyypeistä sekä sekvenssien paikallisten rinnastusten merkittävyyden arviointiin. Käytetyt menetelmät perustuvat mm. dynaamiseen ohjelmointiin, jossa pienimmät osaongelmat ratkaistaan ensin ja näistä pienistä ratkaisuosista kootaan suurempien osaongelmien ratkaisuja. Organismin genomi on yleensä koodattu solun sisään DNA:han, yksinkertaistaen jonoon (sekvenssiin) emäksiä A, C, G ja T. Genomi on jäsentynyt kromosomeihin, jotka sisältävät tietyissä paikoissa esiintyviä muutoksia, merkkijaksoja. Diploidin organismin, kuten ihmisen, kromosomit (autosomit) esiintyvät pareittain. Yksilö perii parin toisen kromosomin isältään ja toisen äidiltään. Haplotyyppi on yksilön tietyissä paikoissa esiintyvien merkkijaksojen jono tietyssä kromosomiparin kromosomissa. Haplotyyppien mittaaminen suoraan on vaikeaa, mutta genotyypit ovat helpommin mitattavia. Genotyypit kertovat, mitkä kaksi merkkijaksoa kromosomin vastaavissa kohdissa esiintyy. Haplotyyppiaineistoja käytetään yleisesti esimerkiksi genettisten tautien tutkimiseen. Tämän vuoksi haplotyyppien laskennallinen ennustaminen genotyypeistä on tärkeä tutkimusongelma. Syötteenä ongelmassa on siis näyte tietyn populaation genotyypeistä, joista tulisi ennustaa haplotyypit jokaiselle näytteen yksilölle. Haplotyyppien ennustaminen genotyypeistä on mahdollista, koska haplotyypit ovat samankaltaisia yksilöiden välillä. Samankaltaisuus johtuu evoluution prosesseista, kuten periytymisestä, luonnonvalinnasta, migraatiosta ja isolaatiosta. Tässä väitöskirjassa esitetään kolme menetelmää haplotyypien määritykseen. Näistä tarkin menetelmä, nimeltään BACH, käyttää vaihtuva-asteista Markov-mallia ja bayesilaista tilastotiedettä haplotyyppien ennnustamiseen genotyyppiaineistosta. Menetelmän malli pystyy mallintamaan tarkasti geneettistä kytkentää eli fyysisesti lähekkäin sijaitsevien merkkijaksojen riippuvuutta. Tämä kytkentä näkyy haplotyyppijonojen lokaalina samankaltaisuutena. Paikallista rinnastusta käytetään esimerkiksi etsittäessä eri organismien genomien sekvensseistä samankaltaisia kohtia, esimerkiksi vastaavia geenejä. Paikallisen rinnastuksen hakualgoritmit löytävät vain samankaltaisimman kohdan, mutta eivät kerro, onko löydös tilastollisesti merkittävä. Yleinen tapa määrittää rinnastuksen tilastollista merkittävyyttä on laskea rinnastuksen hyvyydelle (pisteluvulle) p-arvo, joka kertoo rinnastuksen tilastollisen merkittävyyden. Väitöskirjan menetelmä paikallisten rinnastusten merkittävyyden laskemiseksi laskee sekvenssien paikalliselle rinnastukselle odotusarvon, joka antaa yleisesti käytettävälle p‐arvolle tiukan ylärajan. Vaikka malli on yksinkertainen, empiirisissä testeissä menetelmän antaman odotusarvon yksinkertainen johdannainen osoittautuu sangen tarkaksi p‐arvon estimaatiksi. Lähestymistavan etuna on, että sen avulla rinnastuksen aukot (poistot ja lisäykset) voidaan mallintaa suoraviivaisella tavalla

    A variant of the tandem duplication - random loss model of genome rearrangement

    Get PDF
    In Soda'06, Chaudhuri, Chen, Mihaescu and Rao study algorithmic properties of the tandem duplication - random loss model of genome rearrangement, well-known in evolutionary biology. In their model, the cost of one step of duplication-loss of width k is αk\alpha^k for α=1\alpha =1 or α>=2\alpha >=2 . In this paper, we study a variant of this model, where the cost of one step of width kk is 1 if kKk K, for any value of the parameter KinNK in N. We first show that permutations obtained after pp steps of width KK define classes of pattern-avoiding permutations. We also compute the numbers of duplication-loss steps of width KK necessary and sufficient to obtain any permutation of SnS_n, in the worst case and on average. In this second part, we may also consider the case K=K(n)K=K(n), a function of the size nn of the permutation on which the duplication-loss operations are performed

    Lightweight Massively Parallel Suffix Array Construction

    Get PDF
    The suffix array is an array of sorted suffixes in lexicographic order, where each sorted suffix is represented by its starting position in the input string. It is a fundamental data structure that finds various applications in areas such as string processing, text indexing, data compression, computational biology, and many more. Over the last three decades, researchers have proposed a broad spectrum of suffix array construction algorithms (SACAs). However, the majority of SACAs were implemented using sequential and parallel programming models. The maturity of GPU programming opened doors to the development of massively parallel GPU SACAs that outperform the fastest versions of suffix sorting algorithms optimized for the CPU parallel computing. Over the last five years, several GPU SACA approaches were proposed and implemented. They prioritized the running time over lightweight design. In this thesis, we design and implement a lightweight massively parallel SACA on the GPU using the prefix-doubling technique. Our prefix-doubling implementation is memory-efficient and can successfully construct the suffix array for input strings as large as 640 megabytes (MB) on Tesla P100 GPU. On large datasets, our implementation achieves a speedup of 7-16x over the fastest, highly optimized, OpenMP-accelerated suffix array constructor, libdivsufsort, that leverages the CPU shared memory parallelism. The performance of our algorithm relies on several high-performance parallel primitives such as radix sort, conditional filtering, inclusive prefix sum, random memory scattering, and segmented sort. We evaluate the performance of our implementation over a variety of real-world datasets with respect to its runtime, throughput, memory usage, and scalability. We compare our results against libdivsufsort that we run on a Haswell compute node equipped with 24 cores. Our GPU SACA is simple and compact, consisting of less than 300 lines of readable and effective source code. Additionally, we design and implement a fast and lightweight algorithm for checking the correctness of the suffix array

    Graphical Model approaches for Biclustering

    Get PDF
    In many scientific areas, it is crucial to group (cluster) a set of objects, based on a set of observed features. Such operation is widely known as Clustering and it has been exploited in the most different scenarios ranging from Economics to Biology passing through Psychology. Making a step forward, there exist contexts where it is crucial to group objects and simultaneously identify the features that allow to recognize such objects from the others. In gene expression analysis, for instance, the identification of subsets of genes showing a coherent pattern of expression in subsets of objects/samples can provide crucial information about active biological processes. Such information, which cannot be retrieved by classical clustering approaches, can be extracted with the so called Biclustering, a class of approaches which aim at simultaneously clustering both rows and columns of a given data matrix (where each row corresponds to a different object/sample and each column to a different feature). The problem of biclustering, also known as co-clustering, has been recently exploited in a wide range of scenarios such as Bioinformatics, market segmentation, data mining, text analysis and recommender systems. Many approaches have been proposed to address the biclustering problem, each one characterized by different properties such as interpretability, effectiveness or computational complexity. A recent trend involves the exploitation of sophisticated computational models (Graphical Models) to face the intrinsic complexity of biclustering, and to retrieve very accurate solutions. Graphical Models represent the decomposition of a global objective function to analyse in a set of smaller/local functions defined over a subset of variables. The advantages in using Graphical Models relies in the fact that the graphical representation can highlight useful hidden properties of the considered objective function, plus, the analysis of smaller local problems can be dealt with less computational effort. Due to the difficulties in obtaining a representative and solvable model, and since biclustering is a complex and challenging problem, there exist few promising approaches in literature based on Graphical models facing biclustering. 3 This thesis is inserted in the above mentioned scenario and it investigates the exploitation of Graphical Models to face the biclustering problem. We explored different type of Graphical Models, in particular: Factor Graphs and Bayesian Networks. We present three novel algorithms (with extensions) and evaluate such techniques using available benchmark datasets. All the models have been compared with the state-of-the-art competitors and the results show that Factor Graph approaches lead to solid and efficient solutions for dataset of contained dimensions, whereas Bayesian Networks can manage huge datasets, with the overcome that setting the parameters can be not trivial. As another contribution of the thesis, we widen the range of biclustering applications by studying the suitability of these approaches in some Computer Vision problems where biclustering has been never adopted before. Summarizing, with this thesis we provide evidence that Graphical Model techniques can have a significant impact in the biclustering scenario. Moreover, we demonstrate that biclustering techniques are ductile and can produce effective solutions in the most different fields of applications

    Computational Integrative Models for Cellular Conversion: Application to Cellular Reprogramming and Disease Modeling

    Get PDF
    The groundbreaking identification of only four transcription factors that are able to induce pluripotency in any somatic cell upon perturbation stimulated the discovery of copious amounts of instructive factors triggering different cellular conversions. Such conversions are highly significant to regenerative medicine with its ultimate goal of replacing or regenerating damaged and lost cells. Precise directed conversion of damaged cells into healthy cells offers the tantalizing prospect of promoting regeneration in situ. In the advent of high-throughput sequencing technologies, the distinct transcriptional and accessible chromatin landscapes of several cell types have been characterized. This characterization provided clear evidences for the existence of cell type specific gene regulatory networks determined by their distinct epigenetic landscapes that control cellular phenotypes. Further, these networks are known to dynamically change during the ectopic expression of genes initiating cellular conversions and stabilize again to represent the desired phenotype. Over the years, several computational approaches have been developed to leverage the large amounts of high-throughput datasets for a systematic prediction of instructive factors that can potentially induce desired cellular conversions. To date, the most promising approaches rely on the reconstruction of gene regulatory networks for a panel of well-studied cell types relying predominantly on transcriptional data alone. Though useful, these methods are not designed for newly identified cell types as their frameworks are restricted only to the panel of cell types originally incorporated. More importantly, these approaches rely majorly on gene expression data and cannot account for the cell type specific regulations modulated by the interplay of the transcriptional and epigenetic landscape. In this thesis, a computational method for reconstructing cell type specific gene regulatory networks is proposed that aims at addressing the aforementioned limitations of current approaches. This method integrates transcriptomics, chromatin accessibility assays and available prior knowledge about gene regulatory interactions for predicting instructive factors that can potentially induce desired cellular conversions. Its application to the prioritization of drugs for reverting pathologic phenotypes and the identification of instructive factors for inducing the cellular conversion of adipocytes into osteoblasts underlines the potential to assist in the discovery of novel therapeutic interventions

    Computational Integrative Models for Cellular Conversion: Application to Cellular Reprogramming and Disease Modeling

    Get PDF
    The groundbreaking identification of only four transcription factors that are able to induce pluripotency in any somatic cell upon perturbation stimulated the discovery of copious amounts of instructive factors triggering different cellular conversions. Such conversions are highly significant to regenerative medicine with its ultimate goal of replacing or regenerating damaged and lost cells. Precise directed conversion of damaged cells into healthy cells offers the tantalizing prospect of promoting regeneration in situ. In the advent of high-throughput sequencing technologies, the distinct transcriptional and accessible chromatin landscapes of several cell types have been characterized. This characterization provided clear evidences for the existence of cell type specific gene regulatory networks determined by their distinct epigenetic landscapes that control cellular phenotypes. Further, these networks are known to dynamically change during the ectopic expression of genes initiating cellular conversions and stabilize again to represent the desired phenotype. Over the years, several computational approaches have been developed to leverage the large amounts of high-throughput datasets for a systematic prediction of instructive factors that can potentially induce desired cellular conversions. To date, the most promising approaches rely on the reconstruction of gene regulatory networks for a panel of well-studied cell types relying predominantly on transcriptional data alone. Though useful, these methods are not designed for newly identified cell types as their frameworks are restricted only to the panel of cell types originally incorporated. More importantly, these approaches rely majorly on gene expression data and cannot account for the cell type specific regulations modulated by the interplay of the transcriptional and epigenetic landscape. In this thesis, a computational method for reconstructing cell type specific gene regulatory networks is proposed that aims at addressing the aforementioned limitations of current approaches. This method integrates transcriptomics, chromatin accessibility assays and available prior knowledge about gene regulatory interactions for predicting instructive factors that can potentially induce desired cellular conversions. Its application to the prioritization of drugs for reverting pathologic phenotypes and the identification of instructive factors for inducing the cellular conversion of adipocytes into osteoblasts underlines the potential to assist in the discovery of novel therapeutic interventions
    corecore