59 research outputs found

    Quantification and Visualization of LD Patterns and Identification of Haplotype Blocks

    Get PDF
    Classical measures of linkage disequilibrium (LD) between two loci, based only on the joint distribution of alleles at these loci, present noisy patterns. In this paper, we propose a new distance-based LD measure, R, which takes into account multilocus haplotypes around the two loci in order to exploit information from neighboring loci. The LD measure R yields a matrix of pairwise distances between markers, based on the correlation between the lengths of shared haplotypes among chromosomes around these markers. Data analysis demonstrates that visualization of LD patterns through the R matrix reveals more deterministic patterns, with much less noise, than using classical LD measures. Moreover, the patterns are highly compatible with recently suggested models of haplotype block structure. We propose to apply the new LD measure to define haplotype blocks through cluster analysis. Specifically, we present a distance-based clustering algorithm, DHPBlocker, which performs hierarchical partitioning of an ordered sequence of markers into disjoint and adjacent blocks with a hierarchical structure. The proposed method integrates information on the two main existing criteria in defining haplotype blocks, namely, LD and haplotype diversity, through the use of silhouette width and description length as cluster validity measures, respectively. The new LD measure and clustering procedure are applied to single nucleotide polymorphism (SNP) datasets from the human 5q31 region (Daly et al. 2001) and the class II region of the human major histocompatibility complex (Jeffreys et al. 2001). Our results are in good agreement with published results. In addition, analyses performed on different subsets of markers indicate that the method is robust with regards to the allele frequency and density of the genotyped markers. Unlike previously proposed methods, our new cluster-based method can uncover hierarchical relationships among blocks and can be applied to polymorphic DNA markers or amino acid sequence data

    A model-based approach to selection of tag SNPs

    Get PDF
    BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS: Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION: Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available

    Parsimony-based genetic algorithm for haplotype resolution and block partitioning

    Get PDF
    This dissertation proposes a new algorithm for performing simultaneous haplotype resolution and block partitioning. The algorithm is based on genetic algorithm approach and the parsimonious principle. The multiloculs LD measure (Normalized Entropy Difference) is used as a block identification criterion. The proposed algorithm incorporates missing data is a part of the model and allows blocks of arbitrary length. In addition, the algorithm provides scores for the block boundaries which represent measures of strength of the boundaries at specific positions. The performance of the proposed algorithm was validated by running it on several publicly available data sets including the HapMap data and comparing results to those of the existing state-of-the-art algorithms. The results show that the proposed genetic algorithm provides the accuracy of haplotype decomposition within the range of the same indicators shown by the other algorithms. The block structure output by our algorithm in general agrees with the block structure for the same data provided by the other algorithms. Thus, the proposed algorithm can be successfully used for block partitioning and haplotype phasing while providing some new valuable features like scores for block boundaries and fully incorporated treatment of missing data. In addition, the proposed algorithm for haplotyping and block partitioning is used in development of the new clustering algorithm for two-population mixed genotype samples. The proposed clustering algorithm extracts from the given genotype sample two clusters with substantially different block structures and finds haplotype resolution and block partitioning for each cluster

    The Minimum Description Length Principle for Pattern Mining: A Survey

    Full text link
    This is about the Minimum Description Length (MDL) principle applied to pattern mining. The length of this description is kept to the minimum. Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, as well as of work on the theory behind the MDL and similar principles, we review MDL-based methods for mining various types of data and patterns. Finally, we open a discussion on some issues regarding these methods, and highlight currently active related data analysis problems

    This is just a phase : the impact of population structure on haplotype phasing and linkage disequilibrium measures at functional genetic sites.

    Get PDF
    The block-like structure of the human genome has been the subject of many scientific papers and is of practical significance in large-scale genome-wide association studies. How stringent haplotype block boundaries are within and between populations has been the subject of ongoing debate within human population genetics. This thesis will contribute to the description of universal and population-specific haplotype blocks at functional sites, namely across the IL-10 gene family (including IL-10, IL-19, IL-20 and IL-24), which is involved in a number of immune system processes, and MAPKAP-K2, an adjacent and functionally significant kinase gene. Beyond the description of blocks across these sites in different populations, this thesis will also measure the impact of the haplotype phasing process on downstream applications of linkage disequilibrium analysis, which underlies much of the research on human haplotype blocks. The five genes in this analysis span just over 200kb on the q arm of chromosome 1. A total of 80 samples from the Coriell Institute of Medical Research are used in this analysis and represent Andean, Basque, Chinese, Iberian, Indo-Pakistani, Middle Eastern, Russian, South African and North African populations. Some haplotype block boundaries were concordant with gene boundaries with most populations showing a consistent boundary between IL-20 and IL-24 and at least half of the study populations showing consistent boundaries between MAPKAP-K2, IL-10 and IL-20. The only gene boundary lacking a persistent haplotype block boundary was between IL-19 and IL-20. The haplotype phasing programs PHASE and Beagle shared 13 of 15 haplotype block boundaries in common while MDBlocks and Beagle only shared 2 haplotype block boundaries and PHASE and MDBlocks only shared 1 block boundary. These data indicate that there are indeed population-specific differences in the distribution of LD across these five sites. Despite these differences, there is a general trend of high LD across each gene with a breakdown of LD at gene boundaries across all populations

    Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of disease-related genes in association studies is challenged by the large number of SNPs typed. To address the dilution of power caused by high dimensionality, and to generate results that are biologically interpretable, it is critical to take into consideration spatial correlation of SNPs along the genome. With the goal of identifying true genetic associations, partitioning the genome according to spatial correlation can be a powerful and meaningful way to address this dimensionality problem.</p> <p>Results</p> <p>We developed and validated an MCMC Algorithm To Identify blocks of Linkage DisEquilibrium (MATILDE) for clustering contiguous SNPs, and a statistical testing framework to detect association using partitions as units of analysis. We compared its ability to detect true SNP associations to that of the most commonly used algorithm for block partitioning, as implemented in the Haploview and HapBlock software. Simulations were based on artificially assigning phenotypes to individuals with SNPs corresponding to region 14q11 of the HapMap database. When block partitioning is performed using MATILDE, the ability to correctly identify a disease SNP is higher, especially for small effects, than it is with the alternatives considered.</p> <p>Advantages can be both in terms of true positive findings and limiting the number of false discoveries. Finer partitions provided by LD-based methods or by marker-by-marker analysis are efficient only for detecting big effects, or in presence of large sample sizes. The probabilistic approach we propose offers several additional advantages, including: a) adapting the estimation of blocks to the population, technology, and sample size of the study; b) probabilistic assessment of uncertainty about block boundaries and about whether any two SNPs are in the same block; c) user selection of the probability threshold for assigning SNPs to the same block.</p> <p>Conclusion</p> <p>We demonstrate that, in realistic scenarios, our adaptive, study-specific block partitioning approach is as or more efficient than currently available LD-based approaches in guiding the search for disease loci.</p

    Identification of rheumatoid arthritis biomarkers based on single nucleotide polymorphisms and haplotype blocks: A systematic review and meta-analysis

    Get PDF
    AbstractGenetics of autoimmune diseases represent a growing domain with surpassing biomarker results with rapid progress. The exact cause of Rheumatoid Arthritis (RA) is unknown, but it is thought to have both a genetic and an environmental bases. Genetic biomarkers are capable of changing the supervision of RA by allowing not only the detection of susceptible individuals, but also early diagnosis, evaluation of disease severity, selection of therapy, and monitoring of response to therapy. This review is concerned with not only the genetic biomarkers of RA but also the methods of identifying them. Many of the identified genetic biomarkers of RA were identified in populations of European and Asian ancestries. The study of additional human populations may yield novel results. Most of the researchers in the field of identifying RA biomarkers use single nucleotide polymorphism (SNP) approaches to express the significance of their results. Although, haplotype block methods are expected to play a complementary role in the future of that field

    Computational methods for augmenting association-based gene mapping

    Get PDF
    The context and motivation for this thesis is gene mapping, the discovery of genetic variants that affect susceptibility to disease. The goals of gene mapping research include understanding of disease mechanisms, evaluating individual disease risks and ultimately developing new medicines and treatments. Traditional genetic association mapping methods test each measured genetic variant independently for association with the disease. One way to improve the power of detecting disease-affecting variants is to base the tests on haplotypes, strings of adjacent variants that are inherited together, instead of individual variants. To enable haplotype analyses in large-scale association studies, this thesis introduces two novel statistical models and gives an efficient algorithm for haplotype reconstruction, jointly called HaloRec. HaploRec is based on modeling local regularities of variable length in the haplotypes of the studied population and using the obtained model to statistically reconstruct the most probable haplotypes for each studied individual. Our experiments demonstrate that HaploRec is especially well suited to data sets with a large number or markers and subjects, such as those typically used in currently popular genome-wide association studies. Public biological databases contain large amounts of data that can help in determining the relevance of putative associations. In this thesis, we introduce Biomine, a database and search engine that integrates data from several such databases under a uniform graph representation. The graph database is used to derive a general proximity measure for biological entities represented as graph nodes, based on a novel scheme of weighting individual graph edges based on their informativeness and type. The resulting proximity measure can be used as a basis for various data analysis tasks, such as ranking putative disease genes and visualization of gene relationships. Our experiments show that relevant disease genes can be identified from among the putative ones with a reasonable accuracy using Biomine. Best accuracy is obtained when a pre-known reference set of disease genes is available, but experiments using a novel clustering-based method demonstrate that putative disease genes can also be ranked without a reference set under suitable conditions. An important complementary use of Biomine is the search and visualization of indirect relationships between graph nodes, which can be used e.g. to characterize the relationship of putative disease genes to already known disease genes. We provide two methods for selecting subgraphs to be visualized: one based on weights of the edges on the paths connecting query nodes, and one based on using context free grammars to define the types of paths to be displayed. Both of these query interfaces to Biomine are available online.Tämän väitöskirjan aihealue on geenikartoitus, tautialttiuteen vaikuttavien perinnöllisten muunnosten paikantaminen. Geenikartoituksen käytännöllisiä päämääriä ovat tautimekanismien ymmärtäminen, yksilöllisten tautiriskien arviointi sekä uusien lääkitysten kehittäminen. Tässä työssä on kehitetty laskennallisia menetelmiä joita voidaan käyttää parantamaan olemassaolevien geenikartoitusmenetelmien tehoa sekä analysoimaan niiden antamia alustavia tuloksia. Geenikartoitusmenetelmät perustuvat ns. markereihin, jotka ovat yksilöllistä vaihtelua sisältäviä kohtia perimässä. Tyypillisesti käytetyt menetelmät mittaavat kussakin markerissa esiintyvien muunnosten yhteyttä tautiin erikseen, huomioimatta muita markereita. Kartoituksen tarkkuutta voidaan parantaa käyttämällä testaamisen yksikkönä yksittäisten markerien sijaan haplotyyppejä, lähekkäisissä markereissa esiintyvien muunnosten muodostamia säännönmukaisia jaksoja jotka periytyvät yhdessä. Laboratoriomenelmät eivät suoraan tuota tietoa siitä, miten kunkin yksilön perimästä mitatut muunnokset jakautuvat tämän kahdelta vanhemmalta perimiin haplotyyppeihin. Tämän väitöskirjan alkupuolella esitetään laskennallinen menetelmä, joilla haplotyypit voidaan rekonstruoida tilastollisesti, perustuen niiden paikallisiin säännönmukaisuuksiin. Kehitetty menetelmä on laskennallisesti tehokas ja soveltuu erityisesti genominlaajuisiin tutkimuksiin, joissa sekä tutkittujen yksilöiden että markereiden määrät ovat suuria, ja markerit sijaitsevat kohtuullisen etäällä toisistaan. Yksittäisten muunnosten vaikutukset tauteihin ovat usein suhteellisen heikkoja, ja kun testataan suuri joukko markereita, tuloksiin tulee yleensä sattumalta mukaan myös muunnoksia joilla ei ole todellista vaikutusta tautiin. Julkiset biologiset tietokannat sisältävät paljon tietoa joka voi auttaa alustavien geenikartoitustulosten merkityksen arvioimista. Työn toisessa osassa esitellään Biomine, tietokanta jossa on yhdistetty tietoa joukosta tällaisia tietokantoja käyttäen painotettua verkkomallia joka kuvaa mm. geenien, proteiinien ja tautien välisiä tunnettuja yhteyksiä. Verkon solmujen välisten epäsuorien yhteyksien voimakkuuden mittaamiseen esitetään uusi menetelmä. Tätä menetelmää voidaan hyödyntää mm. geenikartoituksella löydettyjen kandidaattigeenien priorisointiin, perustuen siihen että mitataan kandidaattigeenien ja entuudestaan tunnettujen tautigeenien välisten yhteyksien voimakkuutta, tai kandidaattigeenien keskinäisten yhteyksien voimakkuutta. Työssä esitetään myös menetelmiä verkkotietokannan solmujen välisten epäsuorien yhteyksien visualisointiin, perustuen kulloinkin kiinnostuksen kohteena olevien solmujen yhteyttä parhaiten kuvaavan pienen aliverkon eristämiseen tietokannasta. Aliverkon valintaan esitetään kaksi laskennallisesti tehokasta menetelmää: toinen perustuen yhteyksien voimakkuuden arvioimiseen, ja toinen perustuen yhdistävien polkujen sisältämien linkkien tyyppeihin. Nämä visualisointimenetelmät ovat myös käytettävissä julkisessa verkkopalvelussa jossa voi tehdä kyselyjä Biomine-tietokantaan
    corecore