195 research outputs found

    Parsimony-based genetic algorithm for haplotype resolution and block partitioning

    Get PDF
    This dissertation proposes a new algorithm for performing simultaneous haplotype resolution and block partitioning. The algorithm is based on genetic algorithm approach and the parsimonious principle. The multiloculs LD measure (Normalized Entropy Difference) is used as a block identification criterion. The proposed algorithm incorporates missing data is a part of the model and allows blocks of arbitrary length. In addition, the algorithm provides scores for the block boundaries which represent measures of strength of the boundaries at specific positions. The performance of the proposed algorithm was validated by running it on several publicly available data sets including the HapMap data and comparing results to those of the existing state-of-the-art algorithms. The results show that the proposed genetic algorithm provides the accuracy of haplotype decomposition within the range of the same indicators shown by the other algorithms. The block structure output by our algorithm in general agrees with the block structure for the same data provided by the other algorithms. Thus, the proposed algorithm can be successfully used for block partitioning and haplotype phasing while providing some new valuable features like scores for block boundaries and fully incorporated treatment of missing data. In addition, the proposed algorithm for haplotyping and block partitioning is used in development of the new clustering algorithm for two-population mixed genotype samples. The proposed clustering algorithm extracts from the given genotype sample two clusters with substantially different block structures and finds haplotype resolution and block partitioning for each cluster

    Algorithms For Haplotype Inference And Block Partitioning

    Get PDF
    The completion of the human genome project in 2003 paved the way for studies to better understand and catalog variation in the human genome. The International HapMap Project was started in 2002 with the aim of identifying genetic variation in the human genome and studying the distribution of genetic variation across populations of individuals. The information collected by the HapMap project will enable researchers in associating genetic variations with phenotypic variations. Single Nucleotide Polymorphisms (SNPs) are loci in the genome where two individuals differ in a single base. It is estimated that there are approximately ten million SNPs in the human genome. These ten million SNPS are not completely independent of each other - blocks (contiguous regions) of neighboring SNPs on the same chromosome are inherited together. The pattern of SNPs on a block of the chromosome is called a haplotype. Each block might contain a large number of SNPs, but a small subset of these SNPs are sufficient to uniquely dentify each haplotype in the block. The haplotype map or HapMap is a map of these haplotype blocks. Haplotypes, rather than individual SNP alleles are expected to effect a disease phenotype. The human genome is diploid, meaning that in each cell there are two copies of each chromosome - i.e., each individual has two haplotypes in any region of the chromosome. With the current technology, the cost associated with empirically collecting haplotype data is prohibitively expensive. Therefore, the un-ordered bi-allelic genotype data is collected experimentally. The genotype data gives the two alleles in each SNP locus in an individual, but does not give information about which allele is on which copy of the chromosome. This necessitates computational techniques for inferring haplotypes from genotype data. This computational problem is called the haplotype inference problem. Many statistical approaches have been developed for the haplotype inference problem. Some of these statistical methods have been shown to be reasonably accurate on real genotype data. However, these techniques are very computation-intensive. With the international HapMap project collecting information from nearly 10 million SNPs, and with association studies involving thousands of individuals being undertaken, there is a need for more efficient methods for haplotype inference. This dissertation is an effort to develop efficient perfect phylogeny based combinatorial algorithms for haplotype inference. The perfect phylogeny haplotyping (PPH) problem is to derive a set of haplotypes for a given set of genotypes with the condition that the haplotypes describe a perfect phylogeny. The perfect phylogeny approach to haplotype inference is applicable to the human genome due to the block structure of the human genome. An important contribution of this dissertation is an optimal O(nm) time algorithm for the PPH problem, where n is the number of genotypes and m is the number of SNPs involved. The complexity of the earlier algorithms for this problem was O(nm^2). The O(nm) complexity was achieved by applying some transformations on the input data and by making use of the FlexTree data structure that has been developed as part of this dissertation work, which represents all the possible PPH solution for a given set of genotypes. Real genotype data does not always admit a perfect phylogeny, even within a block of the human genome. Therefore, it is necessary to extend the perfect phylogeny approach to accommodate deviations from perfect phylogeny. Deviations from perfect phylogeny might occur because of recombination events and repeated or back mutations (also referred to as homoplasy events). Another contribution of this dissertation is a set of fixed-parameter tractable algorithms for constructing near-perfect phylogenies with homoplasy events. For the problem of constructing a near perfect phylogeny with q homoplasy events, the algorithm presented here takes O(nm^2+m^(n+m)) time. Empirical analysis on simulated data shows that this algorithm produces more accurate results than PHASE (a popular haplotype inference program), while being approximately 1000 times faster than phase. Another important problem while dealing real genotype or haplotype data is the presence of missing entries. The Incomplete Perfect Phylogeny (IPP) problem is to construct a perfect phylogeny on a set of haplotypes with missing entries. The Incomplete Perfect Phylogeny Haplotyping (IPPH) problem is to construct a perfect phylogeny on a set of genotypes with missing entries. Both the IPP and IPPH problems have been shown to be NP-hard. The earlier approaches for both of these problems dealt with restricted versions of the problem, where the root is either available or can be trivially re-constructed from the data, or certain assumptions were made about the data. We make some novel observations about these problems, and present efficient algorithms for unrestricted versions of these problems. The algorithms have worst-case exponential time complexity, but have been shown to be very fast on practical instances of the problem

    A highly polymorphic insertion in the Y-chromosome amelogenin gene can be used for evolutionary biology, population genetics and sexing in Cetacea and Artiodactyla

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The early radiation of the <it>Cetartiodactyla </it>is complex, and unambiguous molecular characters are needed to clarify the positions of hippotamuses, camels and pigs relative to the remaining taxa (<it>Cetacea </it>and <it>Ruminantia</it>). There is also a need for informative genealogic markers for Y-chromosome population genetics as well as a sexing method applicable to all species from this group. We therefore studied the sequence variation of a partial sequence of the evolutionary conserved amelogenin gene to assess its potential use in each of these fields.</p> <p>Results and discussion</p> <p>We report a large interstitial insertion in the Y amelogenin locus in most of the <it>Cetartiodactyla </it>lineages (cetaceans and ruminants). This sex-linked size polymorphism is the result of a 460–465 bp inserted element in intron 4 of the amelogenin gene of Ruminants and Cetaceans. Therefore, this polymorphism can easily be used in a sexing assay for these species.</p> <p>When taking into account this shared character in addition to nucleotide sequence, gene genealogy follows sex-chromosome divergence in <it>Cetartiodactyla </it>whereas it is more congruent with zoological history when ignoring these characters. This could be related to a loss of homology between chromosomal copies given the old age of the insertion.</p> <p>The 1 kbp <it>Amel-Y </it>amplified fragment is also characterized by high nucleotide diversity (64 polymorphic sites spanning over 1 kbp in seven haplotypes) which is greater than for other Y-chromosome sequence markers studied so far but less than the mitochondrial control region.</p> <p>Conclusion</p> <p>The gender-dependent polymorphism we have identified is relevant not only for phylogenic inference within the <it>Cetartiodactyla </it>but also for Y-chromosome based population genetics and gender determination in cetaceans and ruminants. One single protocol can therefore be used for studies in population and evolutionary genetics, reproductive biotechnologies, and forensic science.</p

    A High Density SNP Array for the Domestic Horse and Extant Perissodactyla: Utility for Association Mapping, Genetic Diversity, and Phylogeny Studies

    Get PDF
    An equine SNP genotyping array was developed and evaluated on a panel of samples representing 14 domestic horse breeds and 18 evolutionarily related species. More than 54,000 polymorphic SNPs provided an average inter-SNP spacing of ∼43 kb. The mean minor allele frequency across domestic horse breeds was 0.23, and the number of polymorphic SNPs within breeds ranged from 43,287 to 52,085. Genome-wide linkage disequilibrium (LD) in most breeds declined rapidly over the first 50–100 kb and reached background levels within 1–2 Mb. The extent of LD and the level of inbreeding were highest in the Thoroughbred and lowest in the Mongolian and Quarter Horse. Multidimensional scaling (MDS) analyses demonstrated the tight grouping of individuals within most breeds, close proximity of related breeds, and less tight grouping in admixed breeds. The close relationship between the Przewalski's Horse and the domestic horse was demonstrated by pair-wise genetic distance and MDS. Genotyping of other Perissodactyla (zebras, asses, tapirs, and rhinoceros) was variably successful, with call rates and the number of polymorphic loci varying across taxa. Parsimony analysis placed the modern horse as sister taxa to Equus przewalski. The utility of the SNP array in genome-wide association was confirmed by mapping the known recessive chestnut coat color locus (MC1R) and defining a conserved haplotype of ∼750 kb across all breeds. These results demonstrate the high quality of this SNP genotyping resource, its usefulness in diverse genome analyses of the horse, and potential use in related species

    Computational methods for augmenting association-based gene mapping

    Get PDF
    The context and motivation for this thesis is gene mapping, the discovery of genetic variants that affect susceptibility to disease. The goals of gene mapping research include understanding of disease mechanisms, evaluating individual disease risks and ultimately developing new medicines and treatments. Traditional genetic association mapping methods test each measured genetic variant independently for association with the disease. One way to improve the power of detecting disease-affecting variants is to base the tests on haplotypes, strings of adjacent variants that are inherited together, instead of individual variants. To enable haplotype analyses in large-scale association studies, this thesis introduces two novel statistical models and gives an efficient algorithm for haplotype reconstruction, jointly called HaloRec. HaploRec is based on modeling local regularities of variable length in the haplotypes of the studied population and using the obtained model to statistically reconstruct the most probable haplotypes for each studied individual. Our experiments demonstrate that HaploRec is especially well suited to data sets with a large number or markers and subjects, such as those typically used in currently popular genome-wide association studies. Public biological databases contain large amounts of data that can help in determining the relevance of putative associations. In this thesis, we introduce Biomine, a database and search engine that integrates data from several such databases under a uniform graph representation. The graph database is used to derive a general proximity measure for biological entities represented as graph nodes, based on a novel scheme of weighting individual graph edges based on their informativeness and type. The resulting proximity measure can be used as a basis for various data analysis tasks, such as ranking putative disease genes and visualization of gene relationships. Our experiments show that relevant disease genes can be identified from among the putative ones with a reasonable accuracy using Biomine. Best accuracy is obtained when a pre-known reference set of disease genes is available, but experiments using a novel clustering-based method demonstrate that putative disease genes can also be ranked without a reference set under suitable conditions. An important complementary use of Biomine is the search and visualization of indirect relationships between graph nodes, which can be used e.g. to characterize the relationship of putative disease genes to already known disease genes. We provide two methods for selecting subgraphs to be visualized: one based on weights of the edges on the paths connecting query nodes, and one based on using context free grammars to define the types of paths to be displayed. Both of these query interfaces to Biomine are available online.Tämän väitöskirjan aihealue on geenikartoitus, tautialttiuteen vaikuttavien perinnöllisten muunnosten paikantaminen. Geenikartoituksen käytännöllisiä päämääriä ovat tautimekanismien ymmärtäminen, yksilöllisten tautiriskien arviointi sekä uusien lääkitysten kehittäminen. Tässä työssä on kehitetty laskennallisia menetelmiä joita voidaan käyttää parantamaan olemassaolevien geenikartoitusmenetelmien tehoa sekä analysoimaan niiden antamia alustavia tuloksia. Geenikartoitusmenetelmät perustuvat ns. markereihin, jotka ovat yksilöllistä vaihtelua sisältäviä kohtia perimässä. Tyypillisesti käytetyt menetelmät mittaavat kussakin markerissa esiintyvien muunnosten yhteyttä tautiin erikseen, huomioimatta muita markereita. Kartoituksen tarkkuutta voidaan parantaa käyttämällä testaamisen yksikkönä yksittäisten markerien sijaan haplotyyppejä, lähekkäisissä markereissa esiintyvien muunnosten muodostamia säännönmukaisia jaksoja jotka periytyvät yhdessä. Laboratoriomenelmät eivät suoraan tuota tietoa siitä, miten kunkin yksilön perimästä mitatut muunnokset jakautuvat tämän kahdelta vanhemmalta perimiin haplotyyppeihin. Tämän väitöskirjan alkupuolella esitetään laskennallinen menetelmä, joilla haplotyypit voidaan rekonstruoida tilastollisesti, perustuen niiden paikallisiin säännönmukaisuuksiin. Kehitetty menetelmä on laskennallisesti tehokas ja soveltuu erityisesti genominlaajuisiin tutkimuksiin, joissa sekä tutkittujen yksilöiden että markereiden määrät ovat suuria, ja markerit sijaitsevat kohtuullisen etäällä toisistaan. Yksittäisten muunnosten vaikutukset tauteihin ovat usein suhteellisen heikkoja, ja kun testataan suuri joukko markereita, tuloksiin tulee yleensä sattumalta mukaan myös muunnoksia joilla ei ole todellista vaikutusta tautiin. Julkiset biologiset tietokannat sisältävät paljon tietoa joka voi auttaa alustavien geenikartoitustulosten merkityksen arvioimista. Työn toisessa osassa esitellään Biomine, tietokanta jossa on yhdistetty tietoa joukosta tällaisia tietokantoja käyttäen painotettua verkkomallia joka kuvaa mm. geenien, proteiinien ja tautien välisiä tunnettuja yhteyksiä. Verkon solmujen välisten epäsuorien yhteyksien voimakkuuden mittaamiseen esitetään uusi menetelmä. Tätä menetelmää voidaan hyödyntää mm. geenikartoituksella löydettyjen kandidaattigeenien priorisointiin, perustuen siihen että mitataan kandidaattigeenien ja entuudestaan tunnettujen tautigeenien välisten yhteyksien voimakkuutta, tai kandidaattigeenien keskinäisten yhteyksien voimakkuutta. Työssä esitetään myös menetelmiä verkkotietokannan solmujen välisten epäsuorien yhteyksien visualisointiin, perustuen kulloinkin kiinnostuksen kohteena olevien solmujen yhteyttä parhaiten kuvaavan pienen aliverkon eristämiseen tietokannasta. Aliverkon valintaan esitetään kaksi laskennallisesti tehokasta menetelmää: toinen perustuen yhteyksien voimakkuuden arvioimiseen, ja toinen perustuen yhdistävien polkujen sisältämien linkkien tyyppeihin. Nämä visualisointimenetelmät ovat myös käytettävissä julkisessa verkkopalvelussa jossa voi tehdä kyselyjä Biomine-tietokantaan

    Haplotype-Based Association Studies: Approaches to Current Challenges

    Get PDF
    Haplotype-based association studies have greatly aided researchers in their attempts to map genes. However, current designs of haplotype-based association studies lead to several challenges from a statistical perspective. To reduce the number of variants, some researchers have employed hierarchical clustering. This thesis starts by addressing the multiple testing problem that results from applying a hierarchical clustering procedure to haplotypes and then performing a statistical test for association at each of the steps in the resulting hierarchy. Applying our method to a haplotype case-control dataset, we find a global p-value. Relative to the minimum p-value over all steps in the hierarchy, the global p-value is markedly inflated. The second challenge involves the inherent errors present when prediction programs are employed to assign haplotype pairs for each individual in a haplotype-based association study. We examined the effect of these misclassification errors on the false positive rate and power for two association tests—the standard likelihood ratio test (LRTstd) and a likelihood ratio test that allows for the misclassification inherent in the haplotype inference procedure (LRTae). Our simulations indicate that 1) for each statistic permutation methods maintain the correct type I error; 2) specific multilocus genotypes that are misclassified as the incorrect haplotype pair are consistently misclassified throughout each entire dataset; and 3) a significant power gain exists for the LRTae over the LRTstd for a subset of the parameter settings. The LRTae showed the greatest benefit over the LRTstd when the cost of phenotyping was very high relative to the cost of genotyping. This situation is likely to occur in a replication study as opposed to a whole genome association study. The third challenge addressed by this thesis involves the uncertainty regarding the exact distribution of the likelihood ratio test (LRT) statistic for haplotype-based association tests in which many of the haplotype frequency estimates are zero or very small. By simulating datasets with known haplotype frequencies and comparing the empirical distribution with various theoretical distributions, we characterized the distribution of the LRT statistic as a χ2 distribution where the degrees of freedom are related to the number of the haplotypes with nonzero frequency estimates. Awareness of the potential pitfalls and the strategies to address them will increase the effectiveness of haplotype-based association as a gene-mapping tool

    Genetic Characterization of the Pee Dee Cotton Breeding Program

    Get PDF
    The history of cotton breeding in the southeastern United States is multifaceted and complex. Public and private breeding programs have driven cotton’s genetic development over the past two centuries. The Pee Dee breeding program in Florence, South Carolina, has had a substantial role in the development of well-adapted cotton cultivars with improved fiber strength, fiber length, and performance in farmers’ fields. Despite the historic importance of the cotton germplasm lines and varieties from the Pee Dee program, little has been done to characterize the population structure and genetic architecture of key traits in this closed breeding program. Here, I first provide an in-depth exploration of the rich history of cotton breeding and genetics over the past century to provide some context for the remainder of this thesis. Then, I discuss the interface of breeding goals, population genetics, and historical implications of a representative sample across 85+ years of cotton breeding in the Pee Dee program. Once the family structure had been evaluated, I applied modern statistical methodology to find gene haplotypes that are associated with improved fiber quality or field performance and attempted to trace the origin of some beneficial alleles. Lastly, I talk about the implications of our work and how it may influence future breeding efforts to utilize the germplasm from this diverse cotton collection
    corecore