106 research outputs found

    Efficient Haplotype Block Matching in Bi-Directional PBWT

    Get PDF
    Efficient haplotype matching search is of great interest when large genotyped cohorts are becoming available. Positional Burrows-Wheeler Transform (PBWT) enables efficient searching for blocks of haplotype matches. However, existing efficient PBWT algorithms sweep across the haplotype panel from left to right, capturing all exact matches. As a result, PBWT does not account for mismatches. It is also not easy to investigate the patterns of changes between the matching blocks. Here, we present an extension to PBWT, called bi-directional PBWT that allows the information about the blocks of matches to be present at both sides of each site. We also present a set of algorithms to efficiently merge the matching blocks or examine the patterns of changes on both sides of each site. The time complexity of the algorithms to find and merge matching blocks using bi-directional PBWT is linear to the input size. Using real data from the UK Biobank, we demonstrate the run time and memory efficiency of our algorithms. More importantly, our algorithms can identify more blocks by enabling tolerance of mismatches. Moreover, by using mutual information (MI) between the forward and the reverse PBWT matching block sets as a measure of haplotype consistency, we found the MI derived from European samples in the 1000 Genomes Project is highly correlated (Spearman correlation r=0.87) with the deCODE recombination map

    Haplotype Threading Using the Positional Burrows-Wheeler Transform

    Get PDF
    In the classic model of population genetics, one haplotype (query) is considered as a mosaic copy of segments from a number of haplotypes in a panel, or threading the haplotype through the panel. The Li and Stephens model parameterized this problem using a hidden Markov model (HMM). However, HMM algorithms are linear to the sample size, and can be very expensive for biobank-scale panels. Here, we formulate the haplotype threading problem as the Minimal Positional Substring Cover problem, where a query is represented by a mosaic of a minimal number of substring matches from the panel. We show that this problem can be solved by a sequential set of greedy set maximal matches. Moreover, the solution space can be bounded by the left-most and the right-most solutions by the greedy approach. Based on these results, we formulate and solve several variations of this problem. Although our results are yet to be generalized to the cases with mismatches, they offer a theoretical framework for designing methods for genotype imputation and haplotype phasing

    Genotype imputation using the Positional Burrows Wheeler Transform.

    Get PDF
    Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost

    Towards Genetic Identification with Male-specific Mutations

    Get PDF
    The identification and application of new genetic variants for differentiation patrilineally related male relatives using Y chromosomal short tandem repeats (Y-STRs)

    Towards Genetic Identification with Male-specific Mutations

    Get PDF

    Identification of breed contributions in crossbred dogs

    Get PDF
    There has been a strong public interest recently in the interrogation of canine ancestries using direct-toconsumer (DTC) genetic ancestry inference tools. Our goal is to improve the accuracy of the associated computational tools, by developing superior algorithms for identifying the breed composition of mixedbreed dogs. Genetic test data has been provided by Mars Veterinary, using SNP markers. We approach this ancestry inference problem from two main directions. The first approach is optimized for datasets composed of a small number of ancestry informative markers (AIM). Firstly, we compute haplotype frequencies from purebred ancestral panels which characterize genetic variation within breeds and are utilized to predict breed compositions. Due to a large number of possible breed combinations in admixed dogs we approximately sample this search space with a Metropolis-Hastings algorithm. As proposal density we either uniformly sample new breeds for the lineage, or we bias the Markov Chain so that breeds in the lineage are more likely to be replaced by similar breeds. The second direction we explore is dominated by HMM approaches which view genotypes as realizations of latent variable sequences corresponding to breeds. In this approach an admixed canine sample is viewed as a linear combination of segments from dogs in the ancestral panel. Results were evaluated using two different performance measures. Firstly, we looked at a generalization of binary ROC-curves to multi-class classification problems. Secondly, to more accurately judge breed contribution approximations we computed the difference between expected and predicted breed contributions. Experimental results on a synthetic, admixed test dataset using AIMs showed that the MCMC approach successfully predicts breed proportions for a variety of lineage complexities. Furthermore, due to exploration in the MCMC algorithm true breed contributions are underestimated. The HMM approach performed less well which is presumably due to using less information of the dataset

    The Laplace transform in population genetics: from theory to efficient algorithms

    Get PDF
    Extracting information on the selective and demographic past of populations contained in samples of genome sequences requires a description of the distribution of the underlying genealogies. Using the Laplace transform, this distribution can be generated with a simple recursive procedure, regardless of model complexity. Assuming an infinite-sites mutation model, the probability of observing specific configurations of linked variants within small haplotype blocks can be recovered from the Laplace transform of the joint distribution of branch lengths. However, the repeated differentiation required to compute these probabilities has proven to be a serious computational bottleneck in earlier implementations. In this thesis, I extend existing work on this theoretical framework in three ways. First, I incorporate a description of the impact of hard sweeps on the genealogies of nearby neutral sites. Secondly, the recursive nature of this approach not only makes the theory easily extendable, but also implies the possibility of graph-based algorithms to query the joint distribution of branch lengths. I devise algorithms that drastically reduce the computational cost of deriving mutation configuration probabilities. This work has been implemented in an open-source Python module, agemo. Finally, the efficient library is used to develop a fully fledged demographic inference tool for fitting models of isolation with migration (IM) to genomic data. Fitting these models to smaller chunks of sequence allows us to also infer both background selection and barriers to gene flow. The software is designed to be modular and user-friendly. It facilitates the entire model fitting workflow, from parsing variants to a simulation-based bootstrap on the model estimates

    New insights in the paternal genetic landscape of Southwestern Europe: dissection of haplogroup R1b-M269 and forensic applications

    Get PDF
    325 p.El cromosoma Y determina el sexo masculino y posee una naturaleza haploide que escapa a la recombinación, lo que hace que sólo esté presente en individuos varones y que se trasmita prácticamente sin cambios de padres a hijos, estableciendo linajes paternos. El estudio de los linajes paternos, mediante polimorfismos de un solo nucleótido (single nucleotide polimorphism, SNP) en el cromosoma Y (Y-SNP), permite la reconstrucción de la historia evolutiva de los linajes paternos de la especie humana. Los marcadores genéticos del cromosoma Y pueden ser de dos tipos, los anteriormente mencionados Y-SNPs, y los microsatélites Y-STRs (short tandem repeat, STR). Su estudio permite conocer cuáles son los linajes paternos característicos o presentes en cada población humana, permitiendo así diferenciar entre unas poblaciones y otras. Por ello, en los últimos años su estudio es relevante dentro del área de la Genética Forense y otras áreas afines como la genética de poblaciones, genealogía genética o la genética evolutiva.El estudio de los Y-SNPs ha revelado que las agrupaciones de linajes paternos, también llamadas haplogrupos, se distribuyen en áreas geográficas concretas a lo largo del mundo, tanto continental comoregionalmente. En el caso del oeste de Europa el haplogrupo más común es R1b-M269. La actual composición genética de Europa ha sido objeto de múltiples controversias centradas alrededor del origen de M269, ya que las estimaciones de edad obtenidas a partir de los distintos estudios genéticos realizados por distintos autores situaron el origen de este linaje paterno tanto durante el periodo Paleolítico, como en tiempos más recientes, en el Neolítico.El objetivo principal de este estudio se centra en la reconstrucción del escenario evolutivo más probable del principal linaje paterno europeo M269 en la Península Ibérica y el suroeste de Europa a través de la disección en sus subhaplogrupos, lo que permitirá caracterizar de manera detallada la distribución de los linajes paternos presentes en la Península Ibérica e inferir el papel de esta región en la historia evolutiva de Europa.Los resultados obtenidos han permitido caracterizar el paisaje genético paterno del suroeste de Europa, revelando, por un lado, que el origen más probable de M269 sea el Este de Europa y, por otro lado, que uno de los sublinajes principales de M269, R1b-S116, presenta un patrón de distribución distinto al propuesto anteriormente por otros autores. Además, el paragrupo de S116, S116*, fue prácticamente resuelto por la presencia del sublinaje R1b-DF27, que ha resultado ser casi específico de la Península Ibérica, haciendo de él un marcador de potencial interés forense para la determinación de la ancestralidad biogeográfica paterna. Las estimaciones de edad de DF27 indican que se originó hace 4.000-4.200 años, durante la transición entre el Neolítico y la Edad de Bronce, siendo su lugar más probable de origen la región noreste de la Península Ibérica. Por otro lado, fruto de este estudio también se han desarrollado dos paneles multiplex de Y-SNPs e Y-STRs para su uso en Genética Forense y de poblaciones.En conclusión, el presente trabajo de tesis doctoral ha proporcionado, por un lado, nuevas pistas sobre la historia evolutiva de acervo genético europeo actual y, por otro lado, dos nuevas herramientas de uso forense para el análisis multiplex de Y-SNPs e Y-STRs
    corecore