694 research outputs found

    Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

    Get PDF
    Given a threshold L and a set R = {R_1, ..., R_m} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b] in P has length at least L and the number d(a,b)=|{R_i[a,b] : 1 <= i <= m}| of distinct substrings at segment [a,b] is minimized over [a,b] in P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b) : [a,b] in P} founder sequences representing the original R such that crossovers happen only at segment boundaries. We give an optimal O(mn) time algorithm to solve the problem, improving over earlier O(mn^2). This improvement enables to exploit the algorithm on a pan-genomic setting of input strings being aligned haplotype sequences of complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling. We implemented the new algorithm and give some experimental evidence on the practicality of the approach on this pan-genomic setting

    Linear time minimum segmentation enables scalable founder reconstruction

    Get PDF
    Abstract Background  We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set R={R1,,Rm}{\mathcal {R}} = \{R_1, \ldots , R_m\} R = { R 1 , … , R m } of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment [a,b]P[a,b] \in P [ a , b ] ∈ P has length at least L and the number d(a,b)={Ri[a,b]:1im}d(a,b)=|\{R_i[a,b] :1\le i \le m\}| d ( a , b ) = | { R i [ a , b ] : 1 ≤ i ≤ m } | of distinct substrings at segment [a, b] is minimized over [a,b]P[a,b] \in P [ a , b ] ∈ P . The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b):[a,b]P}\max \{ d(a,b) :[a,b] \in P \} max { d ( a , b ) : [ a , b ] ∈ P } founder sequences representing the original R{\mathcal {R}} R such that crossovers happen only at segment boundaries. Results  We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn2)O(mn^2) O ( m n 2 ) . Conclusions  Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences

    Linear time minimum segmentation enables scalable founder reconstruction

    Get PDF
    Background: We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set R={R1,...,Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b]P has length at least L and the number d(a,b)=|{Ri[a,b]:1im}| of distinct substrings at segment [a,b] is minimized over [a,b]P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b):[a,b]P} founder sequences representing the original R such that crossovers happen only at segment boundaries. Results: We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn2). Conclusions: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.Peer reviewe

    VariScan Software

    Get PDF
    Linux / Mac OS X : The package includes executables for linux (variscan) and Mac OS X (variscan). For other Unix-based platforms you will have to compile it from the source files included in the VariScan package. Windows: The package includes (src directory), the source code, the project (variscan.dev) and makefile (variscan.win) files to be used, for instance, for the Dev-C++ (a free Integrated Development Environment for the C/C++ programming language)Podeu consultar l'article relacionat a: http://hdl.handle.net/2445/7384Podeu consultar la pàgina de desenvolupament del programari: http://www.ub.edu/softevol/VariScan is a software package for the analysis of DNA sequence polymorphisms at the whole genome scale. Among other features, the software: (1) can conduct many population genetic analyses; (2) incorporates a multiresolution wavelet transform-based method that allows capturing relevant information from DNA polymorphism data; and (3) it facilitates the visualization of the results in the most commonly used genome browsers

    Genome-wide DNA polymorphism analyses using VariScan

    Get PDF
    BACKGROUND: DNA sequence polymorphisms analysis can provide valuable information on the evolutionary forces shaping nucleotide variation, and provides an insight into the functional significance of genomic regions. The recent ongoing genome projects will radically improve our capabilities to detect specific genomic regions shaped by natural selection. Current available methods and software, however, are unsatisfactory for such genome-wide analysis. RESULTS: We have developed methods for the analysis of DNA sequence polymorphisms at the genome-wide scale. These methods, which have been tested on a coalescent-simulated and actual data files from mouse and human, have been implemented in the VariScan software package version 2.0. Additionally, we have also incorporated a graphical-user interface. The main features of this software are: i) exhaustive population-genetic analyses including those based on the coalescent theory; ii) analysis adapted to the shallow data generated by the high-throughput genome projects; iii) use of genome annotations to conduct a comprehensive analyses separately for different functional regions; iv) identification of relevant genomic regions by the sliding-window and wavelet-multiresolution approaches; v) visualization of the results integrated with current genome annotations in commonly available genome browsers. CONCLUSION: VariScan is a powerful and flexible suite of software for the analysis of DNA polymorphisms. The current version implements new algorithms, methods, and capabilities, providing an important tool for an exhaustive exploratory analysis of genome-wide DNA polymorphism data

    Applying the Positional Burrows–Wheeler Transform to All-Pairs Hamming distance

    Get PDF
    Crochemore et al. gave in WABI 2017 an algorithm that from a set of input strings finds all pairs of strings that have Hamming distance at most a given threshold. The proposed algorithm first finds all long enough exact matches between the strings, and sorts these into pairs whose coordinates also match. Then the remaining pairs are verified for the Hamming distance threshold. The algorithm was shown to work in average linear time, under some constraints and assumptions.under some constraints and assumptions. We show that one can use the Positional Burrows-Wheeler Transform (PBWT) by Durbin (Bioinformatics, 2014) to directly find all exact matches whose coordinates also match. The same structure also extends to verifying the pairs for the Hamming distance threshold. The same analysis as for the algorithm of Crochemore et al. applies. As a side result, we show how to extend PBWT for non-binary alphabets. The new operations provided by PBWT find other applications in similar tasks as those considered here. (C) 2019 The Authors. Published by Elsevier B.V.Peer reviewe

    The Pattern of Polymorphism on Human Chromosome 21

    Get PDF
    Polymorphism data from 20 partially resequenced copies of human chromosome 21—more than 20,000 polymorphic sites—were analyzed. The allele-frequency distribution shows no deviation from the simplest population genetic model with a constant population size (although we show that our analysis has no power to detect population growth). The average rate of recombination per site is estimated to be roughly one-half of the rate of mutation per site, again in agreement with simple model predictions. However, sliding-window analyses of the amount of polymorphism and the extent of linkage disequilibrium (LD) show significant deviations from standard models. This could be due to the history of selection or demographic change, but it is impossible to draw strong conclusions without much better knowledge of variation in the relationship between genetic and physical distance along the chromosome

    How many came home? Evaluating ex‐situ conservation of green turtles in the Cayman Islands

    Full text link
    Ex-situ management is an important conservation tool that allows the preservation of biological diversity outside natural habitats while supporting survival in the wild. Captive breeding followed by reintroduction is a possible approach for endangered species conservation and preservation of genetic variability. The Cayman Turtle Centre Ltd was established in 1968 to market green turtle (Chelonia mydas) meat and other products and replenish wild populations, thought to be locally extirpated, through captive breeding. We evaluated the effects of this reintroduction program using molecular markers (13 microsatellites, 800bp D-loop and STR mtDNA sequences) from captive breeders (N=257) and wild nesting females (N=57) (sampling period: 2013-2015). We divided the captive breeders into three groups: founders (from the original stock), and then two subdivisions of F1 individuals corresponding to two different management strategies, cohort 1995 ("C1995)" and multicohort F1 ("MCF1"). Loss of genetic variability and increased relatedness was observed in the captive stock over time. We found no significant differences in diversity among captive and wild groups, and similar or higher levels of haplotype variability when compared to other natural populations. Using parentage and sibship assignment, we determined that 90% of the wild individuals were related to the captive stock. Our results suggest a strong impact of the reintroduction program on the present recovery of the wild green turtle population nesting in the Cayman Islands. Moreover, genetic relatedness analyses of captive populations are necessary to improve future management actions to maintain genetic diversity in the long term and avoid inbreeding depression

    Assemblathon 1: A competitive assessment of de novo short read assembly methods

    Get PDF
    Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/
    corecore