2 research outputs found
Indexing Finite Language Representation of Population Genotypes
With the recent advances in DNA sequencing, it is now possible to have
complete genomes of individuals sequenced and assembled. This rich and focused
genotype information can be used to do different population-wide studies, now
first time directly on whole genome level. We propose a way to index population
genotype information together with the complete genome sequence, so that one
can use the index to efficiently align a given sequence to the genome with all
plausible genotype recombinations taken into account. This is achieved through
converting a multiple alignment of individual genomes into a finite automaton
recognizing all strings that can be read from the alignment by switching the
sequence at any time. The finite automaton is indexed with an extension of
Burrows-Wheeler transform to allow pattern search inside the plausible
recombinant sequences. The size of the index stays limited, because of the high
similarity of individual genomes. The index finds applications in variation
calling and in primer design. On a variation calling experiment, we found about
1.0% of matches to novel recombinants just with exact matching, and up to 2.4%
with approximate matching.Comment: This is the full version of the paper that was presented at WABI
2011. The implementation is available at
http://www.cs.helsinki.fi/group/suds/gcsa
Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples
Motivation: Whole-genome high-coverage sequencing has been widely used for
personal and cancer genomics as well as in various research areas. However, in
the lack of an unbiased whole-genome truth set, the global error rate of
variant calls and the leading causal artifacts still remain unclear even given
the great efforts in the evaluation of variant calling methods.
Results: We made ten SNP and INDEL call sets with two read mappers and five
variant callers, both on a haploid human genome and a diploid genome at a
similar coverage. By investigating false heterozygous calls in the haploid
genome, we identified the erroneous realignment in low-complexity regions and
the incomplete reference genome with respect to the sample as the two major
sources of errors, which press for continued improvements in these two areas.
We estimated that the error rate of raw genotype calls is as high as 1 in
10-15kb, but the error rate of post-filtered calls is reduced to 1 in 100-200kb
without significant compromise on the sensitivity.
Availability: BWA-MEM alignment: http://bit.ly/1g8XqRt; Scripts:
https://github.com/lh3/varcmp; Additional data:
https://figshare.com/articles/Towards_better_understanding_of_artifacts_in_variating_calling_from_high_coverage_samples/981073Comment: Published versio