5,358 research outputs found
A method for finding single-nucleotide polymorphisms with allele frequencies in sequences of deep coverage
BACKGROUND: The allele frequencies of single-nucleotide polymorphisms (SNPs) are needed to select an optimal subset of common SNPs for use in association studies. Sequence-based methods for finding SNPs with allele frequencies may need to handle thousands of sequences from the same genome location (sequences of deep coverage). RESULTS: We describe a computational method for finding common SNPs with allele frequencies in single-pass sequences of deep coverage. The method enhances a widely used program named PolyBayes in several aspects. We present results from our method and PolyBayes on eighteen data sets of human expressed sequence tags (ESTs) with deep coverage. The results indicate that our method used almost all single-pass sequences in computation of the allele frequencies of SNPs. CONCLUSION: The new method is able to handle single-pass sequences of deep coverage efficiently. Our work shows that it is possible to analyze sequences of deep coverage by using pairwise alignments of the sequences with the finished genome sequence, instead of multiple sequence alignments
Recommended from our members
Worldwide genetic variation of the IGHV and TRBV immune receptor gene families in humans.
The immunoglobulin heavy variable (IGHV) and T cell beta variable (TRBV) loci are among the most complex and variable regions in the human genome. Generated through a process of gene duplication/deletion and diversification, these loci can vary extensively between individuals in copy number and contain genes that are highly similar, making their analysis technically challenging. Here, we present a comprehensive study of the functional gene segments in the IGHV and TRBV loci, quantifying their copy number and single-nucleotide variation in a globally diverse sample of 109 (IGHV) and 286 (TRBV) humans from over a 100 populations. We find that the IGHV and TRBV gene families exhibit starkly different patterns of variation. In addition to providing insight into the different evolutionary paths of the IGHV and TRBV loci, our results are also important to the adaptive immune repertoire sequencing community, where the lack of frequencies of common alleles and copy number variants is hampering existing analytical pipelines
Systematic genetic analysis of the MHC region reveals mechanistic underpinnings of HLA type associations with disease.
The MHC region is highly associated with autoimmune and infectious diseases. Here we conduct an in-depth interrogation of associations between genetic variation, gene expression and disease. We create a comprehensive map of regulatory variation in the MHC region using WGS from 419 individuals to call eight-digit HLA types and RNA-seq data from matched iPSCs. Building on this regulatory map, we explored GWAS signals for 4083 traits, detecting colocalization for 180 disease loci with eQTLs. We show that eQTL analyses taking HLA type haplotypes into account have substantially greater power compared with only using single variants. We examined the association between the 8.1 ancestral haplotype and delayed colonization in Cystic Fibrosis, postulating that downregulation of RNF5 expression is the likely causal mechanism. Our study provides insights into the genetic architecture of the MHC region and pinpoints disease associations that are due to differential expression of HLA genes and non-HLA genes
Inferring Genomic Sequences
Recent advances in next generation sequencing have provided unprecedented opportunities for high-throughput genomic research, inexpensively producing millions of genomic sequences in a single run. Analysis of massive volumes of data results in a more accurate picture of the genome complexity and requires adequate bioinformatics support. We explore computational challenges of applying next generation sequencing to particular applications, focusing on the problem of reconstructing viral quasispecies spectrum from pyrosequencing shotgun reads and problem of inferring informative single nucleotide polymorphisms (SNPs), statistically covering genetic variation of a genome region in genome-wide association studies.
The genomic diversity of viral quasispecies is a subject of a great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software cannot be used to simultaneously assemble and estimate the abundance of multiple closely related (but non-identical) quasispecies sequences. Here, we introduce a new Viral Spectrum Assembler (ViSpA) for inferring quasispecies spectrum and compare it with the state-of-the-art ShoRAH tool on both synthetic and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. While ShoRAH has an advanced error correction algorithm, ViSpA is better at quasispecies assembling, producing more accurate reconstruction of a viral population. We also foresee ViSpA application to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations.
Due to the large data volume in genome-wide association studies, it is desirable to find a small subset of SNPs (tags) that covers the genetic variation of the entire set. We explore the trade-off between the number of tags used per non-tagged SNP and possible overfitting and propose an efficient 2LR-Tagging heuristic
Single-nucleotide polymorphism discovery by high-throughput sequencing in sorghum
<p>Abstract</p> <p>Background</p> <p>Eight diverse sorghum (<it>Sorghum bicolor </it>L. Moench) accessions were subjected to short-read genome sequencing to characterize the distribution of single-nucleotide polymorphisms (SNPs). Two strategies were used for DNA library preparation. Missing SNP genotype data were imputed by local haplotype comparison. The effect of library type and genomic diversity on SNP discovery and imputation are evaluated.</p> <p>Results</p> <p>Alignment of eight genome equivalents (6 Gb) to the public reference genome revealed 283,000 SNPs at ≥82% confirmation probability. Sequencing from libraries constructed to limit sequencing to start at defined restriction sites led to genotyping 10-fold more SNPs in all 8 accessions, and correctly imputing 11% more missing data, than from semirandom libraries. The SNP yield advantage of the reduced-representation method was less than expected, since up to one fifth of reads started at noncanonical restriction sites and up to one third of restriction sites predicted <it>in silico </it>to yield unique alignments were not sampled at near-saturation. For imputation accuracy, the availability of a genomically similar accession in the germplasm panel was more important than panel size or sequencing coverage.</p> <p>Conclusions</p> <p>A sequence quantity of 3 million 50-base reads per accession using a <it>Bsr</it>FI library would conservatively provide satisfactory genotyping of 96,000 sorghum SNPs. For most reliable SNP-genotype imputation in shallowly sequenced genomes, germplasm panels should consist of pairs or groups of genomically similar entries. These results may help in designing strategies for economical genotyping-by-sequencing of large numbers of plant accessions.</p
Characterization of duplicate gene evolution in the recent natural allopolyploid Tragopogon miscellus by next-generation sequencing and Sequenom iPLEX MassARRAY genotyping
The definitive version is available at www.blackwell-synergy.co
Recommended from our members
Computational Tools for Immune Repertoire Characterization and Primer Set Design
The enormous decrease in the cost of genomic sequencing over the past two decades has enabled researchers to revisit previously unaddressable questions in sequence analysis. However, this boom of genomic information has introduced new sets of problems that often demand computationally efficient methods. In this work, we describe computational tools for two such settings involving large-scale genomic data: 1) estimating copy number and allelic variation in two highly complex gene families, and 2) selective sequencing of a target genome in a complex DNA sample.We first describe a method that takes short reads from high-throughput sequencing and characterizes both copy number and allelic variation in the IGHV and TRBV loci. These two loci can vary extensively between individuals in copy number and contain genes that are highly similar, making their analysis technically challenging. Additionally, we have conducted the first study of a globally diverse sample of hundreds of individuals in these two loci from over a hundred populations. In addition to providing insight into the different evolutionary paths of the IGHV and TRBV loci, our results are also important to the adaptive immune repertoire sequencing community, where the lack of frequencies of common alleles and copy number variants is hampering existing analytical pipelines.In our second problem setting, we describe SOAPswga, an optimized and parallelized pipeline for primer design in the context of selective amplification. Unlike previous heuristic-based methods, SOAPswga uses machine learning methods to evaluate both individual primers and primer sets. Additionally, rather than brute force search for primer sets, such as in predecessor methods, SOAPswga uses branch-and-bound principles to pursue only the most promising sets. These optimizations, including the parallelization of each step, allow for a huge decrease in runtime from the order of weeks to minutes. We also discuss the results of our pipeline applied to the selective amplification of Mycobacterium tuberculosis in a sample of human blood. Lastly, we expand on the importance of this work, and in general, its potential usefulness to any setting consisting of targeted sequencing
Heterogeneity of Human Neutrophil CD177 Expression Results from CD177P1 Pseudogene Conversion
Most humans harbor both CD177neg and CD177pos neutrophils but 1–10% of people are CD177null, placing them at risk for formation of anti-neutrophil antibodies that can cause transfusion-related acute lung injury and neonatal alloimmune neutropenia. By deep sequencing the CD177 locus, we catalogued CD177 single nucleotide variants and identified a novel stop codon in CD177null individuals arising from a single base substitution in exon 7. This is not a mutation in CD177 itself, rather the CD177null phenotype arises when exon 7 of CD177 is supplied entirely by the CD177 pseudogene (CD177P1), which appears to have resulted from allelic gene conversion. In CD177 expressing individuals the CD177 locus contains both CD177P1 and CD177 sequences. The proportion of CD177hi neutrophils in the blood is a heritable trait. Abundance of CD177hi neutrophils correlates with homozygosity for CD177 reference allele, while heterozygosity for ectopic CD177P1 gene conversion correlates with increased CD177neg neutrophils, in which both CD177P1 partially incorporated allele and paired intact CD177 allele are transcribed. Human neutrophil heterogeneity for CD177 expression arises by ectopic allelic conversion. Resolution of the genetic basis of CD177null phenotype identifies a method for screening for individuals at risk of CD177 isoimmunisation
Heterogeneity of human Neutrophil CD177 expression results from CD177P1 Pseudogene Conversion
Most humans harbor both CD177neg and CD177pos neutrophils but 1–10% of people are CD177null, placing them at risk for formation of anti-neutrophil antibodies that can cause transfusion-related acute lung injury and neonatal alloimmune neutropenia. By deep sequencing the CD177 locus, we catalogued CD177 single nucleotide variants and identified a novel stop codon in CD177null individuals arising from a single base substitution in exon 7. This is not a mutation in CD177 itself, rather the CD177null phenotype arises when exon 7 of CD177 is supplied entirely by the CD177 pseudogene (CD177P1), which appears to have resulted from allelic gene conversion. In CD177 expressing individuals the CD177 locus contains both CD177P1 and CD177 sequences. The proportion of CD177hi neutrophils in the blood is a heritable trait. Abundance of CD177hi neutrophils correlates with homozygosity for CD177 reference allele, while heterozygosity for ectopic CD177P1 gene conversion correlates with increased CD177neg neutrophils, in which both CD177P1 partially incorporated allele and paired intact CD177 allele are transcribed. Human neutrophil heterogeneity for CD177 expression arises by ectopic allelic conversion. Resolution of the genetic basis of CD177null phenotype identifies a method for screening for individuals at risk of CD177 isoimmunisation
- …