340,500 research outputs found
Reference Based Genome Compression
DNA sequencing technology has advanced to a point where storage is becoming
the central bottleneck in the acquisition and mining of more data. Large
amounts of data are vital for genomics research, and generic compression tools,
while viable, cannot offer the same savings as approaches tuned to inherent
biological properties. We propose an algorithm to compress a target genome
given a known reference genome. The proposed algorithm first generates a
mapping from the reference to the target genome, and then compresses this
mapping with an entropy coder. As an illustration of the performance: applying
our algorithm to James Watson's genome with hg18 as a reference, we are able to
reduce the 2991 megabyte (MB) genome down to 6.99 MB, while Gzip compresses it
to 834.8 MB.Comment: 5 pages; Submitted to the IEEE Information Theory Workshop (ITW) 201
A Novel Genome-Wide Association Study Approach Using Genotyping by Exome Sequencing Leads to the Identification of a Primary Open Angle Glaucoma Associated Inversion Disrupting ADAMTS17
Closed breeding populations in the dog in conjunction with advances in gene mapping and sequencing techniques facilitate mapping of autosomal recessive diseases and identification of novel disease-causing variants, often using unorthodox experimental designs. In our investigation we demonstrate successful mapping of the locus for primary open angle glaucoma in the Petit Basset Griffon Vendéen dog breed with 12 cases and 12 controls, using a novel genotyping by exome sequencing approach. The resulting genome-wide association signal was followed up by genome sequencing of an individual case, leading to the identification of an inversion with a breakpoint disrupting the ADAMTS17 gene. Genotyping of additional controls and expression analysis provide strong evidence that the inversion is disease causing. Evidence of cryptic splicing resulting in novel exon transcription as a consequence of the inversion in ADAMTS17 is identified through RNAseq experiments. This investigation demonstrates how a novel genotyping by exome sequencing approach can be used to map an autosomal recessive disorder in the dog, with the use of genome sequencing to facilitate identification of a disease-associated variant
Distribution of label spacings for genome mapping in nanochannels
In genome mapping experiments, long DNA molecules are stretched by confining
them to very narrow channels, so that the locations of sequence-specific
fluorescent labels along the channel axis provide large-scale genomic
information. It is difficult, however, to make the channels narrow enough so
that the DNA molecule is fully stretched. In practice its conformations may
form hairpins that change the spacings between internal segments of the DNA
molecule, and thus the label locations along the channel axis. Here we describe
a theory for the distribution of label spacings that explains the heavy tails
observed in distributions of label spacings in genome mapping experiments.Comment: 18 pages, 4 figures, 1 tabl
Single-molecule real-time sequencing combined with optical mapping yields completely finished fungal genome
Next-generation sequencing (NGS) technologies have increased the scalability, speed, and resolution of genomic sequencing and, thus, have revolutionized genomic studies. However, eukaryotic genome sequencing initiatives typically yield considerably fragmented genome assemblies. Here, we assessed various state-of-the-art sequencing and assembly strategies in order to produce a contiguous and complete eukaryotic genome assembly, focusing on the filamentous fungus Verticillium dahliae. Compared with Illumina-based assemblies of the V. dahliae genome, hybrid assemblies that also include PacBio- generated long reads establish superior contiguity. Intriguingly, provided that sufficient sequence depth is reached, assemblies solely based on PacBio reads outperform hybrid assemblies and even result in fully assembled chromosomes. Furthermore, the addition of optical map data allowed us to produce a gapless and complete V. dahliae genome assembly of the expected eight chromosomes from telomere to telomere. Consequently, we can now study genomic regions that were previously not assembled or poorly assembled, including regions that are populated by repetitive sequences, such as transposons, allowing us to fully appreciate an organism’s biological complexity. Our data show that a combination of PacBio-generated long reads and optical mapping can be used to generate complete and gapless assemblies of fungal genomes. IMPORTANCE Studying whole-genome sequences has become an important aspect of biological research. The advent of nextgeneration sequencing (NGS) technologies has nowadays brought genomic science within reach of most research laboratories, including those that study nonmodel organisms. However, most genome sequencing initiatives typically yield (highly) fragmented genome assemblies. Nevertheless, considerable relevant information related to genome structure and evolution is likely hidden in those nonassembled regions. Here, we investigated a diverse set of strategies to obtain gapless genome assemblies, using the genome of a typical ascomycete fungus as the template. Eventually, we were able to show that a combination of PacBiogenerated long reads and optical mapping yields a gapless telomere-to-telomere genome assembly, allowing in-depth genome sanalyses to facilitate functional studies into an organism’s biology
Genome maps across 26 human populations reveal population-specific patterns of structural variation.
Large structural variants (SVs) in the human genome are difficult to detect and study by conventional sequencing technologies. With long-range genome analysis platforms, such as optical mapping, one can identify large SVs (>2 kb) across the genome in one experiment. Analyzing optical genome maps of 154 individuals from the 26 populations sequenced in the 1000 Genomes Project, we find that phylogenetic population patterns of large SVs are similar to those of single nucleotide variations in 86% of the human genome, while ~2% of the genome has high structural complexity. We are able to characterize SVs in many intractable regions of the genome, including segmental duplications and subtelomeric, pericentromeric, and acrocentric areas. In addition, we discover ~60 Mb of non-redundant genome content missing in the reference genome sequence assembly. Our results highlight the need for a comprehensive set of alternate haplotypes from different populations to represent SV patterns in the genome
Simultaneous mapping of multiple gene loci with pooled segregants
The analysis of polygenic, phenotypic characteristics such as quantitative traits or inheritable diseases remains an important challenge. It requires reliable scoring of many genetic markers covering the entire genome. The advent of high-throughput sequencing technologies provides a new way to evaluate large numbers of single nucleotide polymorphisms (SNPs) as genetic markers. Combining the technologies with pooling of segregants, as performed in bulked segregant analysis (BSA), should, in principle, allow the simultaneous mapping of multiple genetic loci present throughout the genome. The gene mapping process, applied here, consists of three steps: First, a controlled crossing of parents with and without a trait. Second, selection based on phenotypic screening of the offspring, followed by the mapping of short offspring sequences against the parental reference. The final step aims at detecting genetic markers such as SNPs, insertions and deletions with next generation sequencing (NGS). Markers in close proximity of genomic loci that are associated to the trait have a higher probability to be inherited together. Hence, these markers are very useful for discovering the loci and the genetic mechanism underlying the characteristic of interest. Within this context, NGS produces binomial counts along the genome, i.e., the number of sequenced reads that matches with the SNP of the parental reference strain, which is a proxy for the number of individuals in the offspring that share the SNP with the parent. Genomic loci associated with the trait can thus be discovered by analyzing trends in the counts along the genome. We exploit the link between smoothing splines and generalized mixed models for estimating the underlying structure present in the SNP scatterplots
High-Density Genotypes of Inbred Mouse Strains: Improved Power and Precision of Association Mapping.
Human genome-wide association studies have identified thousands of loci associated with disease phenotypes. Genome-wide association studies also have become feasible using rodent models and these have some important advantages over human studies, including controlled environment, access to tissues for molecular profiling, reproducible genotypes, and a wide array of techniques for experimental validation. Association mapping with common mouse inbred strains generally requires 100 or more strains to achieve sufficient power and mapping resolution; in contrast, sample sizes for human studies typically are one or more orders of magnitude greater than this. To enable well-powered studies in mice, we have generated high-density genotypes for ∼175 inbred strains of mice using the Mouse Diversity Array. These new data increase marker density by 1.9-fold, have reduced missing data rates, and provide more accurate identification of heterozygous regions compared with previous genotype data. We report the discovery of new loci from previously reported association mapping studies using the new genotype data. The data are freely available for download, and Web-based tools provide easy access for association mapping and viewing of the underlying intensity data for individual loci
MaxSSmap: A GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence
Programs based on hash tables and Burrows-Wheeler are very fast for mapping
short reads to genomes but have low accuracy in the presence of mismatches and
gaps. Such reads can be aligned accurately with the Smith-Waterman algorithm
but it can take hours and days to map millions of reads even for bacteria
genomes. We introduce a GPU program called MaxSSmap with the aim of achieving
comparable accuracy to Smith-Waterman but with faster runtimes. Similar to most
programs MaxSSmap identifies a local region of the genome followed by exact
alignment. Instead of using hash tables or Burrows-Wheeler in the first part,
MaxSSmap calculates maximum scoring subsequence score between the read and
disjoint fragments of the genome in parallel on a GPU and selects the highest
scoring fragment for exact alignment. We evaluate MaxSSmap's accuracy and
runtime when mapping simulated Illumina E.coli and human chromosome one reads
of different lengths and 10\% to 30\% mismatches with gaps to the E.coli genome
and human chromosome one. We also demonstrate applications on real data by
mapping ancient horse DNA reads to modern genomes and unmapped paired reads
from NA12878 in 1000 genomes. We show that MaxSSmap attains comparable high
accuracy and low error to fast Smith-Waterman programs yet has much lower
runtimes. We show that MaxSSmap can map reads rejected by BWA and NextGenMap
with high accuracy and low error much faster than if Smith-Waterman were used.
On short read lengths of 36 and 51 both MaxSSmap and Smith-Waterman have lower
accuracy compared to at higher lengths. On real data MaxSSmap produces many
alignments with high score and mapping quality that are not given by NextGenMap
and BWA. The MaxSSmap source code is freely available from
http://www.cs.njit.edu/usman/MaxSSmap
Polymorphism identification and improved genome annotation of Brassica rapa through Deep RNA sequencing.
The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes-R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)-using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/
- …
