340,500 research outputs found

    Reference Based Genome Compression

    Full text link
    DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while viable, cannot offer the same savings as approaches tuned to inherent biological properties. We propose an algorithm to compress a target genome given a known reference genome. The proposed algorithm first generates a mapping from the reference to the target genome, and then compresses this mapping with an entropy coder. As an illustration of the performance: applying our algorithm to James Watson's genome with hg18 as a reference, we are able to reduce the 2991 megabyte (MB) genome down to 6.99 MB, while Gzip compresses it to 834.8 MB.Comment: 5 pages; Submitted to the IEEE Information Theory Workshop (ITW) 201

    A Novel Genome-Wide Association Study Approach Using Genotyping by Exome Sequencing Leads to the Identification of a Primary Open Angle Glaucoma Associated Inversion Disrupting ADAMTS17

    Get PDF
    Closed breeding populations in the dog in conjunction with advances in gene mapping and sequencing techniques facilitate mapping of autosomal recessive diseases and identification of novel disease-causing variants, often using unorthodox experimental designs. In our investigation we demonstrate successful mapping of the locus for primary open angle glaucoma in the Petit Basset Griffon Vendéen dog breed with 12 cases and 12 controls, using a novel genotyping by exome sequencing approach. The resulting genome-wide association signal was followed up by genome sequencing of an individual case, leading to the identification of an inversion with a breakpoint disrupting the ADAMTS17 gene. Genotyping of additional controls and expression analysis provide strong evidence that the inversion is disease causing. Evidence of cryptic splicing resulting in novel exon transcription as a consequence of the inversion in ADAMTS17 is identified through RNAseq experiments. This investigation demonstrates how a novel genotyping by exome sequencing approach can be used to map an autosomal recessive disorder in the dog, with the use of genome sequencing to facilitate identification of a disease-associated variant

    Distribution of label spacings for genome mapping in nanochannels

    Full text link
    In genome mapping experiments, long DNA molecules are stretched by confining them to very narrow channels, so that the locations of sequence-specific fluorescent labels along the channel axis provide large-scale genomic information. It is difficult, however, to make the channels narrow enough so that the DNA molecule is fully stretched. In practice its conformations may form hairpins that change the spacings between internal segments of the DNA molecule, and thus the label locations along the channel axis. Here we describe a theory for the distribution of label spacings that explains the heavy tails observed in distributions of label spacings in genome mapping experiments.Comment: 18 pages, 4 figures, 1 tabl

    Single-molecule real-time sequencing combined with optical mapping yields completely finished fungal genome

    Get PDF
    Next-generation sequencing (NGS) technologies have increased the scalability, speed, and resolution of genomic sequencing and, thus, have revolutionized genomic studies. However, eukaryotic genome sequencing initiatives typically yield considerably fragmented genome assemblies. Here, we assessed various state-of-the-art sequencing and assembly strategies in order to produce a contiguous and complete eukaryotic genome assembly, focusing on the filamentous fungus Verticillium dahliae. Compared with Illumina-based assemblies of the V. dahliae genome, hybrid assemblies that also include PacBio- generated long reads establish superior contiguity. Intriguingly, provided that sufficient sequence depth is reached, assemblies solely based on PacBio reads outperform hybrid assemblies and even result in fully assembled chromosomes. Furthermore, the addition of optical map data allowed us to produce a gapless and complete V. dahliae genome assembly of the expected eight chromosomes from telomere to telomere. Consequently, we can now study genomic regions that were previously not assembled or poorly assembled, including regions that are populated by repetitive sequences, such as transposons, allowing us to fully appreciate an organism’s biological complexity. Our data show that a combination of PacBio-generated long reads and optical mapping can be used to generate complete and gapless assemblies of fungal genomes. IMPORTANCE Studying whole-genome sequences has become an important aspect of biological research. The advent of nextgeneration sequencing (NGS) technologies has nowadays brought genomic science within reach of most research laboratories, including those that study nonmodel organisms. However, most genome sequencing initiatives typically yield (highly) fragmented genome assemblies. Nevertheless, considerable relevant information related to genome structure and evolution is likely hidden in those nonassembled regions. Here, we investigated a diverse set of strategies to obtain gapless genome assemblies, using the genome of a typical ascomycete fungus as the template. Eventually, we were able to show that a combination of PacBiogenerated long reads and optical mapping yields a gapless telomere-to-telomere genome assembly, allowing in-depth genome sanalyses to facilitate functional studies into an organism’s biology

    Genome maps across 26 human populations reveal population-specific patterns of structural variation.

    Get PDF
    Large structural variants (SVs) in the human genome are difficult to detect and study by conventional sequencing technologies. With long-range genome analysis platforms, such as optical mapping, one can identify large SVs (>2 kb) across the genome in one experiment. Analyzing optical genome maps of 154 individuals from the 26 populations sequenced in the 1000 Genomes Project, we find that phylogenetic population patterns of large SVs are similar to those of single nucleotide variations in 86% of the human genome, while ~2% of the genome has high structural complexity. We are able to characterize SVs in many intractable regions of the genome, including segmental duplications and subtelomeric, pericentromeric, and acrocentric areas. In addition, we discover ~60 Mb of non-redundant genome content missing in the reference genome sequence assembly. Our results highlight the need for a comprehensive set of alternate haplotypes from different populations to represent SV patterns in the genome

    Simultaneous mapping of multiple gene loci with pooled segregants

    Get PDF
    The analysis of polygenic, phenotypic characteristics such as quantitative traits or inheritable diseases remains an important challenge. It requires reliable scoring of many genetic markers covering the entire genome. The advent of high-throughput sequencing technologies provides a new way to evaluate large numbers of single nucleotide polymorphisms (SNPs) as genetic markers. Combining the technologies with pooling of segregants, as performed in bulked segregant analysis (BSA), should, in principle, allow the simultaneous mapping of multiple genetic loci present throughout the genome. The gene mapping process, applied here, consists of three steps: First, a controlled crossing of parents with and without a trait. Second, selection based on phenotypic screening of the offspring, followed by the mapping of short offspring sequences against the parental reference. The final step aims at detecting genetic markers such as SNPs, insertions and deletions with next generation sequencing (NGS). Markers in close proximity of genomic loci that are associated to the trait have a higher probability to be inherited together. Hence, these markers are very useful for discovering the loci and the genetic mechanism underlying the characteristic of interest. Within this context, NGS produces binomial counts along the genome, i.e., the number of sequenced reads that matches with the SNP of the parental reference strain, which is a proxy for the number of individuals in the offspring that share the SNP with the parent. Genomic loci associated with the trait can thus be discovered by analyzing trends in the counts along the genome. We exploit the link between smoothing splines and generalized mixed models for estimating the underlying structure present in the SNP scatterplots

    High-Density Genotypes of Inbred Mouse Strains: Improved Power and Precision of Association Mapping.

    Get PDF
    Human genome-wide association studies have identified thousands of loci associated with disease phenotypes. Genome-wide association studies also have become feasible using rodent models and these have some important advantages over human studies, including controlled environment, access to tissues for molecular profiling, reproducible genotypes, and a wide array of techniques for experimental validation. Association mapping with common mouse inbred strains generally requires 100 or more strains to achieve sufficient power and mapping resolution; in contrast, sample sizes for human studies typically are one or more orders of magnitude greater than this. To enable well-powered studies in mice, we have generated high-density genotypes for ∼175 inbred strains of mice using the Mouse Diversity Array. These new data increase marker density by 1.9-fold, have reduced missing data rates, and provide more accurate identification of heterozygous regions compared with previous genotype data. We report the discovery of new loci from previously reported association mapping studies using the new genotype data. The data are freely available for download, and Web-based tools provide easy access for association mapping and viewing of the underlying intensity data for individual loci

    MaxSSmap: A GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence

    Get PDF
    Programs based on hash tables and Burrows-Wheeler are very fast for mapping short reads to genomes but have low accuracy in the presence of mismatches and gaps. Such reads can be aligned accurately with the Smith-Waterman algorithm but it can take hours and days to map millions of reads even for bacteria genomes. We introduce a GPU program called MaxSSmap with the aim of achieving comparable accuracy to Smith-Waterman but with faster runtimes. Similar to most programs MaxSSmap identifies a local region of the genome followed by exact alignment. Instead of using hash tables or Burrows-Wheeler in the first part, MaxSSmap calculates maximum scoring subsequence score between the read and disjoint fragments of the genome in parallel on a GPU and selects the highest scoring fragment for exact alignment. We evaluate MaxSSmap's accuracy and runtime when mapping simulated Illumina E.coli and human chromosome one reads of different lengths and 10\% to 30\% mismatches with gaps to the E.coli genome and human chromosome one. We also demonstrate applications on real data by mapping ancient horse DNA reads to modern genomes and unmapped paired reads from NA12878 in 1000 genomes. We show that MaxSSmap attains comparable high accuracy and low error to fast Smith-Waterman programs yet has much lower runtimes. We show that MaxSSmap can map reads rejected by BWA and NextGenMap with high accuracy and low error much faster than if Smith-Waterman were used. On short read lengths of 36 and 51 both MaxSSmap and Smith-Waterman have lower accuracy compared to at higher lengths. On real data MaxSSmap produces many alignments with high score and mapping quality that are not given by NextGenMap and BWA. The MaxSSmap source code is freely available from http://www.cs.njit.edu/usman/MaxSSmap

    Polymorphism identification and improved genome annotation of Brassica rapa through Deep RNA sequencing.

    Get PDF
    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes-R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)-using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/
    corecore