12,051 research outputs found

    Linking de novo assembly results with long DNA reads by dnaasm-link application

    Full text link
    Currently, third-generation sequencing techniques, which allow to obtain much longer DNA reads compared to the next-generation sequencing technologies, are becoming more and more popular. There are many possibilities to combine data from next-generation and third-generation sequencing. Herein, we present a new application called dnaasm-link for linking contigs, a result of \textit{de novo} assembly of second-generation sequencing data, with long DNA reads. Our tool includes an integrated module to fill gaps with a suitable fragment of appropriate long DNA read, which improves the consistency of the resulting DNA sequences. This feature is very important, in particular for complex DNA regions, as presented in the paper. Finally, our implementation outperforms other state-of-the-art tools in terms of speed and memory requirements, which may enable the usage of the presented application for organisms with a large genome, which is not possible in~existing applications. The presented application has many advantages as (i) significant memory optimization and reduction of computation time (ii) filling the gaps through the appropriate fragment of a specified long DNA read (iii) reducing number of spanned and unspanned gaps in the existing genome drafts. The application is freely available to all users under GNU Library or Lesser General Public License version 3.0 (LGPLv3). The demo application, docker image and source code are available at http://dnaasm.sourceforge.net.Comment: 16 pages, 5 figure

    Of bits and bugs

    Get PDF
    Pur-α is a nucleic acid-binding protein involved in cell cycle control, transcription, and neuronal function. Initially no prediction of the three-dimensional structure of Pur-α was possible. However, recently we solved the X-ray structure of Pur-α from the fruitfly Drosophila melanogaster and showed that it contains a so-called PUR domain. Here we explain how we exploited bioinformatics tools in combination with X-ray structure determination of a bacterial homolog to obtain diffracting crystals and the high-resolution structure of Drosophila Pur-α. First, we used sensitive methods for remote-homology detection to find three repetitive regions in Pur-α. We realized that our lack of understanding how these repeats interact to form a globular domain was a major problem for crystallization and structure determination. With our information on the repeat motifs we then identified a distant bacterial homolog that contains only one repeat. We determined the bacterial crystal structure and found that two of the repeats interact to form a globular domain. Based on this bacterial structure, we calculated a computational model of the eukaryotic protein. The model allowed us to design a crystallizable fragment and to determine the structure of Drosophila Pur-α. Key for success was the fact that single repeats of the bacterial protein self-assembled into a globular domain, instructing us on the number and boundaries of repeats to be included for crystallization trials with the eukaryotic protein. This study demonstrates that the simpler structural domain arrangement of a distant prokaryotic protein can guide the design of eukaryotic crystallization constructs. Since many eukaryotic proteins contain multiple repeats or repeating domains, this approach might be instructive for structural studies of a range of proteins

    RIME: Repeat Identification

    Get PDF
    We present an algorithm for detecting long similar fragments occurring at least twice in a set of biological sequences. The problem becomes computationally challenging when the frequency of a repeat is allowed to increase and when a non-negligible number of insertions, deletions and substitutions are allowed. We introduce in this paper an algorithm, Rime1 1 Rime is also a reference to Coleridge's poem "The Rime of an Ancient Mariner" which contains many repetitions as a poetic device. (for Repeat Identification: long, Multiple, and with Edits) that performs this task, and manages instances whose size and combination of parameters cannot be handled by other currently existing methods. This is achieved by using a filter as a preprocessing step, and by then exploiting the information gathered by the filter in the following actual repeat inference step. To the best of our knowledge, Rime is the first algorithm that can accurately deal with very long repeats (up to a few thousands), occurring possibly several times, and with a rate of differences (substitutions and indels) allowed among copies of a same repeat of 10-15% or even more

    Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

    Full text link
    Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we proposed FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. Availability: http://github.com/lh3/fermi Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page

    Targeted genome modifications in soybean with CRISPR/Cas9

    Get PDF
    Background: The ability to selectively alter genomic DNA sequences in vivo is a powerful tool for basic and applied research. The CRISPR/Cas9 system precisely mutates DNA sequences in a number of organisms. Here, the CRISPR/Cas9 system is shown to be effective in soybean by knocking-out a green fluorescent protein (GFP) transgene and modifying nine endogenous loci. Results: Targeted DNA mutations were detected in 95% of 88 hairy-root transgenic events analyzed. Bi-allelic mutations were detected in events transformed with eight of the nine targeting vectors. Small deletions were the most common type of mutation produced, although SNPs and short insertions were also observed. Homoeologous genes were successfully targeted singly and together, demonstrating that CRISPR/Cas9 can both selectively, and generally, target members of gene families. Somatic embryo cultures were also modified to enable the production of plants with heritable mutations, with the frequency of DNA modifications increasing with culture time. A novel cloning strategy and vector system based on In-Fusion (R) cloning was developed to simplify the production of CRISPR/Cas9 targeting vectors, which should be applicable for targeting any gene in any organism. Conclusions: The CRISPR/Cas9 is a simple, efficient, and highly specific genome editing tool in soybean. Although some vectors are more efficient than others, it is possible to edit duplicated genes relatively easily. The vectors and methods developed here will be useful for the application of CRISPR/Cas9 to soybean and other plant species

    Chromosome analysis by non-isotopic in situ hybridization.

    Get PDF

    The CACTA transposon Bot1 played a major role in Brassica genome divergence and gene proliferation

    Get PDF
    We isolated and characterized a Brassica C genome-specific CACTA element, which was designated Bot1 (Brassica oleracea transposon 1). After analysing phylogenetic relationships, copy numbers and sequence similarity of Bot1 and Bot1 analogues in B. oleracea (C genome) versus Brassica rapa (A genome), we concluded that Bot1 has encountered several rounds of amplification in the oleracea genome only, and has played a major role in the recent rapa and oleracea genome divergence. We performed in silico analyses of the genomic organization and internal structure of Bot1, and established which segment of Bot1 is C-genome specific. Our work reports a fully characterized Brassica repetitive sequence that can distinguish the Brassica A and C chromosomes in the allotetraploid Brassica napus, by fluorescent in situ hybridization. We demonstrated that Bot1 carries a host S locus-associated SLL3 gene copy. We speculate that Bot1 was involved in the proliferation of SLL3 around the Brassica genome. The present study reinforces the assumption that transposons are a major driver of genome and gene evolution in higher plants

    From Pine Cones to Read Clouds: Rescaffolding the Megagenome of Sugar Pine (Pinus lambertiana).

    Get PDF
    We investigate the utility and scalability of new read cloud technologies to improve the draft genome assemblies of the colossal, and largely repetitive, genomes of conifers. Synthetic long read technologies have existed in various forms as a means of reducing complexity and resolving repeats since the outset of genome assembly. Recently, technologies that combine subhaploid pools of high molecular weight DNA with barcoding on a massive scale have brought new efficiencies to sample preparation and data generation. When combined with inexpensive light shotgun sequencing, the resulting data can be used to scaffold large genomes. The protocol is efficient enough to consider routinely for even the largest genomes. Conifers represent the largest reference genome projects executed to date. The largest of these is that of the conifer Pinus lambertiana (sugar pine), with a genome size of 31 billion bp. In this paper, we report on the molecular and computational protocols for scaffolding the P. lambertiana genome using the library technology from 10Ă— Genomics. At 247,000 bp, the NG50 of the existing reference sequence is the highest scaffold contiguity among the currently published conifer assemblies; this new assembly's NG50 is 1.94 million bp, an eightfold increase
    • …
    corecore