12,051 research outputs found
Linking de novo assembly results with long DNA reads by dnaasm-link application
Currently, third-generation sequencing techniques, which allow to obtain much
longer DNA reads compared to the next-generation sequencing technologies, are
becoming more and more popular. There are many possibilities to combine data
from next-generation and third-generation sequencing.
Herein, we present a new application called dnaasm-link for linking contigs,
a result of \textit{de novo} assembly of second-generation sequencing data,
with long DNA reads. Our tool includes an integrated module to fill gaps with a
suitable fragment of appropriate long DNA read, which improves the consistency
of the resulting DNA sequences. This feature is very important, in particular
for complex DNA regions, as presented in the paper. Finally, our implementation
outperforms other state-of-the-art tools in terms of speed and memory
requirements, which may enable the usage of the presented application for
organisms with a large genome, which is not possible in~existing applications.
The presented application has many advantages as (i) significant memory
optimization and reduction of computation time (ii) filling the gaps through
the appropriate fragment of a specified long DNA read (iii) reducing number of
spanned and unspanned gaps in the existing genome drafts.
The application is freely available to all users under GNU Library or Lesser
General Public License version 3.0 (LGPLv3). The demo application, docker image
and source code are available at http://dnaasm.sourceforge.net.Comment: 16 pages, 5 figure
Of bits and bugs
Pur-α is a nucleic acid-binding protein involved in cell cycle control, transcription, and neuronal function. Initially no prediction of the three-dimensional structure of Pur-α was possible. However, recently we solved the X-ray structure of Pur-α from the fruitfly Drosophila melanogaster and showed that it contains a so-called PUR domain. Here we explain how we exploited bioinformatics tools in combination with X-ray structure determination of a bacterial homolog to obtain diffracting crystals and the high-resolution structure of Drosophila Pur-α. First, we used sensitive methods for remote-homology detection to find three repetitive regions in Pur-α. We realized that our lack of understanding how these repeats interact to form a globular domain was a major problem for crystallization and structure determination. With our information on the repeat motifs we then identified a distant bacterial homolog that contains only one repeat. We determined the bacterial crystal structure and found that two of the repeats interact to form a globular domain. Based on this bacterial structure, we calculated a computational model of the eukaryotic protein. The model allowed us to design a crystallizable fragment and to determine the structure of Drosophila Pur-α. Key for success was the fact that single repeats of the bacterial protein self-assembled into a globular domain, instructing us on the number and boundaries of repeats to be included for crystallization trials with the eukaryotic protein. This study demonstrates that the simpler structural domain arrangement of a distant prokaryotic protein can guide the design of eukaryotic crystallization constructs. Since many eukaryotic proteins contain multiple repeats or repeating domains, this approach might be instructive for structural studies of a range of proteins
RIME: Repeat Identification
We present an algorithm for detecting long similar fragments occurring at least twice in a set of biological sequences. The problem becomes computationally challenging when the frequency of a repeat is allowed to increase and when a non-negligible number of insertions, deletions and substitutions are allowed. We introduce in this paper an algorithm, Rime1 1 Rime is also a reference to Coleridge's poem "The Rime of an Ancient Mariner" which contains many repetitions as a poetic device. (for Repeat Identification: long, Multiple, and with Edits) that performs this task, and manages instances whose size and combination of parameters cannot be handled by other currently existing methods. This is achieved by using a filter as a preprocessing step, and by then exploiting the information gathered by the filter in the following actual repeat inference step. To the best of our knowledge, Rime is the first algorithm that can accurately deal with very long repeats (up to a few thousands), occurring possibly several times, and with a rate of differences (substitutions and indels) allowed among copies of a same repeat of 10-15% or even more
Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly
Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested
that in a string graph or equivalently a unitig graph, any path spells a valid
assembly. As a string/unitig graph also encodes every valid assembly of reads,
such a graph, provided that it can be constructed correctly, is in fact a
lossless representation of reads. In principle, every analysis based on
whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion
(INDEL) calling, can also be achieved with unitigs.
Results: To explore the feasibility of using de novo assembly in the context
of resequencing, we developed a de novo assembler, fermi, that assembles
Illumina short reads into unitigs while preserving most of information of the
input reads. SNPs and INDELs can be called by mapping the unitigs against a
reference genome. By applying the method on 35-fold human resequencing data, we
showed that in comparison to the standard pipeline, our approach yields similar
accuracy for SNP calling and better results for INDEL calling. It has higher
sensitivity than other de novo assembly based methods for variant calling. Our
work suggests that variant calling with de novo assembly be a beneficial
complement to the standard variant calling pipeline for whole-genome
resequencing. In the methodological aspects, we proposed FMD-index for
forward-backward extension of DNA sequences, a fast algorithm for finding all
super-maximal exact matches and one-pass construction of unitigs from an
FMD-index.
Availability: http://github.com/lh3/fermi
Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page
Targeted genome modifications in soybean with CRISPR/Cas9
Background: The ability to selectively alter genomic DNA sequences in vivo is a powerful tool for basic and applied research. The CRISPR/Cas9 system precisely mutates DNA sequences in a number of organisms. Here, the CRISPR/Cas9 system is shown to be effective in soybean by knocking-out a green fluorescent protein (GFP) transgene and modifying nine endogenous loci.
Results: Targeted DNA mutations were detected in 95% of 88 hairy-root transgenic events analyzed. Bi-allelic mutations were detected in events transformed with eight of the nine targeting vectors. Small deletions were the most common type of mutation produced, although SNPs and short insertions were also observed. Homoeologous genes were successfully targeted singly and together, demonstrating that CRISPR/Cas9 can both selectively, and generally, target members of gene families. Somatic embryo cultures were also modified to enable the production of plants with heritable mutations, with the frequency of DNA modifications increasing with culture time. A novel cloning strategy and vector system based on In-Fusion (R) cloning was developed to simplify the production of CRISPR/Cas9 targeting vectors, which should be applicable for targeting any gene in any organism.
Conclusions: The CRISPR/Cas9 is a simple, efficient, and highly specific genome editing tool in soybean. Although some vectors are more efficient than others, it is possible to edit duplicated genes relatively easily. The vectors and methods developed here will be useful for the application of CRISPR/Cas9 to soybean and other plant species
The CACTA transposon Bot1 played a major role in Brassica genome divergence and gene proliferation
We isolated and characterized a Brassica C genome-specific CACTA element, which was designated Bot1 (Brassica oleracea transposon 1). After analysing phylogenetic relationships, copy numbers and sequence similarity of Bot1 and Bot1 analogues in B. oleracea (C genome) versus Brassica rapa (A genome), we concluded that Bot1 has encountered several rounds of amplification in the oleracea genome only, and has played a major role in the recent rapa and oleracea genome divergence. We performed in silico analyses of the genomic organization and internal structure of Bot1, and established which segment of Bot1 is C-genome specific. Our work reports a fully characterized Brassica repetitive sequence that can distinguish the Brassica A and C chromosomes in the allotetraploid Brassica napus, by fluorescent in situ hybridization. We demonstrated that Bot1 carries a host S locus-associated SLL3 gene copy. We speculate that Bot1 was involved in the proliferation of SLL3 around the Brassica genome. The present study reinforces the assumption that transposons are a major driver of genome and gene evolution in higher plants
From Pine Cones to Read Clouds: Rescaffolding the Megagenome of Sugar Pine (Pinus lambertiana).
We investigate the utility and scalability of new read cloud technologies to improve the draft genome assemblies of the colossal, and largely repetitive, genomes of conifers. Synthetic long read technologies have existed in various forms as a means of reducing complexity and resolving repeats since the outset of genome assembly. Recently, technologies that combine subhaploid pools of high molecular weight DNA with barcoding on a massive scale have brought new efficiencies to sample preparation and data generation. When combined with inexpensive light shotgun sequencing, the resulting data can be used to scaffold large genomes. The protocol is efficient enough to consider routinely for even the largest genomes. Conifers represent the largest reference genome projects executed to date. The largest of these is that of the conifer Pinus lambertiana (sugar pine), with a genome size of 31 billion bp. In this paper, we report on the molecular and computational protocols for scaffolding the P. lambertiana genome using the library technology from 10Ă— Genomics. At 247,000 bp, the NG50 of the existing reference sequence is the highest scaffold contiguity among the currently published conifer assemblies; this new assembly's NG50 is 1.94 million bp, an eightfold increase
- …