3,068 research outputs found
Linking de novo assembly results with long DNA reads by dnaasm-link application
Currently, third-generation sequencing techniques, which allow to obtain much
longer DNA reads compared to the next-generation sequencing technologies, are
becoming more and more popular. There are many possibilities to combine data
from next-generation and third-generation sequencing.
Herein, we present a new application called dnaasm-link for linking contigs,
a result of \textit{de novo} assembly of second-generation sequencing data,
with long DNA reads. Our tool includes an integrated module to fill gaps with a
suitable fragment of appropriate long DNA read, which improves the consistency
of the resulting DNA sequences. This feature is very important, in particular
for complex DNA regions, as presented in the paper. Finally, our implementation
outperforms other state-of-the-art tools in terms of speed and memory
requirements, which may enable the usage of the presented application for
organisms with a large genome, which is not possible in~existing applications.
The presented application has many advantages as (i) significant memory
optimization and reduction of computation time (ii) filling the gaps through
the appropriate fragment of a specified long DNA read (iii) reducing number of
spanned and unspanned gaps in the existing genome drafts.
The application is freely available to all users under GNU Library or Lesser
General Public License version 3.0 (LGPLv3). The demo application, docker image
and source code are available at http://dnaasm.sourceforge.net.Comment: 16 pages, 5 figure
MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons
Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment
Computational methods to improve genome assembly and gene prediction
DNA sequencing is used to read the nucleotides composing the genetic material that forms individual organisms. As 2nd generation sequencing technologies offering high throughput at a feasible cost have matured, sequencing has permeated nearly all areas of biological research. By a combination of large-scale projects led by consortiums and smaller endeavors led by individual labs, the flood of sequencing data will continue, which should provide major insights into how genomes produce physical characteristics, including disease, and evolve. To realize this potential, computer science is required to develop the bioinformatics pipelines to efficiently and accurately process and analyze the data from large and noisy datasets. Here, I focus on two crucial bioinformatics applications: the assembly of a genome from sequencing reads and protein-coding gene prediction.
In genome assembly, we form large contiguous genomic sequences from the short sequence fragments generated by current machines. Starting from the raw sequences, we developed software called Quake that corrects sequencing errors more accurately than previous programs by using coverage of k-mers and probabilistic modeling of sequencing errors. My experiments show correcting errors with Quake improves genome assembly and leads to the detection of more polymorphisms in re-sequencing studies. For post-assembly analysis, we designed a method to detect a particular type of mis-assembly where the two copies of each chromosome in diploid genomes diverge. We found thousands of examples in each of the chimpanzee, cow, and chicken public genome assemblies that created false segmental duplications.
Shotgun sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to both discover unknown microbes and explore complex environments. We developed software called Scimm that clusters metagenomic sequences based on composition in an unsupervised fashion more accurately than previous approaches. Finally, we extended an approach for predicting protein-coding genes on whole genomes to metagenomic sequences by adding new discriminative features and augmenting the task with taxonomic classification and clustering of the sequences. The program, called Glimmer-MG, predicts genes more accurately than all previous methods. By adding a model for sequencing errors that also allows the program to predict insertions and deletions, accuracy significantly improves on error-prone sequences
ALGORITHMS FOR CORRECTING NEXT GENERATION SEQUENCING ERRORS
The advent of next generation sequencing technologies (NGS) generated a revolution in biological research. However, in order to use the data they produce, new computational tools are needed. Due to significantly shorter length of the reads and higher per-base error rate, more complicated approaches are employed and still critical problems, such as genome assembly, are not satisfactorily solved. We therefore focus our attention on improving the quality of the NGS data. More precisely, we address the error correction issue. The current methods for correcting errors are not very accurate. In addition, they do not adapt to the data. We proposed a novel tool, HiTEC, to correct errors in NGS data. HiTEC is based on the suffix array data structure accompanied by a statistical analysis. HiTEC’s accuracy is significantly higher than all previous methods. In addition, it is the only tool with the ability of adjusting to the given data set. In addition, HiTEC is time and space efficient
Designing Efficient Spaced Seeds for SOLiD Read Mapping
The advent of high-throughput sequencing technologies constituted
a major advance in genomic studies, offering new prospects in a
wide range of applications.We propose a rigorous and flexible algorithmic
solution to mapping SOLiD color-space reads to a reference genome. The
solution relies on an advanced method of seed design that uses a faithful
probabilistic model of read matches and, on the other hand, a novel
seeding principle especially adapted to read mapping. Our method can
handle both lossy and lossless frameworks and is able to distinguish, at
the level of seed design, between SNPs and reading errors. We illustrate
our approach by several seed designs and demonstrate their efficiency
Fast and accurate correction of optical mapping data via spaced seeds
Motivation: Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. Results: We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet.Peer reviewe
Causality, Information and Biological Computation: An algorithmic software approach to life, disease and the immune system
Biology has taken strong steps towards becoming a computer science aiming at
reprogramming nature after the realisation that nature herself has reprogrammed
organisms by harnessing the power of natural selection and the digital
prescriptive nature of replicating DNA. Here we further unpack ideas related to
computability, algorithmic information theory and software engineering, in the
context of the extent to which biology can be (re)programmed, and with how we
may go about doing so in a more systematic way with all the tools and concepts
offered by theoretical computer science in a translation exercise from
computing to molecular biology and back. These concepts provide a means to a
hierarchical organization thereby blurring previously clear-cut lines between
concepts like matter and life, or between tumour types that are otherwise taken
as different and may not have however a different cause. This does not diminish
the properties of life or make its components and functions less interesting.
On the contrary, this approach makes for a more encompassing and integrated
view of nature, one that subsumes observer and observed within the same system,
and can generate new perspectives and tools with which to view complex diseases
like cancer, approaching them afresh from a software-engineering viewpoint that
casts evolution in the role of programmer, cells as computing machines, DNA and
genes as instructions and computer programs, viruses as hacking devices, the
immune system as a software debugging tool, and diseases as an
information-theoretic battlefield where all these forces deploy. We show how
information theory and algorithmic programming may explain fundamental
mechanisms of life and death.Comment: 30 pages, 8 figures. Invited chapter contribution to Information and
Causality: From Matter to Life. Sara I. Walker, Paul C.W. Davies and George
Ellis (eds.), Cambridge University Pres
Emerging Approaches to DNA Data Storage: Challenges and Prospects
With the total amount of worldwide data skyrocketing, the global data storage demand is predicted to grow to 1.75 × 1014GB by 2025. Traditional storage methods have difficulties keeping pace given that current storage media have a maximum density of 103GB/mm3. As such, data production will far exceed the capacity of currently available storage methods. The costs of maintaining and transferring data, as well as the limited lifespans and significant data losses associated with current technologies also demand advanced solutions for information storage. Nature offers a powerful alternative through the storage of information that defines living organisms in unique orders of four bases (A, T, C, G) located in molecules called deoxyribonucleic acid (DNA). DNA molecules as information carriers have many advantages over traditional storage media. Their high storage density, potentially low maintenance cost, ease of synthesis, and chemical modification make them an ideal alternative for information storage. To this end, rapid progress has been made over the past decade by exploiting user-defined DNA materials to encode information. In this review, we discuss the most recent advances of DNA-based data storage with a major focus on the challenges that remain in this promising field, including the current intrinsic low speed in data writing and reading and the high cost per byte stored. Alternatively, data storage relying on DNA nanostructures (as opposed to DNA sequence) as well as on other combinations of nanomaterials and biomolecules are proposed with promising technological and economic advantages. In summarizing the advances that have been made and underlining the challenges that remain, we provide a roadmap for the ongoing research in this rapidly growing field, which will enable the development of technological solutions to the global demand for superior storage methodologies
Efficiently Supporting Hierarchy and Data Updates in DNA Storage
We propose a novel and flexible DNA-storage architecture that provides the
notion of hierarchy among the objects tagged with the same primer pair and
enables efficient data updates. In contrast to prior work, in our architecture
a pair of PCR primers of length 20 does not define a single object, but an
independent storage partition, which is internally managed in an independent
way with its own index structure. We make the observation that, while the
number of mutually compatible primer pairs is limited, the internal address
space available to any pair of primers (i.e., partition) is virtually
unlimited. We expose and leverage the flexibility with which this address space
can be managed to provide rich and functional storage semantics, such as
hierarchical data organization and efficient and flexible implementations of
data updates. Furthermore, to leverage the full power of the prefix-based
nature of PCR addressing, we define a methodology for transforming an arbitrary
indexing scheme into a PCR-compatible equivalent. This allows us to run PCR
with primers that can be variably extended to include a desired part of the
index, and thus narrow down the scope of the reaction to retrieve a specific
object (e.g., file or directory) within the partition with high precision. Our
wetlab evaluation demonstrates the practicality of the proposed ideas and shows
140x reduction in sequencing cost retrieval of smaller objects within the
partition
- …