56,182 research outputs found
De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space
Sequencing technologies allow for an in-depth analysis
of biological species but the size of the generated datasets
introduce a number of analytical challenges. Recently, we
demonstrated the application of numerical sequence representations
and data transformations for the alignment of short
reads to a reference genome. Here, we expand out approach
for de novo assembly of short reads. Our results demonstrate
that highly compressed data can encapsulate the signal suffi-
ciently to accurately assemble reads to big contigs or complete
genomes
Local Binary Patterns as a Feature Descriptor in Alignment-free Visualisation of Metagenomic Data
Shotgun sequencing has facilitated the analysis of complex microbial communities. However, clustering and visualising these communities without prior taxonomic information is a major challenge. Feature descriptor methods can be utilised to extract these taxonomic relations from the data. Here, we present a novel approach consisting of local binary patterns (LBP) coupled with randomised singular value decomposition (RSVD) and Barnes-Hut t-stochastic neighbor embedding (BH-tSNE) to highlight the underlying taxonomic structure of the metagenomic data. The effectiveness of our approach is demonstrated using several simulated and a real metagenomic datasets
A perceptual hash function to store and retrieve large scale DNA sequences
This paper proposes a novel approach for storing and retrieving massive DNA
sequences.. The method is based on a perceptual hash function, commonly used to
determine the similarity between digital images, that we adapted for DNA
sequences. Perceptual hash function presented here is based on a Discrete
Cosine Transform Sign Only (DCT-SO). Each nucleotide is encoded as a fixed gray
level intensity pixel and the hash is calculated from its significant frequency
characteristics. This results to a drastic data reduction between the sequence
and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes
are not affected by "avalanche effect" and thus can be compared. The similarity
distance between two hashes is estimated with the Hamming Distance, which is
used to retrieve DNA sequences. Experiments that we conducted show that our
approach is relevant for storing massive DNA sequences, and retrieving them
Bacterial Community Reconstruction Using A Single Sequencing Reaction
Bacteria are the unseen majority on our planet, with millions of species and
comprising most of the living protoplasm. While current methods enable in-depth
study of a small number of communities, a simple tool for breadth studies of
bacterial population composition in a large number of samples is lacking. We
propose a novel approach for reconstruction of the composition of an unknown
mixture of bacteria using a single Sanger-sequencing reaction of the mixture.
This method is based on compressive sensing theory, which deals with
reconstruction of a sparse signal using a small number of measurements.
Utilizing the fact that in many cases each bacterial community is comprised of
a small subset of the known bacterial species, we show the feasibility of this
approach for determining the composition of a bacterial mixture. Using
simulations, we show that sequencing a few hundred base-pairs of the 16S rRNA
gene sequence may provide enough information for reconstruction of mixtures
containing tens of species, out of tens of thousands, even in the presence of
realistic measurement noise. Finally, we show initial promising results when
applying our method for the reconstruction of a toy experimental mixture with
five species. Our approach may have a potential for a practical and efficient
way for identifying bacterial species compositions in biological samples.Comment: 28 pages, 12 figure
Identifying DNA motifs based on match and mismatch alignment information
The conventional way of identifying DNA motifs, solely based on match
alignment information, is susceptible to a high number of spurious sites. A
novel scoring system has been introduced by taking both match and mismatch
alignment information into account. The mismatch alignment information is
useful to remove spurious sites encountered in DNA motif searching. As an
example, a correct TATA box site in Homo sapiens H4/g gene has successfully
been identified based on match and mismatch alignment information
- …