57,510 research outputs found
Encoding DNA sequences by integer chaos game representation
DNA sequences are fundamental for encoding genetic information. The genetic
information may not only be understood by symbolic sequences but also from the
hidden signals inside the sequences. The symbolic sequences need to be
transformed into numerical sequences so the hidden signals can be revealed by
signal processing techniques. All current transformation methods encode DNA
sequences into numerical values of the same length. These representations have
limitations in the applications of genomic signal compression, encryption, and
steganography. We propose an integer chaos game representation (iCGR) of DNA
sequences and a lossless encoding method DNA sequences by the iCGR. In the iCGR
method, a DNA sequence is represented by the iterated function of the
nucleotides and their positions in the sequence. Then the DNA sequence can be
uniquely encoded and recovered using three integers from iCGR. One integer is
the sequence length and the other two integers represent the accumulated
distributions of nucleotides in the sequence. The integer encoding scheme can
compress a DNA sequence by 2 bits per nucleotide. The integer representation of
DNA sequences provides a prospective tool for sequence compression, encryption,
and steganography. The Python programs in this study are freely available to
the public at https://github.com/cyinbox/iCG
De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space
Sequencing technologies allow for an in-depth analysis
of biological species but the size of the generated datasets
introduce a number of analytical challenges. Recently, we
demonstrated the application of numerical sequence representations
and data transformations for the alignment of short
reads to a reference genome. Here, we expand out approach
for de novo assembly of short reads. Our results demonstrate
that highly compressed data can encapsulate the signal suffi-
ciently to accurately assemble reads to big contigs or complete
genomes
Local Binary Patterns as a Feature Descriptor in Alignment-free Visualisation of Metagenomic Data
Shotgun sequencing has facilitated the analysis of complex microbial communities. However, clustering and visualising these communities without prior taxonomic information is a major challenge. Feature descriptor methods can be utilised to extract these taxonomic relations from the data. Here, we present a novel approach consisting of local binary patterns (LBP) coupled with randomised singular value decomposition (RSVD) and Barnes-Hut t-stochastic neighbor embedding (BH-tSNE) to highlight the underlying taxonomic structure of the metagenomic data. The effectiveness of our approach is demonstrated using several simulated and a real metagenomic datasets
A perceptual hash function to store and retrieve large scale DNA sequences
This paper proposes a novel approach for storing and retrieving massive DNA
sequences.. The method is based on a perceptual hash function, commonly used to
determine the similarity between digital images, that we adapted for DNA
sequences. Perceptual hash function presented here is based on a Discrete
Cosine Transform Sign Only (DCT-SO). Each nucleotide is encoded as a fixed gray
level intensity pixel and the hash is calculated from its significant frequency
characteristics. This results to a drastic data reduction between the sequence
and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes
are not affected by "avalanche effect" and thus can be compared. The similarity
distance between two hashes is estimated with the Hamming Distance, which is
used to retrieve DNA sequences. Experiments that we conducted show that our
approach is relevant for storing massive DNA sequences, and retrieving them
Bacterial Community Reconstruction Using A Single Sequencing Reaction
Bacteria are the unseen majority on our planet, with millions of species and
comprising most of the living protoplasm. While current methods enable in-depth
study of a small number of communities, a simple tool for breadth studies of
bacterial population composition in a large number of samples is lacking. We
propose a novel approach for reconstruction of the composition of an unknown
mixture of bacteria using a single Sanger-sequencing reaction of the mixture.
This method is based on compressive sensing theory, which deals with
reconstruction of a sparse signal using a small number of measurements.
Utilizing the fact that in many cases each bacterial community is comprised of
a small subset of the known bacterial species, we show the feasibility of this
approach for determining the composition of a bacterial mixture. Using
simulations, we show that sequencing a few hundred base-pairs of the 16S rRNA
gene sequence may provide enough information for reconstruction of mixtures
containing tens of species, out of tens of thousands, even in the presence of
realistic measurement noise. Finally, we show initial promising results when
applying our method for the reconstruction of a toy experimental mixture with
five species. Our approach may have a potential for a practical and efficient
way for identifying bacterial species compositions in biological samples.Comment: 28 pages, 12 figure
Identifying DNA motifs based on match and mismatch alignment information
The conventional way of identifying DNA motifs, solely based on match
alignment information, is susceptible to a high number of spurious sites. A
novel scoring system has been introduced by taking both match and mismatch
alignment information into account. The mismatch alignment information is
useful to remove spurious sites encountered in DNA motif searching. As an
example, a correct TATA box site in Homo sapiens H4/g gene has successfully
been identified based on match and mismatch alignment information
- …