23,829 research outputs found
Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences
A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%–70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively contextindependent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time
A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified
single-cell genomes, and metagenomes has enabled investigation of a wide range
of organisms and ecosystems. However, sampling variation in short-read data
sets and high sequencing error rates of modern sequencers present many new
computational challenges in data interpretation. These challenges have led to
the development of new classes of mapping tools and {\em de novo} assemblers.
These algorithms are challenged by the continued improvement in sequencing
throughput. We here describe digital normalization, a single-pass computational
algorithm that systematizes coverage in shotgun sequencing data sets, thereby
decreasing sampling variation, discarding redundant data, and removing the
majority of errors. Digital normalization substantially reduces the size of
shotgun data sets and decreases the memory and time requirements for {\em de
novo} sequence assembly, all without significantly impacting content of the
generated contigs. We apply digital normalization to the assembly of microbial
genomic data, amplified single-cell genomic data, and transcriptomic data. Our
implementation is freely available for use and modification
SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner
To tackle the exponentially increasing throughput of Next-Generation
Sequencing (NGS), most of the existing short-read aligners can be configured to
favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging
the computational power of both CPU and GPU with optimized algorithms, delivers
high speed and sensitivity simultaneously. Compared with widely adopted
aligners including BWA, Bowtie2, SeqAlto, GEM and GPU-based aligners including
BarraCUDA and CUSHAW, SOAP3-dp is two to tens of times faster, while
maintaining the highest sensitivity and lowest false discovery rate (FDR) on
Illumina reads with different lengths. Transcending its predecessor SOAP3,
which does not allow gapped alignment, SOAP3-dp by default tolerates alignment
similarity as low as 60 percent. Real data evaluation using human genome
demonstrates SOAP3-dp's power to enable more authentic variants and longer
Indels to be discovered. Fosmid sequencing shows a 9.1 percent FDR on newly
discovered deletions. SOAP3-dp natively supports BAM file format and provides a
scoring scheme same as BWA, which enables it to be integrated into existing
analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and
Tianhe-1A.Comment: 21 pages, 6 figures, submitted to PLoS ONE, additional files
available at "https://www.dropbox.com/sh/bhclhxpoiubh371/O5CO_CkXQE".
Comments most welcom
Two-Stage Code Acquisition Employing Search Space Reduction and Iterative Detection in the DS-UWB Downlink
Abstract—In this paper we propose and investigate an iterative code acquisition scheme assisted by both search space reduction and iterative Massage Passing (MP), which was designed for the Direct Sequence-Ultra WideBand (DS-UWB) DownLink (DL). The performance of this iterative code acquisition scheme is analysed in terms of both the correct detection probability and the achievable Mean Acquisition Time (MAT). We propose an improved criterion for designing the iterative MP based twostage acquisition regime. Our proposed scheme is capable of reducing the MAT by several orders of magnitude compared to the benchmark scenarios, when considering the employment of long PseudoNoise (PN) codes suitable for a variety of applications
Linking de novo assembly results with long DNA reads by dnaasm-link application
Currently, third-generation sequencing techniques, which allow to obtain much
longer DNA reads compared to the next-generation sequencing technologies, are
becoming more and more popular. There are many possibilities to combine data
from next-generation and third-generation sequencing.
Herein, we present a new application called dnaasm-link for linking contigs,
a result of \textit{de novo} assembly of second-generation sequencing data,
with long DNA reads. Our tool includes an integrated module to fill gaps with a
suitable fragment of appropriate long DNA read, which improves the consistency
of the resulting DNA sequences. This feature is very important, in particular
for complex DNA regions, as presented in the paper. Finally, our implementation
outperforms other state-of-the-art tools in terms of speed and memory
requirements, which may enable the usage of the presented application for
organisms with a large genome, which is not possible in~existing applications.
The presented application has many advantages as (i) significant memory
optimization and reduction of computation time (ii) filling the gaps through
the appropriate fragment of a specified long DNA read (iii) reducing number of
spanned and unspanned gaps in the existing genome drafts.
The application is freely available to all users under GNU Library or Lesser
General Public License version 3.0 (LGPLv3). The demo application, docker image
and source code are available at http://dnaasm.sourceforge.net.Comment: 16 pages, 5 figure
- …