16 research outputs found

    Local alignment of two-base encoded DNA sequence

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity.</p> <p>Results</p> <p>We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions.</p> <p>Conclusion</p> <p>The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data.</p

    ParMap, an Algorithm for the Identification of Complex Genomic Variations in Nextgen Sequencing Data

    Get PDF
    Next-generation sequencing produces high-throughput data, albeit with greater error and shorter reads than traditional Sanger sequencing methods. This complicates the detection of genomic variations, especially, small insertions and deletions. Here we describe ParMap, a statistical algorithm for the identification of complex genetic variants using partially mapped reads in nextgen sequencing data. We also report ParMap&#x2019;s successful application to the mutation analysis of chromosome X exome-captured leukemia DNA samples

    Detection of microRNAs in color space

    Get PDF
    MotivationDeep sequencing provides inexpensive opportunities to characterize the transcriptional diversity of known genomes. The AB SOLiD technology generates millions of short sequencing reads in color-space; that is, the raw data is a sequence of colors, where each color represents 2 nt and each nucleotide is represented by two consecutive colors. This strategy is purported to have several advantages, including increased ability to distinguish sequencing errors from polymorphisms. Several programs have been developed to map short reads to genomes in color space. However, a number of previously unexplored technical issues arise when using SOLiD technology to characterize microRNAs.ResultsHere we explore these technical difficulties. First, since the sequenced reads are longer than the biological sequences, every read is expected to contain linker fragments. The color-calling error rate increases toward the 3(') end of the read such that recognizing the linker sequence for removal becomes problematic. Second, mapping in color space may lead to the loss of the first nucleotide of each read. We propose a sequential trimming and mapping approach to map small RNAs. Using our strategy, we reanalyze three published insect small RNA deep sequencing datasets and characterize 22 new microRNAs.Availability and implementationA bash shell script to perform the sequential trimming and mapping procedure, called SeqTrimMap, is available at: http://www.mirbase.org/tools/seqtrimmap/[email protected] informationSupplementary data are available at Bioinformatics online

    ParMap, an algorithm for the identification of small genomic insertions and deletions in nextgen sequencing data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Next-generation sequencing produces high-throughput data, albeit with greater error and shorter reads than traditional Sanger sequencing methods. This complicates the detection of genomic variations, especially, small insertions and deletions.</p> <p>Findings</p> <p>Here we describe ParMap, a statistical algorithm for the identification of complex genetic variants, such as small insertion and deletions, using partially mapped reads in nextgen sequencing data.</p> <p>Conclusions</p> <p>We report ParMap's successful application to the mutation analysis of chromosome X exome-captured leukemia DNA samples.</p

    Whole Methylome Analysis by Ultra-Deep Sequencing Using Two-Base Encoding

    Get PDF
    Methylation, the addition of methyl groups to cytosine (C), plays an important role in the regulation of gene expression in both normal and dysfunctional cells. During bisulfite conversion and subsequent PCR amplification, unmethylated Cs are converted into thymine (T), while methylated Cs will not be converted. Sequencing of this bisulfite-treated DNA permits the detection of methylation at specific sites. Through the introduction of next-generation sequencing technologies (NGS) simultaneous analysis of methylation motifs in multiple regions provides the opportunity for hypothesis-free study of the entire methylome. Here we present a whole methylome sequencing study that compares two different bisulfite conversion methods (in solution versus in gel), utilizing the high throughput of the SOLiD™ System. Advantages and disadvantages of the two different bisulfite conversion methods for constructing sequencing libraries are discussed. Furthermore, the application of the SOLiD™ bisulfite sequencing to larger and more complex genomes is shown with preliminary in silico created bisulfite converted reads

    Transcriptomics of an extended phenotype: Parasite manipulation of wasp social behaviour shifts expression of caste-related genes

    Get PDF
    Parasites can manipulate host behaviour to increase their own transmission and fitness, but the genomic mechanisms by which parasites manipulate hosts are not well understood. We investigated the relationship between the social paper wasp, Polistes dominula, and its parasite, Xenos vesparum (Insecta: Strepsiptera) to understand the effects of an obligate endoparasitoid on its host’s brain transcriptome. Previous research suggests that X. vesparum shifts aspects of host social caste-related behaviour and physiology in ways that benefit the parasitoid. We hypothesized that X. vesparum-infested (stylopized) females would show a shift in caste-related brain gene expression. Specifically, we predicted stylopized females, who would normally be workers, would show gene expression patterns resembling pre-overwintering queens (gynes), reflecting gyne-like changes in behaviour. We used RNA-sequencing data to characterize patterns of brain gene expression in stylopized females, and compared these to those of unstylopized workers and gynes. In support of our hypothesis, we found that stylopized females, despite sharing numerous physiological and life history characteristics with members of the worker caste, show gyne-shifted brain expression patterns. These data suggest the parasitoid affects its host by exploiting phenotypic plasticity related to social caste, thus shifting naturally occurring social behaviour in a way that is beneficial to the parasitoid

    Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA

    Get PDF
    A primary component of next-generation sequencing analysis is to align short reads to a reference genome, with each read aligned independently. However, reads that observe the same non-reference DNA sequence are highly correlated and can be used to better model the true variation in the target genome. A novel short-read micro re-aligner, SRMA, that leverages this correlation to better resolve a consensus of the underlying DNA sequence of the targeted genome is described here

    Local alignment of generalized k-base encoded DNA sequence

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>DNA sequence comparison is a well-studied problem, in which two DNA sequences are compared using a weighted edit distance. Recent DNA sequencing technologies however observe an encoded form of the sequence, rather than each DNA base individually. The encoded DNA sequence may contain technical errors, and therefore encoded sequencing errors must be incorporated when comparing an encoded DNA sequence to a reference DNA sequence.</p> <p>Results</p> <p>Although two-base encoding is currently used in practice, many other encoding schemes are possible, whereby two ore more bases are encoded at a time. A generalized <it>k</it>-base encoding scheme is presented, whereby feasible higher order encodings are better able to differentiate errors in the encoded sequence from true DNA sequence variants. A generalized version of the previous two-base encoding DNA sequence comparison algorithm is used to compare a <it>k</it>-base encoded sequence to a DNA reference sequence. Finally, simulations are performed to evaluate the power, the false positive and false negative SNP discovery rates, and the performance time of <it>k</it>-base encoding compared to previous methods as well as to the standard DNA sequence comparison algorithm.</p> <p>Conclusions</p> <p>The novel generalized <it>k</it>-base encoding scheme and resulting local alignment algorithm permits the development of higher fidelity ligation-based next generation sequencing technology. This bioinformatic solution affords greater robustness to errors, as well as lower false SNP discovery rates, only at the cost of computational time.</p

    BFAST: An Alignment Tool for Large Scale Genome Resequencing

    Get PDF
    BACKGROUND:The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25-100 base range, in the presence of errors and true biological variation. METHODOLOGY:We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels. CONCLUSIONS:We compare BFAST to a selection of large-scale alignment tools -- BLAT, MAQ, SHRiMP, and SOAP -- in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at (http://bfast.sourceforge.net)
    corecore