1,227 research outputs found

    Optimizing a Massive Parallel Sequencing Workflow for Quantitative miRNA Expression Analysis

    Get PDF
    BACKGROUND: Massive Parallel Sequencing methods (MPS) can extend and improve the knowledge obtained by conventional microarray technology, both for mRNAs and short non-coding RNAs, e.g. miRNAs. The processing methods used to extract and interpret the information are an important aspect of dealing with the vast amounts of data generated from short read sequencing. Although the number of computational tools for MPS data analysis is constantly growing, their strengths and weaknesses as part of a complex analytical pipe-line have not yet been well investigated. PRIMARY FINDINGS: A benchmark MPS miRNA dataset, resembling a situation in which miRNAs are spiked in biological replication experiments was assembled by merging a publicly available MPS spike-in miRNAs data set with MPS data derived from healthy donor peripheral blood mononuclear cells. Using this data set we observed that short reads counts estimation is strongly under estimated in case of duplicates miRNAs, if whole genome is used as reference. Furthermore, the sensitivity of miRNAs detection is strongly dependent by the primary tool used in the analysis. Within the six aligners tested, specifically devoted to miRNA detection, SHRiMP and MicroRazerS show the highest sensitivity. Differential expression estimation is quite efficient. Within the five tools investigated, two of them (DESseq, baySeq) show a very good specificity and sensitivity in the detection of differential expression. CONCLUSIONS: The results provided by our analysis allow the definition of a clear and simple analytical optimized workflow for miRNAs digital quantitative analysis

    Circular sequence comparison: algorithms and applications

    Get PDF
    Background: Sequence comparison is a fundamental step in many important tasks in bioinformatics; from phylogenetic reconstruction to the reconstruction of genomes. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular molecular structure is a common phenomenon in nature, a caveat of the adaptation of alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. Results: In this paper, we introduce a new distance measure based on q-grams, and show how it can be applied effectively and computed efficiently for circular sequence comparison. Experimental results, using real DNA, RNA, and protein sequences as well as synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art

    Longest Common Prefixes with kk-Errors and Applications

    Full text link
    Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length nn over a constant-sized alphabet that occurs elsewhere in the string with kk-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant kk and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in O(nlogknloglogn)\mathcal{O}(n \log^k n \log \log n) time on average using O(n)\mathcal{O}(n) space. We show that our technique is applicable to several algorithmic problems in computational biology and elsewhere

    The Full Landscape of Robust Mean Testing: Sharp Separations between Oblivious and Adaptive Contamination

    Full text link
    We consider the question of Gaussian mean testing, a fundamental task in high-dimensional distribution testing and signal processing, subject to adversarial corruptions of the samples. We focus on the relative power of different adversaries, and show that, in contrast to the common wisdom in robust statistics, there exists a strict separation between adaptive adversaries (strong contamination) and oblivious ones (weak contamination) for this task. Specifically, we resolve both the information-theoretic and computational landscapes for robust mean testing. In the exponential-time setting, we establish the tight sample complexity of testing N(0,I)\mathcal{N}(0,I) against N(αv,I)\mathcal{N}(\alpha v, I), where v2=1\|v\|_2 = 1, with an ε\varepsilon-fraction of adversarial corruptions, to be Θ~ ⁣(max(dα2,dε3α4,min(d2/3ε2/3α8/3,dεα2))), \tilde{\Theta}\!\left(\max\left(\frac{\sqrt{d}}{\alpha^2}, \frac{d\varepsilon^3}{\alpha^4},\min\left(\frac{d^{2/3}\varepsilon^{2/3}}{\alpha^{8/3}}, \frac{d \varepsilon}{\alpha^2}\right)\right) \right) \,, while the complexity against adaptive adversaries is Θ~ ⁣(max(dα2,dε2α4)), \tilde{\Theta}\!\left(\max\left(\frac{\sqrt{d}}{\alpha^2}, \frac{d\varepsilon^2}{\alpha^4} \right)\right) \,, which is strictly worse for a large range of vanishing ε,α\varepsilon,\alpha. To the best of our knowledge, ours is the first separation in sample complexity between the strong and weak contamination models. In the polynomial-time setting, we close a gap in the literature by providing a polynomial-time algorithm against adaptive adversaries achieving the above sample complexity Θ~(max(d/α2,dε2/α4))\tilde{\Theta}(\max({\sqrt{d}}/{\alpha^2}, {d\varepsilon^2}/{\alpha^4} )), and a low-degree lower bound (which complements an existing reduction from planted clique) suggesting that all efficient algorithms require this many samples, even in the oblivious-adversary setting.Comment: To appear in FOCS 202

    Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS

    Get PDF
    Motivation: The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their identification, but are difficult to map. Results: Here we present a method for ‘split’ read mapping, where prefix and suffix match of a read may be interrupted by a longer gap in the read-to-reference alignment. We use this method to accurately detect medium-sized insertions and long deletions with precise breakpoints in genomic resequencing data. Compared with alternative split mapping methods, SplazerS significantly improves sensitivity for detecting large indel events, especially in variant-rich regions. Our method is robust in the presence of sequencing errors as well as alignment errors due to genomic mutations/divergence, and can be used on reads of variable lengths. Our analysis shows that SplazerS is a versatile tool applicable to unanchored or single-end as well as anchored paired-end reads. In addition, application of SplazerS to targeted resequencing data led to the interesting discovery of a complete, possibly functional gene retrocopy variant. Availability: SplazerS is available from http://www.seqan.de/projects/ splazers

    Phylogenetic comparative assembly

    Get PDF
    Husemann P, Stoye J. Phylogenetic Comparative Assembly. Algorithms for Molecular Biology. 2010;5(1): 3.BACKGROUND:Recent high throughput sequencing technologies are capable of generating a huge amount of data for bacterial genome sequencing projects. Although current sequence assemblers successfully merge the overlapping reads, often several contigs remain which cannot be assembled any further. It is still costly and time consuming to close all the gaps in order to acquire the whole genomic sequence. RESULTS:Here we propose an algorithm that takes several related genomes and their phylogenetic relationships into account to create a graph that contains the likelihood for each pair of contigs to be adjacent. Subsequently, this graph can be used to compute a layout graph that shows the most promising contig adjacencies in order to aid biologists in finishing the complete genomic sequence. The layout graph shows unique contig orderings where possible, and the best alternatives where necessary. CONCLUSIONS:Our new algorithm for contig ordering uses sequence similarity as well as phylogenetic information to estimate adjacencies of contigs. An evaluation of our implementation shows that it performs better than recent approaches while being much faster at the same tim

    Parallel Natural Language Parsing: From Analysis to Speedup

    Get PDF
    Electrical Engineering, Mathematics and Computer Scienc

    Effective Instance Matching for Heterogeneous Structured Data

    Get PDF
    One main problem towards the effective usage of structured data is instance matching, where the goal is to find instance representations referring to the same real-world thing. In this book we investigate how to effectively match Heterogeneous structured data. We evaluate our approaches against the latest baselines. The results show advances beyond the state-of-the-art
    corecore