6,157 research outputs found

    Highly Scalable Algorithms for Robust String Barcoding

    Full text link
    String barcoding is a recently introduced technique for genomic-based identification of microorganisms. In this paper we describe the engineering of highly scalable algorithms for robust string barcoding. Our methods enable distinguisher selection based on whole genomic sequences of hundreds of microorganisms of up to bacterial size on a well-equipped workstation, and can be easily parallelized to further extend the applicability range to thousands of bacterial size genomes. Experimental results on both randomly generated and NCBI genomic data show that whole-genome based selection results in a number of distinguishers nearly matching the information theoretic lower bounds for the problem

    CAD Tools for DNA Micro-Array Design, Manufacture and Application

    Get PDF
    Motivation: As the human genome project progresses and some microbial and eukaryotic genomes are recognized, numerous biotechnological processes have attracted increasing number of biologists, bioengineers and computer scientists recently. Biotechnological processes profoundly involve production and analysis of highthroughput experimental data. Numerous sequence libraries of DNA and protein structures of a large number of micro-organisms and a variety of other databases related to biology and chemistry are available. For example, microarray technology, a novel biotechnology, promises to monitor the whole genome at once, so that researchers can study the whole genome on the global level and have a better picture of the expressions among millions of genes simultaneously. Today, it is widely used in many fields- disease diagnosis, gene classification, gene regulatory network, and drug discovery. For example, designing organism specific microarray and analysis of experimental data require combining heterogeneous computational tools that usually differ in the data format; such as, GeneMark for ORF extraction, Promide for DNA probe selection, Chip for probe placement on microarray chip, BLAST to compare sequences, MEGA for phylogenetic analysis, and ClustalX for multiple alignments. Solution: Surprisingly enough, despite huge research efforts invested in DNA array applications, very few works are devoted to computer-aided optimization of DNA array design and manufacturing. Current design practices are dominated by ad-hoc heuristics incorporated in proprietary tools with unknown suboptimality. This will soon become a bottleneck for the new generation of high-density arrays, such as the ones currently being designed at Perlegen [109]. The goal of the already accomplished research was to develop highly scalable tools, with predictable runtime and quality, for cost-effective, computer-aided design and manufacturing of DNA probe arrays. We illustrate the utility of our approach by taking a concrete example of combining the design tools of microarray technology for Harpes B virus DNA data

    A New Quartet Tree Heuristic for Hierarchical Clustering

    Get PDF
    We consider the problem of constructing an an optimal-weight tree from the 3*(n choose 4) weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologiesis optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal topologies). We present a heuristic for reconstructing the optimal-weight tree, and a canonical manner to derive the quartet-topology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values.Comment: 22 pages, 14 figure

    Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source

    Get PDF
    We present two novel approaches for the computation of the exact distribution of a pattern in a long sequence. Both approaches take into account the sparse structure of the problem and are two-part algorithms. The first approach relies on a partial recursion after a fast computation of the second largest eigenvalue of the transition matrix of a Markov chain embedding. The second approach uses fast Taylor expansions of an exact bivariate rational reconstruction of the distribution. We illustrate the interest of both approaches on a simple toy-example and two biological applications: the transcription factors of the Human Chromosome 5 and the PROSITE signatures of functional motifs in proteins. On these example our methods demonstrate their complementarity and their hability to extend the domain of feasibility for exact computations in pattern problems to a new level

    A Tutorial on Clique Problems in Communications and Signal Processing

    Full text link
    Since its first use by Euler on the problem of the seven bridges of K\"onigsberg, graph theory has shown excellent abilities in solving and unveiling the properties of multiple discrete optimization problems. The study of the structure of some integer programs reveals equivalence with graph theory problems making a large body of the literature readily available for solving and characterizing the complexity of these problems. This tutorial presents a framework for utilizing a particular graph theory problem, known as the clique problem, for solving communications and signal processing problems. In particular, the paper aims to illustrate the structural properties of integer programs that can be formulated as clique problems through multiple examples in communications and signal processing. To that end, the first part of the tutorial provides various optimal and heuristic solutions for the maximum clique, maximum weight clique, and kk-clique problems. The tutorial, further, illustrates the use of the clique formulation through numerous contemporary examples in communications and signal processing, mainly in maximum access for non-orthogonal multiple access networks, throughput maximization using index and instantly decodable network coding, collision-free radio frequency identification networks, and resource allocation in cloud-radio access networks. Finally, the tutorial sheds light on the recent advances of such applications, and provides technical insights on ways of dealing with mixed discrete-continuous optimization problems

    Inferring Genomic Sequences

    Get PDF
    Recent advances in next generation sequencing have provided unprecedented opportunities for high-throughput genomic research, inexpensively producing millions of genomic sequences in a single run. Analysis of massive volumes of data results in a more accurate picture of the genome complexity and requires adequate bioinformatics support. We explore computational challenges of applying next generation sequencing to particular applications, focusing on the problem of reconstructing viral quasispecies spectrum from pyrosequencing shotgun reads and problem of inferring informative single nucleotide polymorphisms (SNPs), statistically covering genetic variation of a genome region in genome-wide association studies. The genomic diversity of viral quasispecies is a subject of a great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software cannot be used to simultaneously assemble and estimate the abundance of multiple closely related (but non-identical) quasispecies sequences. Here, we introduce a new Viral Spectrum Assembler (ViSpA) for inferring quasispecies spectrum and compare it with the state-of-the-art ShoRAH tool on both synthetic and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. While ShoRAH has an advanced error correction algorithm, ViSpA is better at quasispecies assembling, producing more accurate reconstruction of a viral population. We also foresee ViSpA application to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations. Due to the large data volume in genome-wide association studies, it is desirable to find a small subset of SNPs (tags) that covers the genetic variation of the entire set. We explore the trade-off between the number of tags used per non-tagged SNP and possible overfitting and propose an efficient 2LR-Tagging heuristic
    • …
    corecore