81,557 research outputs found

    SEED: efficient clustering of next-generation sequences.

    Get PDF
    MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    Normalized Information Distance

    Get PDF
    The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in: Information Theory and Statistical Learning, Eds. M. Dehmer, F. Emmert-Streib, Springer-Verlag, New-York, To appea

    Experimental and computational applications of microarray technology for malaria eradication in Africa

    Get PDF
    Various mutation assisted drug resistance evolved in Plasmodium falciparum strains and insecticide resistance to female Anopheles mosquito account for major biomedical catastrophes standing against all efforts to eradicate malaria in Sub-Saharan Africa. Malaria is endemic in more than 100 countries and by far the most costly disease in terms of human health causing major losses among many African nations including Nigeria. The fight against malaria is failing and DNA microarray analysis need to keep up the pace in order to unravel the evolving parasiteā€™s gene expression profile which is a pointer to monitoring the genes involved in malariaā€™s infective metabolic pathway. Huge data is generated and biologists have the challenge of extracting useful information from volumes of microarray data. Expression levels for tens of thousands of genes can be simultaneously measured in a single hybridization experiment and are collectively called a ā€œgene expression profileā€. Gene expression profiles can also be used in studying various state of malaria development in which expression profiles of different disease states at different time points are collected and compared to each other to establish a classifying scheme for purposes such as diagnosis and treatments with adequate drugs. This paper examines microarray technology and its application as supported by appropriate software tools from experimental set-up to the level of data analysis. An assessment of the level of microarray technology in Africa, its availability and techniques required for malaria eradication and effective healthcare in Nigeria and Africa in general were also underscored

    Pattern vectors from algebraic graph theory

    Get PDF
    Graphstructures have proven computationally cumbersome for pattern analysis. The reason for this is that, before graphs can be converted to pattern vectors, correspondences must be established between the nodes of structures which are potentially of different size. To overcome this problem, in this paper, we turn to the spectral decomposition of the Laplacian matrix. We show how the elements of the spectral matrix for the Laplacian can be used to construct symmetric polynomials that are permutation invariants. The coefficients of these polynomials can be used as graph features which can be encoded in a vectorial manner. We extend this representation to graphs in which there are unary attributes on the nodes and binary attributes on the edges by using the spectral decomposition of a Hermitian property matrix that can be viewed as a complex analogue of the Laplacian. To embed the graphs in a pattern space, we explore whether the vectors of invariants can be embedded in a low- dimensional space using a number of alternative strategies, including principal components analysis ( PCA), multidimensional scaling ( MDS), and locality preserving projection ( LPP). Experimentally, we demonstrate that the embeddings result in well- defined graph clusters. Our experiments with the spectral representation involve both synthetic and real- world data. The experiments with synthetic data demonstrate that the distances between spectral feature vectors can be used to discriminate between graphs on the basis of their structure. The real- world experiments show that the method can be used to locate clusters of graphs
    • ā€¦
    corecore