2,830 research outputs found

    Entropy-scaling search of massive biological data

    Get PDF
    Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

    IMPROVING BWA-MEM WITH GPU PARALLEL COMPUTING

    Get PDF
    Due to the many advances made in designing algorithms, especially the ones used in bioinformatics, it is becoming harder and harder to improve their efficiencies. Therefore, hardware acceleration using General-Purpose computing on Graphics Processing Unit has become a popular choice. BWA-MEM is an important part of the BWA software package for sequence mapping. Because of its high speed and accuracy, we choose to parallelize the popular short DNA sequence mapper. BWA has been a prevalent single node tool in genome alignment, and it has been widely studied for acceleration for a long time since the first version of the BWA package came out. This thesis presents the Big Data GPGPU distributed BWA-MEM, a tool that combines GPGPU acceleration and distributed computing. The four hardware parallelization techniques used are CPU multi-threading, GPU paralleled, CPU distributed, and GPU distributed. The GPGPU distributed software typically outperforms other parallelization versions. The alignment is performed on a distributed network, and each node in the network executes a separate GPGPU paralleled version of the software. We parallelize the chain2aln function in three levels. In Level 1, the function ksw\_extend2, an algorithm based on Smith-Waterman, is parallelized to handle extension on one side of the seed. In Level 2, the function chain2aln is parallelized to handle chain extension, where all seeds within the same chain are extended. In Level 3, part of the function mem\_align1\_core is parallelized for extending multiple chains. Due to the program's complexity, the parallelization work was limited at the GPU version of ksw\_extend2 parallelization Level 3. However, we have successfully combined Spark with BWA-MEM and ksw\_extend2 at parallelization Level 1, which has shown that the proposed framework is possible. The paralleled Level 3 GPU version of ksw\_extend2 demonstrated noticeable speed improvement with the test data set

    Searching, clustering and evaluating biological sequences

    Get PDF
    The latest generation of biological sequencing technologies have made it possible to generate sequence data faster and cheaper than ever before. The growth of sequence data has been exponential, and so far, has outpaced the rate of improvement of computer speed and capacity. This rate of growth, however, makes analysis of new datasets increasingly difficult, and highlights the need for efficient, scalable and modular software tools. Fortunately most types of analysis of sequence data involve a few fundamental operations. Here we study three such problems, namely searching for local alignments between two sets of sequences, clustering sequences, and evaluating the assemblies made from sequence fragments. We present simple and efficient heuristic algorithms for these problems, as well as open source software tools which implement these algorithms. First, we present approximate seeds; a new type of seed for local alignment search. Approximate seeds are a generalization of exact seeds and spaced seeds, in that they allow for insertions and deletions within the seed. We prove that approximate seeds are completely sensitive. We also show how to efficiently find approximate seeds using a suffix array index of the sequences. Next, we present DNACLUST; a tool for clustering millions of DNA sequence fragments. Although DNACLUST has been primarily made for clustering 16S ribosomal RNA sequences, it can be used for other tasks, such as removing duplicate or near duplicate sequences from a dataset. Finally, we present a framework for comparing (two or more) assemblies built from the same set of reads. Our evaluation requires the set of reads and the assemblies only, and does not require the true genome sequence. Therefore our method can be used in de novo assembly projects, where the true genome is not known. Our score is based on probability theory, and the true genome is expected to obtain the maximum score

    Software redundancy: what, where, how

    Get PDF
    Software systems have become pervasive in everyday life and are the core component of many crucial activities. An inadequate level of reliability may determine the commercial failure of a software product. Still, despite the commitment and the rigorous verification processes employed by developers, software is deployed with faults. To increase the reliability of software systems, researchers have investigated the use of various form of redundancy. Informally, a software system is redundant when it performs the same functionality through the execution of different elements. Redundancy has been extensively exploited in many software engineering techniques, for example for fault-tolerance and reliability engineering, and in self-adaptive and self- healing programs. Despite the many uses, though, there is no formalization or study of software redundancy to support a proper and effective design of software. Our intuition is that a systematic and formal investigation of software redundancy will lead to more, and more effective uses of redundancy. This thesis develops this intuition and proposes a set of ways to characterize qualitatively as well as quantitatively redundancy. We first formalize the intuitive notion of redundancy whereby two code fragments are considered redundant when they perform the same functionality through different executions. On the basis of this abstract and general notion, we then develop a practical method to obtain a measure of software redundancy. We prove the effectiveness of our measure by showing that it distinguishes between shallow differences, where apparently different code fragments reduce to the same underlying code, and deep code differences, where the algorithmic nature of the computations differs. We also demonstrate that our measure is useful for developers, since it is a good predictor of the effectiveness of techniques that exploit redundancy. Besides formalizing the notion of redundancy, we investigate the pervasiveness of redundancy intrinsically found in modern software systems. Intrinsic redundancy is a form of redundancy that occurs as a by-product of modern design and development practices. We have observed that intrinsic redundancy is indeed present in software systems, and that it can be successfully exploited for good purposes. This thesis proposes a technique to automatically identify equivalent method sequences in software systems to help developers assess the presence of intrinsic redundancy. We demonstrate the effectiveness of the technique by showing that it identifies the majority of equivalent method sequences in a system with good precision and performance

    Transform Based And Search Aware Text Compression Schemes And Compressed Domain Text Retrieval

    Get PDF
    In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. While a good fraction of this information is of transient interest, useful information of archival value will continue to accumulate. We need ways to manage, organize and transport this data from one point to the other on data communications links with limited bandwidth. We must also have means to speedily find the information we need from this huge mass of data. Sometimes, a single site may also contain large collections of data such as a library database, thereby requiring an efficient search mechanism even to search within the local data. To facilitate the information retrieval, an emerging ad hoc standard for uncompressed text is XML which preprocesses the text by putting additional user defined metadata such as DTD or hyperlinks to enable searching with better efficiency and effectiveness. This increases the file size considerably, underscoring the importance of applying text compression. On account of efficiency (in terms of both space and time), there is a need to keep the data in compressed form for as much as possible. Text compression is concerned with techniques for representing the digital text data in alternate representations that takes less space. Not only does it help conserve the storage space for archival and online data, it also helps system performance by requiring less number of secondary storage (disk or CD Rom) accesses and improves the network transmission bandwidth utilization by reducing the transmission time. Unlike static images or video, there is no international standard for text compression, although compressed formats like .zip, .gz, .Z files are increasingly being used. In general, data compression methods are classified as lossless or lossy. Lossless compression allows the original data to be recovered exactly. Although used primarily for text data, lossless compression algorithms are useful in special classes of images such as medical imaging, finger print data, astronomical images and data bases containing mostly vital numerical data, tables and text information. Many lossy algorithms use lossless methods at the final stage of the encoding stage underscoring the importance of lossless methods for both lossy and lossless compression applications. In order to be able to effectively utilize the full potential of compression techniques for the future retrieval systems, we need efficient information retrieval in the compressed domain. This means that techniques must be developed to search the compressed text without decompression or only with partial decompression independent of whether the search is done on the text or on some inversion table corresponding to a set of key words for the text. In this dissertation, we make the following contributions: (1) Star family compression algorithms: We have proposed an approach to develop a reversible transformation that can be applied to a source text that improves existing algorithm\u27s ability to compress. We use a static dictionary to convert the English words into predefined symbol sequences. These transformed sequences create additional context information that is superior to the original text. Thus we achieve some compression at the preprocessing stage. We have a series of transforms which improve the performance. Star transform requires a static dictionary for a certain size. To avoid the considerable complexity of conversion, we employ the ternary tree data structure that efficiently converts the words in the text to the words in the star dictionary in linear time. (2) Exact and approximate pattern matching in Burrows-Wheeler transformed (BWT) files: We proposed a method to extract the useful context information in linear time from the BWT transformed text. The auxiliary arrays obtained from BWT inverse transform brings logarithm search time. Meanwhile, approximate pattern matching can be performed based on the results of exact pattern matching to extract the possible candidate for the approximate pattern matching. Then fast verifying algorithm can be applied to those candidates which could be just small parts of the original text. We present algorithms for both k-mismatch and k-approximate pattern matching in BWT compressed text. A typical compression system based on BWT has Move-to-Front and Huffman coding stages after the transformation. We propose a novel approach to replace the Move-to-Front stage in order to extend compressed domain search capability all the way to the entropy coding stage. A modification to the Move-to-Front makes it possible to randomly access any part of the compressed text without referring to the part before the access point. (3) Modified LZW algorithm that allows random access and partial decoding for the compressed text retrieval: Although many compression algorithms provide good compression ratio and/or time complexity, LZW is the first one studied for the compressed pattern matching because of its simplicity and efficiency. Modifications on LZW algorithm provide the extra advantage for fast random access and partial decoding ability that is especially useful for text retrieval systems. Based on this algorithm, we can provide a dynamic hierarchical semantic structure for the text, so that the text search can be performed on the expected level of granularity. For example, user can choose to retrieve a single line, a paragraph, or a file, etc. that contains the keywords. More importantly, we will show that parallel encoding and decoding algorithm is trivial with the modified LZW. Both encoding and decoding can be performed with multiple processors easily and encoding and decoding process are independent with respect to the number of processors

    Technology dictates algorithms: Recent developments in read alignment

    Full text link
    Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Modern sequencing platforms generate enormous amounts of genomic data in the form of nucleotide sequences or reads. Aligning reads onto reference genomes enables the identification of individual-specific genetic variants and is an essential step of the majority of genomic analysis pipelines. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Importantly, computational algorithms have evolved and diversified in accordance with technological advances, leading to todays diverse array of bioinformatics tools. Our review provides a survey of algorithmic foundations and methodologies across 107 alignment methods published between 1988 and 2020, for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies

    BiGGEsTS: integrated environment for biclustering analysis of time series gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The ability to monitor changes in expression patterns over time, and to observe the emergence of coherent temporal responses using expression time series, is critical to advance our understanding of complex biological processes. Biclustering has been recognized as an effective method for discovering local temporal expression patterns and unraveling potential regulatory mechanisms. The general biclustering problem is NP-hard. In the case of time series this problem is tractable, and efficient algorithms can be used. However, there is still a need for specialized applications able to take advantage of the temporal properties inherent to expression time series, both from a computational and a biological perspective.</p> <p>Findings</p> <p>BiGGEsTS makes available state-of-the-art biclustering algorithms for analyzing expression time series. Gene Ontology (GO) annotations are used to assess the biological relevance of the biclusters. Methods for preprocessing expression time series and post-processing results are also included. The analysis is additionally supported by a visualization module capable of displaying informative representations of the data, including heatmaps, dendrograms, expression charts and graphs of enriched GO terms.</p> <p>Conclusion</p> <p>BiGGEsTS is a free open source graphical software tool for revealing local coexpression of genes in specific intervals of time, while integrating meaningful information on gene annotations. It is freely available at: <url>http://kdbio.inesc-id.pt/software/biggests</url>. We present a case study on the discovery of transcriptional regulatory modules in the response of <it>Saccharomyces cerevisiae </it>to heat stress.</p

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Gammapy: A Python package for gamma-ray astronomy

    Full text link
    In this article, we present Gammapy, an open-source Python package for the analysis of astronomical Îł\gamma-ray data, and illustrate the functionalities of its first long-term-support release, version 1.0. Built on the modern Python scientific ecosystem, Gammapy provides a uniform platform for reducing and modeling data from different Îł\gamma-ray instruments for many analysis scenarios. Gammapy complies with several well-established data conventions in high-energy astrophysics, providing serialized data products that are interoperable with other software packages. Starting from event lists and instrument response functions, Gammapy provides functionalities to reduce these data by binning them in energy and sky coordinates. Several techniques for background estimation are implemented in the package to handle the residual hadronic background affecting Îł\gamma-ray instruments. After the data are binned, the flux and morphology of one or more Îł\gamma-ray sources can be estimated using Poisson maximum likelihood fitting and assuming a variety of spectral, temporal, and spatial models. Estimation of flux points, likelihood profiles, and light curves is also supported. After describing the structure of the package, we show, using publicly available Îł\gamma-ray data, the capabilities of Gammapy in multiple traditional and novel Îł\gamma-ray analysis scenarios, such as spectral and spectro-morphological modeling and estimations of a spectral energy distribution and a light curve. Its flexibility and power are displayed in a final multi-instrument example, where datasets from different instruments, at different stages of data reduction, are simultaneously fitted with an astrophysical flux model.Comment: 26 pages, 16 figure

    Dealing with clones in software : a practical approach from detection towards management

    Get PDF
    Despite the fact that duplicated fragments of code also called code clones are considered one of the prominent code smells that may exist in software, cloning is widely practiced in industrial development. The larger the system, the more people involved in its development and the more parts developed by different teams result in an increased possibility of having cloned code in the system. While there are particular benefits of code cloning in software development, research shows that it might be a source of various troubles in evolving software. Therefore, investigating and understanding clones in a software system is important to manage the clones efficiently. However, when the system is fairly large, it is challenging to identify and manage those clones properly. Among the various types of clones that may exist in software, research shows detection of near-miss clones where there might be minor to significant differences (e.g., renaming of identifiers and additions/deletions/modifications of statements) among the cloned fragments is costly in terms of time and memory. Thus, there is a great demand of state-of-the-art technologies in dealing with clones in software. Over the years, several tools have been developed to detect and visualize exact and similar clones. However, usually the tools are standalone and do not integrate well with a software developer's workflow. In this thesis, first, a study is presented on the effectiveness of a fingerprint based data similarity measurement technique named 'simhash' in detecting clones in large scale code-base. Based on the positive outcome of the study, a time efficient detection approach is proposed to find exact and near-miss clones in software, especially in large scale software systems. The novel detection approach has been made available as a highly configurable and fully fledged standalone clone detection tool named 'SimCad', which can be configured for detection of clones in both source code and non-source code based data. Second, we show a robust use of the clone detection approach studied earlier by assembling its detection service as a portable library named 'SimLib'. This library can provide tightly coupled (integrated) clone detection functionality to other applications as opposed to loosely coupled service provided by a typical standalone tool. Because of being highly configurable and easily extensible, this library allows the user to customize its clone detection process for detecting clones in data having diverse characteristics. We performed a user study to get some feedback on installation and use of the 'SimLib' API (Application Programming Interface) and to uncover its potential use as a third-party clone detection library. Third, we investigated on what tools and techniques are currently in use to detect and manage clones and understand their evolution. The goal was to find how those tools and techniques can be made available to a developer's own software development platform for convenient identification, tracking and management of clones in the software. Based on that, we developed a clone-aware software development platform named 'SimEclipse' to promote the practical use of code clone research and to provide better support for clone management in software. Finally, we evaluated 'SimEclipse' by conducting a user study on its effectiveness, usability and information management. We believe that both researchers and developers would enjoy and utilize the benefit of using these tools in different aspect of code clone research and manage cloned code in software systems
    • 

    corecore