534,053 research outputs found

    Rust-Bio - a fast and safe bioinformatics library

    Full text link
    We present Rust-Bio, the first general purpose bioinformatics library for the innovative Rust programming language. Rust-Bio leverages the unique combination of speed, memory safety and high-level syntax offered by Rust to provide a fast and safe set of bioinformatics algorithms and data structures with a focus on sequence analysis

    Simulated single molecule microscopy with SMeagol

    Full text link
    SMeagol is a software tool to simulate highly realistic microscopy data based on spatial systems biology models, in order to facilitate development, validation, and optimization of advanced analysis methods for live cell single molecule microscopy data. Availability and Implementation: SMeagol runs on Matlab R2014 and later, and uses compiled binaries in C for reaction-diffusion simulations. Documentation, source code, and binaries for recent versions of Mac OS, Windows, and Ubuntu Linux can be downloaded from http://smeagol.sourceforge.net.Comment: v2: 14 pages including supplementary text. Pre-copyedited, author-produced version of an application note published in Bioinformatics following peer review. The version of record, and additional supplementary material is available online at: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw10

    Bioinformatics Databases: State of the Art and Research Perspectives

    Get PDF
    Bioinformatics or computational biology, i.e. the application of mathematical and computer science methods to solving problems in molecular biology that require large scale data, computation, and analysis, is a research area currently receiving a considerable attention. Databases play an essential role in molecular biology and consequently in bioinformatics. molecular biology data are often relatively cheap to produce, leading to a proliferation of databases: the number of bioinformatics databases accessible worldwide probably lies between 500 and 1.000. Not only molecular biology data, but also molecular biology literature and literature references are stored in databases. Bioinformatics databases are often very large (e.g. the sequence database GenBank contains more than 4 × 10 6 nucleotide sequences) and in general grows rapidly (e.g. about 8000 abstracts are added every month to the literature database PubMed). Bioinformatics databases are heterogeneous in their data, in their data modeling paradigms, in their management systems, and in the data analysis tools they supports. Furthermore, bioinformatics databases are often implemented, queried, updated, and managed using methods rarely applied for other databases. This presentation aims at introducing in current bioinformatics databases, stressing their aspects departing from conventional databases. A more detailed survey can be found in [1] upon which thi

    Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics

    Get PDF
    The Random Forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research

    Computational Strategies for Scalable Genomics Analysis.

    Get PDF
    The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications

    PCA and K-Means decipher genome

    Full text link
    In this paper, we aim to give a tutorial for undergraduate students studying statistical methods and/or bioinformatics. The students will learn how data visualization can help in genomic sequence analysis. Students start with a fragment of genetic text of a bacterial genome and analyze its structure. By means of principal component analysis they ``discover'' that the information in the genome is encoded by non-overlapping triplets. Next, they learn how to find gene positions. This exercise on PCA and K-Means clustering enables active study of the basic bioinformatics notions. Appendix 1 contains program listings that go along with this exercise. Appendix 2 includes 2D PCA plots of triplet usage in moving frame for a series of bacterial genomes from GC-poor to GC-rich ones. Animated 3D PCA plots are attached as separate gif files. Topology (cluster structure) and geometry (mutual positions of clusters) of these plots depends clearly on GC-content.Comment: 18 pages, with program listings for MatLab, PCA analysis of genomes and additional animated 3D PCA plot
    corecore