1,889 research outputs found

    Visual and computational analysis of structure-activity relationships in high-throughput screening data

    Get PDF
    Novel analytic methods are required to assimilate the large volumes of structural and bioassay data generated by combinatorial chemistry and high-throughput screening programmes in the pharmaceutical and agrochemical industries. This paper reviews recent work in visualisation and data mining that can be used to develop structure-activity relationships from such chemical/biological datasets

    Information visualization for DNA microarray data analysis: A critical review

    Get PDF
    Graphical representation may provide effective means of making sense of the complexity and sheer volume of data produced by DNA microarray experiments that monitor the expression patterns of thousands of genes simultaneously. The ability to use ldquoabstractrdquo graphical representation to draw attention to areas of interest, and more in-depth visualizations to answer focused questions, would enable biologists to move from a large amount of data to particular records they are interested in, and therefore, gain deeper insights in understanding the microarray experiment results. This paper starts by providing some background knowledge of microarray experiments, and then, explains how graphical representation can be applied in general to this problem domain, followed by exploring the role of visualization in gene expression data analysis. Having set the problem scene, the paper then examines various multivariate data visualization techniques that have been applied to microarray data analysis. These techniques are critically reviewed so that the strengths and weaknesses of each technique can be tabulated. Finally, several key problem areas as well as possible solutions to them are discussed as being a source for future work

    Generating concept trees from dynamic self-organizing map

    Get PDF
    Self-organizing map (SOM) provides both clustering and visualization capabilities in mining data. Dynamic self-organizing maps such as Growing Self-organizing Map (GSOM) has been developed to overcome the problem of fixed structure in SOM to enable better representation of the discovered patterns. However, in mining large datasets or historical data the hierarchical structure of the data is also useful to view the cluster formation at different levels of abstraction. In this paper, we present a technique to generate concept trees from the GSOM. The formation of tree from different spread factor values of GSOM is also investigated and the quality of the trees analyzed. The results show that concept trees can be generated from GSOM, thus, eliminating the need for re-clustering of the data from scratch to obtain a hierarchical view of the data under study

    AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Clustering the information content of large high-dimensional gene expression datasets has widespread application in "omics" biology. Unfortunately, the underlying structure of these natural datasets is often fuzzy, and the computational identification of data clusters generally requires knowledge about cluster number and geometry.</p> <p>Results</p> <p>We integrated strategies from machine learning, cartography, and graph theory into a new informatics method for automatically clustering self-organizing map ensembles of high-dimensional data. Our new method, called AutoSOME, readily identifies discrete and fuzzy data clusters without prior knowledge of cluster number or structure in diverse datasets including whole genome microarray data. Visualization of AutoSOME output using network diagrams and differential heat maps reveals unexpected variation among well-characterized cancer cell lines. Co-expression analysis of data from human embryonic and induced pluripotent stem cells using AutoSOME identifies >3400 up-regulated genes associated with pluripotency, and indicates that a recently identified protein-protein interaction network characterizing pluripotency was underestimated by a factor of four.</p> <p>Conclusions</p> <p>By effectively extracting important information from high-dimensional microarray data without prior knowledge or the need for data filtration, AutoSOME can yield systems-level insights from whole genome microarray expression studies. Due to its generality, this new method should also have practical utility for a variety of data-intensive applications, including the results of deep sequencing experiments. AutoSOME is available for download at <url>http://jimcooperlab.mcdb.ucsb.edu/autosome</url>.</p

    An Introduction to Programming for Bioscientists: A Python-based Primer

    Full text link
    Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in the biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, including molecular evolution and bioinformatics; genome-, proteome-, transcriptome- and metabolome-wide experimental studies; structural genomics; and atomistic simulations of cellular-scale molecular assemblies as large as ribosomes and intact viruses. In short, much of post-genomic biology is increasingly becoming a form of computational biology. The ability to design and write computer programs is among the most indispensable skills that a modern researcher can cultivate. Python has become a popular programming language in the biosciences, largely because (i) its straightforward semantics and clean syntax make it a readily accessible first language; (ii) it is expressive and well-suited to object-oriented programming, as well as other modern paradigms; and (iii) the many available libraries and third-party toolkits extend the functionality of the core language into virtually every biological domain (sequence and structure analyses, phylogenomics, workflow management systems, etc.). This primer offers a basic introduction to coding, via Python, and it includes concrete examples and exercises to illustrate the language's usage and capabilities; the main text culminates with a final project in structural bioinformatics. A suite of Supplemental Chapters is also provided. Starting with basic concepts, such as that of a 'variable', the Chapters methodically advance the reader to the point of writing a graphical user interface to compute the Hamming distance between two DNA sequences.Comment: 65 pages total, including 45 pages text, 3 figures, 4 tables, numerous exercises, and 19 pages of Supporting Information; currently in press at PLOS Computational Biolog

    Mining, visualizing and comparing multidimensional biomolecular data using the Genomics Data Miner (GMine) web-server

    Get PDF
    Genomics Data Miner (GMine) is a user-friendly online software that allows non-experts to mine, cluster and compare multidimensional biomolecular datasets. Various powerful visualization techniques are provided, generating high quality figures that can be directly incorporated into scientific publications. Robust and comprehensive analyses are provided via a broad range of data-mining techniques, including univariate and multivariate statistical analysis, supervised learning, correlation networks, clustering and multivariable regression. The software has a focus on multivariate techniques, which can attribute variance in the measurements to multiple explanatory variables and confounders. Various normalization methods are provided. Extensive help pages and a tutorial are available via a wiki server. Using GMine we reanalyzed proteome microarray data of host antibody response against Plasmodium falciparum. Our results support the hypothesis that immunity to malaria is a higher-order phenomenon related to a pattern of responses and not attributable to any single antigen. We also analyzed gene expression across resting and activated T cells, identifying many immune-related genes with differential expression. This highlights both the plasticity of T cells and the operation of a hardwired activation program. These application examples demonstrate that GMine facilitates an accurate and in-depth analysis of complex molecular datasets, including genomics, transcriptomics and proteomics data

    A computational pipeline for the diagnosis of CVID patients

    Get PDF
    Common variable immunodeficiency (CVID) is one of the most frequently diagnosed primary antibody deficiencies (PADs), a group of disorders characterized by a decrease in one or more immunoglobulin (sub) classes and/or impaired antibody responses caused by inborn defects in B cells in the absence of other major immune defects. CVID patients suffer from recurrent infections and disease-related, non-infectious, complications such as autoimmune manifestations, lymphoproliferation, and malignancies. A timely diagnosis is essential for optimal follow-up and treatment. However, CVID is by definition a diagnosis of exclusion, thereby covering a heterogeneous patient population and making it difficult to establish a definite diagnosis. To aid the diagnosis of CVID patients, and distinguish them from other PADs, we developed an automated machine learning pipeline which performs automated diagnosis based on flow cytometric immunophenotyping. Using this pipeline, we analyzed the immunophenotypic profile in a pediatric and adult cohort of 28 patients with CVID, 23 patients with idiopathic primary hypogammaglobulinemia, 21 patients with IgG subclass deficiency, six patients with isolated IgA deficiency, one patient with isolated IgM deficiency, and 100 unrelated healthy controls. Flow cytometry analysis is traditionally done by manual identification of the cell populations of interest. Yet, this approach has severe limitations including subjectivity of the manual gating and bias toward known populations. To overcome these limitations, we here propose an automated computational flow cytometry pipeline that successfully distinguishes CVID phenotypes from other PADs and healthy controls. Compared to the traditional, manual analysis, our pipeline is fully automated, performing automated quality control and data pre-processing, automated population identification (gating) and deriving features from these populations to build a machine learning classifier to distinguish CVID from other PADs and healthy controls. This results in a more reproducible flow cytometry analysis, and improves the diagnosis compared to manual analysis: our pipelines achieve on average a balanced accuracy score of 0.93 (+/- 0.07), whereas using the manually extracted populations, an averaged balanced accuracy score of 0.72 (+/- 0.23) is achieved
    corecore