1,961 research outputs found

    Biclustering of Gene Expression Data by Correlation-Based Scatter Search

    Get PDF
    BACKGROUND: The analysis of data generated by microarray technology is very useful to understand how the genetic information becomes functional gene products. Biclustering algorithms can determine a group of genes which are co-expressed under a set of experimental conditions. Recently, new biclustering methods based on metaheuristics have been proposed. Most of them use the Mean Squared Residue as merit function but interesting and relevant patterns from a biological point of view such as shifting and scaling patterns may not be detected using this measure. However, it is important to discover this type of patterns since commonly the genes can present a similar behavior although their expression levels vary in different ranges or magnitudes. METHODS: Scatter Search is an evolutionary technique that is based on the evolution of a small set of solutions which are chosen according to quality and diversity criteria. This paper presents a Scatter Search with the aim of finding biclusters from gene expression data. In this algorithm the proposed fitness function is based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes. RESULTS: The proposed algorithm has been tested with three real data sets such as Yeast Cell Cycle dataset, human B-cells lymphoma dataset and Yeast Stress dataset, finding a remarkable number of biclusters with shifting and scaling patterns. In addition, the performance of the proposed method and fitness function are compared to that of CC, OPSM, ISA, BiMax, xMotifs and Samba using Gene the Ontology Database

    Pairwise gene GO-based measures for biclustering of high-dimensional expression data

    Get PDF
    Background: Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. Results: The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. Conclusions: It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.Ministerio de Economía y Competitividad TIN2014-55894-C2-

    Filling the gap between biology and computer science

    Get PDF
    This editorial introduces BioData Mining, a new journal which publishes research articles related to advances in computational methods and techniques for the extraction of useful knowledge from heterogeneous biological data. We outline the aims and scope of the journal, introduce the publishing model and describe the open peer review policy, which fosters interaction within the research community

    A Normalized Tree Index for identification of correlated clinical parameters in microarray experiments

    Get PDF
    Martin C, Tauchen A, Becker A, Nattkemper TW. A Normalized Tree Index for identification of correlated clinical parameters in microarray data. BioData Mining. 2011;4(1): 2.BACKGROUND: Measurements on gene level are widely used to gain new insights in complex diseases e.g. cancer. A promising approach to understand basic biological mechanisms is to combine gene expression profiles and classical clinical parameters. However, the computation of a correlation coefficient between high-dimensional data and such parameters is not covered by traditional statistical methods. METHODS: We propose a novel index, the Normalized Tree Index (NTI), to compute a correlation coefficient between the clustering result of high-dimensional microarray data and nominal clinical parameters. The NTI detects correlations between hierarchically clustered microarray data and nominal clinical parameters (labels) and gives a measurement of significance in terms of an empiric p-value of the identified correlations. Therefore, the microarray data is clustered by hierarchical agglomerative clustering using standard settings. In a second step, the computed cluster tree is evaluated. For each label, a NTI is computed measuring the correlation between that label and the clustered microarray data. RESULTS: The NTI successfully identifies correlated clinical parameters at different levels of significance when applied on two real-world microarray breast cancer data sets. Some of the identified highly correlated labels confirm the actual state of knowledge whereas others help to identify new risk factors and provide a good basis to formulate new hypothesis. CONCLUSIONS: The NTI is a valuable tool in the domain of biomedical data analysis. It allows the identification of correlations between high-dimensional data and nominal labels, while at the same time a p-value measures the level of significance of the detected correlations

    A biclustering algorithm based on a Bicluster Enumeration Tree: application to DNA microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In a number of domains, like in DNA microarray data analysis, we need to cluster simultaneously rows (genes) and columns (conditions) of a data matrix to identify groups of rows coherent with groups of columns. This kind of clustering is called <it>biclustering</it>. Biclustering algorithms are extensively used in DNA microarray data analysis. More effective biclustering algorithms are highly desirable and needed.</p> <p>Methods</p> <p>We introduce <it>BiMine</it>, a new enumeration algorithm for biclustering of DNA microarray data. The proposed algorithm is based on three original features. First, <it>BiMine </it>relies on a new evaluation function called <it>Average Spearman's rho </it>(ASR). Second, <it>BiMine </it>uses a new tree structure, called <it>Bicluster Enumeration Tree </it>(BET), to represent the different biclusters discovered during the enumeration process. Third, to avoid the combinatorial explosion of the search tree, <it>BiMine </it>introduces a parametric rule that allows the enumeration process to cut tree branches that cannot lead to good biclusters.</p> <p>Results</p> <p>The performance of the proposed algorithm is assessed using both synthetic and real DNA microarray data. The experimental results show that <it>BiMine </it>competes well with several other biclustering methods. Moreover, we test the biological significance using a gene annotation web-tool to show that our proposed method is able to produce biologically relevant biclusters. The software is available upon request from the authors to academic users.</p

    Network-Based Analysis of Genetic Variants Associated with Hippocampal Volume in Alzheimer’S Disease: A Study of Adni Cohorts

    Get PDF
    Background: Alzheimer’s disease (AD) is a neurodegenerative disease that causes dementia. While molecular basis of AD is not fully understood, genetic factors are expected to participate in the development and progression of the disease. Our goal was to uncover novel genetic underpinnings of Alzheimer’s disease with a bioinformatics approach that accounts for tissue specificity. Findings: We performed genome-wide association studies (GWAS) for hippocampal volume in two Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohorts. We used these GWAS in a subsequent tissue-specific network-wide association study (NetWAS), which applied nominally significant associations in the initial GWAS to identify disease relevant patterns in a functional network for the hippocampus. We compared prioritized gene lists from NetWAS and GWAS with literature curated AD-associated genes from the Online Mendelian Inheritance in Man (OMIM) database. In the ADNI-1 GWAS, where we also observed an enrichment of low p-values, NetWAS prioritized disease-gene associations in accordance with OMIM annotations. This was not observed in the ADNI-2 dataset. We provide source code to replicate these analyses as well as complete results under permissive licenses. Conclusions: We performed the first analysis of hippocampal volume using NetWAS, which uses machine learning algorithms applied to tissue-specific functional interaction network to prioritize GWAS results. Our findings support the idea that tissue-specific networks may provide helpful context for understanding the etiology of common human diseases and reveal challenges that network-based approaches encounter in some datasets. Our source code and intermediate results files can facilitate the development of methods to address these challenges

    Representing and querying disease networks using graph databases

    Get PDF
    BACKGROUND: Systems biology experiments generate large volumes of data of multiple modalities and this information presents a challenge for integration due to a mix of complexity together with rich semantics. Here, we describe how graph databases provide a powerful framework for storage, querying and envisioning of biological data. RESULTS: We show how graph databases are well suited for the representation of biological information, which is typically highly connected, semi-structured and unpredictable. We outline an application case that uses the Neo4j graph database for building and querying a prototype network to provide biological context to asthma related genes. CONCLUSIONS: Our study suggests that graph databases provide a flexible solution for the integration of multiple types of biological data and facilitate exploratory data mining to support hypothesis generation. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13040-016-0102-8) contains supplementary material, which is available to authorized users

    DAPPER: a data-mining resource for protein-protein interactions.

    Get PDF
    BACKGROUND: The identification of interaction networks between proteins and complexes holds the promise of offering novel insights into the molecular mechanisms that regulate many biological processes. With increasing volumes of such datasets, especially in model organisms such as Drosophila melanogaster, there exists a pressing need for specialised tools, which can seamlessly collect, integrate and analyse these data. Here we describe a database coupled with a mining tool for protein-protein interactions (DAPPER), developed as a rich resource for studying multi-protein complexes in Drosophila melanogaster. RESULTS: This proteomics database is compiled through mass spectrometric analyses of many protein complexes affinity purified from Drosophila tissues and cultured cells. The web access to DAPPER is provided via an accelerated version of BioMart software enabling data-mining through customised querying and output formats. The protein-protein interaction dataset is annotated with FlyBase identifiers, and further linked to the Ensembl database using BioMart's data-federation model, thereby enabling complex multi-dataset queries. DAPPER is open source, with all its contents and source code are freely available. CONCLUSIONS: DAPPER offers an easy-to-navigate and extensible platform for real-time integration of diverse resources containing new and existing protein-protein interaction datasets of Drosophila melanogaster.This work was supported financially by grants from the Cancer Research UK (CRUK), the Biotechnology and Biological Sciences Research Council and the Medical Research Council to DMG (C3/A11431, BB/I013938/1, G1001696), by a Cancer Research UK Career Development Fellowship to YK (C40697/A12874), and by Cancer Research UK grants to PPD (C12296/A8039 and C12296/A12541). ZL is on leave from the Biological Research Centre of the Hungarian Academy of Sciences (Institute of Biochemistry, Szeged, Hungary) and was supported by a Long-Term Fellowship of the Federation of European Biochemical Societies (FEBS)
    corecore