171 research outputs found
Biclustering of Gene Expression Data by Correlation-Based Scatter Search
BACKGROUND: The analysis of data generated by microarray technology is very useful to understand how the genetic information becomes functional gene products. Biclustering algorithms can determine a group of genes which are co-expressed under a set of experimental conditions. Recently, new biclustering methods based on metaheuristics have been proposed. Most of them use the Mean Squared Residue as merit function but interesting and relevant patterns from a biological point of view such as shifting and scaling patterns may not be detected using this measure. However, it is important to discover this type of patterns since commonly the genes can present a similar behavior although their expression levels vary in different ranges or magnitudes. METHODS: Scatter Search is an evolutionary technique that is based on the evolution of a small set of solutions which are chosen according to quality and diversity criteria. This paper presents a Scatter Search with the aim of finding biclusters from gene expression data. In this algorithm the proposed fitness function is based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes. RESULTS: The proposed algorithm has been tested with three real data sets such as Yeast Cell Cycle dataset, human B-cells lymphoma dataset and Yeast Stress dataset, finding a remarkable number of biclusters with shifting and scaling patterns. In addition, the performance of the proposed method and fitness function are compared to that of CC, OPSM, ISA, BiMax, xMotifs and Samba using Gene the Ontology Database
Pairwise gene GO-based measures for biclustering of high-dimensional expression data
Background: Biclustering algorithms search for groups of genes that share the same
behavior under a subset of samples in gene expression data. Nowadays, the biological
knowledge available in public repositories can be used to drive these algorithms to
find biclusters composed of groups of genes functionally coherent. On the other hand,
a distance among genes can be defined according to their information stored in Gene
Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each
pair of genes which establishes their functional similarity. A scatter search-based
algorithm that optimizes a merit function that integrates GO information is studied in
this paper. This merit function uses a term that addresses the information through a GO
measure.
Results: The effect of two possible different gene pairwise GO measures on the
performance of the algorithm is analyzed. Firstly, three well known yeast datasets with
approximately one thousand of genes are studied. Secondly, a group of human
datasets related to clinical data of cancer is also explored by the algorithm. Most of
these data are high-dimensional datasets composed of a huge number of genes. The
resultant biclusters reveal groups of genes linked by a same functionality when the
search procedure is driven by one of the proposed GO measures. Furthermore, a
qualitative biological study of a group of biclusters show their relevance from a cancer
disease perspective.
Conclusions: It can be concluded that the integration of biological information
improves the performance of the biclustering process. The two different GO measures
studied show an improvement in the results obtained for the yeast dataset. However, if
datasets are composed of a huge number of genes, only one of them really improves
the algorithm performance. This second case constitutes a clear option to explore
interesting datasets from a clinical point of view.Ministerio de Economía y Competitividad TIN2014-55894-C2-
Biclustering of Gene Expression Data Based on SimUI Semantic Similarity Measure
Biclustering is an unsupervised machine learning technique
that simultaneously clusters genes and conditions in gene expression
data. Gene Ontology (GO) is usually used in this context to validate
the biological relevance of the results. However, although the integration
of biological information from different sources is one of the research
directions in Bioinformatics, GO is not used in biclustering as an input
data. A scatter search-based algorithm that integrates GO information
during the biclustering search process is presented in this paper. SimUI
is a GO semantic similarity measure that defines a distance between two
genes. The algorithm optimizes a fitness function that uses SimUI to
integrate the biological information stored in GO. Experimental results
analyze the effect of integration of the biological information through
this measure. A SimUI fitness function configuration is experimentally
studied in a scatter search-based biclustering algorithmMinisterio de Ciencia e Innovación TIN2011-28956-C02-02Ministerio de Ciencia e Innovación TIN2014-55894-C2-RJunta de Andalucía P12-TIC-1728Universidad Pablo de Olavide APPB81309
Correlation–Based Scatter Search for Discovering Biclusters from Gene Expression Data
Scatter Search is an evolutionary method that combines ex isting solutions to create new offspring as the well–known genetic algo rithms. This paper presents a Scatter Search with the aim of finding
biclusters from gene expression data. However, biclusters with certain
patterns are more interesting from a biological point of view. Therefore,
the proposed Scatter Search uses a measure based on linear correlations
among genes to evaluate the quality of biclusters. As it is usual in Scatter
Search methodology an improvement method is included which avoids
to find biclusters with negatively correlated genes. Experimental results
from yeast cell cycle and human B-cell lymphoma datasets are reported
showing a remarkable performance of the proposed method and measureMinisterio de Ciencia y Tecnología TIN2007-68084-C00Junta de Andalucía P07-TIC-0261
Evolutionary Metaheuristic for Biclustering based on Linear Correlations among Genes
A new measure to evaluate the quality of a bicluster is proposed in
this paper. This measure is based on correlations among genes.
Moreover, a new evolutionary metaheuristic based on Scatter
Search, which uses this measure as the fitness function, is presented
to obtain biclusters that contain groups de highly-correlated genes.
Later, an analysis of the correlation matrix of these biclusters is
made to select these groups of genes that define new biclusters with
shifting and scaling patterns. Experimental results from human B cell lymphoma are presented.Ministerio de Ciencia e Innovación TIN2007-68084-C02Junta de Andalucía P07-TIC-0261
Recommended from our members
Collective analysis of multiple high-throughput gene expression datasets
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonModern technologies have resulted in the production of numerous high-throughput biological datasets. However, the pace of development of capable computational methods does not cope with the pace of generation of new high-throughput datasets. Amongst the most popular biological high-throughput datasets are gene expression datasets (e.g. microarray datasets). This work targets this aspect by proposing a suite of computational methods which can analyse multiple gene expression datasets collectively. The focal method in this suite is the unification of clustering results from multiple datasets using external specifications (UNCLES). This method applies clustering to multiple heterogeneous datasets which measure the expression of the same set of genes separately and then combines the resulting partitions in accordance to one of two types of external specifications; type A identifies the subsets of genes that are consistently co-expressed in all of the given datasets while type B identifies the subsets of genes that are consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets. This contributes to the types of questions which can addressed by computational methods because existing clustering, consensus clustering, and biclustering methods are inapplicable to address the aforementioned objectives. Moreover, in order to assist in setting some of the parameters required by UNCLES, the M-N scatter plots technique is proposed. These methods, and less mature versions of them, have been validated and applied to numerous real datasets from the biological contexts of budding yeast, bacteria, human red blood cells, and malaria. While collaborating with biologists, these applications have led to various biological insights. In yeast, the role of the poorly-understood gene CMR1 in the yeast cell-cycle has been further elucidated. Also, a novel subset of poorly understood yeast genes has been discovered with an expression profile consistently negatively correlated with the well-known ribosome biogenesis genes. Bacterial data analysis has identified two clusters of negatively correlated genes. Analysis of data from human red blood cells has produced some hypotheses regarding the regulation of the pathways producing such cells. On the other hand, malarial data analysis is still at a preliminary stage. Taken together, this thesis provides an original integrative suite of computational methods which scrutinise multiple gene expression datasets collectively to address previously unresolved questions, and provides the results and findings of many applications of these methods to real biological datasets from multiple contexts.National Institute for Health Research (NIHR) and the Brunel College of Engineering, Design and Physical Science
Data Mining Using the Crossing Minimization Paradigm
Our ability and capacity to generate, record and store multi-dimensional, apparently
unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining.
Because of the size, and complexity of the problem, practical data mining problems are
best attempted using automatic means.
Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes.
In this dissertation, a novel fast and white noise tolerant data mining solution is
proposed based on the Crossing Minimization (CM) paradigm; the solution works for
one-way as well as two-way clustering for discovering overlapping biclusters. For
decades the CM paradigm has traditionally been used for graph drawing and VLSI
(Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains.
Two other interesting and hard problems also addressed in this dissertation are (i) the
Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth
Minimization (BWM) problem of sparse matrices. The proposed CM technique is
demonstrated to provide very convincing results while attempting to solve the said
problems using real public domain data.
Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has
been observed during 1989-97 between cotton yield and pesticide consumption in
Pakistan showing unexpected periods of negative correlation. By applying the
indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis
Leveraging expression and network data for protein function prediction
2012 Summer.Includes bibliographical references.Protein function prediction is one of the prominent problems in bioinformatics today. Protein annotation is slowly falling behind as more and more genomes are being sequenced. Experimental methods are expensive and time consuming, which leaves computational methods to fill the gap. While computational methods are still not accurate enough to be used without human supervision, this is the goal. The Gene Ontology (GO) is a collection of terms that are the standard for protein function annotations. Because of the structure of GO, protein function prediction is a hierarchical multi-label classification problem. The classification method used in this thesis is GOstruct, which performs structured predictions that take into account all GO terms. GOstruct has been shown to work well, but there are still improvements to be made. In this thesis, I work to improve predictions by building new kernels from the data that are used by GOstruct. To do this, I find key representations of the data that help define what kernels perform best on the variety of data types. I apply this methodology to function prediction in two model organisms, Saccharomyces cerevisiae and Mus musculus, and found better methods for interpreting the data
TriSig: Assessing the statistical significance of triclusters
Tensor data analysis allows researchers to uncover novel patterns and
relationships that cannot be obtained from matrix data alone. The information
inferred from the patterns provides valuable insights into disease progression,
bioproduction processes, weather fluctuations, and group dynamics. However,
spurious and redundant patterns hamper this process. This work aims at
proposing a statistical frame to assess the probability of patterns in tensor
data to deviate from null expectations, extending well-established principles
for assessing the statistical significance of patterns in matrix data. A
comprehensive discussion on binomial testing for false positive discoveries is
entailed at the light of: variable dependencies, temporal dependencies and
misalignments, and \textit{p}-value corrections under the Benjamini-Hochberg
procedure. Results gathered from the application of state-of-the-art
triclustering algorithms over distinct real-world case studies in biochemical
and biotechnological domains confer validity to the proposed statistical frame
while revealing vulnerabilities of some triclustering searches. The proposed
assessment can be incorporated into existing triclustering algorithms to
mitigate false positive/spurious discoveries and further prune the search
space, reducing their computational complexity.
Availability: The code is freely available at
https://github.com/JupitersMight/TriSig under the MIT license
Finding large average submatrices in high dimensional data
The search for sample-variable associations is an important problem in the
exploratory analysis of high dimensional data. Biclustering methods search for
sample-variable associations in the form of distinguished submatrices of the
data matrix. (The rows and columns of a submatrix need not be contiguous.) In
this paper we propose and evaluate a statistically motivated biclustering
procedure (LAS) that finds large average submatrices within a given real-valued
data matrix. The procedure operates in an iterative-residual fashion, and is
driven by a Bonferroni-based significance score that effectively trades off
between submatrix size and average value. We examine the performance and
potential utility of LAS, and compare it with a number of existing methods,
through an extensive three-part validation study using two gene expression
datasets. The validation study examines quantitative properties of biclusters,
biological and clinical assessments using auxiliary information, and
classification of disease subtypes using bicluster membership. In addition, we
carry out a simulation study to assess the effectiveness and noise sensitivity
of the LAS search procedure. These results suggest that LAS is an effective
exploratory tool for the discovery of biologically relevant structures in high
dimensional data. Software is available at https://genome.unc.edu/las/.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS239 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …