195 research outputs found
Mining Order-Preserving Submatrices from Data with Repeated Measurements
published_or_final_versio
On Solving Selected Nonlinear Integer Programming Problems in Data Mining, Computational Biology, and Sustainability
This thesis consists of three essays concerning the use of optimization techniques to solve four problems in the fields of data mining, computational biology, and sustainable energy devices. To the best of our knowledge, the particular problems we discuss have not been previously addressed using optimization, which is a specific contribution of this dissertation. In particular, we analyze each of the problems to capture their underlying essence, subsequently demonstrating that each problem can be modeled as a nonlinear (mixed) integer program. We then discuss the design and implementation of solution techniques to locate optimal solutions to the aforementioned problems. Running throughout this dissertation is the theme of using mixed-integer programming techniques in conjunction with context-dependent algorithms to identify optimal and previously undiscovered underlying structure
DNA Microarray Data Analysis: A New Survey on Biclustering
There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes that are coexpressed under clusters of conditions. This type of clustering is called biclustering.Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions. In this paper, we make a new survey on biclustering of gene expression data, also called microarray data
Extracting regulatory modules from gene expression data by sequential pattern mining
Abstract Background Identifying a regulatory module (RM), a bi-set of co-regulated genes and co-regulating conditions (or samples), has been an important challenge in functional genomics and bioinformatics. Given a microarray gene-expression matrix, biclustering has been the most common method for extracting RMs. Among biclustering methods, order-preserving biclustering by a sequential pattern mining technique has native advantage over the conventional biclustering approaches since it preserves the order of genes (or conditions) according to the magnitude of the expression value. However, previous sequential pattern mining-based biclustering has several weak points in that they can easily be computationally intractable in the real-size of microarray data and sensitive to inherent noise in the expression value. Results In this paper, we propose a novel sequential pattern mining algorithm that is scalable in the size of microarray data and robust with respect to noise. When applied to the microarray data of yeast, the proposed algorithm successfully found long order-preserving patterns, which are biologically significant but cannot be found in randomly shuffled data. The resulting patterns are well enriched to known annotations and are consistent with known biological knowledge. Furthermore, RMs as well as inter-module relations were inferred from the biologically significant patterns. Conclusions Our approach for identifying RMs could be valuable for systematically revealing the mechanism of gene regulation at a genome-wide level.</p
QUBIC: a qualitative biclustering algorithm for analyses of gene expression data
Biclustering extends the traditional clustering techniques by attempting to find (all) subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. Still the real power of this clustering strategy is yet to be fully realized due to the lack of effective and efficient algorithms for reliably solving the general biclustering problem. We report a QUalitative BIClustering algorithm (QUBIC) that can solve the biclustering problem in a more general form, compared to existing algorithms, through employing a combination of qualitative (or semi-quantitative) measures of gene expression data and a combinatorial optimization technique. One key unique feature of the QUBIC algorithm is that it can identify all statistically significant biclusters including biclusters with the so-called ‘scaling patterns’, a problem considered to be rather challenging; another key unique feature is that the algorithm solves such general biclustering problems very efficiently, capable of solving biclustering problems with tens of thousands of genes under up to thousands of conditions in a few minutes of the CPU time on a desktop computer. We have demonstrated a considerably improved biclustering performance by our algorithm compared to the existing algorithms on various benchmark sets and data sets of our own. QUBIC was written in ANSI C and tested using GCC (version 4.1.2) on Linux. Its source code is available at: http://csbl.bmb.uga.edu/∼maqin/bicluster. A server version of QUBIC is also available upon request
A New Approach for Mining Order-Preserving Submatrices Based on All Common Subsequences
Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method
Faster optimal univariate microgaggregation
Microaggregation is a method to coarsen a dataset, by optimally clustering
data points in groups of at least points, thereby providing a -anonymity
type disclosure guarantee for each point in the dataset. Previous algorithms
for univariate microaggregation had a time complexity. By rephrasing
microaggregation as an instance of the concave least weight subsequence
problem, in this work we provide improved algorithms that provide an optimal
univariate microaggregation on sorted data in time and space. We further
show that our algorithms work not only for sum of squares cost functions, as
typically considered, but seamlessly extend to many other cost functions used
for univariate microaggregation tasks. In experiments we show that the
presented algorithms lead to real world performance improvements
Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm
<p>Abstract</p> <p>Background</p> <p>Nowadays, it is possible to collect expression levels of a set of genes from a set of biological samples during a series of time points. Such data have three dimensions: gene-sample-time (GST). Thus they are called 3D microarray gene expression data. To take advantage of the 3D data collected, and to fully understand the biological knowledge hidden in the GST data, novel subspace clustering algorithms have to be developed to effectively address the biological problem in the corresponding space.</p> <p>Results</p> <p>We developed a subspace clustering algorithm called Order Preserving Triclustering (OPTricluster), for 3D short time-series data mining. OPTricluster is able to identify 3D clusters with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the order preserving (OP) concept on the time dimension. The fusion of the two methodologies allows one to study similarities and differences between samples in terms of their temporal expression profile. OPTricluster has been successfully applied to four case studies: immune response in mice infected by malaria (<it>Plasmodium chabaudi</it>), systemic acquired resistance in <it>Arabidopsis thaliana</it>, similarities and differences between inner and outer cotyledon in <it>Brassica napus </it>during seed development, and to <it>Brassica napus </it>whole seed development. These studies showed that OPTricluster is robust to noise and is able to detect the similarities and differences between biological samples.</p> <p>Conclusions</p> <p>Our analysis showed that OPTricluster generally outperforms other well known clustering algorithms such as the TRICLUSTER, gTRICLUSTER and K-means; it is robust to noise and can effectively mine the biological knowledge hidden in the 3D short time-series gene expression data.</p
Data Mining Using the Crossing Minimization Paradigm
Our ability and capacity to generate, record and store multi-dimensional, apparently
unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining.
Because of the size, and complexity of the problem, practical data mining problems are
best attempted using automatic means.
Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes.
In this dissertation, a novel fast and white noise tolerant data mining solution is
proposed based on the Crossing Minimization (CM) paradigm; the solution works for
one-way as well as two-way clustering for discovering overlapping biclusters. For
decades the CM paradigm has traditionally been used for graph drawing and VLSI
(Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains.
Two other interesting and hard problems also addressed in this dissertation are (i) the
Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth
Minimization (BWM) problem of sparse matrices. The proposed CM technique is
demonstrated to provide very convincing results while attempting to solve the said
problems using real public domain data.
Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has
been observed during 1989-97 between cotton yield and pesticide consumption in
Pakistan showing unexpected periods of negative correlation. By applying the
indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis
- …