Search CORE

195 research outputs found

Mining Order-Preserving Submatrices from Data with Repeated Measurements

Author: Cheung DWL
Chui CK
Kao BCM
Lee SD
Yip KY
Zhu X
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

published_or_final_versio

HKU Scholars Hub

On Solving Selected Nonlinear Integer Programming Problems in Data Mining, Computational Biology, and Sustainability

Author: Trapp Andrew Christopher
Publication venue
Publication date: 27/06/2011
Field of study

This thesis consists of three essays concerning the use of optimization techniques to solve four problems in the fields of data mining, computational biology, and sustainable energy devices. To the best of our knowledge, the particular problems we discuss have not been previously addressed using optimization, which is a specific contribution of this dissertation. In particular, we analyze each of the problems to capture their underlying essence, subsequently demonstrating that each problem can be modeled as a nonlinear (mixed) integer program. We then discuss the design and implementation of solution techniques to locate optimal solutions to the aforementioned problems. Running throughout this dissertation is the theme of using mixed-integer programming techniques in conjunction with context-dependent algorithms to identify optimal and previously undiscovered underlying structure

D-Scholarship@Pitt

DNA Microarray Data Analysis: A New Survey on Biclustering

Author: Ben Saber Haifa
ELLOUMI Mourad
Publication venue: International Journal for Computational Biology (IJCB)
Publication date: 03/04/2015
Field of study

There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes that are coexpressed under clusters of conditions. This type of clustering is called biclustering.Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions. In this paper, we make a new survey on biclustering of gene expression data, also called microarray data

International Journal for Computational Biology (IJCB)

Extracting regulatory modules from gene expression data by sequential pattern mining

Author: Hyunjung Shin
Je-Gun Joung
Ju Han Kim
Mingoo Kim
Tae Su Chung
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Abstract Background Identifying a regulatory module (RM), a bi-set of co-regulated genes and co-regulating conditions (or samples), has been an important challenge in functional genomics and bioinformatics. Given a microarray gene-expression matrix, biclustering has been the most common method for extracting RMs. Among biclustering methods, order-preserving biclustering by a sequential pattern mining technique has native advantage over the conventional biclustering approaches since it preserves the order of genes (or conditions) according to the magnitude of the expression value. However, previous sequential pattern mining-based biclustering has several weak points in that they can easily be computationally intractable in the real-size of microarray data and sensitive to inherent noise in the expression value. Results In this paper, we propose a novel sequential pattern mining algorithm that is scalable in the size of microarray data and robust with respect to noise. When applied to the microarray data of yeast, the proposed algorithm successfully found long order-preserving patterns, which are biologically significant but cannot be found in randomly shuffled data. The resulting patterns are well enriched to known annotations and are consistent with known biological knowledge. Furthermore, RMs as well as inter-module relations were inferred from the biologically significant patterns. Conclusions Our approach for identifying RMs could be valuable for systematically revealing the mechanism of gene regulation at a genome-wide level.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

QUBIC: a qualitative biclustering algorithm for analyses of gene expression data

Author: Aguilar-Ruiz
Andrew H. Paterson
Armstrong
Ashburner
Barkow
Ben-dor
Bryan
Bryan
Bryan
Carmona-Saez
Castillo-Davis
Cheng
Eisen
Faith
Gasch
Getz
Golub
Guojun Li
Haibao Tang
Hartigan
Huttenhower
Ihmels
Kanehisa
Keseler
Kluger
Kung
Li
Liu
Madeira
McLachlan
Morgan
Murali
Prelic
Qin Ma
Reiss
Ruepp
Shamir
Tanay
Xu
Yeung
Ying Xu
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Biclustering extends the traditional clustering techniques by attempting to find (all) subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. Still the real power of this clustering strategy is yet to be fully realized due to the lack of effective and efficient algorithms for reliably solving the general biclustering problem. We report a QUalitative BIClustering algorithm (QUBIC) that can solve the biclustering problem in a more general form, compared to existing algorithms, through employing a combination of qualitative (or semi-quantitative) measures of gene expression data and a combinatorial optimization technique. One key unique feature of the QUBIC algorithm is that it can identify all statistically significant biclusters including biclusters with the so-called ‘scaling patterns’, a problem considered to be rather challenging; another key unique feature is that the algorithm solves such general biclustering problems very efficiently, capable of solving biclustering problems with tens of thousands of genes under up to thousands of conditions in a few minutes of the CPU time on a desktop computer. We have demonstrated a considerably improved biclustering performance by our algorithm compared to the existing algorithms on various benchmark sets and data sets of our own. QUBIC was written in ANSI C and tested using GCC (version 4.1.2) on Linux. Its source code is available at: http://csbl.bmb.uga.edu/∼maqin/bicluster. A server version of QUBIC is also available upon request

CiteSeerX

Crossref

PubMed Central

A New Approach for Mining Order-Preserving Submatrices Based on All Common Subsequences

Author: Jie Luo
Meihang Li
Qiuhua Kuang
Tiechen Li
Xiaohui Hu
Yun Xue
Zhengling Liao
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2015
Field of study

Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method

Crossref

Directory of Open Access Journals

Faster optimal univariate microgaggregation

Author: Schaub Michael T.
Stamm Felix I.
Publication venue
Publication date: 04/01/2024
Field of study

Microaggregation is a method to coarsen a dataset, by optimally clustering data points in groups of at least

k

points, thereby providing a

k

-anonymity type disclosure guarantee for each point in the dataset. Previous algorithms for univariate microaggregation had a

O(k n)

time complexity. By rephrasing microaggregation as an instance of the concave least weight subsequence problem, in this work we provide improved algorithms that provide an optimal univariate microaggregation on sorted data in

O(n)

time and space. We further show that our algorithms work not only for sum of squares cost functions, as typically considered, but seamlessly extend to many other cost functions used for univariate microaggregation tasks. In experiments we show that the presented algorithms lead to real world performance improvements

arXiv.org e-Print Archive

Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm

Author: A Ben-Dor
A Cernetich
A Tanay
AB Tchagang
AB Tchagang
AB Tchagang
AB Tchagang
Adrian Cutler
AH Tewfik
Alain B Tchagang
Daiqing Huang
E Eisenberg
Fazel Famili
H Jiang
Heather Shearer
IP Androulakis
J Ernst
J Mu
JB Ohlrogge
JD Jones
Jitao Zou
L Wang
L Zhao
M Shah
MB Eisen
MN Arbeitman
P Tamayo
Pierre Fobert
PT Spellman
S Baud
S Bleuler
S Tavazoie
SA Braybrook
SC Madeira
Sieu Phan
T Syeda-Mahmood
TS Gardner
Yi Huang
Youlian Pan
Ziying Liu
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Abstract Background Nowadays, it is possible to collect expression levels of a set of genes from a set of biological samples during a series of time points. Such data have three dimensions: gene-sample-time (GST). Thus they are called 3D microarray gene expression data. To take advantage of the 3D data collected, and to fully understand the biological knowledge hidden in the GST data, novel subspace clustering algorithms have to be developed to effectively address the biological problem in the corresponding space. Results We developed a subspace clustering algorithm called Order Preserving Triclustering (OPTricluster), for 3D short time-series data mining. OPTricluster is able to identify 3D clusters with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the order preserving (OP) concept on the time dimension. The fusion of the two methodologies allows one to study similarities and differences between samples in terms of their temporal expression profile. OPTricluster has been successfully applied to four case studies: immune response in mice infected by malaria (<it>Plasmodium chabaudi</it>), systemic acquired resistance in <it>Arabidopsis thaliana</it>, similarities and differences between inner and outer cotyledon in <it>Brassica napus </it>during seed development, and to <it>Brassica napus </it>whole seed development. These studies showed that OPTricluster is robust to noise and is able to detect the similarities and differences between biological samples. Conclusions Our analysis showed that OPTricluster generally outperforms other well known clustering algorithms such as the TRICLUSTER, gTRICLUSTER and K-means; it is robust to noise and can effectively mine the biological knowledge hidden in the 3D short time-series gene expression data.</p

NRC Publications Archive

Crossref

Springer

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Data Mining Using the Crossing Minimization Paradigm

Author: Abdullah Ahsan
Publication venue: University of Stirling
Publication date: 01/01/2007
Field of study

Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining. Because of the size, and complexity of the problem, practical data mining problems are best attempted using automatic means. Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes. In this dissertation, a novel fast and white noise tolerant data mining solution is proposed based on the Crossing Minimization (CM) paradigm; the solution works for one-way as well as two-way clustering for discovering overlapping biclusters. For decades the CM paradigm has traditionally been used for graph drawing and VLSI (Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains. Two other interesting and hard problems also addressed in this dissertation are (i) the Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth Minimization (BWM) problem of sparse matrices. The proposed CM technique is demonstrated to provide very convincing results while attempting to solve the said problems using real public domain data. Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has been observed during 1989-97 between cotton yield and pesticide consumption in Pakistan showing unexpected periods of negative correlation. By applying the indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis

CiteSeerX

Stirling Online Research Repository