Search CORE

17 research outputs found

Discovering significant OPSM subspace clusters in massive gene expression data

Author: Byron J. Gao
Martin Ester
Obi L. Griffith
Steven J. M. Jones
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

Order-preserving submatrixes (OPSMs) have been accepted as a biologically meaningful subspace cluster model, captur-ing the general tendency of gene expressions across a subset of conditions. In an OPSM, the expression levels of all genes induce the same linear ordering of the conditions. OPSM mining is reducible to a special case of the sequential pat-tern mining problem, in which a pattern and its supporting sequences uniquely specify an OPSM cluster. Those small twig clusters, specified by long patterns with naturally low support, incur explosive computational costs and would be completely pruned off by most existing methods for massive datasets containing thousands of conditions and hundreds of thousands of genes, which are common in today’s gene ex-pression analysis. However, it is in particular interest of bi-ologists to reveal such small groups of genes that are tightly coregulated under many conditions, and some pathways or processes might require only two genes to act in concert. In this paper, we introduce the KiWi mining framework for massive datasets, that exploits two parameters k and w to provide a biased testing on a bounded number of candidates, substantially reducing the search space and problem scale, targeting on highly promising seeds that lead to significant clusters and twig clusters. Extensive biological and compu-tational evaluations on real datasets demonstrate that KiWi can effectively mine biologically meaningful OPSM subspace clusters with good efficiency and scalability

CiteSeerX

Crossref

A New Approach for Mining Order-Preserving Submatrices Based on All Common Subsequences

Author: Jie Luo
Meihang Li
Qiuhua Kuang
Tiechen Li
Xiaohui Hu
Yun Xue
Zhengling Liao
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2015
Field of study

Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method

Crossref

Directory of Open Access Journals

Mining Order-Preserving Submatrices from Data with Repeated Measurements

Author: Cheung DWL
Chui CK
Kao BCM
Lee SD
Yip KY
Zhu X
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

published_or_final_versio

HKU Scholars Hub

A Fixed Parameter Tractable Integer Program for Finding the Maximum Order Preserving Submatrix

Author: Gaertner Thomas
Garriga Gemma
Humrich Jens
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

International audienceOrder-preserving submatrices are an important tool for the analysis of gene expression data. As finding large order-preserving submatrices is a computationally hard problem, previous work has investigated both exact but exponential-time as well as polynomial-time but inexact algorithms for finding large order-preserving submatrices. In this paper, we propose a novel exact algorithm to find maximum order preserving submatrices which is fixed parameter tractable with respect to the number of columns of the provided gene expression data. In particular, our algorithm is based on solving a sequence of mixed integer linear programs and it exhibits better guarantees as well as better runtime performance as compared to the state-of-the-art exact algorithms. Our empirical study in benchmark datasets shows large improvement in terms of computational speed

HAL - Lille 3

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Fraunhofer-ePrints

Application Oriented Analysis of Large Scale Datasets

Author: Paymal Prashant Shivaji
Publication venue: DigitalCommons@UNO
Publication date: 01/12/2011
Field of study

Diverse application areas, such as social network, epidemiology, and software engineering consist of systems of objects and their relationships. Such systems are generally modeled as graphs. Graphs consist of vertices that represent the objects, and edges that represent the relationships between them. These systems are data intensive and it is important to correctly analyze the data to obtain meaningful information. Combinatorial metrics can provide useful insights for analyzing these systems. In this thesis, we use the graph based metrics such as betweenness centrality, clustering coefficient, articulation points, etc. for analyzing instances of large change in evolving networks (Software Engineering), and identifying points of similarity (Gene Expression Data). Computations of combinatorial properties are expensive and most real world networks are not static. As the network evolves these properties have to be recomputed. In the last part of thesis, we develop a fast algorithm that avoids redundant recomputations of communities in dynamic networks

The University of Nebraska, Omaha

On Solving Selected Nonlinear Integer Programming Problems in Data Mining, Computational Biology, and Sustainability

Author: Trapp Andrew Christopher
Publication venue
Publication date: 27/06/2011
Field of study

This thesis consists of three essays concerning the use of optimization techniques to solve four problems in the fields of data mining, computational biology, and sustainable energy devices. To the best of our knowledge, the particular problems we discuss have not been previously addressed using optimization, which is a specific contribution of this dissertation. In particular, we analyze each of the problems to capture their underlying essence, subsequently demonstrating that each problem can be modeled as a nonlinear (mixed) integer program. We then discuss the design and implementation of solution techniques to locate optimal solutions to the aforementioned problems. Running throughout this dissertation is the theme of using mixed-integer programming techniques in conjunction with context-dependent algorithms to identify optimal and previously undiscovered underlying structure

D-Scholarship@Pitt

New approaches for clustering high dimensional data

Author: Liu Jinze
Publication venue
Publication date: 01/12/2006
Field of study

Clustering is one of the most effective methods for analyzing datasets that contain a large number of objects with numerous attributes. Clustering seeks to identify groups, or clusters, of similar objects. In low dimensional space, the similarity between objects is often evaluated by summing the difference across all of their attributes. High dimensional data, however, may contain irrelevant attributes which mask the existence of clusters. The discovery of groups of objects that are highly similar within some subsets of relevant attributes becomes an important but challenging task. My thesis focuses on various models and algorithms for this task. We first present a flexible clustering model, namely OP-Cluster (Order Preserving Cluster). Under this model, two objects are similar on a subset of attributes if the values of these two objects induce the same relative ordering of these attributes. OPClustering algorithm has demonstrated to be useful to identify co-regulated genes in gene expression data. We also propose a semi-supervised approach to discover biologically meaningful OP-Clusters by incorporating existing gene function classifications into the clustering process. This semi-supervised algorithm yields only OP-clusters that are significantly enriched by genes from specific functional categories. Real datasets are often noisy. We propose a noise-tolerant clustering algorithm for mining frequently occuring itemsets. This algorithm is called approximate frequent itemsets (AFI). Both the theoretical and experimental results demonstrate that our AFI mining algorithm has higher recoverability of real clusters than any other existing itemset mining approaches. Pair-wise dissimilarities are often derived from original data to reduce the complexities of high dimensional data. Traditional clustering algorithms taking pair-wise dissimilarities as input often generate disjoint clusters from pair-wise dissimilarities. It is well known that the classification model represented by disjoint clusters is inconsistent with many real classifications, such gene function classifications. We develop a Poclustering algorithm, which generates overlapping clusters from pair-wise dissimilarities. We prove that by allowing overlapping clusters, Poclustering fully preserves the information of any dissimilarity matrices while traditional partitioning algorithms may cause significant information loss

Carolina Digital Repository

An approach to clustering biological phenotypes /

Author: Vatsa Avimanyou Kumar
Publication venue: 'University of Missouri Libraries'
Publication date
Field of study

Recently emerging approaches to high-throughput phenotyping have become important tools in unraveling the biological basis of agronomically and medically important phenotypes. These experiments produce very large sets of either low or high-dimensional data. Finding clusters in the entire space of high-dimensional data (HDD) is a challenging task, because the relative distances between any two objects converge to zero with increasing dimensionality. Additionally, real data may not be mathematically well behaved. Finally, many clusters are expected on biological grounds to be "natural" -- that is, to have irregular, overlapping boundaries in different subsets of the dimensions. More precisely, the natural clusters of the data could differ in shape, size, density, and dimensionality; and they might not be disjoint. In principle, clustering such data could be done by dimension reduction methods. However, these methods convert many dimensions to a smaller set of dimensions that make the clustering results difficult to interpret and may also lead to a significant loss of information. Another possible approach is to find subspaces (subsets of dimensions) in the entire data space of the HDD. However, the existing subspace methods don't discover natural clusters. Therefore, in this dissertation I propose a novel data preprocessing method, demonstrating that a group of phenotypes are interdependent, and propose a novel density-based subspace clustering algorithm for high-dimensional data, called Dynamic Locally Density Adaptive Scalable Subspace Clustering (DynaDASC). This algorithm is relatively locally density adaptive, scalable, dynamic, and nonmetric in nature, and discovers natural clusters.Dr. Toni Kazic, Dissertation Supervisor.|Includes vita.Includes bibliographical references (pages 62-73)

University of Missouri: MOspace