17,832 research outputs found
A Constraint Programming Approach for Mining Sequential Patterns in a Sequence Database
Constraint-based pattern discovery is at the core of numerous data mining
tasks. Patterns are extracted with respect to a given set of constraints
(frequency, closedness, size, etc). In the context of sequential pattern
mining, a large number of devoted techniques have been developed for solving
particular classes of constraints. The aim of this paper is to investigate the
use of Constraint Programming (CP) to model and mine sequential patterns in a
sequence database. Our CP approach offers a natural way to simultaneously
combine in a same framework a large set of constraints coming from various
origins. Experiments show the feasibility and the interest of our approach
Context-aware visual exploration of molecular databases
Facilitating the visual exploration of scientific data has
received increasing attention in the past decade or so. Especially
in life science related application areas the amount
of available data has grown at a breath taking pace. In this
paper we describe an approach that allows for visual inspection
of large collections of molecular compounds. In
contrast to classical visualizations of such spaces we incorporate
a specific focus of analysis, for example the outcome
of a biological experiment such as high throughout
screening results. The presented method uses this experimental
data to select molecular fragments of the underlying
molecules that have interesting properties and uses the
resulting space to generate a two dimensional map based
on a singular value decomposition algorithm and a self organizing
map. Experiments on real datasets show that
the resulting visual landscape groups molecules of similar
chemical properties in densely connected regions
Prefix-Projection Global Constraint for Sequential Pattern Mining
Sequential pattern mining under constraints is a challenging data mining
task. Many efficient ad hoc methods have been developed for mining sequential
patterns, but they are all suffering from a lack of genericity. Recent works
have investigated Constraint Programming (CP) methods, but they are not still
effective because of their encoding. In this paper, we propose a global
constraint based on the projected databases principle which remedies to this
drawback. Experiments show that our approach clearly outperforms CP approaches
and competes well with ad hoc methods on large datasets
Mining Biclusters of Similar Values with Triadic Concept Analysis
Biclustering numerical data became a popular data-mining task in the
beginning of 2000's, especially for analysing gene expression data. A bicluster
reflects a strong association between a subset of objects and a subset of
attributes in a numerical object/attribute data-table. So called biclusters of
similar values can be thought as maximal sub-tables with close values. Only few
methods address a complete, correct and non redundant enumeration of such
patterns, which is a well-known intractable problem, while no formal framework
exists. In this paper, we introduce important links between biclustering and
formal concept analysis. More specifically, we originally show that Triadic
Concept Analysis (TCA), provides a nice mathematical framework for
biclustering. Interestingly, existing algorithms of TCA, that usually apply on
binary data, can be used (directly or with slight modifications) after a
preprocessing step for extracting maximal biclusters of similar values.Comment: Concept Lattices and their Applications (CLA) (2011
Mining frequent patterns for AMP-activated protein kinase regulation on skeletal muscle
BACKGROUND: AMP-activated protein kinase (AMPK) has emerged as a significant signaling intermediary that regulates metabolisms in response to energy demand and supply. An investigation into the degree of activation and deactivation of AMPK subunits under exercise can provide valuable data for understanding AMPK. In particular, the effect of AMPK on muscle cellular energy status makes this protein a promising pharmacological target for disease treatment. As more AMPK regulation data are accumulated, data mining techniques can play an important role in identifying frequent patterns in the data. Association rule mining, which is commonly used in market basket analysis, can be applied to AMPK regulation. RESULTS: This paper proposes a framework that can identify the potential correlation, either between the state of isoforms of α, β and γ subunits of AMPK, or between stimulus factors and the state of isoforms. Our approach is to apply item constraints in the closed interpretation to the itemset generation so that a threshold is specified in terms of the amount of results, rather than a fixed threshold value for all itemsets of all sizes. The derived rules from experiments are roughly analyzed. It is found that most of the extracted association rules have biological meaning and some of them were previously unknown. They indicate direction for further research. CONCLUSION: Our findings indicate that AMPK has a great impact on most metabolic actions that are related to energy demand and supply. Those actions are adjusted via its subunit isoforms under specific physical training. Thus, there are strong co-relationships between AMPK subunit isoforms and exercises. Furthermore, the subunit isoforms are correlated with each other in some cases. The methods developed here could be used when predicting these essential relationships and enable an understanding of the functions and metabolic pathways regarding AMPK
Strong compound-risk factors: Efficient discovery through emerging patterns and contrast sets
Odds ratio (OR), relative risk (RR) (risk ratio), and absolute risk reduction (ARR) (risk difference) are biostatistics measurements that are widely used for identifying significant risk factors in dichotomous groups of subjects. In the past, they have often been used to assess simple risk factors. In this paper, we introduce the concept of compound-risk factors to broaden the applicability of these statistical tests for assessing factor interplays. We observe that compound-risk factors with a high risk ratio or a big risk difference have an one-to-one correspondence to strong emerging patterns or strong contrast sets-two types of patterns that have been extensively studied in the data mining field. Such a relationship has been unknown to researchers in the past, and efficient algorithms for discovering strong compound-risk factors have been lacking. In this paper, we propose a theoretical framework and a new algorithm that unify the discovery of compound-risk factors that have a strong OR, risk ratio, or a risk difference. Our method guarantees that all patterns meeting a certain test threshold can be efficiently discovered. Our contribution thus represents the first of its kind in linking the risk ratios and ORs to pattern mining algorithms, making it possible to find compound-risk factors in large-scale data sets. In addition, we show that using compound-risk factors can improve classification accuracy in probabilistic learning algorithms on several disease data sets, because these compound-risk factors capture the interdependency between important data attributes. © 2007 IEEE
Online Spectral Clustering on Network Streams
Graph is an extremely useful representation of a wide variety of practical systems in data analysis. Recently, with the fast accumulation of stream data from various type of networks, significant research interests have arisen on spectral clustering for network streams (or evolving networks). Compared with the general spectral clustering problem, the data analysis of this new type of problems may have additional requirements, such as short processing time, scalability in distributed computing environments, and temporal variation tracking. However, to design a spectral clustering method to satisfy these requirements certainly presents non-trivial efforts. There are three major challenges for the new algorithm design. The first challenge is online clustering computation. Most of the existing spectral methods on evolving networks are off-line methods, using standard eigensystem solvers such as the Lanczos method. It needs to recompute solutions from scratch at each time point. The second challenge is the parallelization of algorithms. To parallelize such algorithms is non-trivial since standard eigen solvers are iterative algorithms and the number of iterations can not be predetermined. The third challenge is the very limited existing work. In addition, there exists multiple limitations in the existing method, such as computational inefficiency on large similarity changes, the lack of sound theoretical basis, and the lack of effective way to handle accumulated approximate errors and large data variations over time. In this thesis, we proposed a new online spectral graph clustering approach with a family of three novel spectrum approximation algorithms. Our algorithms incrementally update the eigenpairs in an online manner to improve the computational performance. Our approaches outperformed the existing method in computational efficiency and scalability while retaining competitive or even better clustering accuracy. We derived our spectrum approximation techniques GEPT and EEPT through formal theoretical analysis. The well established matrix perturbation theory forms a solid theoretic foundation for our online clustering method. We facilitated our clustering method with a new metric to track accumulated approximation errors and measure the short-term temporal variation. The metric not only provides a balance between computational efficiency and clustering accuracy, but also offers a useful tool to adapt the online algorithm to the condition of unexpected drastic noise. In addition, we discussed our preliminary work on approximate graph mining with evolutionary process, non-stationary Bayesian Network structure learning from non-stationary time series data, and Bayesian Network structure learning with text priors imposed by non-parametric hierarchical topic modeling
The Bases of Association Rules of High Confidence
We develop a new approach for distributed computing of the association rules
of high confidence in a binary table. It is derived from the D-basis algorithm
in K. Adaricheva and J.B. Nation (TCS 2017), which is performed on multiple
sub-tables of a table given by removing several rows at a time. The set of
rules is then aggregated using the same approach as the D-basis is retrieved
from a larger set of implications. This allows to obtain a basis of association
rules of high confidence, which can be used for ranking all attributes of the
table with respect to a given fixed attribute using the relevance parameter
introduced in K. Adaricheva et al. (Proceedings of ICFCA-2015). This paper
focuses on the technical implementation of the new algorithm. Some testing
results are performed on transaction data and medical data.Comment: Presented at DTMN, Sydney, Australia, July 28, 201
- …