8,484 research outputs found
FP-tree and COFI Based Approach for Mining of Multiple Level Association Rules in Large Databases
In recent years, discovery of association rules among itemsets in a large
database has been described as an important database-mining problem. The
problem of discovering association rules has received considerable research
attention and several algorithms for mining frequent itemsets have been
developed. Many algorithms have been proposed to discover rules at single
concept level. However, mining association rules at multiple concept levels may
lead to the discovery of more specific and concrete knowledge from data. The
discovery of multiple level association rules is very much useful in many
applications. In most of the studies for multiple level association rule
mining, the database is scanned repeatedly which affects the efficiency of
mining process. In this research paper, a new method for discovering multilevel
association rules is proposed. It is based on FP-tree structure and uses
cooccurrence frequent item tree to find frequent items in multilevel concept
hierarchy.Comment: Pages IEEE format, International Journal of Computer Science and
Information Security, IJCSIS, Vol. 7 No. 2, February 2010, USA. ISSN 1947
5500, http://sites.google.com/site/ijcsis
Pattern Detection with Rare Item-set Mining
The discovery of new and interesting patterns in large datasets, known as
data mining, draws more and more interest as the quantities of available data
are exploding. Data mining techniques may be applied to different domains and
fields such as computer science, health sector, insurances, homeland security,
banking and finance, etc. In this paper we are interested by the discovery of a
specific category of patterns, known as rare and non-present patterns. We
present a novel approach towards the discovery of non-present patterns using
rare item-set mining.Comment: 17 pages, 5 figures, International Journal on Soft Computing,
Artificial Intelligence and Applications (IJSCAI), Vol.1, No.1, August 201
A Tight Upper Bound on the Number of Candidate Patterns
In the context of mining for frequent patterns using the standard levelwise
algorithm, the following question arises: given the current level and the
current set of frequent patterns, what is the maximal number of candidate
patterns that can be generated on the next level? We answer this question by
providing a tight upper bound, derived from a combinatorial result from the
sixties by Kruskal and Katona. Our result is useful to reduce the number of
database scans
Mining Generalized Patterns from Large Databases using Ontologies
Formal Concept Analysis (FCA) is a mathematical theory based on the
formalization of the notions of concept and concept hierarchies. It has been
successfully applied to several Computer Science fields such as data
mining,software engineering, and knowledge engineering, and in many domains
like medicine, psychology, linguistics and ecology. For instance, it has been
exploited for the design, mapping and refinement of ontologies. In this paper,
we show how FCA can benefit from a given domain ontology by analyzing the
impact of a taxonomy (on objects and/or attributes) on the resulting concept
lattice. We willmainly concentrate on the usage of a taxonomy to extract
generalized patterns (i.e., knowledge generated from data when elements of a
given domain ontology are used) in the form of concepts and rules, and improve
navigation through these patterns. To that end, we analyze three generalization
cases and show their impact on the size of the generalized pattern set.
Different scenarios of simultaneous generalizations on both objects and
attributes are also discusse
Temporal data mining for root-cause analysis of machine faults in automotive assembly lines
Engine assembly is a complex and heavily automated distributed-control
process, with large amounts of faults data logged everyday. We describe an
application of temporal data mining for analyzing fault logs in an engine
assembly plant. Frequent episode discovery framework is a model-free method
that can be used to deduce (temporal) correlations among events from the logs
in an efficient manner. In addition to being theoretically elegant and
computationally efficient, frequent episodes are also easy to interpret in the
form actionable recommendations. Incorporation of domain-specific information
is critical to successful application of the method for analyzing fault logs in
the manufacturing domain. We show how domain-specific knowledge can be
incorporated using heuristic rules that act as pre-filters and post-filters to
frequent episode discovery. The system described here is currently being used
in one of the engine assembly plants of General Motors and is planned for
adaptation in other plants. To the best of our knowledge, this paper presents
the first real, large-scale application of temporal data mining in the
manufacturing domain. We believe that the ideas presented in this paper can
help practitioners engineer tools for analysis in other similar or related
application domains as well
Abstract Representations and Frequent Pattern Discovery
We discuss the frequent pattern mining problem in a general setting. From an
analysis of abstract representations, summarization and frequent pattern
mining, we arrive at a generalization of the problem. Then, we show how the
problem can be cast into the powerful language of algorithmic information
theory. This allows us to formulate a simple algorithm to mine for all frequent
patterns
Intelligent Search of Correlated Alarms from Database containing Noise Data
Alarm correlation plays an important role in improving the service and
reliability in modern telecommunications networks. Most previous research of
alarm correlation didn't consider the effect of noise data in Database. This
paper focuses on the method of discovering alarm correlation rules from
database containing noise data. We firstly define two parameters Win_freq and
Win_add as the measure of noise data and then present the Robust_search
algorithm to solve the problem. At different size of Win_freq and Win_add,
experiments with alarm data containing noise data show that the Robust_search
Algorithm can discover the more rules with the bigger size of Win_add. We also
experimentally compare two different interestingness measures of confidence and
correlation.Comment: 15 pages,4 figure
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
This paper documents the release of the ELKI data mining framework, version
0.7.5.
ELKI is an open source (AGPLv3) data mining software written in Java. The
focus of ELKI is research in algorithms, with an emphasis on unsupervised
methods in cluster analysis and outlier detection. In order to achieve high
performance and scalability, ELKI offers data index structures such as the
R*-tree that can provide major performance gains. ELKI is designed to be easy
to extend for researchers and students in this domain, and welcomes
contributions of additional methods. ELKI aims at providing a large collection
of highly parameterizable algorithms, in order to allow easy and fair
evaluation and benchmarking of algorithms.
We will first outline the motivation for this release, the plans for the
future, and then give a brief overview over the new functionality in this
version. We also include an appendix presenting an overview on the overall
implemented functionality
Efficient Web Log Mining using Doubly Linked Tree
World Wide Web is a huge data repository and is growing with the explosive
rate of about 1 million pages a day. As the information available on World Wide
Web is growing the usage of the web sites is also growing. Web log records each
access of the web page and number of entries in the web logs is increasing
rapidly. These web logs, when mined properly can provide useful information for
decision-making. The designer of the web site, analyst and management
executives are interested in extracting this hidden information from web logs
for decision making. Web access pattern, which is the frequently used sequence
of accesses, is one of the important information that can be mined from the web
logs. This information can be used to gather business intelligence to improve
sales and advertisement, personalization for a user, to analyze system
performance and to improve the web site organization. There exist many
techniques to mine access patterns from the web logs. This paper describes the
powerful algorithm that mines the web logs efficiently. Proposed algorithm
firstly converts the web access data available in a special doubly linked tree.
Each access is called an event. This tree keeps the critical mining related
information in very compressed form based on the frequent event count. Proposed
recursive algorithm uses this tree to efficiently find all access patterns that
satisfy user specified criteria. To prove that our algorithm is efficient from
the other GSP (Generalized Sequential Pattern) algorithms we have done
experimental studies on sample data.Comment: 5 pages, International Journal of Computer Science and Information
Security, ISSN 1947 5500, Impact Factor 0.42
A framework for redescription set construction
Redescription mining is a field of knowledge discovery that aims at finding
different descriptions of similar subsets of instances in the data. These
descriptions are represented as rules inferred from one or more disjoint sets
of attributes, called views. As such, they support knowledge discovery process
and help domain experts in formulating new hypotheses or constructing new
knowledge bases and decision support systems. In contrast to previous
approaches that typically create one smaller set of redescriptions satisfying a
pre-defined set of constraints, we introduce a framework that creates large and
heterogeneous redescription set from which user/expert can extract compact sets
of differing properties, according to its own preferences. Construction of
large and heterogeneous redescription set relies on CLUS-RM algorithm and a
novel, conjunctive refinement procedure that facilitates generation of larger
and more accurate redescription sets. The work also introduces the variability
of redescription accuracy when missing values are present in the data, which
significantly extends applicability of the method. Crucial part of the
framework is the redescription set extraction based on heuristic
multi-objective optimization procedure that allows user to define importance
levels towards one or more redescription quality criteria. We provide both
theoretical and empirical comparison of the novel framework against current
state of the art redescription mining algorithms and show that it represents
more efficient and versatile approach for mining redescriptions from data
- …