Search CORE

486 research outputs found

Mining Frequent Itemsets (MFI) over Data Streams: Variable Window Size (VWS) by Context Variation Analysis (CVA) of the Streaming Transactions

Author: Govardhan Dr. A.
Rao Dr. T. V.
Reddy V. Sidda
Publication venue
Publication date: 01/07/2014
Field of study

The challenges with respect to mining frequent items over data streaming engaging variable window size and low memory space are addressed in this research paper. To check the varying point of context change in streaming transaction we have developed a window structure which will be in two levels and supports in fixing the window size instantly and controls the heterogeneities and assures homogeneities among transactions added to the window. To minimize the memory utilization, computational cost and improve the process scalability, this design will allow fixing the coverage or support at window level. Here in this document, an incremental mining of frequent item-sets from the window and a context variation analysis approach are being introduced. The complete technology that we are presenting in this document is named as Mining Frequent Item-sets using Variable Window Size fixed by Context Variation Analysis (MFI-VWS-CVA). There are clear boundaries among frequent and infrequent item-sets in specific item-sets. In this design we have used window size change to represent the conceptual drift in an information stream. As it were, whenever there is a problem in setting window size effectively the item-set will be infrequent. The experiments that we have executed and documented proved that the algorithm that we have designed is much efficient than that of existing

arXiv.org e-Print Archive

Directory of Open Access Journals

Towards an incremental maintenance of cyclic association rules

Author: Ahmed Eya ben
Gouider Mohamed Salah
Publication venue
Publication date: 26/09/2010
Field of study

Recently, the cyclic association rules have been introduced in order to discover rules from items characterized by their regular variation over time. In real life situations, temporal databases are often appended or updated. Rescanning the whole database every time is highly expensive while existing incremental mining techniques can efficiently solve such a problem. In this paper, we propose an incremental algorithm for cyclic association rules maintenance. The carried out experiments of our proposal stress on its efficiency and performance

arXiv.org e-Print Archive

Approximate-Closed-Itemset Mining for Streaming Data Under Resource Constraint

Author: Iwanuma Koji
Tabei Yasuo
Yamamoto Yoshitaka
Publication venue
Publication date: 07/01/2019
Field of study

Here, we present a novel algorithm for frequent itemset mining for streaming data (FIM-SD). For the past decade, various FIM-SD methods in one-pass approximation settings have been developed to approximate the frequency of each itemset. These approaches can be categorized into two approximation types: parameter-constrained (PC) mining and resource-constrained (RC) mining. PC methods control the maximum error that can be included in the frequency based on a pre-defined parameter. In contrast, RC methods limit the maximum memory consumption based on resource constraints. However, the existing PC methods can exponentially increase the memory consumption, while the existing RC methods can rapidly increase the maximum error. In this study, we address this problem by introducing the notion of a condensed representation, called a

\Delta

-covered set, to the RC approximation. This notion is regarded as an extension of the closedness compression and when

\Delta = 0

, the solution corresponds to an ordinary closed itemset. The algorithm searches for such approximate closed itemsets that can restore the frequent itemsets and their frequencies under resource constraint while the maximum error is bounded by an integer,

\Delta

. We first propose a one-pass approximation algorithm to find the condensed solution. Then, we improve the basic algorithm by introducing a unified PC-RC approximation approach. Finally, we empirically demonstrate that the proposed algorithm significantly outperforms the state-of-the-art PC and RC methods for FIM-SD.Comment: 14 pages, 16 figures, submitted to VLDB201

arXiv.org e-Print Archive

Sequential Mining: Patterns and Algorithms Analysis

Author: Lazzez Amor
Slimani Thabet
Publication venue
Publication date: 02/11/2013
Field of study

This paper presents and analysis the common existing sequential pattern mining algorithms. It presents a classifying study of sequential pattern-mining algorithms into five extensive classes. First, on the basis of Apriori-based algorithm, second on Breadth First Search-based strategy, third on Depth First Search strategy, fourth on sequential closed-pattern algorithm and five on the basis of incremental pattern mining algorithms. At the end, a comparative analysis is done on the basis of important key features supported by various algorithms. This study gives an enhancement in the understanding of the approaches of sequential pattern mining.Comment: 10 page

arXiv.org e-Print Archive

Big Data Analytics in Bioinformatics: A Machine Learning Perspective

Author: Ahmed Hasin Afzal
Bhattacharyya Dhruba Kumar
Hoque Nazrul
Kashyap Hirak
Roy Swarup
Publication venue
Publication date: 15/06/2015
Field of study

Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch-mode and are not optimized for iterative processing and high data dependency among operations. In the recent years, parallel, incremental, and multi-view machine learning algorithms have been proposed. Similarly, graph-based architectures and in-memory big data tools have been developed to minimize I/O cost and optimize iterative processing. However, there lack standard big data architectures and tools for many important bioinformatics problems, such as fast construction of co-expression and regulatory networks and salient module identification, detection of complexes over growing protein-protein interaction data, fast analysis of massive DNA, RNA, and protein sequence data, and fast querying on incremental and heterogeneous disease networks. This paper addresses the issues and challenges posed by several big data problems in bioinformatics, and gives an overview of the state of the art and the future research opportunities.Comment: 20 pages survey paper on Big data analytics in Bioinformatic

arXiv.org e-Print Archive

Fast Utility Mining on Complex Sequences

Author: Chao Han-Chieh
Fournier-Viger Philippe
Gan Wensheng
Lin Jerry Chun-Wei
Yu Philip S.
Zhang Jiexiong
Publication venue
Publication date: 27/04/2019
Field of study

High-utility sequential pattern mining is an emerging topic in the field of Knowledge Discovery in Databases. It consists of discovering subsequences having a high utility (importance) in sequences, referred to as high-utility sequential patterns (HUSPs). HUSPs can be applied to many real-life applications, such as market basket analysis, E-commerce recommendation, click-stream analysis and scenic route planning. For example, in economics and targeted marketing, understanding economic behavior of consumers is quite challenging, such as finding credible and reliable information on product profitability. Several algorithms have been proposed to address this problem by efficiently mining utility-based useful sequential patterns. Nevertheless, the performance of these algorithms can be unsatisfying in terms of runtime and memory usage due to the combinatorial explosion of the search space for low utility threshold and large databases. Hence, this paper proposes a more efficient algorithm for the task of high-utility sequential pattern mining, called HUSP-ULL. It utilizes a lexicographic sequence (LS)-tree and a utility-linked (UL)-list structure to fast discover HUSPs. Furthermore, two pruning strategies are introduced in HUSP-ULL to obtain tight upper-bounds on the utility of candidate sequences, and reduce the search space by pruning unpromising candidates early. Substantial experiments both on real-life and synthetic datasets show that the proposed algorithm can effectively and efficiently discover the complete set of HUSPs and outperforms the state-of-the-art algorithms.Comment: Under review in IEEE TKDE, 15 page

arXiv.org e-Print Archive

Utility Mining Across Multi-Dimensional Sequences

Author: Chao Han-Chieh
Fournier-Viger Philippe
Gan Wensheng
Lin Jerry Chun-Wei
Yin Hongzhi
Yu Philip S.
Zhang Jiexiong
Publication venue
Publication date: 25/02/2019
Field of study

Knowledge extraction from database is the fundamental task in database and data mining community, which has been applied to a wide range of real-world applications and situations. Different from the support-based mining models, the utility-oriented mining framework integrates the utility theory to provide more informative and useful patterns. Time-dependent sequence data is commonly seen in real life. Sequence data has been widely utilized in many applications, such as analyzing sequential user behavior on the Web, influence maximization, route planning, and targeted marketing. Unfortunately, all the existing algorithms lose sight of the fact that the processed data not only contain rich features (e.g., occur quantity, risk, profit, etc.), but also may be associated with multi-dimensional auxiliary information, e.g., transaction sequence can be associated with purchaser profile information. In this paper, we first formulate the problem of utility mining across multi-dimensional sequences, and propose a novel framework named MDUS to extract Multi-Dimensional Utility-oriented Sequential useful patterns. Two algorithms respectively named MDUS_EM and MDUS_SD are presented to address the formulated problem. The former algorithm is based on database transformation, and the later one performs pattern joins and a searching method to identify desired patterns across multi-dimensional sequences. Extensive experiments are carried on five real-life datasets and one synthetic dataset to show that the proposed algorithms can effectively and efficiently discover the useful knowledge from multi-dimensional sequential databases. Moreover, the MDUS framework can provide better insight, and it is more adaptable to real-life situations than the current existing models.Comment: Under review in IEEE TKDE, 14 page

arXiv.org e-Print Archive

A Guided FP-growth algorithm for multitude-targeted mining of big data

Author: Dattner Itai
Shabtay Lior
Yaari Rami
Publication venue
Publication date: 04/07/2018
Field of study

In this paper we present the GFP-growth (Guided FP-growth) algorithm, a novel method for multitude-targeted mining: finding the count of a given large list of itemsets in large data. The GFP-growth algorithm is designed to focus on the specific multitude itemsets of interest and optimizes the time and memory costs. We prove that the GFP-growth algorithm yields the exact frequency-counts for the required itemsets. We show that for a number of different problems, a solution can be devised which takes advantage of the efficient implementation of multitude-targeted mining for boosting the performance. In particular, we study in detail the problem of generating the minority-class rules from imbalanced data, a scenario that appears in many real-life domains such as medical applications, failure prediction, network and cyber security, and maintenance. We develop the Minority-Report Algorithm that uses the GFP-growth for boosting performance. We prove some theoretical properties of the Minority-Report Algorithm and demonstrate its performance gain using simulations and real data

arXiv.org e-Print Archive

An Algorithm for Mining High Utility Closed Itemsets and Generators

Author: Das Ashok Kumar
Goswami A.
Sahoo Jayakrushna
Publication venue
Publication date: 11/10/2014
Field of study

Traditional association rule mining based on the support-confidence framework provides the objective measure of the rules that are of interest to users. However, it does not reflect the utility of the rules. To extract non-redundant association rules in support-confidence framework frequent closed itemsets and their generators play an important role. To extract non-redundant association rules among high utility itemsets, high utility closed itemsets (HUCI) and their generators should be extracted in order to apply traditional support-confidence framework. However, no efficient method exists at present for mining HUCIs with their generators. This paper addresses this issue. A post-processing algorithm, called the HUCI-Miner, is proposed to mine HUCIs with their generators. The proposed algorithm is implemented using both synthetic and real datasets

arXiv.org e-Print Archive

A Survey of Utility-Oriented Pattern Mining

Author: Chao Han-Chieh
Fournier-Viger Philippe
Gan Wensheng
Lin Jerry Chun-Wei
Tseng Vincent S.
Yu Philip S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/09/2019
Field of study

The main purpose of data mining and analytics is to find novel, potentially useful patterns that can be utilized in real-world applications to derive beneficial knowledge. For identifying and evaluating the usefulness of different kinds of patterns, many techniques and constraints have been proposed, such as support, confidence, sequence order, and utility parameters (e.g., weight, price, profit, quantity, satisfaction, etc.). In recent years, there has been an increasing demand for utility-oriented pattern mining (UPM, or called utility mining). UPM is a vital task, with numerous high-impact applications, including cross-marketing, e-commerce, finance, medical, and biomedical applications. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods of UPM. First, we introduce an in-depth understanding of UPM, including concepts, examples, and comparisons with related concepts. A taxonomy of the most common and state-of-the-art approaches for mining different kinds of high-utility patterns is presented in detail, including Apriori-based, tree-based, projection-based, vertical-/horizontal-data-format-based, and other hybrid approaches. A comprehensive review of advanced topics of existing high-utility pattern mining techniques is offered, with a discussion of their pros and cons. Finally, we present several well-known open-source software packages for UPM. We conclude our survey with a discussion on open and practical challenges in this field.Comment: Survey paper, accepted by IEEE TKDE, 20 page

arXiv.org e-Print Archive