9 research outputs found

    Sensing the Web for Induction of Association Rules and their Composition through Ensemble Techniques

    Get PDF
    Abstract Starting from geophysical data collected from heterogeneous sources, such as meteorological stations and information gathered from the web, we seek unknown connections between the sampled values through the extraction of association rules. These rules imply the co-occurrence of two or more symbols in the same representation, and the rule confidence may vary according to the collected data. We propose, starting from traditional algorithms such as FP-Growth and Apriori, the creation of complex association rules through boosting of simpler ones. The composition enables the creation of rules that are robust and let emerge a larger number of interesting rules

    Partial rule match for filtering rules in associative classification

    Get PDF
    In this study, we propose a new method to enhance the accuracy of Modified Multi-class Classification based on Association Rule (MMCAR) classifier.We introduce a Partial Rule Match Filtering (PRMF) method that allows a minimal match of the items in the rule's body in order for the rule to be added into a classifier. Experiments on Reuters-21578 data sets are performed in order to evaluate the effectiveness of PRMF in MMCAR. Results show that the MMCAR classifier performs better as compared to the chosen competitors

    A modified multi-class association rule for text mining

    Get PDF
    Classification and association rule mining are significant tasks in data mining. Integrating association rule discovery and classification in data mining brings us an approach known as the associative classification. One common shortcoming of existing Association Classifiers is the huge number of rules produced in order to obtain high classification accuracy. This study proposes s a Modified Multi-class Association Rule Mining (mMCAR) that consists of three procedures; rule discovery, rule pruning and group-based class assignment. The rule discovery and rule pruning procedures are designed to reduce the number of classification rules. On the other hand, the group-based class assignment procedure contributes in improving the classification accuracy. Experiments on the structured and unstructured text datasets obtained from the UCI and Reuters repositories are performed in order to evaluate the proposed Association Classifier. The proposed mMCAR classifier is benchmarked against the traditional classifiers and existing Association Classifiers. Experimental results indicate that the proposed Association Classifier, mMCAR, produced high accuracy with a smaller number of classification rules. For the structured dataset, the mMCAR produces an average of 84.24% accuracy as compared to MCAR that obtains 84.23%. Even though the classification accuracy difference is small, the proposed mMCAR uses only 50 rules for the classification while its benchmark method involves 60 rules. On the other hand, mMCAR is at par with MCAR when unstructured dataset is utilized. Both classifiers produce 89% accuracy but mMCAR uses less number of rules for the classification. This study contributes to the text mining domain as automatic classification of huge and widely distributed textual data could facilitate the text representation and retrieval processes

    LC an effective classification based association rule mining algorithm

    Get PDF
    Classification using association rules is a research field in data mining that primarily uses association rule discovery techniques in classification benchmarks. It has been confirmed by many research studies in the literature that classification using association tends to generate more predictive classification systems than traditional classification data mining techniques like probabilistic, statistical and decision tree. In this thesis, we introduce a novel data mining algorithm based on classification using association called โ€œLooking at the Classโ€ (LC), which can be used in for mining a range of classification data sets. Unlike known algorithms in classification using the association approach such as Classification based on Association rule (CBA) system and Classification based on Predictive Association (CPAR) system, which merge disjoint items in the rule learning step without anticipating the class label similarity, the proposed algorithm merges only items with identical class labels. This saves too many unnecessary items combining during the rule learning step, and consequently results in large saving in computational time and memory. Furthermore, the LC algorithm uses a novel prediction procedure that employs multiple rules to make the prediction decision instead of a single rule. The proposed algorithm has been evaluated thoroughly on real world security data sets collected using an automated tool developed at Huddersfield University. The security application which we have considered in this thesis is about categorizing websites based on their features to legitimate or fake which is a typical binary classification problem. Also, experimental results on a number of UCI data sets have been conducted and the measures used for evaluation is the classification accuracy, memory usage, and others. The results show that LC algorithm outperformed traditional classification algorithms such as C4.5, PART and Naรฏve Bayes as well as known classification based association algorithms like CBA with respect to classification accuracy, memory usage, and execution time on most data sets we consider

    MapReduce network enabled algorithms for classification based on association rules

    Get PDF
    There is growing evidence that integrating classification and association rule mining can produce more efficient and accurate classifiers than traditional techniques. This thesis introduces a new MapReduce based association rule miner for extracting strong rules from large datasets. This miner is used later to develop a new large scale classifier. Also new MapReduce simulator was developed to evaluate the scalability of proposed algorithms on MapReduce clusters. The developed associative rule miner inherits the MapReduce scalability to huge datasets and to thousands of processing nodes. For finding frequent itemsets, it uses hybrid approach between miners that uses counting methods on horizontal datasets, and miners that use set intersections on datasets of vertical formats. The new miner generates same rules that usually generated using apriori-like algorithms because it uses the same confidence and support thresholds definitions. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. This thesis also introduces a new MapReduce classifier that based MapReduce associative rule mining. This algorithm employs different approaches in rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. The new classifier works on multi-class datasets and is able to produce multi-label predications with probabilities for each predicted label. To evaluate the classifier 20 different datasets from the UCI data collection were used. Results show that the proposed approach is an accurate and effective classification technique, highly competitive and scalable if compared with other traditional and associative classification approaches. Also a MapReduce simulator was developed to measure the scalability of MapReduce based applications easily and quickly, and to captures the behaviour of algorithms on cluster environments. This also allows optimizing the configurations of MapReduce clusters to get better execution times and hardware utilization.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Pertanika Journal of Science & Technology

    Get PDF

    Large-scale Text Categorization Based on Boosting Association Rules

    No full text
    Doctor์—ฐ๊ด€๊ทœ์น™์„ ์ด์šฉํ•œ ๋ถ„๋ฅ˜์—์„œ, ๋งŽ์€ ๋‹จ์–ด๋ฅผ ๊ฐ€์ง„ ๊ทœ์น™์€ ๋‹จ์–ดํŒจํ„ด๊ณผ ๋ฒ”์ฃผ ์‚ฌ์ด์˜์—ฐ๊ด€์„ ๋ณด๋‹ค ์ •ํ™•ํ•˜๊ฒŒ ํ‘œํ˜„ํ• ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ์ƒ์„ฑ๋˜๋Š” ๋‹ค๋‹จ์–ด ๊ทœ์น™์˜ ์ˆ˜๊ฐ€์ง€์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋‹จ์–ด ๊ทœ์น™์„ ์ถ”์ถœํ•˜๋Š” ์ž‘์—…์€ ๋งค์šฐ ๋งŽ์€ ์‹œ๊ฐ„์„ํ•„์š”๋กœ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ, ๊ณผ๊ฑฐ ์—ฐ๊ตฌ์—์„œ ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ด€ ๋ถ„๋ฅ˜๊ธฐ๋Š” ๊ทœ์น™์— ๋“ค์–ด๊ฐ€๋Š”๋‹จ์–ด์ˆ˜๋ฅผ ์ค„์ž„์œผ๋กœ์จ ์ƒ์„ฑ๋˜๋Š” ๊ทœ์น™์˜ ์ˆ˜๋ฅผ ์ค„์—ฌ ์™”๊ณ , ํ…Œ์ŠคํŠธ ๋ฌธ์„œ๋“ค์„ ๋ถ„๋ฅ˜ํ•˜๋Š”์ตœ์ข… ๋ถ„๋ฅ˜๊ธฐ์— ๋ณด๋‹ค ์ ์€ ์ˆ˜์˜ ๊ณ ์‹ ๋ขฐ๋„ ๊ทœ์น™๋“ค์„ ํฌํ•จ์‹œ์ผœ ์™”๋‹ค.์šฐ๋ฆฌ๋Š” ๋ถ„๋ฅ˜๊ธฐ์˜ ํ•™์Šต์— ์žˆ์–ด์„œ ๋‹ค๋ฅธ ์ ‘๊ทผ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๋Š”๋ฐ, ์ ์–ด๋„ ์ž„์˜์„ ํƒ๋ณด๋‹ค๋Š” ์ •ํ™•๋„๊ฐ€ ๋›ฐ์–ด๋‚œ ์ €์‹ ๋ขฐ๋„ ๊ทœ์น™๋“ค์„ ๋ ์ˆ˜ ์žˆ์œผ๋ฉด ๋งŽ์ด ํฌํ•จ์‹œํ‚ค๋Š”๊ฒƒ์ด๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ƒˆ๋กœ์ด ๊ณ ์•ˆํ•œ ์„ฑ๋Šฅ๋ถ€์–‘ ์•Œ๊ณ ๋ฆฌ๋“ฌ์„ ์ ์šฉํ•˜์—ฌ, ๋งŽ์€ ์ˆ˜์˜ ์ดˆ๊ธฐ์ƒ์„ฑ๋œ ๊ทœ์น™๋“ค๋กœ๋ถ€ํ„ฐ ์ ์€ ์ˆ˜์˜ ๊ทœ์น™๋“ค์„ ์„ ๋ณ„ํ•˜์—ฌ์„œ ์ตœ์ข… ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค.๊ทธ๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ ์ตœ์ข… ๋ถ„๋ฅ˜๊ธฐ๋Š” ํ•™์Šต ์—๋Ÿฌ๋ฐ ์ผ๋ฐ˜ํ™” ์—๋Ÿฌ์— ์žˆ์–ด ๋งค์šฐ ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์„๋ณด์—ฌ์ค€๋‹ค.๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ์˜ ๋ฐฉ๋ฒ•์€ ์ตœ์†Œ ์ง€์ง€๋„์™€ ์ตœ์†Œ ์‹ ๋ขฐ๋„ ๋ฌธํ„ฑ์„๋‚ฎ์ถค์œผ๋กœ์จ ์•„์ฃผ ๋งŽ์€ ์ˆ˜์˜ ์—ฐ๊ด€ ๊ทœ์น™์„ ์บ๋‚ด๋Š”๋ฐ, ์ด๋Š” ์‹œํ—˜ ๋ฌธ์„œ๋“ค์— ๋Œ€ํ•œ์ฒ˜๋ฆฌ๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋˜ํ•œ ๊ทœ์น™ ์ถ”์ถœ๊ณผ ์„ฑ๋Šฅ๋ถ€์–‘ ๊ณผ์ •์— ์žˆ์–ด์„œ ๊ณ„์‚ฐํšจ์œจ์„ฑ์„ ์ฆ์ง„์‹œํ‚ค๋Š” ๋‘๊ฐœ์˜ ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ๋“ฌ์„ ์ œ์•ˆํ•œ๋‹ค. ์ž˜ ์•Œ๋ ค์ง„ ์„ฑ๋Šฅํ‰๊ฐ€์šฉ๋ฐ์ดํƒ€์™€ ๋Œ€์šฉ๋Ÿ‰ ๋ฌธ์„œ์ง‘ํ•ฉ์„ ๊ฐ€์ง€๊ณ  ์ฒ ์ €ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด, ์šฐ๋ฆฌ์˜ ๋ฐฉ๋ฒ•๋ก ์€ ๊ณ„์‚ฐํšจ์œจ์„ฑ๋ฟ ์•„๋‹ˆ๋ผ ๋ถ„๋ฅ˜ ์ •ํ™•๋„์— ์žˆ์–ด์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.In the associative classification, high-order rules can represent more exactly the association between pattern and class label. But mining high-order rules is very time consuming because the number of generated rules grows exponentially. Thus, most of associative classifiers in the previous studies have decreased the number of generated rules by reducing the number of features or lowering the order of rules, and have selected a small number of high-confidence rules for the final hypotheses for test instances. We propose an alternative approach in which the training of the classifier starts with as many rules as possible including those which have low confidence values but are better than random guessing. Controlled by our new boosting algorithm, a smaller number of rules are selected from that large number of generated rules and make up a final classifier. The resulting final classifier shows a greatly improved performance for both the training error and the generalization error. To maximize the classifier performance, our approach mines a huge number of association rules by lowering the minimum support and the minimum confidence thresholds, which helps to raise the coverage for test instances. We also propose two new algorithms to enhance the computational efficiency during the processes of rule generation and boosting. By conducting thorough experiments using well-known benchmark databases and large-scale text corpora, our method achieves outstanding classification accuracy and computational efficiency as well
    corecore