2,864 research outputs found

    A review of associative classification mining

    Get PDF
    Associative classification mining is a promising approach in data mining that utilizes the association rule discovery techniques to construct classification systems, also known as associative classifiers. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative classification techniques with regards to the above criteria. Finally, future directions in associative classification, such as incremental learning and mining low-quality data sets, are also highlighted in this paper

    Multi-score Learning for Affect Recognition: the Case of Body Postures

    Get PDF
    An important challenge in building automatic affective state recognition systems is establishing the ground truth. When the groundtruth is not available, observers are often used to label training and testing sets. Unfortunately, inter-rater reliability between observers tends to vary from fair to moderate when dealing with naturalistic expressions. Nevertheless, the most common approach used is to label each expression with the most frequent label assigned by the observers to that expression. In this paper, we propose a general pattern recognition framework that takes into account the variability between observers for automatic affect recognition. This leads to what we term a multi-score learning problem in which a single expression is associated with multiple values representing the scores of each available emotion label. We also propose several performance measurements and pattern recognition methods for this framework, and report the experimental results obtained when testing and comparing these methods on two affective posture datasets

    Learning Interpretable Rules for Multi-label Classification

    Full text link
    Multi-label classification (MLC) is a supervised learning problem in which, contrary to standard multiclass classification, an instance can be associated with several class labels simultaneously. In this chapter, we advocate a rule-based approach to multi-label classification. Rule learning algorithms are often employed when one is not only interested in accurate predictions, but also requires an interpretable theory that can be understood, analyzed, and qualitatively evaluated by domain experts. Ideally, by revealing patterns and regularities contained in the data, a rule-based theory yields new insights in the application domain. Recently, several authors have started to investigate how rule-based models can be used for modeling multi-label data. Discussing this task in detail, we highlight some of the problems that make rule learning considerably more challenging for MLC than for conventional classification. While mainly focusing on our own previous work, we also provide a short overview of related work in this area.Comment: Preprint version. To appear in: Explainable and Interpretable Models in Computer Vision and Machine Learning. The Springer Series on Challenges in Machine Learning. Springer (2018). See http://www.ke.tu-darmstadt.de/bibtex/publications/show/3077 for further informatio

    Speech Recognition by Composition of Weighted Finite Automata

    Full text link
    We present a general framework based on weighted finite automata and weighted finite-state transducers for describing and implementing speech recognizers. The framework allows us to represent uniformly the information sources and data structures used in recognition, including context-dependent units, pronunciation dictionaries, language models and lattices. Furthermore, general but efficient algorithms can used for combining information sources in actual recognizers and for optimizing their application. In particular, a single composition algorithm is used both to combine in advance information sources such as language models and dictionaries, and to combine acoustic observations and information sources dynamically during recognition.Comment: 24 pages, uses psfig.st

    I-prune: Item selection for associative classification

    Get PDF
    Associative classification is characterized by accurate models and high model generation time. Most time is spent in extracting and postprocessing a large set of irrelevant rules, which are eventually pruned.We propose I-prune, an item-pruning approach that selects uninteresting items by means of an interestingness measure and prunes them as soon as they are detected. Thus, the number of extracted rules is reduced and model generation time decreases correspondingly. A wide set of experiments on real and synthetic data sets has been performed to evaluate I-prune and select the appropriate interestingness measure. The experimental results show that I-prune allows a significant reduction in model generation time, while increasing (or at worst preserving) model accuracy. Experimental evaluation also points to the chi-square measure as the most effective interestingness measure for item pruning

    A modified multi-class association rule for text mining

    Get PDF
    Classification and association rule mining are significant tasks in data mining. Integrating association rule discovery and classification in data mining brings us an approach known as the associative classification. One common shortcoming of existing Association Classifiers is the huge number of rules produced in order to obtain high classification accuracy. This study proposes s a Modified Multi-class Association Rule Mining (mMCAR) that consists of three procedures; rule discovery, rule pruning and group-based class assignment. The rule discovery and rule pruning procedures are designed to reduce the number of classification rules. On the other hand, the group-based class assignment procedure contributes in improving the classification accuracy. Experiments on the structured and unstructured text datasets obtained from the UCI and Reuters repositories are performed in order to evaluate the proposed Association Classifier. The proposed mMCAR classifier is benchmarked against the traditional classifiers and existing Association Classifiers. Experimental results indicate that the proposed Association Classifier, mMCAR, produced high accuracy with a smaller number of classification rules. For the structured dataset, the mMCAR produces an average of 84.24% accuracy as compared to MCAR that obtains 84.23%. Even though the classification accuracy difference is small, the proposed mMCAR uses only 50 rules for the classification while its benchmark method involves 60 rules. On the other hand, mMCAR is at par with MCAR when unstructured dataset is utilized. Both classifiers produce 89% accuracy but mMCAR uses less number of rules for the classification. This study contributes to the text mining domain as automatic classification of huge and widely distributed textual data could facilitate the text representation and retrieval processes

    An associative classification based approach for detecting SNP-SNP interactions in high dimensional genome

    Get PDF
    There have been many studies that depict genotype phenotype relationships by identifying genetic variants associated with a specific disease. Researchers focus more attention on interactions between SNPs that are strongly associated with disease in the absence of main effect. In this context, a number of machine learning and data mining tools are applied to identify the combinations of multi-locus SNPs in higher order data.However, none of the current models can identify useful SNPSNP interactions for high dimensional genome data. Detecting these interactions is challenging due to bio-molecular complexities and computational limitations. The goal of this research was to implement associative classification and study its effectiveness for detecting the epistasis in balanced and imbalanced datasets. The proposed approach was evaluated for two locus epistasis interactions using simulated data. The datasets were generated for 5 different penetrance functions by varying heritability, minor allele frequency and sample size. In total, 23,400 datasets were generated and several experiments are conducted to identify the disease causal SNP interactions. The accuracy of classification by the proposed approach wascompared with the previous approaches. Though associative classification showed only relatively small improvement in accuracy for balanced datasets, it outperformed existing approaches in higher order multi-locus interactions in imbalanced datasets

    Discriminative Probabilistic Pattern Mining using Graph for Electronic Health Records

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2019. 8. ๊น€์„ .์ „์ž์˜๋ฃŒ๊ธฐ๋ก(Electronic Health Records)์˜ ์ž„์ƒ ๋…ธํŠธ์—๋Š” ํ™˜์ž์˜ ๋ณ‘๋ ฅ์— ๋Œ€ํ•œ ์œ ์šฉํ•œ ์ •๋ณด๊ฐ€ ๋งŽ์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ž„์ƒ ๋…ธํŠธ๋Š” ์ฒด๊ณ„ํ™”๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์ด๋ฉฐ ๊ทธ ์–‘์€ ๋‚˜๋‚ ์ด ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ž„์ƒ ๋…ธํŠธ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•œ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹ ๊ธฐ์ˆ ์ด ํ•„์š”ํ•˜๋‹ค. ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹ ๊ธฐ์ˆ ์€ ํ‚ค์›Œ๋“œ์˜ ๋นˆ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑ๋œ ๋นˆ๋ฐœ ํŒจํ„ด(frequent patterns)์„ ์ด์šฉํ•˜์—ฌ ๊ทธ๋ฃน ๋ถ„๋ฅ˜ ์ž‘์—…(classification)์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋นˆ๋ฐœ ํŒจํ„ด์€ ์ „์ž์˜๋ฃŒ๊ธฐ๋ก์˜ ์ž„์ƒ ๋…ธํŠธ์™€ ๊ฐ™์ด ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ์˜ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ํ•„์š”ํ•œ ์ถฉ๋ถ„ํžˆ ๊ฐ•๋ ฅํ•˜๊ณ  ๋ช…ํ™•ํ•˜๊ฒŒ ๊ตฌ๋ณ„๋˜๋Š” ํŠน์ง•์„ ๊ฐ–๊ณ  ์žˆ์ง€ ์•Š๋‹ค. ๋˜ํ•œ ๋นˆ๋ฐœ ํŒจํ„ด ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ ์€ ๋Œ€๊ทœ๋ชจ ์ „์ž์˜๋ฃŒ๊ธฐ๋ก ๋ฐ์ดํ„ฐ์— ์ ์šฉ๋  ๋•Œ ํ™•์žฅ์„ฑ๊ณผ ๊ณ„์‚ฐ ๋น„์šฉ์˜ ๋ฌธ์ œ์— ์ง๋ฉดํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ™•๋ฅ ์  ํŒ๋ณ„ ํŒจํ„ด ๋งˆ์ด๋‹(discriminative probabilistic pattern mining) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์†Œ๊ฐœํ•œ๋‹ค. ํ™•๋ฅ ์  ํŒ๋ณ„ ํŒจํ„ด ๋งˆ์ด๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ๋Š” ์ „์ž์˜๋ฃŒ๊ธฐ๋ก์˜ ์ž„์ƒ ๋…ธํŠธ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•ด ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๋ฅผ ๋„์ž…ํ•˜์—ฌ ๋นˆ๋ฐœ ํŒจํ„ด์˜ ๋ถ€๋ถ„ ๊ทธ๋ž˜ํ”„๋ฅผ ์ƒ์„ฑํ•˜๊ฒŒ ๋œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํŒ๋ณ„๋ ฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๊ฐœ๋ณ„ ํ‚ค์›Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹  ์ด์ง„ ํŠน์„ฑ ์กฐํ•ฉ์—์„œ์˜ ๋™์‹œ ์ถœํ˜„(co-occurrence)์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž„์ƒ ๋…ธํŠธ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋นˆ๋ฐœ ํŒจํ„ด ๊ทธ๋ž˜ํ”„๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. ๊ฐ๊ฐ์˜ ๋™์‹œ ์ถœํ˜„์€ ํŒ๋ณ„๋ ฅ(discriminative power)์— ๋”ฐ๋ฅธ log-odds ๊ฐ’์œผ๋กœ ๊ทธ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ–๋Š”๋‹ค. ์ž„์ƒ ๋…ธํŠธ์˜ ๋ณธ์งˆ์„ ๋ฐ˜์˜ํ•˜๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ํ™•๋ฅ ์  ํŒ๋ณ„ ๋ถ€๋ถ„ ๊ทธ๋ž˜ํ”„ ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ ๊ทธ๋ž˜ํ”„์˜ ํ—ˆ๋ธŒ(hub) ๋…ธ๋“œ์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ๋™์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ(dynamic programming)์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฝ๋กœ๋ฅผ ์ฐพ๋Š”๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ฒ€์ƒ‰ํ•œ ๋นˆ๋ฐœ ๋ถ€๋ถ„ ๊ทธ๋ž˜ํ”„๋ฅผ ์ด์šฉํ•˜์—ฌ ์ „์ž์˜๋ฃŒ๊ธฐ๋ก์˜ ์ž„์ƒ ๋…ธํŠธ์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋œ๋‹ค.Electronic Health Records (EHR) contains plenty of useful information about patients medical history. However, EHR is highly unstructured data and amount of it is growing continuously, that is why there is a need in a reliable data mining technique to group and categorize clinical notes. Although, many existing data mining techniques for group classification use frequent patterns generated based on frequencies of keywords, these patterns do not possess strong enough distinguishing characteristics to show the difference between datasets to classify complex data such as clinical notes in EHR. Also, these techniques encounter scalability and computational cost problems when used on large EHR dataset. To address these issues, we introduce discriminative probabilistic pattern mining algorithm that uses a graph (DPPMG) to generate the subgraphs of frequent patterns for classification in electronic health records. We use co-occurrence, a combination of binary features, which is more discriminative than individual keywords to construct discriminative probabilistic frequent patterns graph for clinical notes classification. Each co-occurrence has a weight of log-odds score that is associated with its discriminative power. The graph, which reflects the essence of clinical notes is searched to find discriminative probabilistic frequent subgraphs. To discover the discriminative frequent subgraphs, we start from a hub node in the graph and use dynamic programming to find a path. The discriminative probabilistic frequent subgraphs discovered by this approach are later used to classify clinical notes of electronic health records.Chapter 1 Introduction and Motivation 1 Chapter 2 Background 4 2.1 Frequent Pattern Based Classification 4 2.2 Discriminative Pattern Mining 5 2.3 Electronic Health Records 6 Chapter 3 Related Work 8 Chapter 4 Overview and Design 10 Chapter 5 Implementation 12 5.1 Dataset 12 5.2 Keyword Extraction and Filtering 15 5.3 Co-occurrence Generation and Graph Construction 16 5.4 Dynamic Programming to Discover Optimal Path 17 Chapter 6 Results and Evaluation 20 6.1 Choosing Starting Hub Node 20 6.2 Qualitative Analysis 22 6.3 Discriminative Power of the Probabilistic Frequent Patterns 24 Chapter 7 Conclusion 26 Bibliography 28 ์š”์•ฝ 33Maste
    • โ€ฆ
    corecore