22,813 research outputs found

    Effective pattern discovery for text mining

    Get PDF
    Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase) based approaches should perform better than the term-based ones, but many experiments did not support this hypothesis. This paper presents an innovative technique, effective pattern discovery which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance

    A review of associative classification mining

    Get PDF
    Associative classification mining is a promising approach in data mining that utilizes the association rule discovery techniques to construct classification systems, also known as associative classifiers. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative classification techniques with regards to the above criteria. Finally, future directions in associative classification, such as incremental learning and mining low-quality data sets, are also highlighted in this paper

    Interpretable multiclass classification by MDL-based rule lists

    Get PDF
    Interpretable classifiers have recently witnessed an increase in attention from the data mining community because they are inherently easier to understand and explain than their more complex counterparts. Examples of interpretable classification models include decision trees, rule sets, and rule lists. Learning such models often involves optimizing hyperparameters, which typically requires substantial amounts of data and may result in relatively large models. In this paper, we consider the problem of learning compact yet accurate probabilistic rule lists for multiclass classification. Specifically, we propose a novel formalization based on probabilistic rule lists and the minimum description length (MDL) principle. This results in virtually parameter-free model selection that naturally allows to trade-off model complexity with goodness of fit, by which overfitting and the need for hyperparameter tuning are effectively avoided. Finally, we introduce the Classy algorithm, which greedily finds rule lists according to the proposed criterion. We empirically demonstrate that Classy selects small probabilistic rule lists that outperform state-of-the-art classifiers when it comes to the combination of predictive performance and interpretability. We show that Classy is insensitive to its only parameter, i.e., the candidate set, and that compression on the training set correlates with classification performance, validating our MDL-based selection criterion

    Quantitative Redundancy in Partial Implications

    Get PDF
    We survey the different properties of an intuitive notion of redundancy, as a function of the precise semantics given to the notion of partial implication. The final version of this survey will appear in the Proceedings of the Int. Conf. Formal Concept Analysis, 2015.Comment: Int. Conf. Formal Concept Analysis, 201

    Mining Traversal Patterns from Weighted Traversals and Graph

    Get PDF
    ์‹ค์„ธ๊ณ„์˜ ๋งŽ์€ ๋ฌธ์ œ๋“ค์€ ๊ทธ๋ž˜ํ”„์™€ ๊ทธ ๊ทธ๋ž˜ํ”„๋ฅผ ์ˆœํšŒํ•˜๋Š” ํŠธ๋žœ์žญ์…˜์œผ๋กœ ๋ชจ๋ธ๋ง๋  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด, ์›น ํŽ˜์ด์ง€์˜ ์—ฐ๊ฒฐ๊ตฌ์กฐ๋Š” ๊ทธ๋ž˜ํ”„๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๊ณ , ์‚ฌ์šฉ์ž์˜ ์›น ํŽ˜์ด์ง€ ๋ฐฉ๋ฌธ๊ฒฝ๋กœ๋Š” ๊ทธ ๊ทธ๋ž˜ํ”„๋ฅผ ์ˆœํšŒํ•˜๋Š” ํŠธ๋žœ์žญ์…˜์œผ๋กœ ๋ชจ๋ธ๋ง๋  ์ˆ˜ ์žˆ๋‹ค. ์ด์™€ ๊ฐ™์ด ๊ทธ๋ž˜ํ”„๋ฅผ ์ˆœํšŒํ•˜๋Š” ํŠธ๋žœ์žญ์…˜์œผ๋กœ๋ถ€ํ„ฐ ์ค‘์š”ํ•˜๊ณ  ๊ฐ€์น˜ ์žˆ๋Š” ํŒจํ„ด์„ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์€ ์˜๋ฏธ ์žˆ๋Š” ์ผ์ด๋‹ค. ์ด๋Ÿฌํ•œ ํŒจํ„ด์„ ์ฐพ๊ธฐ ์œ„ํ•œ ์ง€๊ธˆ๊นŒ์ง€์˜ ์—ฐ๊ตฌ์—์„œ๋Š” ์ˆœํšŒ๋‚˜ ๊ทธ๋ž˜ํ”„์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ๋‹จ์ˆœํžˆ ๋นˆ๋ฐœํ•˜๋Š” ํŒจํ„ด๋งŒ์„ ์ฐพ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•œ๊ณ„๋Š” ๋ณด๋‹ค ์‹ ๋ขฐ์„ฑ ์žˆ๊ณ  ์ •ํ™•ํ•œ ํŒจํ„ด์„ ํƒ์‚ฌํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ˆœํšŒ๋‚˜ ๊ทธ๋ž˜ํ”„์˜ ์ •์ ์— ๋ถ€์—ฌ๋œ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํŒจํ„ด์„ ํƒ์‚ฌํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ ๊ทธ๋ž˜ํ”„๋ฅผ ์ˆœํšŒํ•˜๋Š” ์ •๋ณด์— ๊ฐ€์ค‘์น˜๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ์— ๋นˆ๋ฐœ ์ˆœํšŒ ํŒจํ„ด์„ ํƒ์‚ฌํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜ํ”„ ์ˆœํšŒ์— ๋ถ€์—ฌ๋  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์ค‘์น˜๋กœ๋Š” ๋‘ ๋„์‹œ๊ฐ„์˜ ์ด๋™ ์‹œ๊ฐ„์ด๋‚˜ ์›น ์‚ฌ์ดํŠธ๋ฅผ ๋ฐฉ๋ฌธํ•  ๋•Œ ํ•œ ํŽ˜์ด์ง€์—์„œ ๋‹ค๋ฅธ ํŽ˜์ด์ง€๋กœ ์ด๋™ํ•˜๋Š” ์‹œ๊ฐ„ ๋“ฑ์ด ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ข€ ๋” ์ •ํ™•ํ•œ ์ˆœํšŒ ํŒจํ„ด์„ ๋งˆ์ด๋‹ํ•˜๊ธฐ ์œ„ํ•ด ํ†ต๊ณ„ํ•™์˜ ์‹ ๋ขฐ ๊ตฌ๊ฐ„์„ ์ด์šฉํ•œ๋‹ค. ์ฆ‰, ์ „์ฒด ์ˆœํšŒ์˜ ๊ฐ ๊ฐ„์„ ์— ๋ถ€์—ฌ๋œ ๊ฐ€์ค‘์น˜๋กœ๋ถ€ํ„ฐ ์‹ ๋ขฐ ๊ตฌ๊ฐ„์„ ๊ตฌํ•œ ํ›„ ์‹ ๋ขฐ ๊ตฌ๊ฐ„์˜ ๋‚ด์— ์žˆ๋Š” ์ˆœํšŒ๋งŒ์„ ์œ ํšจํ•œ ๊ฒƒ์œผ๋กœ ์ธ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•จ์œผ๋กœ์จ ๋”์šฑ ์‹ ๋ขฐ์„ฑ ์žˆ๋Š” ์ˆœํšŒ ํŒจํ„ด์„ ๋งˆ์ด๋‹ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์ด๋ ‡๊ฒŒ ๊ตฌํ•œ ํŒจํ„ด๊ณผ ๊ทธ๋ž˜ํ”„ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ํŒจํ„ด ๊ฐ„์˜ ์šฐ์„ ์ˆœ์œ„๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๊ณผ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋„ ์ œ์‹œํ•œ๋‹ค. ๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ ๊ทธ๋ž˜ํ”„์˜ ์ •์ ์— ๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์—ฌ๋œ ๊ฒฝ์šฐ์— ๊ฐ€์ค‘์น˜๊ฐ€ ๊ณ ๋ ค๋œ ๋นˆ๋ฐœ ์ˆœํšŒ ํŒจํ„ด์„ ํƒ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ทธ๋ž˜ํ”„์˜ ์ •์ ์— ๋ถ€์—ฌ๋  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์ค‘์น˜๋กœ๋Š” ์›น ์‚ฌ์ดํŠธ ๋‚ด์˜ ๊ฐ ๋ฌธ์„œ์˜ ์ •๋ณด๋Ÿ‰์ด๋‚˜ ์ค‘์š”๋„ ๋“ฑ์ด ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋ฌธ์ œ์—์„œ๋Š” ๋นˆ๋ฐœ ์ˆœํšŒ ํŒจํ„ด์„ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ํŒจํ„ด์˜ ๋ฐœ์ƒ ๋นˆ๋„๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ฐฉ๋ฌธํ•œ ์ •์ ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋™์‹œ์— ๊ณ ๋ คํ•˜์—ฌ์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ •์ ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ด์šฉํ•˜์—ฌ ํ–ฅํ›„์— ๋นˆ๋ฐœ ํŒจํ„ด์ด ๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋Š” ํ›„๋ณด ํŒจํ„ด์€ ๊ฐ ๋งˆ์ด๋‹ ๋‹จ๊ณ„์—์„œ ์ œ๊ฑฐํ•˜์ง€ ์•Š๊ณ  ์œ ์ง€ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. ๋˜ํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ํ›„๋ณด ํŒจํ„ด์˜ ์ˆ˜๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜๋„ ์ œ์•ˆํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ์ˆ˜ํ–‰ ์‹œ๊ฐ„ ๋ฐ ์ƒ์„ฑ๋˜๋Š” ํŒจํ„ด์˜ ์ˆ˜ ๋“ฑ์„ ๋น„๊ต ๋ถ„์„ํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ˆœํšŒ์— ๊ฐ€์ค‘์น˜๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ์™€ ๊ทธ๋ž˜ํ”„์˜ ์ •์ ์— ๊ฐ€์ค‘์น˜๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ์— ๋นˆ๋ฐœ ์ˆœํšŒ ํŒจํ„ด์„ ํƒ์‚ฌํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์›น ๋งˆ์ด๋‹๊ณผ ๊ฐ™์€ ๋ถ„์•ผ์— ์ ์šฉํ•จ์œผ๋กœ์จ ์›น ๊ตฌ์กฐ์˜ ํšจ์œจ์ ์ธ ๋ณ€๊ฒฝ์ด๋‚˜ ์›น ๋ฌธ์„œ์˜ ์ ‘๊ทผ ์†๋„ ํ–ฅ์ƒ, ์‚ฌ์šฉ์ž๋ณ„ ๊ฐœ์ธํ™”๋œ ์›น ๋ฌธ์„œ ๊ตฌ์ถ• ๋“ฑ์ด ๊ฐ€๋Šฅํ•  ๊ฒƒ์ด๋‹ค.Abstract โ…ถ Chapter 1 Introduction 1.1 Overview 1.2 Motivations 1.3 Approach 1.4 Organization of Thesis Chapter 2 Related Works 2.1 Itemset Mining 2.2 Weighted Itemset Mining 2.3 Traversal Mining 2.4 Graph Traversal Mining Chapter 3 Mining Patterns from Weighted Traversals on Unweighted Graph 3.1 Definitions and Problem Statements 3.2 Mining Frequent Patterns 3.2.1 Augmentation of Base Graph 3.2.2 In-Mining Algorithm 3.2.3 Pre-Mining Algorithm 3.2.4 Priority of Patterns 3.3 Experimental Results Chapter 4 Mining Patterns from Unweighted Traversals on Weighted Graph 4.1 Definitions and Problem Statements 4.2 Mining Weighted Frequent Patterns 4.2.1 Pruning by Support Bounds 4.2.2 Candidate Generation 4.2.3 Mining Algorithm 4.3 Estimation of Support Bounds 4.3.1 Estimation by All Vertices 4.3.2 Estimation by Reachable Vertices 4.4 Experimental Results Chapter 5 Conclusions and Further Works Reference

    Learning from Ontology Streams with Semantic Concept Drift

    Get PDF
    Data stream learning has been largely studied for extracting knowledge structures from continuous and rapid data records. In the semantic Web, data is interpreted in ontologies and its ordered sequence is represented as an ontology stream. Our work exploits the semantics of such streams to tackle the problem of concept drift i.e., unexpected changes in data distribution, causing most of models to be less accurate as time passes. To this end we revisited (i) semantic inference in the context of supervised stream learning, and (ii) models with semantic embeddings. The experiments show accurate prediction with data from Dublin and Beijing
    • โ€ฆ
    corecore