418 research outputs found

    Privacy Preserving Utility Mining: A Survey

    Full text link
    In big data era, the collected data usually contains rich information and hidden knowledge. Utility-oriented pattern mining and analytics have shown a powerful ability to explore these ubiquitous data, which may be collected from various fields and applications, such as market basket analysis, retail, click-stream analysis, medical analysis, and bioinformatics. However, analysis of these data with sensitive private information raises privacy concerns. To achieve better trade-off between utility maximizing and privacy preserving, Privacy-Preserving Utility Mining (PPUM) has become a critical issue in recent years. In this paper, we provide a comprehensive overview of PPUM. We first present the background of utility mining, privacy-preserving data mining and PPUM, then introduce the related preliminaries and problem formulation of PPUM, as well as some key evaluation criteria for PPUM. In particular, we present and discuss the current state-of-the-art PPUM algorithms, as well as their advantages and deficiencies in detail. Finally, we highlight and discuss some technical challenges and open directions for future research on PPUM.Comment: 2018 IEEE International Conference on Big Data, 10 page

    Constraint-based sequence mining using constraint programming

    Full text link
    The goal of constraint-based sequence mining is to find sequences of symbols that are included in a large number of input sequences and that satisfy some constraints specified by the user. Many constraints have been proposed in the literature, but a general framework is still missing. We investigate the use of constraint programming as general framework for this task. We first identify four categories of constraints that are applicable to sequence mining. We then propose two constraint programming formulations. The first formulation introduces a new global constraint called exists-embedding. This formulation is the most efficient but does not support one type of constraint. To support such constraints, we develop a second formulation that is more general but incurs more overhead. Both formulations can use the projected database technique used in specialised algorithms. Experiments demonstrate the flexibility towards constraint-based settings and compare the approach to existing methods.Comment: In Integration of AI and OR Techniques in Constraint Programming (CPAIOR), 201

    Discriminative Probabilistic Pattern Mining using Graph for Electronic Health Records

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2019. 8. ๊น€์„ .์ „์ž์˜๋ฃŒ๊ธฐ๋ก(Electronic Health Records)์˜ ์ž„์ƒ ๋…ธํŠธ์—๋Š” ํ™˜์ž์˜ ๋ณ‘๋ ฅ์— ๋Œ€ํ•œ ์œ ์šฉํ•œ ์ •๋ณด๊ฐ€ ๋งŽ์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ž„์ƒ ๋…ธํŠธ๋Š” ์ฒด๊ณ„ํ™”๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์ด๋ฉฐ ๊ทธ ์–‘์€ ๋‚˜๋‚ ์ด ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ž„์ƒ ๋…ธํŠธ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•œ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹ ๊ธฐ์ˆ ์ด ํ•„์š”ํ•˜๋‹ค. ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹ ๊ธฐ์ˆ ์€ ํ‚ค์›Œ๋“œ์˜ ๋นˆ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑ๋œ ๋นˆ๋ฐœ ํŒจํ„ด(frequent patterns)์„ ์ด์šฉํ•˜์—ฌ ๊ทธ๋ฃน ๋ถ„๋ฅ˜ ์ž‘์—…(classification)์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋นˆ๋ฐœ ํŒจํ„ด์€ ์ „์ž์˜๋ฃŒ๊ธฐ๋ก์˜ ์ž„์ƒ ๋…ธํŠธ์™€ ๊ฐ™์ด ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ์˜ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ํ•„์š”ํ•œ ์ถฉ๋ถ„ํžˆ ๊ฐ•๋ ฅํ•˜๊ณ  ๋ช…ํ™•ํ•˜๊ฒŒ ๊ตฌ๋ณ„๋˜๋Š” ํŠน์ง•์„ ๊ฐ–๊ณ  ์žˆ์ง€ ์•Š๋‹ค. ๋˜ํ•œ ๋นˆ๋ฐœ ํŒจํ„ด ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ ์€ ๋Œ€๊ทœ๋ชจ ์ „์ž์˜๋ฃŒ๊ธฐ๋ก ๋ฐ์ดํ„ฐ์— ์ ์šฉ๋  ๋•Œ ํ™•์žฅ์„ฑ๊ณผ ๊ณ„์‚ฐ ๋น„์šฉ์˜ ๋ฌธ์ œ์— ์ง๋ฉดํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ™•๋ฅ ์  ํŒ๋ณ„ ํŒจํ„ด ๋งˆ์ด๋‹(discriminative probabilistic pattern mining) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์†Œ๊ฐœํ•œ๋‹ค. ํ™•๋ฅ ์  ํŒ๋ณ„ ํŒจํ„ด ๋งˆ์ด๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ๋Š” ์ „์ž์˜๋ฃŒ๊ธฐ๋ก์˜ ์ž„์ƒ ๋…ธํŠธ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•ด ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๋ฅผ ๋„์ž…ํ•˜์—ฌ ๋นˆ๋ฐœ ํŒจํ„ด์˜ ๋ถ€๋ถ„ ๊ทธ๋ž˜ํ”„๋ฅผ ์ƒ์„ฑํ•˜๊ฒŒ ๋œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํŒ๋ณ„๋ ฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๊ฐœ๋ณ„ ํ‚ค์›Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹  ์ด์ง„ ํŠน์„ฑ ์กฐํ•ฉ์—์„œ์˜ ๋™์‹œ ์ถœํ˜„(co-occurrence)์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž„์ƒ ๋…ธํŠธ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋นˆ๋ฐœ ํŒจํ„ด ๊ทธ๋ž˜ํ”„๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. ๊ฐ๊ฐ์˜ ๋™์‹œ ์ถœํ˜„์€ ํŒ๋ณ„๋ ฅ(discriminative power)์— ๋”ฐ๋ฅธ log-odds ๊ฐ’์œผ๋กœ ๊ทธ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ–๋Š”๋‹ค. ์ž„์ƒ ๋…ธํŠธ์˜ ๋ณธ์งˆ์„ ๋ฐ˜์˜ํ•˜๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ํ™•๋ฅ ์  ํŒ๋ณ„ ๋ถ€๋ถ„ ๊ทธ๋ž˜ํ”„ ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ ๊ทธ๋ž˜ํ”„์˜ ํ—ˆ๋ธŒ(hub) ๋…ธ๋“œ์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ๋™์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ(dynamic programming)์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฝ๋กœ๋ฅผ ์ฐพ๋Š”๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ฒ€์ƒ‰ํ•œ ๋นˆ๋ฐœ ๋ถ€๋ถ„ ๊ทธ๋ž˜ํ”„๋ฅผ ์ด์šฉํ•˜์—ฌ ์ „์ž์˜๋ฃŒ๊ธฐ๋ก์˜ ์ž„์ƒ ๋…ธํŠธ์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋œ๋‹ค.Electronic Health Records (EHR) contains plenty of useful information about patients medical history. However, EHR is highly unstructured data and amount of it is growing continuously, that is why there is a need in a reliable data mining technique to group and categorize clinical notes. Although, many existing data mining techniques for group classification use frequent patterns generated based on frequencies of keywords, these patterns do not possess strong enough distinguishing characteristics to show the difference between datasets to classify complex data such as clinical notes in EHR. Also, these techniques encounter scalability and computational cost problems when used on large EHR dataset. To address these issues, we introduce discriminative probabilistic pattern mining algorithm that uses a graph (DPPMG) to generate the subgraphs of frequent patterns for classification in electronic health records. We use co-occurrence, a combination of binary features, which is more discriminative than individual keywords to construct discriminative probabilistic frequent patterns graph for clinical notes classification. Each co-occurrence has a weight of log-odds score that is associated with its discriminative power. The graph, which reflects the essence of clinical notes is searched to find discriminative probabilistic frequent subgraphs. To discover the discriminative frequent subgraphs, we start from a hub node in the graph and use dynamic programming to find a path. The discriminative probabilistic frequent subgraphs discovered by this approach are later used to classify clinical notes of electronic health records.Chapter 1 Introduction and Motivation 1 Chapter 2 Background 4 2.1 Frequent Pattern Based Classification 4 2.2 Discriminative Pattern Mining 5 2.3 Electronic Health Records 6 Chapter 3 Related Work 8 Chapter 4 Overview and Design 10 Chapter 5 Implementation 12 5.1 Dataset 12 5.2 Keyword Extraction and Filtering 15 5.3 Co-occurrence Generation and Graph Construction 16 5.4 Dynamic Programming to Discover Optimal Path 17 Chapter 6 Results and Evaluation 20 6.1 Choosing Starting Hub Node 20 6.2 Qualitative Analysis 22 6.3 Discriminative Power of the Probabilistic Frequent Patterns 24 Chapter 7 Conclusion 26 Bibliography 28 ์š”์•ฝ 33Maste

    Diverse Rule Sets

    Full text link
    While machine-learning models are flourishing and transforming many aspects of everyday life, the inability of humans to understand complex models poses difficulties for these models to be fully trusted and embraced. Thus, interpretability of models has been recognized as an equally important quality as their predictive power. In particular, rule-based systems are experiencing a renaissance owing to their intuitive if-then representation. However, simply being rule-based does not ensure interpretability. For example, overlapped rules spawn ambiguity and hinder interpretation. Here we propose a novel approach of inferring diverse rule sets, by optimizing small overlap among decision rules with a 2-approximation guarantee under the framework of Max-Sum diversification. We formulate the problem as maximizing a weighted sum of discriminative quality and diversity of a rule set. In order to overcome an exponential-size search space of association rules, we investigate several natural options for a small candidate set of high-quality rules, including frequent and accurate rules, and examine their hardness. Leveraging the special structure in our formulation, we then devise an efficient randomized algorithm, which samples rules that are highly discriminative and have small overlap. The proposed sampling algorithm analytically targets a distribution of rules that is tailored to our objective. We demonstrate the superior predictive power and interpretability of our model with a comprehensive empirical study against strong baselines

    SeqScout: Using a Bandit Model to Discover Interesting Subgroups in Labeled Sequences

    Get PDF
    International audienceIt is extremely useful to exploit labeled datasets not only to learn models but also to improve our understanding of a domain and its available targeted classes. The so-called subgroup discovery task has been considered for a long time. It concerns the discovery of patterns or descriptions, the set of supporting objects of which have interesting properties, e.g., they characterize or discriminate a given target class. Though many subgroup discovery algorithms have been proposed for transactional data, discovering subgroups within labeled sequential data and thus searching for descriptions as sequential patterns has been much less studied. In that context, exhaustive exploration strategies can not be used for real-life applications and we have to look for heuristic approaches. We propose the algorithm SeqScout to discover interesting subgroups (w.r.t. a chosen quality measure) from labeled sequences of itemsets. This is a new sampling algorithm that mines discriminant sequential patterns using a multi-armed bandit model. It is an anytime algorithm that, for a given budget, finds a collection of local optima in the search space of descriptions and thus subgroups. It requires a light configuration and it is independent from the quality measure used for pattern scoring. Furthermore, it is fairly simple to implement. We provide qualitative and quantitative experiments on several datasets to illustrate its added-value
    • โ€ฆ
    corecore