Search CORE

418 research outputs found

Privacy Preserving Utility Mining: A Survey

Author: Chao Han-Chieh
Gan Wensheng
Lin Jerry Chun-Wei
Wang Shyue-Liang
Yu Philip S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/11/2018
Field of study

In big data era, the collected data usually contains rich information and hidden knowledge. Utility-oriented pattern mining and analytics have shown a powerful ability to explore these ubiquitous data, which may be collected from various fields and applications, such as market basket analysis, retail, click-stream analysis, medical analysis, and bioinformatics. However, analysis of these data with sensitive private information raises privacy concerns. To achieve better trade-off between utility maximizing and privacy preserving, Privacy-Preserving Utility Mining (PPUM) has become a critical issue in recent years. In this paper, we provide a comprehensive overview of PPUM. We first present the background of utility mining, privacy-preserving data mining and PPUM, then introduce the related preliminaries and problem formulation of PPUM, as well as some key evaluation criteria for PPUM. In particular, we present and discuss the current state-of-the-art PPUM algorithms, as well as their advantages and deficiencies in detail. Finally, we highlight and discuss some technical challenges and open directions for future research on PPUM.Comment: 2018 IEEE International Conference on Big Data, 10 page

arXiv.org e-Print Archive

Crossref

Constraint-based sequence mining using constraint programming

Author: H Mannila
K Ye
MJ Zaki
T Fannes
T Guns
W Ugarte Rojas
Publication venue
Publication date: 25/02/2015
Field of study

The goal of constraint-based sequence mining is to find sequences of symbols that are included in a large number of input sequences and that satisfy some constraints specified by the user. Many constraints have been proposed in the literature, but a general framework is still missing. We investigate the use of constraint programming as general framework for this task. We first identify four categories of constraints that are applicable to sequence mining. We then propose two constraint programming formulations. The first formulation introduces a new global constraint called exists-embedding. This formulation is the most efficient but does not support one type of constraint. To support such constraints, we develop a second formulation that is more general but incurs more overhead. Both formulations can use the projected database technique used in specialised algorithms. Experiments demonstrate the flexibility towards constraint-based settings and compare the approach to existing methods.Comment: In Integration of AI and OR Techniques in Constraint Programming (CPAIOR), 201

arXiv.org e-Print Archive

Crossref

Discriminative Probabilistic Pattern Mining using Graph for Electronic Health Records

Author: Evgenii Li
Publication venue: 서울대학교 대학원
Publication date: 01/08/2019
Field of study

학위논문(석사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2019. 8. 김선.전자의료기록(Electronic Health Records)의 임상 노트에는 환자의 병력에 대한 유용한 정보가 많이 포함되어 있다. 그러나 임상 노트는 체계화되지 않은 데이터이며 그 양은 나날이 증가하고 있다. 따라서 임상 노트를 그룹화하고 분류하기 위한 신뢰할 수 있는 데이터 마이닝 기술이 필요하다. 기존의 데이터 마이닝 기술은 키워드의 빈도를 기반으로 생성된 빈발 패턴(frequent patterns)을 이용하여 그룹 분류 작업(classification)을 수행한다. 하지만 이러한 빈발 패턴은 전자의료기록의 임상 노트와 같이 복잡한 데이터의 분류를 위해 필요한 충분히 강력하고 명확하게 구별되는 특징을 갖고 있지 않다. 또한 빈발 패턴 기반 기술은 대규모 전자의료기록 데이터에 적용될 때 확장성과 계산 비용의 문제에 직면한다. 따라서 본 연구에서는 이러한 문제점을 해결하기 위해 확률적 판별 패턴 마이닝(discriminative probabilistic pattern mining) 알고리즘을 소개한다. 확률적 판별 패턴 마이닝 알고리즘에서는 전자의료기록의 임상 노트를 분류하기 위해 그래프 구조를 도입하여 빈발 패턴의 부분 그래프를 생성하게 된다. 본 연구에서는 판별력을 높이기 위해 개별 키워드를 사용하는 대신 이진 특성 조합에서의 동시 출현(co-occurrence)을 사용하여 임상 노트 분류를 위한 빈발 패턴 그래프를 구성한다. 각각의 동시 출현은 판별력(discriminative power)에 따른 log-odds 값으로 그 가중치를 갖는다. 임상 노트의 본질을 반영하는 그래프를 찾기 위해 확률적 판별 부분 그래프 검색을 수행하며 그래프의 허브(hub) 노드에서 시작하여 동적 프로그래밍(dynamic programming)을 사용하여 경로를 찾는다. 이러한 방법으로 검색한 빈발 부분 그래프를 이용하여 전자의료기록의 임상 노트에 대한 분류 작업을 수행하게 된다.Electronic Health Records (EHR) contains plenty of useful information about patients medical history. However, EHR is highly unstructured data and amount of it is growing continuously, that is why there is a need in a reliable data mining technique to group and categorize clinical notes. Although, many existing data mining techniques for group classification use frequent patterns generated based on frequencies of keywords, these patterns do not possess strong enough distinguishing characteristics to show the difference between datasets to classify complex data such as clinical notes in EHR. Also, these techniques encounter scalability and computational cost problems when used on large EHR dataset. To address these issues, we introduce discriminative probabilistic pattern mining algorithm that uses a graph (DPPMG) to generate the subgraphs of frequent patterns for classification in electronic health records. We use co-occurrence, a combination of binary features, which is more discriminative than individual keywords to construct discriminative probabilistic frequent patterns graph for clinical notes classification. Each co-occurrence has a weight of log-odds score that is associated with its discriminative power. The graph, which reflects the essence of clinical notes is searched to find discriminative probabilistic frequent subgraphs. To discover the discriminative frequent subgraphs, we start from a hub node in the graph and use dynamic programming to find a path. The discriminative probabilistic frequent subgraphs discovered by this approach are later used to classify clinical notes of electronic health records.Chapter 1 Introduction and Motivation 1 Chapter 2 Background 4 2.1 Frequent Pattern Based Classification 4 2.2 Discriminative Pattern Mining 5 2.3 Electronic Health Records 6 Chapter 3 Related Work 8 Chapter 4 Overview and Design 10 Chapter 5 Implementation 12 5.1 Dataset 12 5.2 Keyword Extraction and Filtering 15 5.3 Co-occurrence Generation and Graph Construction 16 5.4 Dynamic Programming to Discover Optimal Path 17 Chapter 6 Results and Evaluation 20 6.1 Choosing Starting Hub Node 20 6.2 Qualitative Analysis 22 6.3 Discriminative Power of the Probabilistic Frequent Patterns 24 Chapter 7 Conclusion 26 Bibliography 28 요약 33Maste

SNU Open Repository and Archive

Diverse Rule Sets

Author: Agrawal Rakesh
Chaoji Vineet
Cohen William W
Freitas Alex A
Hannu Toivonen
Hasan Mohammad Al
Hasan Mohammad Al
Liu Bing
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/06/2020
Field of study

While machine-learning models are flourishing and transforming many aspects of everyday life, the inability of humans to understand complex models poses difficulties for these models to be fully trusted and embraced. Thus, interpretability of models has been recognized as an equally important quality as their predictive power. In particular, rule-based systems are experiencing a renaissance owing to their intuitive if-then representation. However, simply being rule-based does not ensure interpretability. For example, overlapped rules spawn ambiguity and hinder interpretation. Here we propose a novel approach of inferring diverse rule sets, by optimizing small overlap among decision rules with a 2-approximation guarantee under the framework of Max-Sum diversification. We formulate the problem as maximizing a weighted sum of discriminative quality and diversity of a rule set. In order to overcome an exponential-size search space of association rules, we investigate several natural options for a small candidate set of high-quality rules, including frequent and accurate rules, and examine their hardness. Leveraging the special structure in our formulation, we then devise an efficient randomized algorithm, which samples rules that are highly discriminative and have small overlap. The proposed sampling algorithm analytically targets a distribution of rules that is tailored to our objective. We demonstrate the superior predictive power and interpretability of our model with a comprehensive empirical study against strong baselines

arXiv.org e-Print Archive

Crossref

SeqScout: Using a Bandit Model to Discover Interesting Subgroups in Labeled Sequences

Author: Boulicaut Jean-François
Kaytoue Mehdi
Mathonat Romain
Nurbakova Diana
Publication venue: HAL CCSD
Publication date: 05/10/2019
Field of study

International audienceIt is extremely useful to exploit labeled datasets not only to learn models but also to improve our understanding of a domain and its available targeted classes. The so-called subgroup discovery task has been considered for a long time. It concerns the discovery of patterns or descriptions, the set of supporting objects of which have interesting properties, e.g., they characterize or discriminate a given target class. Though many subgroup discovery algorithms have been proposed for transactional data, discovering subgroups within labeled sequential data and thus searching for descriptions as sequential patterns has been much less studied. In that context, exhaustive exploration strategies can not be used for real-life applications and we have to look for heuristic approaches. We propose the algorithm SeqScout to discover interesting subgroups (w.r.t. a chosen quality measure) from labeled sequences of itemsets. This is a new sampling algorithm that mines discriminant sequential patterns using a multi-armed bandit model. It is an anytime algorithm that, for a given budget, finds a collection of local optima in the search space of descriptions and thus subgroups. It requires a light configuration and it is independent from the quality measure used for pattern scoring. Furthermore, it is fairly simple to implement. We provide qualitative and quantitative experiments on several datasets to illustrate its added-value

Crossref

HAL

Hal-Diderot