Search CORE

2,537 research outputs found

Discriminative Probabilistic Pattern Mining using Graph for Electronic Health Records

Author: Evgenii Li
Publication venue: 서울대학교 대학원
Publication date: 01/08/2019
Field of study

학위논문(석사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2019. 8. 김선.전자의료기록(Electronic Health Records)의 임상 노트에는 환자의 병력에 대한 유용한 정보가 많이 포함되어 있다. 그러나 임상 노트는 체계화되지 않은 데이터이며 그 양은 나날이 증가하고 있다. 따라서 임상 노트를 그룹화하고 분류하기 위한 신뢰할 수 있는 데이터 마이닝 기술이 필요하다. 기존의 데이터 마이닝 기술은 키워드의 빈도를 기반으로 생성된 빈발 패턴(frequent patterns)을 이용하여 그룹 분류 작업(classification)을 수행한다. 하지만 이러한 빈발 패턴은 전자의료기록의 임상 노트와 같이 복잡한 데이터의 분류를 위해 필요한 충분히 강력하고 명확하게 구별되는 특징을 갖고 있지 않다. 또한 빈발 패턴 기반 기술은 대규모 전자의료기록 데이터에 적용될 때 확장성과 계산 비용의 문제에 직면한다. 따라서 본 연구에서는 이러한 문제점을 해결하기 위해 확률적 판별 패턴 마이닝(discriminative probabilistic pattern mining) 알고리즘을 소개한다. 확률적 판별 패턴 마이닝 알고리즘에서는 전자의료기록의 임상 노트를 분류하기 위해 그래프 구조를 도입하여 빈발 패턴의 부분 그래프를 생성하게 된다. 본 연구에서는 판별력을 높이기 위해 개별 키워드를 사용하는 대신 이진 특성 조합에서의 동시 출현(co-occurrence)을 사용하여 임상 노트 분류를 위한 빈발 패턴 그래프를 구성한다. 각각의 동시 출현은 판별력(discriminative power)에 따른 log-odds 값으로 그 가중치를 갖는다. 임상 노트의 본질을 반영하는 그래프를 찾기 위해 확률적 판별 부분 그래프 검색을 수행하며 그래프의 허브(hub) 노드에서 시작하여 동적 프로그래밍(dynamic programming)을 사용하여 경로를 찾는다. 이러한 방법으로 검색한 빈발 부분 그래프를 이용하여 전자의료기록의 임상 노트에 대한 분류 작업을 수행하게 된다.Electronic Health Records (EHR) contains plenty of useful information about patients medical history. However, EHR is highly unstructured data and amount of it is growing continuously, that is why there is a need in a reliable data mining technique to group and categorize clinical notes. Although, many existing data mining techniques for group classification use frequent patterns generated based on frequencies of keywords, these patterns do not possess strong enough distinguishing characteristics to show the difference between datasets to classify complex data such as clinical notes in EHR. Also, these techniques encounter scalability and computational cost problems when used on large EHR dataset. To address these issues, we introduce discriminative probabilistic pattern mining algorithm that uses a graph (DPPMG) to generate the subgraphs of frequent patterns for classification in electronic health records. We use co-occurrence, a combination of binary features, which is more discriminative than individual keywords to construct discriminative probabilistic frequent patterns graph for clinical notes classification. Each co-occurrence has a weight of log-odds score that is associated with its discriminative power. The graph, which reflects the essence of clinical notes is searched to find discriminative probabilistic frequent subgraphs. To discover the discriminative frequent subgraphs, we start from a hub node in the graph and use dynamic programming to find a path. The discriminative probabilistic frequent subgraphs discovered by this approach are later used to classify clinical notes of electronic health records.Chapter 1 Introduction and Motivation 1 Chapter 2 Background 4 2.1 Frequent Pattern Based Classification 4 2.2 Discriminative Pattern Mining 5 2.3 Electronic Health Records 6 Chapter 3 Related Work 8 Chapter 4 Overview and Design 10 Chapter 5 Implementation 12 5.1 Dataset 12 5.2 Keyword Extraction and Filtering 15 5.3 Co-occurrence Generation and Graph Construction 16 5.4 Dynamic Programming to Discover Optimal Path 17 Chapter 6 Results and Evaluation 20 6.1 Choosing Starting Hub Node 20 6.2 Qualitative Analysis 22 6.3 Discriminative Power of the Probabilistic Frequent Patterns 24 Chapter 7 Conclusion 26 Bibliography 28 요약 33Maste

SNU Open Repository and Archive

Efficient mining of discriminative molecular fragments

Author: Berthold Michael R.
Di Fatta Giuseppe
Publication venue
Publication date: 01/01/2005
Field of study

Frequent pattern discovery in structured data is receiving an increasing attention in many application areas of sciences. However, the computational complexity and the large amount of data to be explored often make the sequential algorithms unsuitable. In this context high performance distributed computing becomes a very interesting and promising approach. In this paper we present a parallel formulation of the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The application is characterized by a highly irregular tree-structured computation. No estimation is available for task workloads, which show a power-law distribution in a wide range. The proposed approach allows dynamic resource aggregation and provides fault and latency tolerance. These features make the distributed application suitable for multi-domain heterogeneous environments, such as computational Grids. The distributed application has been evaluated on the well known National Cancer Institute’s HIV-screening dataset

KOPS - The Institutional Repository of the University of Konstanz

Central Archive at the University of Reading

GTRACE-RS: Efficient Graph Sequence Mining using Reverse Search

Author: Ikuta Hiroaki
Inokuchi Akihiro
Washio Takashi
Publication venue: 'Institute of Electronics, Information and Communications Engineers (IEICE)'
Publication date: 18/10/2011
Field of study

The mining of frequent subgraphs from labeled graph data has been studied extensively. Furthermore, much attention has recently been paid to frequent pattern mining from graph sequences. A method, called GTRACE, has been proposed to mine frequent patterns from graph sequences under the assumption that changes in graphs are gradual. Although GTRACE mines the frequent patterns efficiently, it still needs substantial computation time to mine the patterns from graph sequences containing large graphs and long sequences. In this paper, we propose a new version of GTRACE that enables efficient mining of frequent patterns based on the principle of a reverse search. The underlying concept of the reverse search is a general scheme for designing efficient algorithms for hard enumeration problems. Our performance study shows that the proposed method is efficient and scalable for mining both long and large graph sequence patterns and is several orders of magnitude faster than the original GTRACE

arXiv.org e-Print Archive

Crossref

Dynamic load balancing for the distributed mining of molecular structures

Author: Berthold M.R.
Di Fatta Giuseppe
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids

KOPS - The Institutional Repository of the University of Konstanz

Central Archive at the University of Reading

Crossref

Exploring the Evolution of Node Neighborhoods in Dynamic Networks

Author: Labatut Vincent
Naskali Ahmet Teoman
Orman Günce Keziban
Publication venue: 'Elsevier BV'
Publication date: 05/05/2017
Field of study

Dynamic Networks are a popular way of modeling and studying the behavior of evolving systems. However, their analysis constitutes a relatively recent subfield of Network Science, and the number of available tools is consequently much smaller than for static networks. In this work, we propose a method specifically designed to take advantage of the longitudinal nature of dynamic networks. It characterizes each individual node by studying the evolution of its direct neighborhood, based on the assumption that the way this neighborhood changes reflects the role and position of the node in the whole network. For this purpose, we define the concept of \textit{neighborhood event}, which corresponds to the various transformations such groups of nodes can undergo, and describe an algorithm for detecting such events. We demonstrate the interest of our method on three real-world networks: DBLP, LastFM and Enron. We apply frequent pattern mining to extract meaningful information from temporal sequences of neighborhood events. This results in the identification of behavioral trends emerging in the whole network, as well as the individual characterization of specific nodes. We also perform a cluster analysis, which reveals that, in all three networks, one can distinguish two types of nodes exhibiting different behaviors: a very small group of active nodes, whose neighborhood undergo diverse and frequent events, and a very large group of stable nodes

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

High performance subgraph mining in molecular compounds

Author: M.J. Zaki
O. Weislow
R. Finkel
T. Washio
Y. Chung
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

Structured data represented in the form of graphs arises in several fields of the science and the growing amount of available data makes distributed graph mining techniques particularly relevant. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiver-initiated, load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening dataset, where the approach attains close-to linear speedup in a network of workstations

KOPS - The Institutional Repository of the University of Konstanz

Central Archive at the University of Reading

Crossref