2,537 research outputs found
Discriminative Probabilistic Pattern Mining using Graph for Electronic Health Records
ํ์๋
ผ๋ฌธ(์์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ,2019. 8. ๊น์ .์ ์์๋ฃ๊ธฐ๋ก(Electronic Health Records)์ ์์ ๋
ธํธ์๋ ํ์์ ๋ณ๋ ฅ์ ๋ํ ์ ์ฉํ ์ ๋ณด๊ฐ ๋ง์ด ํฌํจ๋์ด ์๋ค. ๊ทธ๋ฌ๋ ์์ ๋
ธํธ๋ ์ฒด๊ณํ๋์ง ์์ ๋ฐ์ดํฐ์ด๋ฉฐ ๊ทธ ์์ ๋๋ ์ด ์ฆ๊ฐํ๊ณ ์๋ค. ๋ฐ๋ผ์ ์์ ๋
ธํธ๋ฅผ ๊ทธ๋ฃนํํ๊ณ ๋ถ๋ฅํ๊ธฐ ์ํ ์ ๋ขฐํ ์ ์๋ ๋ฐ์ดํฐ ๋ง์ด๋ ๊ธฐ์ ์ด ํ์ํ๋ค. ๊ธฐ์กด์ ๋ฐ์ดํฐ ๋ง์ด๋ ๊ธฐ์ ์ ํค์๋์ ๋น๋๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ์์ฑ๋ ๋น๋ฐ ํจํด(frequent patterns)์ ์ด์ฉํ์ฌ ๊ทธ๋ฃน ๋ถ๋ฅ ์์
(classification)์ ์ํํ๋ค. ํ์ง๋ง ์ด๋ฌํ ๋น๋ฐ ํจํด์ ์ ์์๋ฃ๊ธฐ๋ก์ ์์ ๋
ธํธ์ ๊ฐ์ด ๋ณต์กํ ๋ฐ์ดํฐ์ ๋ถ๋ฅ๋ฅผ ์ํด ํ์ํ ์ถฉ๋ถํ ๊ฐ๋ ฅํ๊ณ ๋ช
ํํ๊ฒ ๊ตฌ๋ณ๋๋ ํน์ง์ ๊ฐ๊ณ ์์ง ์๋ค. ๋ํ ๋น๋ฐ ํจํด ๊ธฐ๋ฐ ๊ธฐ์ ์ ๋๊ท๋ชจ ์ ์์๋ฃ๊ธฐ๋ก ๋ฐ์ดํฐ์ ์ ์ฉ๋ ๋ ํ์ฅ์ฑ๊ณผ ๊ณ์ฐ ๋น์ฉ์ ๋ฌธ์ ์ ์ง๋ฉดํ๋ค. ๋ฐ๋ผ์ ๋ณธ ์ฐ๊ตฌ์์๋ ์ด๋ฌํ ๋ฌธ์ ์ ์ ํด๊ฒฐํ๊ธฐ ์ํด ํ๋ฅ ์ ํ๋ณ ํจํด ๋ง์ด๋(discriminative probabilistic pattern mining) ์๊ณ ๋ฆฌ์ฆ์ ์๊ฐํ๋ค. ํ๋ฅ ์ ํ๋ณ ํจํด ๋ง์ด๋ ์๊ณ ๋ฆฌ์ฆ์์๋ ์ ์์๋ฃ๊ธฐ๋ก์ ์์ ๋
ธํธ๋ฅผ ๋ถ๋ฅํ๊ธฐ ์ํด ๊ทธ๋ํ ๊ตฌ์กฐ๋ฅผ ๋์
ํ์ฌ ๋น๋ฐ ํจํด์ ๋ถ๋ถ ๊ทธ๋ํ๋ฅผ ์์ฑํ๊ฒ ๋๋ค.
๋ณธ ์ฐ๊ตฌ์์๋ ํ๋ณ๋ ฅ์ ๋์ด๊ธฐ ์ํด ๊ฐ๋ณ ํค์๋๋ฅผ ์ฌ์ฉํ๋ ๋์ ์ด์ง ํน์ฑ ์กฐํฉ์์์ ๋์ ์ถํ(co-occurrence)์ ์ฌ์ฉํ์ฌ ์์ ๋
ธํธ ๋ถ๋ฅ๋ฅผ ์ํ ๋น๋ฐ ํจํด ๊ทธ๋ํ๋ฅผ ๊ตฌ์ฑํ๋ค. ๊ฐ๊ฐ์ ๋์ ์ถํ์ ํ๋ณ๋ ฅ(discriminative power)์ ๋ฐ๋ฅธ log-odds ๊ฐ์ผ๋ก ๊ทธ ๊ฐ์ค์น๋ฅผ ๊ฐ๋๋ค. ์์ ๋
ธํธ์ ๋ณธ์ง์ ๋ฐ์ํ๋ ๊ทธ๋ํ๋ฅผ ์ฐพ๊ธฐ ์ํด ํ๋ฅ ์ ํ๋ณ ๋ถ๋ถ ๊ทธ๋ํ ๊ฒ์์ ์ํํ๋ฉฐ ๊ทธ๋ํ์ ํ๋ธ(hub) ๋
ธ๋์์ ์์ํ์ฌ ๋์ ํ๋ก๊ทธ๋๋ฐ(dynamic programming)์ ์ฌ์ฉํ์ฌ ๊ฒฝ๋ก๋ฅผ ์ฐพ๋๋ค. ์ด๋ฌํ ๋ฐฉ๋ฒ์ผ๋ก ๊ฒ์ํ ๋น๋ฐ ๋ถ๋ถ ๊ทธ๋ํ๋ฅผ ์ด์ฉํ์ฌ ์ ์์๋ฃ๊ธฐ๋ก์ ์์ ๋
ธํธ์ ๋ํ ๋ถ๋ฅ ์์
์ ์ํํ๊ฒ ๋๋ค.Electronic Health Records (EHR) contains plenty of useful information about patients medical history. However, EHR is highly unstructured data and amount of it is growing continuously, that is why there is a need in a reliable data mining technique to group and categorize clinical notes. Although, many existing data mining techniques for group classification use frequent patterns generated based on frequencies of keywords, these patterns do not possess strong enough distinguishing characteristics to show the difference between datasets to classify complex data such as clinical notes in EHR. Also, these techniques encounter scalability and computational cost problems when used on large EHR dataset. To address these issues, we introduce discriminative probabilistic pattern mining algorithm that uses a graph (DPPMG) to generate the subgraphs of frequent patterns for classification in electronic health records.
We use co-occurrence, a combination of binary features, which is more discriminative than individual keywords to construct discriminative probabilistic frequent patterns graph for clinical notes classification. Each co-occurrence has a weight of log-odds score that is associated with its discriminative power. The graph, which reflects the essence of clinical notes is searched to find discriminative probabilistic frequent subgraphs. To discover the discriminative frequent subgraphs, we start from a hub node in the graph and use dynamic programming to find a path. The discriminative probabilistic frequent subgraphs discovered by this approach are later used to classify clinical notes of electronic health records.Chapter 1 Introduction and Motivation 1
Chapter 2 Background 4
2.1 Frequent Pattern Based Classification 4
2.2 Discriminative Pattern Mining 5
2.3 Electronic Health Records 6
Chapter 3 Related Work 8
Chapter 4 Overview and Design 10
Chapter 5 Implementation 12
5.1 Dataset 12
5.2 Keyword Extraction and Filtering 15
5.3 Co-occurrence Generation and Graph Construction 16
5.4 Dynamic Programming to Discover Optimal Path 17
Chapter 6 Results and Evaluation 20
6.1 Choosing Starting Hub Node 20
6.2 Qualitative Analysis 22
6.3 Discriminative Power of the Probabilistic Frequent Patterns 24
Chapter 7 Conclusion 26
Bibliography 28
์์ฝ 33Maste
Efficient mining of discriminative molecular fragments
Frequent pattern discovery in structured data is receiving
an increasing attention in many application areas of sciences. However, the computational complexity and the large amount of data to be explored often make the sequential algorithms unsuitable. In this context high performance distributed computing becomes a very interesting and promising approach. In this paper we present a parallel formulation of the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The application is characterized by a highly irregular tree-structured computation. No estimation is available for task workloads, which show a power-law distribution in a wide range. The proposed approach allows dynamic resource aggregation and provides fault and latency tolerance. These features make the distributed application suitable for multi-domain heterogeneous environments, such as computational Grids. The distributed application has been evaluated on the well known National Cancer Instituteโs HIV-screening dataset
GTRACE-RS: Efficient Graph Sequence Mining using Reverse Search
The mining of frequent subgraphs from labeled graph data has been studied
extensively. Furthermore, much attention has recently been paid to frequent
pattern mining from graph sequences. A method, called GTRACE, has been proposed
to mine frequent patterns from graph sequences under the assumption that
changes in graphs are gradual. Although GTRACE mines the frequent patterns
efficiently, it still needs substantial computation time to mine the patterns
from graph sequences containing large graphs and long sequences. In this paper,
we propose a new version of GTRACE that enables efficient mining of frequent
patterns based on the principle of a reverse search. The underlying concept of
the reverse search is a general scheme for designing efficient algorithms for
hard enumeration problems. Our performance study shows that the proposed method
is efficient and scalable for mining both long and large graph sequence
patterns and is several orders of magnitude faster than the original GTRACE
Dynamic load balancing for the distributed mining of molecular structures
In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of
methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the
past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially
render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to
discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no
reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic
partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated
load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer
Instituteโs HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed
approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable
for large-scale, multi-domain, heterogeneous environments, such as computational grids
Exploring the Evolution of Node Neighborhoods in Dynamic Networks
Dynamic Networks are a popular way of modeling and studying the behavior of
evolving systems. However, their analysis constitutes a relatively recent
subfield of Network Science, and the number of available tools is consequently
much smaller than for static networks. In this work, we propose a method
specifically designed to take advantage of the longitudinal nature of dynamic
networks. It characterizes each individual node by studying the evolution of
its direct neighborhood, based on the assumption that the way this neighborhood
changes reflects the role and position of the node in the whole network. For
this purpose, we define the concept of \textit{neighborhood event}, which
corresponds to the various transformations such groups of nodes can undergo,
and describe an algorithm for detecting such events. We demonstrate the
interest of our method on three real-world networks: DBLP, LastFM and Enron. We
apply frequent pattern mining to extract meaningful information from temporal
sequences of neighborhood events. This results in the identification of
behavioral trends emerging in the whole network, as well as the individual
characterization of specific nodes. We also perform a cluster analysis, which
reveals that, in all three networks, one can distinguish two types of nodes
exhibiting different behaviors: a very small group of active nodes, whose
neighborhood undergo diverse and frequent events, and a very large group of
stable nodes
High performance subgraph mining in molecular compounds
Structured data represented in the form of graphs arises in
several fields of the science and the growing amount of available data makes distributed graph mining techniques particularly relevant. In this paper, we present a distributed approach to the frequent subgraph mining
problem to discover interesting patterns in molecular compounds. The problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main
aspects of the proposed distributed algorithm, namely a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiver-initiated, load balancing
algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Instituteโs HIV-screening dataset, where the approach attains close-to linear speedup in a network
of workstations
- โฆ