1,017 research outputs found

    Probabilistic Random Walk Models for Comparative Network Analysis

    Get PDF
    Graph-based systems and data analysis methods have become critical tools in many fields as they can provide an intuitive way of representing and analyzing interactions between variables. Due to the advances in measurement techniques, a massive amount of labeled data that can be represented as nodes on a graph (or network) have been archived in databases. Additionally, novel data without label information have been gradually generated and archived. Labeling and identifying characteristics of novel data is an important first step in utilizing the valuable data in an effective and meaningful way. Comparative network analysis is an effective computational means to identify and predict the properties of the unlabeled data by comparing the similarities and differences between well-studied and less-studied networks. Comparative network analysis aims to identify the matching nodes and conserved subnetworks across multiple networks to enable a prediction of the properties of the nodes in the less-studied networks based on the properties of the matching nodes in the well-studied networks (i.e., transferring knowledge between networks). One of the fundamental and important questions in comparative network analysis is how to accurately estimate node-to-node correspondence as it can be a critical clue in analyzing the similarities and differences between networks. Node correspondence is a comprehensive similarity that integrates various types of similarity measurements in a balanced manner. However, there are several challenges in accurately estimating the node correspondence for large-scale networks. First, the scale of the networks is a critical issue. As networks generally include a large number of nodes, we have to examine an extremely large space and it can pose a computational challenge due to the combinatorial nature of the problem. Furthermore, although there are matching nodes and conserved subnetworks in different networks, structural variations such as node insertions and deletions make it difficult to integrate a topological similarity. In this dissertation, novel probabilistic random walk models are proposed to accurately estimate node-to-node correspondence between networks. First, we propose a context-sensitive random walk (CSRW) model. In the CSRW model, the random walker analyzes the context of the current position of the random walker and it can switch the random movement to either a simultaneous walk on both networks or an individual walk on one of the networks. The context-sensitive nature of the random walker enables the method to effectively integrate different types of similarities by dealing with structural variations. Second, we propose the CUFID (Comparative network analysis Using the steady-state network Flow to IDentify orthologous proteins) model. In the CUFID model, we construct an integrated network by inserting pseudo edges between potential matching nodes in different networks. Then, we design the random walk protocol to transit more frequently between potential matching nodes as their node similarity increases and they have more matching neighboring nodes. We apply the proposed random walk models to comparative network analysis problems: global network alignment and network querying. Through extensive performance evaluations, we demonstrate that the proposed random walk models can accurately estimate node correspondence and these can lead to improved and reliable network comparison results

    Probabilistic Random Walk Models for Comparative Network Analysis

    Get PDF
    Graph-based systems and data analysis methods have become critical tools in many fields as they can provide an intuitive way of representing and analyzing interactions between variables. Due to the advances in measurement techniques, a massive amount of labeled data that can be represented as nodes on a graph (or network) have been archived in databases. Additionally, novel data without label information have been gradually generated and archived. Labeling and identifying characteristics of novel data is an important first step in utilizing the valuable data in an effective and meaningful way. Comparative network analysis is an effective computational means to identify and predict the properties of the unlabeled data by comparing the similarities and differences between well-studied and less-studied networks. Comparative network analysis aims to identify the matching nodes and conserved subnetworks across multiple networks to enable a prediction of the properties of the nodes in the less-studied networks based on the properties of the matching nodes in the well-studied networks (i.e., transferring knowledge between networks). One of the fundamental and important questions in comparative network analysis is how to accurately estimate node-to-node correspondence as it can be a critical clue in analyzing the similarities and differences between networks. Node correspondence is a comprehensive similarity that integrates various types of similarity measurements in a balanced manner. However, there are several challenges in accurately estimating the node correspondence for large-scale networks. First, the scale of the networks is a critical issue. As networks generally include a large number of nodes, we have to examine an extremely large space and it can pose a computational challenge due to the combinatorial nature of the problem. Furthermore, although there are matching nodes and conserved subnetworks in different networks, structural variations such as node insertions and deletions make it difficult to integrate a topological similarity. In this dissertation, novel probabilistic random walk models are proposed to accurately estimate node-to-node correspondence between networks. First, we propose a context-sensitive random walk (CSRW) model. In the CSRW model, the random walker analyzes the context of the current position of the random walker and it can switch the random movement to either a simultaneous walk on both networks or an individual walk on one of the networks. The context-sensitive nature of the random walker enables the method to effectively integrate different types of similarities by dealing with structural variations. Second, we propose the CUFID (Comparative network analysis Using the steady-state network Flow to IDentify orthologous proteins) model. In the CUFID model, we construct an integrated network by inserting pseudo edges between potential matching nodes in different networks. Then, we design the random walk protocol to transit more frequently between potential matching nodes as their node similarity increases and they have more matching neighboring nodes. We apply the proposed random walk models to comparative network analysis problems: global network alignment and network querying. Through extensive performance evaluations, we demonstrate that the proposed random walk models can accurately estimate node correspondence and these can lead to improved and reliable network comparison results

    ํฐ ๊ทธ๋ž˜ํ”„ ์ƒ์—์„œ์˜ ๊ฐœ์ธํ™”๋œ ํŽ˜์ด์ง€ ๋žญํฌ์— ๋Œ€ํ•œ ๋น ๋ฅธ ๊ณ„์‚ฐ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์ด์ƒ๊ตฌ.Computation of Personalized PageRank (PPR) in graphs is an important function that is widely utilized in myriad application domains such as search, recommendation, and knowledge discovery. Because the computation of PPR is an expensive process, a good number of innovative and efficient algorithms for computing PPR have been developed. However, efficient computation of PPR within very large graphs with over millions of nodes is still an open problem. Moreover, previously proposed algorithms cannot handle updates efficiently, thus, severely limiting their capability of handling dynamic graphs. In this paper, we present a fast converging algorithm that guarantees high and controlled precision. We improve the convergence rate of traditional Power Iteration method by adopting successive over-relaxation, and initial guess revision, a vector reuse strategy. The proposed method vastly improves on the traditional Power Iteration in terms of convergence rate and computation time, while retaining its simplicity and strictness. Since it can reuse the previously computed vectors for refreshing PPR vectors, its update performance is also greatly enhanced. Also, since the algorithm halts as soon as it reaches a given error threshold, we can flexibly control the trade-off between accuracy and time, a feature lacking in both sampling-based approximation methods and fully exact methods. Experiments show that the proposed algorithm is at least 20 times faster than the Power Iteration and outperforms other state-of-the-art algorithms.๊ทธ๋ž˜ํ”„ ๋‚ด์—์„œ ๊ฐœ์ธํ™”๋œ ํŽ˜์ด์ง€๋žญํฌ (P ersonalized P age R ank, PPR ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ๊ฒ€์ƒ‰ , ์ถ”์ฒœ , ์ง€์‹๋ฐœ๊ฒฌ ๋“ฑ ์—ฌ๋Ÿฌ ๋ถ„์•ผ์—์„œ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ํ™œ์šฉ๋˜๋Š” ์ค‘์š”ํ•œ ์ž‘์—… ์ด๋‹ค . ๊ฐœ์ธํ™”๋œ ํŽ˜์ด์ง€๋žญํฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ๊ณ ๋น„์šฉ์˜ ๊ณผ์ •์ด ํ•„์š”ํ•˜๋ฏ€๋กœ , ๊ฐœ์ธํ™”๋œ ํŽ˜์ด์ง€๋žญํฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ํšจ์œจ์ ์ด๊ณ  ํ˜์‹ ์ ์ธ ๋ฐฉ๋ฒ•๋“ค์ด ๋‹ค์ˆ˜ ๊ฐœ๋ฐœ๋˜์–ด์™”๋‹ค . ๊ทธ๋Ÿฌ๋‚˜ ์ˆ˜๋ฐฑ๋งŒ ์ด์ƒ์˜ ๋…ธ๋“œ๋ฅผ ๊ฐ€์ง„ ๋Œ€์šฉ๋Ÿ‰ ๊ทธ๋ž˜ํ”„์— ๋Œ€ํ•œ ํšจ์œจ์ ์ธ ๊ณ„์‚ฐ์€ ์—ฌ์ „ํžˆ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์€ ๋ฌธ์ œ์ด๋‹ค . ๊ทธ์— ๋”ํ•˜์—ฌ , ๊ธฐ์กด ์ œ์‹œ๋œ ์•Œ๊ณ ๋ฆฌ๋“ฌ๋“ค์€ ๊ทธ๋ž˜ํ”„ ๊ฐฑ์‹ ์„ ํšจ์œจ์ ์œผ๋กœ ๋‹ค๋ฃจ์ง€ ๋ชปํ•˜์—ฌ ๋™์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐ์— ํ•œ๊ณ„์ ์ด ํฌ๋‹ค . ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋†’์€ ์ •๋ฐ€๋„๋ฅผ ๋ณด์žฅํ•˜๊ณ  ์ •๋ฐ€๋„๋ฅผ ํ†ต์ œ ๊ฐ€๋Šฅํ•œ , ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๋Š” ๊ฐœ์ธํ™”๋œ ํŽ˜์ด์ง€๋žญํฌ ๊ณ„์‚ฐ ์•Œ๊ณ ๋ฆฌ๋“ฌ์„ ์ œ์‹œํ•œ๋‹ค . ์ „ํ†ต์ ์ธ ๊ฑฐ๋“ญ์ œ๊ณฑ๋ฒ• (Power ์— ์ถ•์ฐจ๊ฐ€์†์™„ํ™”๋ฒ• (Successive Over Relaxation) ๊ณผ ์ดˆ๊ธฐ ์ถ”์ธก ๊ฐ’ ๋ณด์ •๋ฒ• (Initial Guess ์„ ํ™œ์šฉํ•œ ๋ฒกํ„ฐ ์žฌ์‚ฌ์šฉ ์ „๋žต์„ ์ ์šฉํ•˜์—ฌ ์ˆ˜๋ ด ์†๋„๋ฅผ ๊ฐœ์„ ํ•˜์˜€๋‹ค . ์ œ์‹œ๋œ ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด ๊ฑฐ๋“ญ์ œ๊ณฑ๋ฒ•์˜ ์žฅ์ ์ธ ๋‹จ์ˆœ์„ฑ๊ณผ ์—„๋ฐ€์„ฑ์„ ์œ ์ง€ ํ•˜๋ฉด์„œ ๋„ ์ˆ˜๋ ด์œจ๊ณผ ๊ณ„์‚ฐ์†๋„๋ฅผ ํฌ๊ฒŒ ๊ฐœ์„  ํ•œ๋‹ค . ๋˜ํ•œ ๊ฐœ์ธํ™”๋œ ํŽ˜์ด์ง€๋žญํฌ ๋ฒกํ„ฐ์˜ ๊ฐฑ์‹ ์„ ์œ„ํ•˜์—ฌ ์ด์ „์— ๊ณ„์‚ฐ ๋˜์–ด ์ €์žฅ๋œ ๋ฒกํ„ฐ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜ ์—ฌ , ๊ฐฑ์‹  ์— ๋“œ๋Š” ์‹œ๊ฐ„์ด ํฌ๊ฒŒ ๋‹จ์ถ•๋œ๋‹ค . ๋ณธ ๋ฐฉ๋ฒ•์€ ์ฃผ์–ด์ง„ ์˜ค์ฐจ ํ•œ๊ณ„์— ๋„๋‹ฌํ•˜๋Š” ์ฆ‰์‹œ ๊ฒฐ๊ณผ๊ฐ’์„ ์‚ฐ์ถœํ•˜๋ฏ€๋กœ ์ •ํ™•๋„์™€ ๊ณ„์‚ฐ์‹œ๊ฐ„์„ ์œ ์—ฐํ•˜๊ฒŒ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๋Š” ํ‘œ๋ณธ ๊ธฐ๋ฐ˜ ์ถ”์ •๋ฐฉ๋ฒ•์ด๋‚˜ ์ •ํ™•ํ•œ ๊ฐ’์„ ์‚ฐ์ถœํ•˜๋Š” ์—ญํ–‰๋ ฌ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ• ์ด ๊ฐ€์ง€์ง€ ๋ชปํ•œ ํŠน์„ฑ์ด๋‹ค . ์‹คํ—˜ ๊ฒฐ๊ณผ , ๋ณธ ๋ฐฉ๋ฒ•์€ ๊ฑฐ๋“ญ์ œ๊ณฑ๋ฒ•์— ๋น„ํ•˜์—ฌ 20 ๋ฐฐ ์ด์ƒ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•œ๋‹ค๋Š” ๊ฒƒ์ด ํ™•์ธ๋˜์—ˆ์œผ๋ฉฐ , ๊ธฐ ์ œ์‹œ๋œ ์ตœ๊ณ  ์„ฑ๋Šฅ ์˜ ์•Œ๊ณ ๋ฆฌ ๋“ฌ ๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ ๋˜ํ•œ ํ™•์ธ๋˜์—ˆ๋‹ค1 Introduction 1 2 Preliminaries: Personalized PageRank 4 2.1 Random Walk, PageRank, and Personalized PageRank. 5 2.1.1 Basics on Random Walk 5 2.1.2 PageRank. 6 2.1.3 Personalized PageRank 8 2.2 Characteristics of Personalized PageRank. 9 2.3 Applications of Personalized PageRank. 12 2.4 Previous Work on Personalized PageRank Computation. 17 2.4.1 Basic Algorithms 17 2.4.2 Enhanced Power Iteration 18 2.4.3 Bookmark Coloring Algorithm. 20 2.4.4 Dynamic Programming 21 2.4.5 Monte-Carlo Sampling. 22 2.4.6 Enhanced Direct Solving 24 2.5 Summary 26 3 Personalized PageRank Computation with Initial Guess Revision 30 3.1 Initial Guess Revision and Relaxation 30 3.2 Finding Optimal Weight of Successive Over Relaxation for PPR. 34 3.3 Initial Guess Construction Algorithm for Personalized PageRank. 36 4 Fully Personalized PageRank Algorithm with Initial Guess Revision 42 4.1 FPPR with IGR. 42 4.2 Optimization. 49 4.3 Experiments. 52 5 Personalized PageRank Query Processing with Initial Guess Revision 56 5.1 PPR Query Processing with IGR 56 5.2 Optimization. 64 5.3 Experiments. 67 6 Conclusion 74 Bibliography 77 Appendix 88 Abstract (In Korean) 90Docto

    Graph based Anomaly Detection and Description: A Survey

    Get PDF
    Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly attribution and highlight the major techniques that facilitate digging out the root cause, or the โ€˜whyโ€™, of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field

    Accurate multiple network alignment through context-sensitive random walk

    Get PDF
    BACKGROUND: Comparative network analysis can provide an effective means of analyzing large-scale biological networks and gaining novel insights into their structure and organization. Global network alignment aims to predict the best overall mapping between a given set of biological networks, thereby identifying important similarities as well as differences among the networks. It has been shown that network alignment methods can be used to detect pathways or network modules that are conserved across different networks. Until now, a number of network alignment algorithms have been proposed based on different formulations and approaches, many of them focusing on pairwise alignment. RESULTS: In this work, we propose a novel multiple network alignment algorithm based on a context-sensitive random walk model. The random walker employed in the proposed algorithm switches between two different modes, namely, an individual walk on a single network and a simultaneous walk on two networks. The switching decision is made in a context-sensitive manner by examining the current neighborhood, which is effective for quantitatively estimating the degree of correspondence between nodes that belong to different networks, in a manner that sensibly integrates node similarity and topological similarity. The resulting node correspondence scores are then used to predict the maximum expected accuracy (MEA) alignment of the given networks. CONCLUSIONS: Performance evaluation based on synthetic networks as well as real protein-protein interaction networks shows that the proposed algorithm can construct more accurate multiple network alignments compared to other leading methods

    Bayesian matching of unlabeled marked point sets using random fields, with an application to molecular alignment

    Full text link
    Statistical methodology is proposed for comparing unlabeled marked point sets, with an application to aligning steroid molecules in chemoinformatics. Methods from statistical shape analysis are combined with techniques for predicting random fields in spatial statistics in order to define a suitable measure of similarity between two marked point sets. Bayesian modeling of the predicted field overlap between pairs of point sets is proposed, and posterior inference of the alignment is carried out using Markov chain Monte Carlo simulation. By representing the fields in reproducing kernel Hilbert spaces, the degree of overlap can be computed without expensive numerical integration. Superimposing entire fields rather than the configuration matrices of point coordinates thereby avoids the problem that there is usually no clear one-to-one correspondence between the points. In addition, mask parameters are introduced in the model, so that partial matching of the marked point sets can be carried out. We also propose an adaptation of the generalized Procrustes analysis algorithm for the simultaneous alignment of multiple point sets. The methodology is illustrated with a simulation study and then applied to a data set of 31 steroid molecules, where the relationship between shape and binding activity to the corticosteroid binding globulin receptor is explored.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS486 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Clustering Service Networks with Entity, Attribute, and Link Heterogeneity

    Get PDF
    Many popular web service networks are content-rich in terms of heterogeneous types of entities and links, associated with incomplete attributes. Clustering such heterogeneous service networks demands new clustering techniques that can handle two heterogeneity challenges: (1) multiple types of entities co-exist in the same service network with multiple attributes, and (2) links between entities have diverse types and carry different semantics. Existing heterogeneous graph clustering techniques tend to pick initial centroids uniformly at random, specify the number k of clusters in advance, and fix k during the clustering process. In this paper, we propose Service Cluster, a novel heterogeneous service network clustering algorithm with four unique features. First, we incorporate various types of entity, attribute and link information into a unified distance measure. Second, we design a Discrete Steepest Descent method to naturally produce initial k and initial centroids simultaneously. Third, we propose a dynamic learning method to automatically adjust the link weights towards clustering convergence. Fourth, we develop an effective optimization strategy to identify new suitable k and k well-chosen centroids at each clustering iteration. Extensive evaluation on real datasets demonstrates that Service Cluster outperforms existing representative methods in terms of both effectiveness and efficiency

    Correspondence driven saliency transfer

    Get PDF
    In this paper, we show that large annotated data sets have great potential to provide strong priors for saliency estimation rather than merely serving for benchmark evaluations. To this end, we present a novel image saliency detection method called saliency transfer. Given an input image, we first retrieve a support set of best matches from the large database of saliency annotated images. Then, we assign the transitional saliency scores by warping the support set annotations onto the input image according to computed dense correspondences. To incorporate context, we employ two complementary correspondence strategies: a global matching scheme based on scene-level analysis and a local matching scheme based on patch-level inference. We then introduce two refinement measures to further refine the saliency maps and apply the random-walk-with-restart by exploring the global saliency structure to estimate the affinity between foreground and background assignments. Extensive experimental results on four publicly available benchmark data sets demonstrate that the proposed saliency algorithm consistently outperforms the current state-of-the-art methods

    Simple and Efficient Local Codes for Distributed Stable Network Construction

    Full text link
    In this work, we study protocols so that populations of distributed processes can construct networks. In order to highlight the basic principles of distributed network construction we keep the model minimal in all respects. In particular, we assume finite-state processes that all begin from the same initial state and all execute the same protocol (i.e. the system is homogeneous). Moreover, we assume pairwise interactions between the processes that are scheduled by an adversary. The only constraint on the adversary scheduler is that it must be fair. In order to allow processes to construct networks, we let them activate and deactivate their pairwise connections. When two processes interact, the protocol takes as input the states of the processes and the state of the their connection and updates all of them. Initially all connections are inactive and the goal is for the processes, after interacting and activating/deactivating connections for a while, to end up with a desired stable network. We give protocols (optimal in some cases) and lower bounds for several basic network construction problems such as spanning line, spanning ring, spanning star, and regular network. We provide proofs of correctness for all of our protocols and analyze the expected time to convergence of most of them under a uniform random scheduler that selects the next pair of interacting processes uniformly at random from all such pairs. Finally, we prove several universality results by presenting generic protocols that are capable of simulating a Turing Machine (TM) and exploiting it in order to construct a large class of networks.Comment: 43 pages, 7 figure
    • โ€ฆ
    corecore