3 research outputs found

    Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster

    Get PDF
    Rapid advancement in technology and in-expensive camera has raised the necessity of monitoring systems for surveillance applications. As a result data acquired from numerous cameras deployed for surveillance is tremendous. When an event is triggered then, manually investigating such a massive data is a complex task. Thus it is essential to explore an approach that, can store massive multi-stream video data as well as, process them to find useful information. To address the challenge of storing and processing multi-stream video data, we have used Hadoop, which has grown into a leading computing model for data intensive applications. In this paper we propose a novel technique for performing post event investigation on stored surveillance video data. Our algorithm stores video data in HDFS in such a way that it efficiently identifies the location of data from HDFS based on the time of occurrence of event and perform further processing. To prove efficiency of our proposed work, we have performed event detection in the video based on the time period provided by the user. In order to estimate the performance of our approach, we evaluated the storage and processing of video data by varying (i) pixel resolution of video frame (ii) size of video data (iii) number of reducers (workers) executing the task (iv) the number of nodes in the cluster. The proposed framework efficiently achieve speed up of 5.9 for large files of 1024X1024 pixel resolution video frames thus makes it appropriate for the feasible practical deployment in any applications

    λŒ€μš©λŸ‰ μ˜μƒλ¬Όν•™ λ§ν¬λ“œ 데이터λ₯Ό μœ„ν•œ κ·Έλž˜ν”„ 경둜 탐색

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : μΉ˜μ˜κ³Όν•™κ³Ό, 2017. 2. 김홍기.A drug could give rise to an adverse effect when combined with another particular drug. Addressing the underlying causes of the adverse effects is crucial for researchers to develop new drugs and for clinicians to prescribe medicine. Most existing approaches attempt to identify a set of target genes for which drugs are most effective, which provides insufficient information regarding these causes in terms of biological dynamics. Drugs should instead be considered as participants in activating a sequence of pathways that lead to some effects. I believe that the causes can better be understood by such linked pathways. Therefore, the purpose of this thesis is to develop algorithms and tools that can be used to discover a sequence of pathways that is activated by a particular drug combination. Furthermore, these algorithms are required to be scalable to manage massive biomedical Linked Data because up-to-date results of biomedical research are increasingly available in Linked Data. My hypothesis is that for a drug combination, when a drug up-regulates particular pathways in one direction and another drug down-regulates the same pathways in an opposite direction, adverse effects may occur by the drug combination. In this regard, the problem of revealing the causes of adverse effects of drug combinations is cast into the problem of discovering paths of a sequence of linked pathways that begins and ends at the genes that the given drugs target. Therefore, the scalable graph path discovery and matching algorithms are devised such that they work with a distributed computing environment. A pathway graph model is defined to integrate diverse biomedical datasets and a visualization tool is implemented to provide biomedical researchers and clinicians with intuitive interfaces for revealing the causes of the adverse effects. An algorithm for the shortest graph path discovery is proposed. An existing relational database approach is adapted to address the shortest graph path discovery in a distributed computing framework, in particular, Spark. The 2-hop reachability index is exploited to prune non-reachable paths during discovery computation. A vertex re-labeling technique is proposed to reduce the size of the 2-hop reachability index. Experimental results show that the proposed approach can successfully manage a large graph, which previous studies have failed to do. The discovered shortest graph path can be transformed into a graph path query to find another similar graph path. To achieve this, a MapReduce algorithm for graph path matching, based on multi-way joins, is proposed. A signature encoding technique is devised to prune intermediate data that is not relevant to the given query. Experiments against RDF (Resource Description Framework) datasets show that SPARQL query processing is faster than the state-of-the-art approaches. To adapt these algorithms into the problem of drug combinations causing adverse effects, a novel pathway graph model is proposed. In particular, a pathway relationship model is describeddirected links between pathways are established using protein–protein interactions and up/down regulations between genes. A prototype system based on a visualization framework is implemented and applied to a pathway graph that is built on the basis of several biomedical Linked Data (e.g. Reactome, KEGG, BioGrid, STRING and etc). A list of candidate drug combinations is obtained using the proposed system, which is compared with known drug-drug combinations available in DrugBank. A scalable graph path discovery solution is proposed in this thesis. Distributed computing frameworks and several index structures are exploited to efficiently handle massive graphs. A pathway graph model is defined and a prototype system for biomedical researchers is implemented to apply the algorithms to the problem of drug combinations causing adverse effects. In future works, the solution will be generalized to address the temporal organization of signaling pathways, thereby enabling the causes of adverse effects of drug combination to be better understood.I. Introduction 1 1.1 Background and Motivation 1 1.2 Contributions 4 1.2.1 Shortest Graph Path Discovery based on Reachability Index 4 1.2.2 Graph Path Matching based on Signature Encoding 5 1.2.3 Application to Biomedical Linked Data 6 1.3 Thesis Organization 6 II. Preliminaries and RelatedWork 9 2.1 Graph 9 2.2 Graph Path 10 2.3 Acyclic Transformation 11 2.4 Reachability 11 2.5 Distributed Computing Frameworks 12 2.6 RDF & SPARQL 12 2.7 SPARQL Processing Engines 14 III. Shortest Graph Path Discovery based on Reachability Index 17 3.1 Introduction 17 3.2 Space Reduction of Reachability Index 18 3.2.1 Introduction 18 3.2.2 Related Work 21 3.2.3 The Proposed Approach 24 3.2.4 Theoretical Analysis 25 3.2.5 Experimental Results 31 3.2.6 Conclusion and Future Work 33 3.3 Shortest Path Discovery 40 3.3.1 Introduction 40 3.3.2 FEM 41 3.3.3 FEM-SR 42 3.3.4 Theoretical Analysis 46 3.3.5 Experimental Results 51 3.3.6 Federated Shortest Path Discovery 53 3.4 Conclusion 55 IV. Graph Path Matching based on Signature Encoding 61 4.1 Introduction 61 4.2 Related Work 67 4.3 Limitations of MapReduce-based SPARQL engines 68 4.4 SigMR 69 4.5 Index Structure 70 4.5.1 Encoding Joined Triples 72 4.6 Index Building 76 4.7 Query Processing 83 4.8 Theoretical Analysis 88 4.8.1 Cost Model 89 4.8.2 Correctness 92 4.9 Experiments 94 4.9.1 Index Building Time and Space Requirements 95 4.9.2 Query Execution Time 98 4.9.3 Effect of Signature Encoding 100 4.9.4 Effect of the Size of Join Matrix 100 4.10 Conclusion 102 V. Application to Biomedical Linked Data 105 5.1 Introduction 105 5.2 Related Work 106 5.3 Data Model 108 5.4 CyHadoop 116 5.5 Scenario 119 5.6 Preliminary Results 120 5.7 Future Directions 121 VI. Conclusion 129 References 131 Appendix 141 초둝 153Docto

    λ§΅λ¦¬λ“€μŠ€μ—μ„œμ˜ 병렬 쑰인을 μœ„ν•œ 닀차원 λ²”μœ„ λΆ„ν•  기법

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : 전기·컴퓨터곡학뢀, 2014. 8. 이상ꡬ.Joins are fundamental operations for many data analysis tasks, but are not directly supported by the MapReduce framework. This is because 1) the framework is basically designed to process a single input data set, and 2) MapReduce's key-equality based data grouping method makes it difficult to support complex join conditions. As a result, a large number of MapReduce-based join algorithms have been proposed. As in traditional shared-nothing systems, one of the major issues in join algorithms using MapReduce is handling of data skew. We propose a new skew handling method, called Multi-Dimensional Range Partitioning (MDRP), and show that the proposed method outperforms traditional skew handling methods: range-based and randomized methods. Specifically, the proposed method has the following advantages: 1) Compared to the range-based method, it considers the number of output tuples at each machine, which leads better handling of join product skew. 2) Compared with the randomized method, it exploits given join conditions before the actual join begins, so that unnecessary input duplication can be reduced. The MDRP method can be used to support advanced join operations such as theta-joins and multi-way joins. With extensive experiments using real and synthetic data sets, we evaluate the effectiveness of the proposed algorithm.Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 II. Backgrounds and RelatedWork . . . . . . . . . . . . . . . . 8 2.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Join Algorithms in MapReduce . . . . . . . . . . . . . . . . 11 2.2.1 Two-Way Join Algorithms . . . . . . . . . . . . . . 11 2.2.2 Multi-Way Join Algorithms . . . . . . . . . . . . . 17 2.3 Data Skew in Join Algorithms . . . . . . . . . . . . . . . . 18 2.4 Skew Handling Approaches in MapReduce . . . . . . . . . 22 2.4.1 Hash-Based Approach . . . . . . . . . . . . . . . . 22 2.4.2 Range-Based Approach . . . . . . . . . . . . . . . 24 2.4.3 Randomized Approach . . . . . . . . . . . . . . . . 26 III. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Multi-Dimensional Range Partitioning . . . . . . . . . . . . 29 3.1.1 Creation of a Partitioning Matrix . . . . . . . . . . . 29 3.1.2 Identifying and Chopping of Heavy Cells . . . . . . 31 3.1.3 Assigning Cells to Reducers . . . . . . . . . . . . . 33 3.1.4 Join Processing using the Partitioning Matrix . . . . 35 3.2 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . 39 3.3 Complex Join Conditions . . . . . . . . . . . . . . . . . . . 41 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.1 Scalar Skew Experiments . . . . . . . . . . . . . . . 44 3.4.2 Zipfs Distribution . . . . . . . . . . . . . . . . . . 49 3.4.3 Non-Equijoin Experiments . . . . . . . . . . . . . . 50 3.4.4 Scalability Experiments . . . . . . . . . . . . . . . 52 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.2 Memory-Awareness . . . . . . . . . . . . . . . . . 58 3.5.3 Handling of Heavy Cells . . . . . . . . . . . . . . . 59 3.5.4 Existing Histograms . . . . . . . . . . . . . . . . . 60 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 IV. Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1 Joining Multiple Relations in a MapReduce Job . . . . . . . 65 4.1.1 Example: SPARQL Basic Graph Pattern . . . . . . . 65 4.1.2 Example: Matrix Chain Multiplication . . . . . . . . 67 4.1.3 Single-Key Join and Multiple-Key Join Queries . . . 69 4.2 Skew Handling for Multi-Way Joins . . . . . . . . . . . . . 71 4.2.1 Skew Handling for SK-Join Queries . . . . . . . . . 71 4.2.2 Skew Handling for MK-Join Queires . . . . . . . . 72 4.3 Combinations of SK-Join and MK-Join . . . . . . . . . . . 74 4.3.1 Complex Queries . . . . . . . . . . . . . . . . . . . 74 4.3.2 Iteration-Based Algorithms . . . . . . . . . . . . . . 75 4.3.3 Replication-Based Algorithms . . . . . . . . . . . . 77 4.3.4 Iteration-Based vs. Replication-Based . . . . . . . . 78 4.4 Join-Key Selection Algorithms for Complex Queries . . . . 83 4.4.1 Greedy Key Selection . . . . . . . . . . . . . . . . 84 4.4.2 Multiple Key Selection . . . . . . . . . . . . . . . . 85 4.4.3 Hybrid Key Selection . . . . . . . . . . . . . . . . . 86 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.1 SK-Join Experiments . . . . . . . . . . . . . . . . . 87 4.5.2 MK-Join Experiments . . . . . . . . . . . . . . . . 89 4.5.3 Analysis of TV Watching Logs . . . . . . . . . . . . 90 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 V. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.1 Algorithms for SPARQL Basic Graph Pattern . . . . . . . . 94 5.1.1 MR-Selection . . . . . . . . . . . . . . . . . . . . . 95 5.1.2 MR-Join . . . . . . . . . . . . . . . . . . . . . . . 98 5.1.3 Performance Evaluation . . . . . . . . . . . . . . . 101 5.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Algorithms for Matrix Chain Multiplication . . . . . . . . . 107 5.2.1 Serial Two-Way Join (S2) . . . . . . . . . . . . . . 109 5.2.2 Parallel M-Way Join (P2, PM) . . . . . . . . . . . . 111 5.2.3 Serial Two-Way vs. Parallel M-Way . . . . . . . . . 115 5.2.4 Performance Evaluation . . . . . . . . . . . . . . . 116 5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . 119 5.2.6 Extension: Embedded MapReduce . . . . . . . . . . 119 VI. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 초둝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Docto
    corecore