4 research outputs found

    ๋Œ€์šฉ๋Ÿ‰ ์˜์ƒ๋ฌผํ•™ ๋งํฌ๋“œ ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ํ•œ ๊ทธ๋ž˜ํ”„ ๊ฒฝ๋กœ ํƒ์ƒ‰

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์น˜์˜๊ณผํ•™๊ณผ, 2017. 2. ๊น€ํ™๊ธฐ.A drug could give rise to an adverse effect when combined with another particular drug. Addressing the underlying causes of the adverse effects is crucial for researchers to develop new drugs and for clinicians to prescribe medicine. Most existing approaches attempt to identify a set of target genes for which drugs are most effective, which provides insufficient information regarding these causes in terms of biological dynamics. Drugs should instead be considered as participants in activating a sequence of pathways that lead to some effects. I believe that the causes can better be understood by such linked pathways. Therefore, the purpose of this thesis is to develop algorithms and tools that can be used to discover a sequence of pathways that is activated by a particular drug combination. Furthermore, these algorithms are required to be scalable to manage massive biomedical Linked Data because up-to-date results of biomedical research are increasingly available in Linked Data. My hypothesis is that for a drug combination, when a drug up-regulates particular pathways in one direction and another drug down-regulates the same pathways in an opposite direction, adverse effects may occur by the drug combination. In this regard, the problem of revealing the causes of adverse effects of drug combinations is cast into the problem of discovering paths of a sequence of linked pathways that begins and ends at the genes that the given drugs target. Therefore, the scalable graph path discovery and matching algorithms are devised such that they work with a distributed computing environment. A pathway graph model is defined to integrate diverse biomedical datasets and a visualization tool is implemented to provide biomedical researchers and clinicians with intuitive interfaces for revealing the causes of the adverse effects. An algorithm for the shortest graph path discovery is proposed. An existing relational database approach is adapted to address the shortest graph path discovery in a distributed computing framework, in particular, Spark. The 2-hop reachability index is exploited to prune non-reachable paths during discovery computation. A vertex re-labeling technique is proposed to reduce the size of the 2-hop reachability index. Experimental results show that the proposed approach can successfully manage a large graph, which previous studies have failed to do. The discovered shortest graph path can be transformed into a graph path query to find another similar graph path. To achieve this, a MapReduce algorithm for graph path matching, based on multi-way joins, is proposed. A signature encoding technique is devised to prune intermediate data that is not relevant to the given query. Experiments against RDF (Resource Description Framework) datasets show that SPARQL query processing is faster than the state-of-the-art approaches. To adapt these algorithms into the problem of drug combinations causing adverse effects, a novel pathway graph model is proposed. In particular, a pathway relationship model is describeddirected links between pathways are established using proteinโ€“protein interactions and up/down regulations between genes. A prototype system based on a visualization framework is implemented and applied to a pathway graph that is built on the basis of several biomedical Linked Data (e.g. Reactome, KEGG, BioGrid, STRING and etc). A list of candidate drug combinations is obtained using the proposed system, which is compared with known drug-drug combinations available in DrugBank. A scalable graph path discovery solution is proposed in this thesis. Distributed computing frameworks and several index structures are exploited to efficiently handle massive graphs. A pathway graph model is defined and a prototype system for biomedical researchers is implemented to apply the algorithms to the problem of drug combinations causing adverse effects. In future works, the solution will be generalized to address the temporal organization of signaling pathways, thereby enabling the causes of adverse effects of drug combination to be better understood.I. Introduction 1 1.1 Background and Motivation 1 1.2 Contributions 4 1.2.1 Shortest Graph Path Discovery based on Reachability Index 4 1.2.2 Graph Path Matching based on Signature Encoding 5 1.2.3 Application to Biomedical Linked Data 6 1.3 Thesis Organization 6 II. Preliminaries and RelatedWork 9 2.1 Graph 9 2.2 Graph Path 10 2.3 Acyclic Transformation 11 2.4 Reachability 11 2.5 Distributed Computing Frameworks 12 2.6 RDF & SPARQL 12 2.7 SPARQL Processing Engines 14 III. Shortest Graph Path Discovery based on Reachability Index 17 3.1 Introduction 17 3.2 Space Reduction of Reachability Index 18 3.2.1 Introduction 18 3.2.2 Related Work 21 3.2.3 The Proposed Approach 24 3.2.4 Theoretical Analysis 25 3.2.5 Experimental Results 31 3.2.6 Conclusion and Future Work 33 3.3 Shortest Path Discovery 40 3.3.1 Introduction 40 3.3.2 FEM 41 3.3.3 FEM-SR 42 3.3.4 Theoretical Analysis 46 3.3.5 Experimental Results 51 3.3.6 Federated Shortest Path Discovery 53 3.4 Conclusion 55 IV. Graph Path Matching based on Signature Encoding 61 4.1 Introduction 61 4.2 Related Work 67 4.3 Limitations of MapReduce-based SPARQL engines 68 4.4 SigMR 69 4.5 Index Structure 70 4.5.1 Encoding Joined Triples 72 4.6 Index Building 76 4.7 Query Processing 83 4.8 Theoretical Analysis 88 4.8.1 Cost Model 89 4.8.2 Correctness 92 4.9 Experiments 94 4.9.1 Index Building Time and Space Requirements 95 4.9.2 Query Execution Time 98 4.9.3 Effect of Signature Encoding 100 4.9.4 Effect of the Size of Join Matrix 100 4.10 Conclusion 102 V. Application to Biomedical Linked Data 105 5.1 Introduction 105 5.2 Related Work 106 5.3 Data Model 108 5.4 CyHadoop 116 5.5 Scenario 119 5.6 Preliminary Results 120 5.7 Future Directions 121 VI. Conclusion 129 References 131 Appendix 141 ์ดˆ๋ก 153Docto

    Labelling Dynamic XML Documents: A GroupBased Approach

    Get PDF
    Documents that comply with the XML standard are characterised by inherent ordering and their modelling usually takes the form of a tree. Nowadays, applications generate massive amounts of XML data, which requires accurate and efficient query-able XML database systems. XML querying depends on XML labelling in much the same way as relational databases rely on indexes. Document order and structural information are encoded by labelling schemes, thus facilitating their use by queries without having to access the original XML document. Dynamic XML data, data which changes, complicates the labelling scheme. As demonstrated by much research efforts, it is difficult to allocate unique labels to nodes in a dynamic XML tree so that all structural relationships between the nodes are encoded by the labels. Static XML documents are generally managed with labelling schemes that use simple labels. By contrast, dynamic labelling schemes have extra labelling costs and lower query performance to allow random updates irrespective of the document update frequency. Given that static and dynamic XML documents are often not clearly distinguished, a labelling scheme whose efficiency does not depend on updating frequency would be useful. The GroupBased labelling scheme proposed in this thesis is compatible with static as well as dynamic XML documents. In particular, this scheme has a high performance in processing dynamic XML data updates. What differentiates it from other dynamic labelling schemes is its uniform behaviour irrespective of whether the document is static or dynamic, ability to determine all structural relationships between nodes, and the improved query performance in both types of document. The advantages of the GroupBased scheme in comparison to earlier schemes are highlighted by the experiment results

    Compressing Labels of Dynamic XML Data using Base-9 Scheme and Fibonacci Encoding

    Get PDF
    The flexibility and self-describing nature of XML has made it the most common mark-up language used for data representation over the Web. XML data is naturally modelled as a tree, where the structural tree information can be encoded into labels via XML labelling scheme in order to permit answers to queries without the need to access original XML files. As the transmission of XML data over the Internet has become vibrant, it has also become necessary to have an XML labelling scheme that supports dynamic XML data. For a large-scale and frequently updated XML document, existing dynamic XML labelling schemes still suffer from high growth rates in terms of their label size, which can result in overflow problems and/or ambiguous data/query retrievals. This thesis considers the compression of XML labels. A novel XML labelling scheme, named โ€œBase-9โ€, has been developed to generate labels that are as compact as possible and yet provide efficient support for queries to both static and dynamic XML data. A Fibonacci prefix-encoding method has been used for the first time to store Base-9โ€™s XML labels in a compressed format, with the intention of minimising the storage space without degrading XML querying performance. The thesis also investigates the compression of XML labels using various existing prefix-encoding methods. This investigation has resulted in the proposal of a novel prefix-encoding method named โ€œElias-Fibonacci of order 3โ€, which has achieved the fastest encoding time of all prefix-encoding methods studied in this thesis, whereas Fibonacci encoding was found to require the minimum storage. Unlike current XML labelling schemes, the new Base-9 labelling scheme ensures the generation of short labels even after large, frequent, skewed insertions. The advantages of such short labels as those generated by the combination of applying the Base-9 scheme and the use of Fibonacci encoding in terms of storing, updating, retrieving and querying XML data are supported by the experimental results reported herein
    corecore