Parallelisation for data-intensive applications over peer-to-peer networks

Abstract

In Data Intensive Computing, properties of the data that are the input for an application decide running performance in most cases. Those properties include the size of the data, the relationships inside data, and so forth. There is a class of data intensive applications (BLAST, SETI@home, Folding@Home and so on so forth) whose performances solely depend on the amount of input data. Another important characteristic of those applications is that the input data can be split into units and these units are not related to each other during the runs of the applications. This characteristic helps this class of data intensive applications to be parallelised in the way where the input data is split into units and application runs on different computer nodes for certain portion of the units. SETI@home and Folding@Home have been successfully parallelised over peer-to-peer networks. However, they suffer from the problems of single point of failure and poor scalability. In order to solve these problems, we choose BLAST as our example data intensive applications and parallelise BLAST over a fully distributed peer-to-peer network. BLAST is a popular bioinformatics toolset which can be used to compare two DNA sequences. The major usage of BLAST is searching a query of sequences inside a database for their similarities so as to identify whether they are new. When comparing single pair of sequences, BLAST is efficient. However, due to growing size of the databases, executing BLAST jobs locally produces prohibitively poor performance. Thus, methods for parallelising BLAST are sought. Traditional BLAST parallelisation approaches are all based on clusters. Clusters employ a number of computing nodes and high bandwidth interlinks between nodes. Cluster-based BLAST exhibits higher performance; nevertheless, clusters suffer from limited resources and scalability problems. Clusters are expensive, prohibitively so when the growth of the sequence database are taken into account. It involves high cost and complication when increasing the number of nodes to adapt to the growth of BLAST databases. Hence a Peer-to-Peer-based BLAST service is required. This thesis demonstrates our parallelisation of BLAST over Peer-to-Peer networks (termed ppBLAST), which utilises the free storage and computing resources in the Peer-to-Peer networks to complete BLAST jobs in parallel. In order to achieve the goal, we build three layers in ppBLAST each of which is responsible for particular functions. The bottom layer is a DHT infrastructure with the support of range queries. It provides efficient range-based lookup service and storage for BLAST tasks. The middle layer is the BitTorrent-based database distribution. The upper layer is the core of ppBLAST which schedules and dispatches task to peers. For each layer, we conduct comprehensive research and the achievements are presented in this thesis. For the DHT layer, we design and implement our DAST-DHT. We analyse balancing, maximum number of children and the accuracy of the range query. We also compare the DAST with other range query methodology and state that if the number of children is adjusted to more two, the performance of DAST overcomes others. For the BitTorrent-like database distribution layer, we investigate the relationship between the seeding strategies and the selfish leechers (freeriders and exploiters). We conclude that OSS works better than TSS in a normal situation

Similar works

Full text

thumbnail-image

Warwick Research Archives Portal Repository

redirect
Last time updated on 28/06/2012

This paper was published in Warwick Research Archives Portal Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.