In Data Intensive Computing, properties of the data that are the input for\ud an application decide running performance in most cases. Those properties include\ud the size of the data, the relationships inside data, and so forth. There is a\ud class of data intensive applications (BLAST, SETI@home, Folding@Home and\ud so on so forth) whose performances solely depend on the amount of input data.\ud Another important characteristic of those applications is that the input data can be\ud split into units and these units are not related to each other during the runs of the\ud applications. This characteristic helps this class of data intensive applications to\ud be parallelised in the way where the input data is split into units and application\ud runs on different computer nodes for certain portion of the units. SETI@home and\ud Folding@Home have been successfully parallelised over peer-to-peer networks.\ud However, they suffer from the problems of single point of failure and poor scalability.\ud In order to solve these problems, we choose BLAST as our example data\ud intensive applications and parallelise BLAST over a fully distributed peer-to-peer\ud network.\ud BLAST is a popular bioinformatics toolset which can be used to compare\ud two DNA sequences. The major usage of BLAST is searching a query of sequences\ud inside a database for their similarities so as to identify whether they are\ud new. When comparing single pair of sequences, BLAST is efficient. However,\ud due to growing size of the databases, executing BLAST jobs locally produces\ud prohibitively poor performance. Thus, methods for parallelising BLAST are\ud sought.\ud Traditional BLAST parallelisation approaches are all based on clusters.\ud Clusters employ a number of computing nodes and high bandwidth interlinks between\ud nodes. Cluster-based BLAST exhibits higher performance; nevertheless,\ud clusters suffer from limited resources and scalability problems. Clusters are expensive, prohibitively so when the growth of the sequence database are taken into\ud account. It involves high cost and complication when increasing the number of\ud nodes to adapt to the growth of BLAST databases. Hence a Peer-to-Peer-based\ud BLAST service is required.\ud This thesis demonstrates our parallelisation of BLAST over Peer-to-Peer\ud networks (termed ppBLAST), which utilises the free storage and computing resources\ud in the Peer-to-Peer networks to complete BLAST jobs in parallel. In order\ud to achieve the goal, we build three layers in ppBLAST each of which is responsible\ud for particular functions. The bottom layer is a DHT infrastructure with the\ud support of range queries. It provides efficient range-based lookup service and\ud storage for BLAST tasks. The middle layer is the BitTorrent-based database distribution.\ud The upper layer is the core of ppBLAST which schedules and dispatches\ud task to peers. For each layer, we conduct comprehensive research and the\ud achievements are presented in this thesis.\ud For the DHT layer, we design and implement our DAST-DHT. We analyse\ud balancing, maximum number of children and the accuracy of the range query.\ud We also compare the DAST with other range query methodology and state that if\ud the number of children is adjusted to more two, the performance of DAST overcomes\ud others. For the BitTorrent-like database distribution layer, we investigate\ud the relationship between the seeding strategies and the selfish leechers (freeriders\ud and exploiters). We conclude that OSS works better than TSS in a normal situation
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.