Location of Repository

Parallelisation for data-intensive applications over peer-to-peer networks

By Xinuo Chen

Abstract

In Data Intensive Computing, properties of the data that are the input for\ud an application decide running performance in most cases. Those properties include\ud the size of the data, the relationships inside data, and so forth. There is a\ud class of data intensive applications (BLAST, SETI@home, Folding@Home and\ud so on so forth) whose performances solely depend on the amount of input data.\ud Another important characteristic of those applications is that the input data can be\ud split into units and these units are not related to each other during the runs of the\ud applications. This characteristic helps this class of data intensive applications to\ud be parallelised in the way where the input data is split into units and application\ud runs on different computer nodes for certain portion of the units. SETI@home and\ud Folding@Home have been successfully parallelised over peer-to-peer networks.\ud However, they suffer from the problems of single point of failure and poor scalability.\ud In order to solve these problems, we choose BLAST as our example data\ud intensive applications and parallelise BLAST over a fully distributed peer-to-peer\ud network.\ud BLAST is a popular bioinformatics toolset which can be used to compare\ud two DNA sequences. The major usage of BLAST is searching a query of sequences\ud inside a database for their similarities so as to identify whether they are\ud new. When comparing single pair of sequences, BLAST is efficient. However,\ud due to growing size of the databases, executing BLAST jobs locally produces\ud prohibitively poor performance. Thus, methods for parallelising BLAST are\ud sought.\ud Traditional BLAST parallelisation approaches are all based on clusters.\ud Clusters employ a number of computing nodes and high bandwidth interlinks between\ud nodes. Cluster-based BLAST exhibits higher performance; nevertheless,\ud clusters suffer from limited resources and scalability problems. Clusters are expensive, prohibitively so when the growth of the sequence database are taken into\ud account. It involves high cost and complication when increasing the number of\ud nodes to adapt to the growth of BLAST databases. Hence a Peer-to-Peer-based\ud BLAST service is required.\ud This thesis demonstrates our parallelisation of BLAST over Peer-to-Peer\ud networks (termed ppBLAST), which utilises the free storage and computing resources\ud in the Peer-to-Peer networks to complete BLAST jobs in parallel. In order\ud to achieve the goal, we build three layers in ppBLAST each of which is responsible\ud for particular functions. The bottom layer is a DHT infrastructure with the\ud support of range queries. It provides efficient range-based lookup service and\ud storage for BLAST tasks. The middle layer is the BitTorrent-based database distribution.\ud The upper layer is the core of ppBLAST which schedules and dispatches\ud task to peers. For each layer, we conduct comprehensive research and the\ud achievements are presented in this thesis.\ud For the DHT layer, we design and implement our DAST-DHT. We analyse\ud balancing, maximum number of children and the accuracy of the range query.\ud We also compare the DAST with other range query methodology and state that if\ud the number of children is adjusted to more two, the performance of DAST overcomes\ud others. For the BitTorrent-like database distribution layer, we investigate\ud the relationship between the seeding strategies and the selfish leechers (freeriders\ud and exploiters). We conclude that OSS works better than TSS in a normal situation

Topics: QA76
OAI identifier: oai:wrap.warwick.ac.uk:3640

Suggested articles

Preview

Citations

  1. (2005). A Case Study in Building Layered DHT Applications", doi
  2. (2004). A GT3 based BLAST grid service for biomedical research",
  3. (2002). A Measurement Study of Peer-to-Peer File Sharing Systems", doi
  4. (2001). A Scalable Content-Addressable Network", doi
  5. (2003). An Empirical Evaluation of WideArea Internet Bottlenecks", doi
  6. (2005). An implementation of BLAST over peer-to-peer and its performance validation through simulation",
  7. (2009). Analysing BitTorrent's Seeding Strategies", doi
  8. (2006). Analysing Seeding Strategies and Fairness in BitTorrent-based Networks",
  9. (2006). Analyzing and Improving a BitTorrent Network's Performance Mechanisms", doi
  10. Basic local alignment search tool", doi
  11. (1996). BioSCAN: A Dynamically Reconfigurable Systolic Array for Biosequence Analysis", doi
  12. (2004). BOINC: A System for Public-Resource Computing and Storage", doi
  13. (2001). Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications", doi
  14. (2007). Clustering and Sharing Incentives in BitTorrent Systems", doi
  15. (2004). Clustering in Peer-to-Peer File Sharing Workloads", doi
  16. (2000). Computational Geometry: Algorithms and Applications, doi
  17. (2005). Coupon Replication Systems", doi
  18. (2007). Design and Implementation of Efficient Range Query over DHT Services", doi
  19. (2005). Design and Implementation Tradeoffs for Wide-Area Resource Discovery ", HPDC, doi
  20. (2004). Dissecting BitTorrent: Five Months in a Torrent's Lifetime", Passive and Active Network Measurement, doi
  21. (2007). Distributed Arbitrary Segment Tree: Efficient Range Query Over Public DHT Services", doi
  22. (2003). Distributed Query Processing and Catalogs for Peer-to-Peer Systems", Innovative Data Systems Research Asilomar,
  23. (2006). Distributed Segment Tree: Support of Range Query and Cover Query over DHT",
  24. (2004). Dynamic Scheduling of Paral-lel Jobs with QoS Demands doi
  25. (2004). Dynamic, Hybrid Perform-ance-oriented Scheduling of Moldable Jobs with QoS Demands doi
  26. (2005). Efficient Data Access for Parallel BLAST", doi
  27. (2007). Exploring the robustness of BitTorrent Peer-to-Peer Systems", Concurrency and Computation: Practice and Experience, doi
  28. (2002). Folding@Home and Genome@Home: Using distributed computing to tackle previously intractable problems in computational biology", Computational Genomics,
  29. (2006). Free Riding in BitTorrent is Cheap", HotNets-V,
  30. (2000). Free Riding on Gnutella", doi
  31. (2007). Free-riding in BitTorrent Networks with the Large View Exploit", IPTPS,
  32. (2000). Freenet: A distributed anonymous information storage and retrieval system.", doi
  33. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", doi
  34. (2003). Green Destiny + mpiBLAST = Bioinfomagic",
  35. (2002). Hi-per BLAST: High Performance
  36. (1998). High-Throughput BLAST", SGI White Paper,
  37. (2004). Hybrid Performancebased Workload Management for Multiclusters and Grids", doi
  38. (2004). Hybrid performanceoriented optimisation mechanism for scheduling QoS-requesting parallel jobs in multi-clusters and grids", 20th Annual UK Performance Engineering Workshop (UK-PEW'
  39. (2003). Incentives Build Robustness in BitTorrent",
  40. (2005). Incentives in BitTorrent Induce Free Riding", doi
  41. (2005). Influences on Cooperation in BitTorrent Communities", The Third Workshop on Economics of Peer-to-Peer System, doi
  42. (2004). MapReduce: Simplified Data Processing on Large Clusters", Sixth doi
  43. (2008). Massively Parallel Genomic Sequence Search on the Blue Gene/P Architecture", doi
  44. (2005). Measurements, Analysis, and Modeling of BitTorrent-like Systems", doi
  45. (2004). Mercury: Supporting Scalable Multi-Attribute Range Queries", doi
  46. (2004). Modeling and Performance Analysis of BitTorrent-Like Peer-to-Peer Networks", doi
  47. (2006). Modeling, Analysis and Improvement for BitTorrent-Like File Sharing Networks", Infocom, doi
  48. (2005). Modelling Web trans-fer Performance over Asymmetric Networks",
  49. (2000). Oceanstore: An Architecture for Global-Scalable Persistent Storage", ASPLOS, doi
  50. (2005). OpenDHT: A Public DHT Service and Its Uses", doi
  51. (2001). Parallelization of local BLAST service on workstation clusters", doi
  52. (2001). Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems", doi
  53. (2005). Performance Analysis and Improvement of Overlay Construction for Peer-to-Peer Live Media Streaming", doi
  54. (2006). Performance Analysis and Improvement of Overlay Construction for Peer-to-Peer Live Media Streaming", Simulation: Transactions of the Society for Modeling and Simulation, doi
  55. (2009). ppBLAST: A Computational Service over Peer-toPeer net-work for BLAST", doi
  56. (2003). Querying the Internet with PIER", doi
  57. (2004). Queueing Net-work-based Optimisation Techniques for Workload Allocation in Clusters of Computers", doi
  58. (2005). Scalable Range Query Processing for LargeScale Distributed Database Applications ", Parallel and Distributed Computing Systems,
  59. (2004). Self-scaling Networks for Content Distribution",
  60. SETI@home-massively distributed computing for SETI", doi
  61. (2002). SETI@home: An Experiment in Public-Resource Computing", doi
  62. (2003). Skip Graphs", doi
  63. SkipNet: A Scalable Overlay Network with Practical Locality Properties",
  64. (2001). Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility", doi
  65. (2003). Tapestry: A Resilient Global-scale Overlay for Service Deployment.", doi
  66. (2005). The BitTorrent P2P File-Sharing System: Measurements and Analysis", International workshop on Peer-To-Peer Systems, doi
  67. (2006). The Delicate Tradeoffs in BitTorrentlike File Sharing Protocol Design", doi
  68. (2003). The Design, Implementation, and Evaluation of mpiBLAST",
  69. (1999). Three complementary approaches to parallelization of local BLAST service on workstation clusters", doi
  70. (2002). TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub", doi
  71. (2004). UsenetDHT: A Low Overhead Usenet Server", IPTPS, doi
  72. (2001). Wide-area Cooperative Storage with CFS", doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.