240 research outputs found

    When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing

    Full text link
    Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: more workers implies fewer keys per worker, so it becomes harder to "average out" the cost of hot keys with cold keys. We propose a novel load balancing technique that uses a heaving hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to d≥2d \geq 2 choices to ensure a balanced load, where dd is tuned automatically to minimize the memory and computation cost of operator replication. The technique works online and does not require the use of routing tables. Our extensive evaluation shows that our technique can balance real-world workloads on large deployments, and improve throughput and latency by 150%\mathbf{150\%} and 60%\mathbf{60\%} respectively over the previous state-of-the-art when deployed on Apache Storm.Comment: 12 pages, 14 Figures, this paper is accepted and will be published at ICDE 201

    Content-based addressing in hierarchical distributed hash tables

    Get PDF
    Peer-to-peer networks have drawn their strength from their ability to operate functionally without the use of a central agent. In recent years the development of the structured peer-to-peer network has further increased the distributed nature of p2p systems. These networks take advantage of an underlying distributed data structure, a common one is the distributed hash table (DHT). These peers use this structure to act as equals in a network, sharing the same responsibilities of maintaining and contributing. But herein lays the problem, not all peers are equal in terms of resources and power. And with no central agent to monitor and balance load , the heterogeneous nature of peers can cause many distribution or bottleneck issues on the network and peer levels. This is due to the way in which addresses are allocated in these DHTs. Often this function is carried out by a consistent hashing function. These functions although powerful in their simplicity and effectiveness are the stem of a crucial flaw. This flaw causes the random nature in which addresses are assigned both when considering peer identification and allocating resource ownership. This work proposes a solution to mitigate the random nature of address assignment in DHTs, leveraging two methodologies called hierarchical DHTs and content based addressing. Combining these methods would enable peers to work in cooperative groups of like interested peers in order to dynamically share the load between group members. Group formation and utilization relies on the actual resources a peer willingly shares and is able to contribute rather than a function of the random hash employed by traditional DHT p2p structures

    Towards a Framework for DHT Distributed Computing

    Get PDF
    Distributed Hash Tables (DHTs) are protocols and frameworks used by peer-to-peer (P2P) systems. They are used as the organizational backbone for many P2P file-sharing systems due to their scalability, fault-tolerance, and load-balancing properties. These same properties are highly desirable in a distributed computing environment, especially one that wants to use heterogeneous components. We show that DHTs can be used not only as the framework to build a P2P file-sharing service, but as a P2P distributed computing platform. We propose creating a P2P distributed computing framework using distributed hash tables, based on our prototype system ChordReduce. This framework would make it simple and efficient for developers to create their own distributed computing applications. Unlike Hadoop and similar MapReduce frameworks, our framework can be used both in both the context of a datacenter or as part of a P2P computing platform. This opens up new possibilities for building platforms to distributed computing problems. One advantage our system will have is an autonomous load-balancing mechanism. Nodes will be able to independently acquire work from other nodes in the network, rather than sitting idle. More powerful nodes in the network will be able use the mechanism to acquire more work, exploiting the heterogeneity of the network. By utilizing the load-balancing algorithm, a datacenter could easily leverage additional P2P resources at runtime on an as needed basis. Our framework will allow MapReduce-like or distributed machine learning platforms to be easily deployed in a greater variety of contexts
    • …
    corecore