6 research outputs found

    Geo-distributed Edge and Cloud Resource Management for Low-latency Stream Processing

    Get PDF
    The proliferation of Internet-of-Things (IoT) devices is rapidly increasing the demands for efficient processing of low latency stream data generated close to the edge of the network. Edge Computing provides a layer of infrastructure to fill latency gaps between the IoT devices and the back-end cloud computing infrastructure. A large number of IoT applications require continuous processing of data streams in real-time. Edge computing-based stream processing techniques that carefully consider the heterogeneity of the computing and network resources available in the geo-distributed infrastructure provide significant benefits in optimizing the throughput and end-to-end latency of the data streams. Managing geo-distributed resources operated by individual service providers raises new challenges in terms of effective global resource sharing and achieving global efficiency in the resource allocation process. In this dissertation, we present a distributed stream processing framework that optimizes the performance of stream processing applications through a careful allocation of computing and network resources available at the edge of the network. The proposed approach differentiates itself from the state-of-the-art through its careful consideration of data locality and resource constraints during physical plan generation and operator placement for the stream queries. Additionally, it considers co-flow dependencies that exist between the data streams to optimize the network resource allocation through an application-level rate control mechanism. The proposed framework incorporates resilience through a cost-aware partial active replication strategy that minimizes the recovery cost when applications incur failures. The framework employs a reinforcement learning-based online learning model for dynamically determining the level of parallelism to adapt to changing workload conditions. The second dimension of this dissertation proposes a novel model for allocating computing resources in edge and cloud computing environments. In edge computing environments, it allows service providers to establish resource sharing contracts with infrastructure providers apriori in a latency-aware manner. In geo-distributed cloud environments, it allows cloud service providers to establish resource sharing contracts with individual datacenters apriori for defined time intervals in a cost-aware manner. Based on these mechanisms, we develop a decentralized implementation of the contract-based resource allocation model for geo-distributed resources using Smart Contracts in Ethereum

    Network flow optimization for distributed clouds

    Get PDF
    Internet applications, which rely on large-scale networked environments such as data centers for their back-end support, are often geo-distributed and typically have stringent performance constraints. The interconnecting networks, within and across data centers, are critical in determining these applications' performance. Data centers can be viewed as composed of three layers: physical infrastructure consisting of servers, switches, and links, control platforms that manage the underlying resources, and applications that run on the infrastructure. This dissertation shows that network flow optimization can improve performance of distributed applications in the cloud by designing high-throughput schemes spanning all three layers. At the physical infrastructure layer, we devise a framework for measuring and understanding throughput of network topologies. We develop a heuristic for estimating the worst-case performance of any topology and propose a systematic methodology for comparing performance of networks built with different equipment. At the control layer, we put forward a source-routed data center fabric which can achieve near-optimal throughput performance by leveraging a large number of available paths while using limited memory in switches. At the application layer, we show that current Application Network Interfaces (ANIs), abstractions that translate an application's performance goals to actionable network objectives, fail to capture the requirements of many emerging applications. We put forward a novel ANI that can capture application intent more effectively and quantify performance gains achievable with it. We also tackle resource optimization in the inter-data center context of cellular providers. In this emerging environment, a large amount of resources are geographically fragmented across thousands of micro data centers, each with a limited share of resources, necessitating cross-application optimization to satisfy diverse performance requirements and improve network and server utilization. Our solution, Patronus, employs hierarchical optimization for handling multiple performance requirements and temporally partitioned scheduling for scalability

    On the Importance of Infrastructure-Awareness in Large-Scale Distributed Storage Systems

    Get PDF
    Big data applications put significant latency and throughput demands on distributed storage systems. Meeting these demands requires storage systems to use a significant amount of infrastructure resources, such as network capacity and storage devices. Resource demands largely depend on the workloads and can vary significantly over time. Moreover, demand hotspots can move rapidly between different infrastructure locations. Existing storage systems are largely infrastructure-oblivious as they are designed to support a broad range of hardware and deployment scenarios. Most only use basic configuration information about the infrastructure to make important placement and routing decisions. In the case of cloud-based storage systems, cloud services have their own infrastructure-specific limitations, such as minimum request sizes and maximum number of concurrent requests. By ignoring infrastructure-specific details, these storage systems are unable to react to resource demand changes and may have additional inefficiencies from performing redundant network operations. As a result, provisioning enough resources for these systems to address all possible workloads and scenarios would be cost prohibitive. This thesis studies the performance problems in commonly used distributed storage systems and introduces novel infrastructure-aware design methods to improve their performance. First, it addresses the problem of slow reads due to network congestion that is induced by disjoint replica and path selection. Selecting a read replica separately from the network path can perform poorly if all paths to the pre-selected endpoints are congested. Second, this thesis looks at scalability limitations of consensus protocols that are commonly used in geo-distributed key value stores and distributed ledgers. Due to their network-oblivious designs, existing protocols redundantly communicate over highly oversubscribed WAN links, which poorly utilize network resources and limits consistent replication at large scale. Finally, this thesis addresses the need for a cloud-specific realtime storage system for capital market use cases. Public cloud infrastructures provide feature-rich and cost-effective storage services. However, existing realtime timeseries databases are not built to take advantage of cloud storage services. Therefore, they do not effectively utilize cloud services to provide high performance while minimizing deployment cost. This thesis presents three systems that address these problems by using infrastructure-aware design methods. Our performance evaluation of these systems shows that infrastructure-aware design is highly effective in improving the performance of large scale distributed storage systems

    BDS+: An Inter-Datacenter Data Replication System With Dynamic Bandwidth Separation

    Get PDF
    Many important cloud services require replicating massive data from one datacenter (DC) to multiple DCs. While the performance of pair-wise inter-DC data transfers has been much improved, prior solutions are insufficient to optimize bulk-data multicast, as they fail to explore the rich inter-DC overlay paths that exist in geo-distributed DCs, as well as the remaining bandwidth reserved for online traffic under fixed bandwidth separation scheme. To take advantage of these opportunities, we present BDS+, a near-optimal network system for large-scale inter-DC data replication. BDS+ is an application-level multicast overlay network with a fully centralized architecture, allowing a central controller to maintain an up-to-date global view of data delivery status of intermediate servers, in order to fully utilize the available overlay paths. Furthermore, in each overlay path, it leverages dynamic bandwidth separation to make use of the remaining available bandwidth reserved for online traffic. By constantly estimating online traffic demand and rescheduling bulk-data transfers accordingly, BDS+ can further speed up the massive data multicast. Through a pilot deployment in one of the largest online service providers and large-scale real-trace simulations, we show that BDS+ can achieve 3-5 x speedup over the provider's existing system and several well-known overlay routing baselines of static bandwidth separation. Moreover, dynamic bandwidth separation can further reduce the completion time of bulk data transfers by 1.2 to 1.3 times

    Efficient data reconfiguration for today's cloud systems

    Get PDF
    Performance of big data systems largely relies on efficient data reconfiguration techniques. Data reconfiguration operations deal with changing configuration parameters that affect data layout in a system. They could be user-initiated like changing shard key, block size in NoSQL databases, or system-initiated like changing replication in distributed interactive analytics engine. Current data reconfiguration schemes are heuristics at best and often do not scale well as data volume grows. As a result, system performance suffers. In this thesis, we show that {\it data reconfiguration mechanisms can be done in the background by using new optimal or near-optimal algorithms coupling them with performant system designs}. We explore four different data reconfiguration operations affecting three popular types of systems -- storage, real-time analytics and batch analytics. In NoSQL databases (storage), we explore new strategies for changing table-level configuration and for compaction as they improve read/write latencies. In distributed interactive analytics engines, a good replication algorithm can save costs by judiciously using memory that is sufficient to provide the highest throughput and low latency for queries. Finally, in batch processing systems, we explore prefetching and caching strategies that can improve the number of production jobs meeting their SLOs. All these operations happen in the background without affecting the fast path. Our contributions in each of the problems are two-fold -- 1) we model the problem and design algorithms inspired from well-known theoretical abstractions, 2) we design and build a system on top of popular open source systems used in companies today. Finally, using real-life workloads, we evaluate the efficacy of our solutions. Morphus and Parqua provide several 9s of availability while changing table level configuration parameters in databases. By halving memory usage in distributed interactive analytics engine, Getafix reduces cost of deploying the system by 10 million dollars annually and improves query throughput. We are the first to model the problem of compaction and provide formal bounds on their runtime. Finally, NetCachier helps 30\% more production jobs to meet their SLOs compared to existing state-of-the-art
    corecore