144 research outputs found

    Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling

    Get PDF
    To provide low-latency and high-throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural parallelism across lookups, the load imbalance, introduced by heavy skew in the popularity distribution of keys, limits performance. To avoid violating tail latency service-level objectives, systems tend to keep server utilization low and organize the data in micro-shards, which provides units of migration and replication for the purpose of load balancing. These techniques reduce the skew but incur additional monitoring, data replication, and consistency maintenance overheads. In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data are aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the rack’s micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. We devise two distinct approaches to load balancing within a RackOut unit, one based on random selection of nodes—RackOut_static—and another one based on an adaptive load balancing mechanism— RackOut_adaptive. Our results show that RackOut_static increases throughput by up to 6× for RDMA and 8.6× for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency service-level objectives. RackOut_adaptive improves the throughput by 30% for workloads with 20% of writes over RackOut_static

    Rack-Scale Memory Pooling for Datacenters

    Get PDF
    The rise of web-scale services has led to a staggering growth in user data on the Internet. To transform such a vast raw data into valuable information for the user and provide quality assurances, it is important to minimize access latency and enable in-memory processing. For more than a decade, the only practical way to accommodate for ever-growing data in memory has been to scale out server resources, which has led to the emergence of large-scale datacenters and distributed non-relational databases (NoSQL). Such horizontal scaling of resources translates to an increasing number of servers that participate in processing individual user requests. Typically, each user request results in hundreds of independent queries targeting different NoSQL nodes - servers, and the larger the number of servers involved, the higher the fan-out. To complete a single user request, all of the queries associated with that request have to complete first, and thus, the slowest query determines the completion time. Because of skewed popularity distributions and resource contention, the more servers we have, the harder it is to achieve high throughput and facilitate server utilization, without violating service level objectives. This thesis proposes rack-scale memory pooling (RSMP), a new scaling technique for future datacenters that reduces networking overheads and improves the performance of core datacenter software. RSMP is an approach to building larger, rack-scale capacity units for datacenters through specialized fabric interconnects with support for one-sided operations, and using them, in lieu of conventional servers (e.g. 1U), to scale out. We define an RSMP unit to be a server rack connecting 10s to 100s of servers to a secondary network enabling direct, low-latency access to the global memory of the rack. We, then, propose a new RSMP design - Scale-Out NUMA that leverages integration and a NUMA fabric to bridge the gap between local and remote memory to only 5× difference in access latency. Finally, we show how RSMP impacts NoSQL data serving, a key datacenter service used by most web-scale applications today. We show that using fewer larger data shards leads to less load imbalance and higher effective throughput, without violating applications¿ service level objectives. For example, by using Scale-Out NUMA, RSMP improves the throughput of a key-value store up to 8.2× over a traditional scale-out deployment

    The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems

    Get PDF
    To provide low latency and high throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural parallelism across lookups, the load imbalance, introduced by heavy skew in the popularity distribution of keys, limits performance. To avoid violating tail latency service-level objectives, systems tend to keep server utilization low and organize the data in micro-shards, which provides units of migration and replication for the purpose of load balancing. These techniques reduce the skew, but incur additional monitoring, data replication and consistency maintenance overheads. In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data is aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the rack’s micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key-value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. Our results show that RackOut can increase throughput up to 6× for RDMA and 8.6× for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency service-level objectives

    Flashpoint: A Low-latency Serverless Platform for Deep Learning Inference Serving

    Get PDF
    Recent breakthroughs in Deep Learning (DL) have led to high demand for executing inferences in interactive services such as ChatGPT and GitHub Copilot. However, these interactive services require low-latency inferences, which can only be met with GPUs and result in exorbitant operating costs. For instance, ChatGPT reportedly requires millions of U.S. dollars in cloud GPUs to serve its 1+ million users. A potential solution to meet low-latency requirements with acceptable costs is to use serverless platforms. These platforms automatically scale resources to meet user demands. However, current serverless systems have long cold starts which worsen with larger DL models and lead to poor performance during bursts of requests. Meanwhile, the demand for larger and larger DL models make it more challenging to deliver an acceptable user experience cost-effectively. While current systems over-provision GPUs to address this issue, they incur high costs in idle resources which greatly reduces the benefit of using a serverless platform. In this thesis, we introduce Flashpoint, a GPU-based serverless platform that serves DL inferences with low latencies. Flashpoint achieves this by reducing cold start durations, especially for large DL models, making serverless computing feasible for latency-sensitive DL workloads. To reduce cold start durations, Flashpoint reduces download times by sourcing the DL model data from within the compute cluster rather than slow cloud storage. Additionally, Flashpoint minimizes in-cluster network congestion from redundant packet transfers of the same DL model to multiple machines with multicasting. Finally, Flashpoint also reduces cold start durations by automatically partitioning models and deploying them in parallel on multiple machines. The reduced cold start durations achieved by Flashpoint enable the platform to scale resource allocations elastically and complete requests with low latencies without over-provisioning expensive GPU resources. We perform large-scale data center simulations that were parameterized with measurements our prototype implementations. We evaluate the system using six state-of-the-art DL models ranging from 499 MB to 11 GB in size. We also measure the performance of the system in representative real-world traces from Twitter and Microsoft Azure. Our results in the full-scale simulations show that Flashpoint achieves an arithmetic mean of 93.51% shorter average cold start durations, leading to 75.42% and 66.90% respective reductions in average and 99th percentile end-to-end request latencies across the DL models with the same amount of resources. These results show that Flashpoint boosts the performance of serving DL inferences on a serverless platform without increasing costs

    MACHS: Mitigating the Achilles Heel of the Cloud through High Availability and Performance-aware Solutions

    Get PDF
    Cloud computing is continuously growing as a business model for hosting information and communication technology applications. However, many concerns arise regarding the quality of service (QoS) offered by the cloud. One major challenge is the high availability (HA) of cloud-based applications. The key to achieving availability requirements is to develop an approach that is immune to cloud failures while minimizing the service level agreement (SLA) violations. To this end, this thesis addresses the HA of cloud-based applications from different perspectives. First, the thesis proposes a component’s HA-ware scheduler (CHASE) to manage the deployments of carrier-grade cloud applications while maximizing their HA and satisfying the QoS requirements. Second, a Stochastic Petri Net (SPN) model is proposed to capture the stochastic characteristics of cloud services and quantify the expected availability offered by an application deployment. The SPN model is then associated with an extensible policy-driven cloud scoring system that integrates other cloud challenges (i.e. green and cost concerns) with HA objectives. The proposed HA-aware solutions are extended to include a live virtual machine migration model that provides a trade-off between the migration time and the downtime while maintaining HA objective. Furthermore, the thesis proposes a generic input template for cloud simulators, GITS, to facilitate the creation of cloud scenarios while ensuring reusability, simplicity, and portability. Finally, an availability-aware CloudSim extension, ACE, is proposed. ACE extends CloudSim simulator with failure injection, computational paths, repair, failover, load balancing, and other availability-based modules

    Datacenter Architectures for the Microservices Era

    Full text link
    Modern internet services are shifting away from single-binary, monolithic services into numerous loosely-coupled microservices that interact via Remote Procedure Calls (RPCs), to improve programmability, reliability, manageability, and scalability of cloud services. Computer system designers are faced with many new challenges with microservice-based architectures, as individual RPCs/tasks are only a few microseconds in most microservices. In this dissertation, I seek to address the most notable challenges that arise due to the dissimilarities of the modern microservice based and classic monolithic cloud services, and design novel server architectures and runtime systems that enable efficient execution of µs-scale microservices on modern hardware. In the first part of my dissertation, I seek to address the problem of Killer Microseconds, which refers to µs-scale “holes” in CPU schedules caused by stalls to access fast I/O devices or brief idle times between requests in high throughput µs-scale microservices. Whereas modern computing platforms can efficiently hide ns-scale and ms-scale stalls through micro-architectural techniques and OS context switching, they lack efficient support to hide the latency of µs-scale stalls. In chapter II, I propose Duplexity, a heterogeneous server architecture that employs aggressive multithreading to hide the latency of killer microseconds, without sacrificing the Quality-of-Service (QoS) of latency-sensitive microservices. Duplexity is able to achieve 1.9× higher core utilization and 2.7× lower iso-throughput 99th-percentile tail latency over an SMT-based server design, on average. In chapters III-IV, I comprehensively investigate the problem of tail latency in the context of microservices and address multiple aspects of it. First, in chapter III, I characterize the tail latency behavior of microservices and provide general guidelines for optimizing computer systems from a queuing perspective to minimize tail latency. Queuing is a major contributor to end-to-end tail latency, wherein nominal tasks are enqueued behind rare, long ones, due to Head-of-Line (HoL) blocking. Next, in chapter IV, I introduce Q-Zilla, a scheduling framework to tackle tail latency from a queuing perspective, and CoreZilla, a microarchitectural instantiation of the framework. Q-Zilla is composed of the ServerQueue Decoupled Size-Interval Task Assignment (SQD-SITA) scheduling algorithm and the Express-lane Simultaneous Multithreading (ESMT) microarchitecture, which together seek to address HoL blocking by providing an “express-lane” for short tasks, protecting them from queuing behind rare, long ones. By combining the ESMT microarchitecture and the SQD-SITA scheduling algorithm, CoreZilla is able to improves tail latency over a conventional SMT core with 2, 4, and 8 contexts by 2.25×, 3.23×, and 4.38×, on average, respectively, and outperform a theoretical 32-core scale-up organization by 12%, on average, with 8 contexts. Finally, in chapters V-VI, I investigate the tail latency problem of microservices from a cluster, rather than server-level, perspective. Whereas Service Level Objectives (SLOs) define end-to-end latency targets for the entire service to ensure user satisfaction, with microservice-based applications, it is unclear how to scale individual microservices when end-to-end SLOs are violated or underutilized. I introduce Parslo as an analytical framework for partial SLO allocation in virtualized cloud microservices. Parslo takes a microservice graph as an input and employs a Gradient Descent-based approach to allocate “partial SLOs” to different microservice nodes, enabling independent auto-scaling of individual microservices. Parslo achieves the optimal solution, minimizing the total cost for the entire service deployment, and is applicable to general microservice graphs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167978/1/miramir_1.pd

    Software-defined middlebox networking

    Get PDF
    [no abstract

    Data-centric serverless cloud architecture

    Get PDF
    Serverless has become a new dominant cloud architecture thanks to its high scalability and flexible, pay-as-you-go billing model. In serverless, developers compose their cloud services as a set of functions while providers take responsibility for scaling each function’s resources according to traffic changes. Hence, the provider needs to timely spawn, or tear down, function instances (i.e., HTTP servers with user-provider handles), which cannot hold state across function invocations. Performance of a modern serverless cloud is bound by data movement. Serverless architecture separates compute resources and data management to allow function instances to run on any node in a cloud datacenter. This flexibility comes at the cost of the necessity to move function initialization state across the entire datacenter when spawning new instances on demand. Furthermore, to facilitate scaling, cloud providers restrict the serverless programming model to stateless functions (which cannot hold or share state across different functions), which lack efficient support for cross-function communication. This thesis consists of four following research contributions that pave the way for a data-centric serverless cloud architecture. First, we introduce STeLLAR, an opensource serverless benchmarking framework, which enables an accurate performance characterization of serverless deployments. Using STeLLAR, we study three leading serverless clouds and identify that all of them follow the same conceptual architecture that comprises three essential subsystems, namely the worker fleet, the scheduler, and the storage. Our analysis quantifies the aspect of the data movement problem that is related to moving state from the storage to workers when spawning function instances (“cold-start” delays). Also, we study two state-of-the-art production methods of crossfunction communication that involve either the storage or the scheduler subsystems, if the data is transmitted as part of invocation HTTP requests (i.e., inline). Second, we introduce vHive, an open-source ecosystem for serverless benchmarking and experimentation, with the goal of enabling researchers to study and innovate across the entire serverless stack. In contrast to the incomplete academic prototypes and proprietary infrastructure of the leading commercial clouds, vHive is representative of the leading clouds and comprises only fully open-source production-grade components, such as Kubernetes orchestrator and AWS Firecracker hypervisor technologies. To demonstrate vHive’s utility, we analyze the cold-start delays, revealing that the high cold-start latency of function instances is attributable to frequent page faults as the function’s state is brought from disk into guest memory one page at a time. Our analysis further reveals that serverless functions operate over stable working sets - even across function invocations. Third, to reduce the cold-start delays of serverless functions, we introduce a novel snapshotting mechanism that records and prefetches their memory working sets. This mechanism, called REAP, is implemented in userspace and consists of two phases. During the first invocation of a function, all accessed memory pages are recorded and their contents are stored compactly as a part of the function snapshot. Starting from the second cold invocation, the contents of the recorded pages are retrieved from storage and installed in the guest memory before the new function instance starts to process the invocation, allowing to avoid the majority of page faults, hence significantly accelerating the function’s cold starts. Finally, to accelerate the cross-function data communication, we propose Expedited Data Transfers (XDT), an API-preserving high-performance data communication method for serverless. In production clouds, function transmit intermediate data to other functions either inline or through a third-party storage service. The former approach is restricted to small transfer sizes, the latter supports arbitrary transfers but suffers from performance and cost overheads. XDT enables direct function-to-function transfers in a way that is fully compatible with the existing autoscaling infrastructure. With XDT, a trusted component of the sender function buffers the payload in its memory and sends a secure reference to the receiver, which is picked by the load balancer and autoscaler based on the current load. Using the reference, the receiver instance pulls the transmitted data directly from sender’s memory, obviating the need for intermediary storage

    Co-designing reliability and performance for datacenter memory

    Get PDF
    Memory is one of the key components that affects reliability and performance of datacenter servers. Memory in today’s servers is organized and shared in several ways to provide the most performant and efficient access to data. For example, cache hierarchy in multi-core chips to reduce access latency, non-uniform memory access (NUMA) in multi-socket servers to improve scalability, disaggregation to increase memory capacity. In all these organizations, hardware coherence protocols are used to maintain memory consistency of this shared memory and implicitly move data to the requesting cores. This thesis aims to provide fault-tolerance against newer models of failure in the organization of memory in datacenter servers. While designing for improved reliability, this thesis explores solutions that can also enhance performance of applications. The solutions build over modern coherence protocols to achieve these properties. First, we observe that DRAM memory system failure rates have increased, demanding stronger forms of memory reliability. To combat this, the thesis proposes Dvé, a hardware driven replication mechanism where data blocks are replicated across two different memory controllers in a cache-coherent NUMA system. Data blocks are accompanied by a code with strong error detection capabilities so that when an error is detected, correction is performed using the replica. Dvé’s organization offers two independent points of access to data which enables: (a) strong error correction that can recover from a range of faults affecting any of the components in the memory and (b) higher performance by providing another nearer point of memory access. Dvé’s coherent replication keeps the replicas in sync for reliability and also provides coherent access to read replicas during fault-free operation for improved performance. Dvé can flexibly provide these benefits on-demand at runtime. Next, we observe that the coherence protocol itself requires to be hardened against failures. Memory in datacenter servers is being disaggregated from the compute servers into dedicated memory servers, driven by standards like CXL. CXL specifies the coherence protocol semantics for compute servers to access and cache data from a shared region in the disaggregated memory. However, the CXL specification lacks the requisite level of fault-tolerance necessary to operate at an inter-server scale within the datacenter. Compute servers can fail or be unresponsive in the datacenter and therefore, it is important that the coherence protocol remain available in the presence of such failures. The thesis proposes Āpta, a CXL-based, shared disaggregated memory system for keeping the cached data consistent without compromising availability in the face of compute server failures. Āpta architects a high-performance fault-tolerant object-granular memory server that significantly improves performance for stateless function-as-a-service (FaaS) datacenter applications

    Development and management of collective network and cloud computing infrastructures

    Get PDF
    In the search and development of more participatory models for infrastructure development and management, in this dissertation, we investigate models for the financing, deployment, and operation of network and cloud computing infrastructures. Our main concern is to overcome the inherent exclusion in participation in the processes of development and management and in the right of use in the current dominant models. Our work starts by studying in detail the model of Guifi.net, a successful bottom-up initiative for building network infrastructure, generally referred to as a community networks. We pay special attention to its governance system and economic organisation because we argue that these are the key components of the success of this initiative. Then, we generalise our findings for any community network, aiming at becoming sustainable and scalable, and we explore the suitability of the Guifi.net model to the cloud computing infrastructure. As a result of both, we coin the attribute extensible to refer to infrastructure that is relatively easy to expand and maintain in contrast to those naturally limited or hard to expand, such as natural resources or highly complex or advanced artificial systems. We conclude proposing a generic model which, in our opinion, is suitable, at least, for managing extensible infrastructure. The Guifi.net model is deeply rooted in the commons; thus, the research in this field, in general, and Elinor Ostrom’s work, in particular, have left a profound imprint in our work. Our results show that the \guifinet model meets almost entirely the principles of long-enduring commons identified by E. Ostrom. This work has been developed as an industrial doctorate. As such, it combines academic research with elements of practice and pursues an effective knowledge transfer between academia and the private sector. Given that the private sector’s partner is a not-for-profit organisation, the effort to create social value has prevailed over the ambition to advance the development of a specific industrial product or particular technology.En la recerca i desenvolupament de models més participatius per al desenvolupament i gestió d'infraestructura, en aquesta tesi investiguem sobre models per al finançament, desplegament i operació d'infraestructures de xarxa i de computació al núvol. La nostra preocupació principal és fer front a l’exclusió inherent dels models dominants actualment pel que fa a la participació en els processos de desenvolupament i gestió i, també, als drets d’us. El nostre treball comença amb un estudi detallat del model de Guifi.net, un cas d'èxit d'iniciativa ciutadana en la construcció d'infraestructura de xarxa, iniciatives que es coneixen com a xarxes comunitàries. En fer-ho, parem una atenció especial al sistema de governança i a l’organització econòmica perquè pensem que són els dos elements claus de l'èxit d'aquesta iniciativa. Tot seguit passem a analitzar d'altres xarxes comunitàries per abundar en la comprensió dels factors determinants per a la seva sostenibilitat i escalabilitat. Després ampliem el nostre estudi analitzant la capacitat i el comportament del model de Guifi.net en el camp de les infraestructures de computació al núvol. A resultes d'aquests estudis, proposem l'atribut extensible per a descriure aquelles infraestructures que són relativament fàcil d'ampliar i gestionar, en contraposició a les que o bé estan limitades de forma natural o be són difícils d'ampliar, com ara els recursos naturals o els sistemes artificials avançats o complexos. Finalitzem aquest treball fent una proposta de model genèric que pensem que és d'aplicabilitat, com a mínim, a tot tipus d'infraestructura extensible. El model de Guifi.net està fortament vinculat als bens comuns. És per això que la recerca en aquest àmbit, en general, i els treballs de Elinor Ostrom en particular, han deixat una forta empremta en el nostre treball. Els resultats que hem obtingut mostren que el model Guifi.net s'ajusta molt bé als principis que segons Ostrom han de complir els béns comuns per ser sostenibles. Aquest treball s'ha desenvolupat com a doctorat industrial. Com a tal, combina la investigació acadèmica amb elements de practica i persegueix una transferència efectiva de coneixement entre l'àmbit acadèmic i el sector privat. Ates que el soci del sector privat és una organització sense ànim de lucre, l’esforç per crear valor social ha prevalgut en l’ambició d’avançar en el desenvolupament d'un producte industrial específic o d'una tecnologia particula
    corecore