1,043 research outputs found
Composable architecture for rack scale big data computing
The rapid growth of cloud computing, both in terms of the spectrum and volume of cloud workloads, necessitate re-visiting the traditional rack-mountable servers based datacenter design. Next generation datacenters need to offer enhanced support for: (i) fast changing system configuration requirements due to workload constraints, (ii) timely adoption of emerging hardware technologies, and (iii) maximal sharing of systems and subsystems in order to lower costs. Disaggregated datacenters, constructed as a collection of individual resources such as CPU, memory, disks etc., and composed into workload execution units on demand, are an interesting new trend that can address the above challenges. In this paper, we demonstrated the feasibility of composable systems through building a rack scale composable system prototype using PCIe switch. Through empirical approaches, we develop assessment of the opportunities and challenges for leveraging the composable architecture for rack scale cloud datacenters with a focus on big data and NoSQL workloads. In particular, we compare and contrast the programming models that can be used to access the composable resources, and developed the implications for the network and resource provisioning and management for rack scale architecture
dReDBox: Materializing a full-stack rack-scale system prototype of a next-generation disaggregated datacenter
Current datacenters are based on server machines, whose mainboard and hardware components form the baseline, monolithic building block that the rest of the system software, middleware and application stack are built upon. This leads to the following limitations: (a) resource proportionality of a multi-tray system is bounded by the basic building block (mainboard), (b) resource allocation to processes or virtual machines (VMs) is bounded by the available resources within the boundary of the mainboard, leading to spare resource fragmentation and inefficiencies, and (c) upgrades must be applied to each and every server even when only a specific component needs to be upgraded. The dRedBox project (Disaggregated Recursive Datacentre-in-a-Box) addresses the above limitations, and proposes the next generation, low-power, across form-factor datacenters, departing from the paradigm of the mainboard-as-a-unit and enabling the creation of function-block-as-a-unit. Hardware-level disaggregation and software-defined wiring of resources is supported by a full-fledged Type-1 hypervisor that can execute commodity virtual machines, which communicate over a low-latency and high-throughput software-defined optical network. To evaluate its novel approach, dRedBox will demonstrate application execution in the domains of network functions virtualization, infrastructure analytics, and real-time video surveillance.This work has been supported in part by EU H2020 ICTproject dRedBox, contract #687632.Peer ReviewedPostprint (author's final draft
Quantifying the latency benefits of near-edge and in-network FPGA acceleration
Transmitting data to cloud datacenters in distributed IoT applications introduces significant communication latency, but is often the only feasible solution when source nodes are computationally limited. To address latency concerns, cloudlets, in-network computing, and more capable edge nodes are all being explored as a way of moving processing capability towards the edge of the network. Hardware acceleration using Field Programmable Gate Arrays (FPGAs) is also seeing increased interest due to reduced computation latency and improved efficiency. This paper evaluates the the implications of these offloading approaches using a case study neural network based image classification application, quantifying both the computation and communication latency resulting from different platform choices. We consider communication latency including the ingestion of packets for processing on the target platform, showing that this varies significantly with the choice of platform. We demonstrate that emerging in-network accelerator approaches offer much improved and predictable performance as well as better scaling to support multiple data sources
OS Scheduling Algorithms for Memory Intensive Workloads in Multi-socket Multi-core servers
Major chip manufacturers have all introduced multicore microprocessors.
Multi-socket systems built from these processors are routinely used for running
various server applications. Depending on the application that is run on the
system, remote memory accesses can impact overall performance. This paper
presents a new operating system (OS) scheduling optimization to reduce the
impact of such remote memory accesses. By observing the pattern of local and
remote DRAM accesses for every thread in each scheduling quantum and applying
different algorithms, we come up with a new schedule of threads for the next
quantum. This new schedule potentially cuts down remote DRAM accesses for the
next scheduling quantum and improves overall performance. We present three such
new algorithms of varying complexity followed by an algorithm which is an
adaptation of Hungarian algorithm. We used three different synthetic workloads
to evaluate the algorithm. We also performed sensitivity analysis with respect
to varying DRAM latency. We show that these algorithms can cut down DRAM access
latency by up to 55% depending on the algorithm used. The benefit gained from
the algorithms is dependent upon their complexity. In general higher the
complexity higher is the benefit. Hungarian algorithm results in an optimal
solution. We find that two out of four algorithms provide a good trade-off
between performance and complexity for the workloads we studied
Recommended from our members
Design Space Exploration of Accelerators for Warehouse Scale Computing
With Moore’s law grinding to a halt, accelerators are one of the ways that new silicon can improve performance, and they are already a key component in modern datacenters. Accelerators are integrated circuits that implement parts of an application with the objective of higher energy efficiency compared to execution on a standard general purpose CPU. Many accelerators can target any particular workload, generally with a wide range of performance, and costs such as area or power. Exploring these design choices, called Design Space Exploration (DSE), is a crucial step in trying to find the most efficient accelerator design, the one that produces the largest reduction of the total cost of ownership.
This work aims to improve this design space exploration phase for accelerators and to avoid pitfalls in the process. This dissertation supports the thesis that early design choices – including the level of specialization – are critical for accelerator development and therefore require benchmarks reflective of production workloads. We present three studies that support this thesis. First, we show how to benchmark datacenter applications by creating a benchmark for large video sharing infrastructures. Then, we present two studies focused on accelerators for analytical query processing. The first is an analysis on the impact of Network on Chip specialization while the second analyses the impact of the level of specialization.
The first part of this dissertation introduces vbench: a video transcoding benchmark tailored to the growing video-as-a-service market. Video transcoding is not accurately represented in current computer architecture benchmarks such as SPEC or PARSEC. Despite posing a big computational burden for cloud video providers, such as YouTube and Facebook, it is not included in cloud benchmarks such as CloudSuite. Using vbench, we found that the microarchitectural profile of video transcoding is highly dependent on the input video, that SIMD extensions provide limited benefits, and that commercial hardware transcoders impose tradeoffs that are not ideal for cloud video providers. Our benchmark should spur architectural innovations for this critical workload. This work shows how to benchmark a real world warehouse scale application and the possible pitfalls in case of a mischaracterization.
When considering accelerators for the different, but no less important, application of analytical query processing, design space exploration plays a critical role. We analyzed the Q100, a class of accelerators for this application domain, using TPC-H as the reference benchmark. We found that the hardware computational blocks have to be tailored to the requirements of the application, but also the Network on Chip (NoC) can be specialized. We developed an algorithm capable of producing more effective Q100 designs by tailoring the NoC to the communication requirements of the system. Our algorithm is capable of producing designs that are Pareto optimal compared to standard NoC topologies. This shows how NoC specialization is highly effective for accelerators and it should be an integral part of design space exploration for large accelerators’ designs.
The third part of this dissertation analyzes the impact of the level of specialization, e.g. using an ASIC or Coarse Grain Reconfigurable Architecture (CGRA) implementation, on an accelerator performance. We developed a CGRA architecture capable of executing SQL query plans. We compare this architecture against Q100, an ASIC that targets the same class of workloads. Despite being less specialized, this programmable architecture shows comparable performance to the Q100 given an area and power budget. Resource usage explains this counterintuitive result, since a well programmed, homogeneous array of resources is able to more effectively harness silicon for the workload at hand. This suggests that a balanced accelerator research portfolio must include alternative programmable architectures – and their software stacks
Software Defined Application Delivery Networking
In this thesis we present the architecture, design, and prototype implementation details of AppFabric. AppFabric is a next generation application delivery platform for easily creating, managing and controlling massively distributed and very dynamic application deployments that may span multiple datacenters.
Over the last few years, the need for more flexibility, finer control, and automatic management of large (and messy) datacenters has stimulated technologies for virtualizing the infrastructure components and placing them under software-based management and control; generically called Software-defined Infrastructure (SDI). However, current applications are not designed to leverage this dynamism and flexibility offered by SDI and they mostly depend on a mix of different techniques including manual configuration, specialized appliances (middleboxes), and (mostly) proprietary middleware solutions together with a team of extremely conscientious and talented system engineers to get their applications deployed and running. AppFabric, 1) automates the whole control and management stack of application deployment and delivery, 2) allows application architects to define logical workflows consisting of application servers, message-level middleboxes, packet-level middleboxes and network services (both, local and wide-area) composed over application-level routing policies, and 3) provides the abstraction of an application cloud that allows the application to dynamically (and automatically) expand and shrink its distributed footprint across multiple geographically distributed datacenters operated by different cloud providers. The architecture consists of a hierarchical control plane system called Lighthouse and a fully distributed data plane design (with no special hardware components such as service orchestrators, load balancers, message brokers, etc.) called OpenADN . The current implementation (under active development) consists of ~10000 lines of python and C code.
AppFabric will allow applications to fully leverage the opportunities provided by modern virtualized Software-Defined Infrastructures. It will serve as the platform for deploying massively distributed, and extremely dynamic next generation application use-cases, including:
Internet-of-Things/Cyber-Physical Systems: Through support for managing distributed gather-aggregate topologies common to most Internet-of-Things(IoT) and Cyber-Physical Systems(CPS) use-cases. By their very nature, IoT and CPS use cases are massively distributed and have different levels of computation and storage requirements at different locations. Also, they have variable latency requirements for their different distributed sites. Some services, such as device controllers, in an Iot/CPS application workflow may need to gather, process and forward data under near-real time constraints and hence need to be as close to the device as possible. Other services may need more computation to process aggregated data to drive long term business intelligence functions. AppFabric has been designed to provide support for such very dynamic, highly diversified and massively distributed application use-cases.
Network Function Virtualization: Through support for heterogeneous workflows, application-aware networking, and network-aware application deployments, AppFabric will enable new partnerships between Application Service Providers (ASPs) and Network Service Providers (NSPs). An application workflow in AppFabric may comprise of application services, packet and message-level middleboxes, and network transport services chained together over an application-level routing substrate. The Application-level routing substrate allows policy-based service chaining where the application may specify policies for routing their application traffic over different services based on application-level content or context.
Virtual worlds/multiplayer games: Through support for creating, managing and controlling dynamic and distributed application clouds needed by these applications. AppFabric allows the application to easily specify policies to dynamically grow and shrink the application\u27s footprint over different geographical sites, on-demand.
Mobile Apps: Through support for extremely diversified and very dynamic application contexts typical of such applications. Also, AppFabric provides support for automatically managing massively distributed service deployment and controlling application traffic based on application-level policies. This allows mobile applications to provide the best Quality-of-Experience to its users without
This thesis is the first to handle and provide a complete solution for such a complex and relevant architectural problem that is expected to touch each of our lives by enabling exciting new application use-cases that are not possible today. Also, AppFabric is a non-proprietary platform that is expected to spawn lots of innovations both in the design of the platform itself and the features it provides to applications. AppFabric still needs many iterations, both in terms of design and implementation maturity. This thesis is not the end of journey for AppFabric but rather just the beginning
Bankrupt Covert Channel: Turning Network Predictability into Vulnerability
Recent years have seen a surge in the number of data leaks despite aggressive
information-containment measures deployed by cloud providers. When attackers
acquire sensitive data in a secure cloud environment, covert communication
channels are a key tool to exfiltrate the data to the outside world. While the
bulk of prior work focused on covert channels within a single CPU, they require
the spy (transmitter) and the receiver to share the CPU, which might be
difficult to achieve in a cloud environment with hundreds or thousands of
machines.
This work presents Bankrupt, a high-rate highly clandestine channel that
enables covert communication between the spy and the receiver running on
different nodes in an RDMA network. In Bankrupt, the spy communicates with the
receiver by issuing RDMA network packets to a private memory region allocated
to it on a different machine (an intermediary). The receiver similarly
allocates a separate memory region on the same intermediary, also accessed via
RDMA. By steering RDMA packets to a specific set of remote memory addresses,
the spy causes deep queuing at one memory bank, which is the finest addressable
internal unit of main memory. This exposes a timing channel that the receiver
can listen on by issuing probe packets to addresses mapped to the same bank but
in its own private memory region. Bankrupt channel delivers 74Kb/s throughput
in CloudLab's public cloud while remaining undetectable to the existing
monitoring capabilities, such as CPU and NIC performance counters.Comment: Published in WOOT 2020 co-located with USENIX Security 202
Optimizing datacenter power with memory system levers for guaranteed quality-of-service
pre-printCo-location of applications is a proven technique to improve hardware utilization. Recent advances in virtualization have made co-location of independent applications on shared hardware a common scenario in datacenters. Colocation, while maintaining Quality-of-Service (QoS) for each application is a complex problem that is fast gaining relevance for these datacenters. The problem is exacerbated by the need for effective resource utilization at datacenter scales. In this work, we show that the memory system is a primary bottleneck in many workloads and is a more effective focal point when enforcing QoS. We examine four different memory system levers to enforce QoS: two that have been previously proposed, and two novel levers. We compare the effectiveness of each lever in minimizing power and resource needs, while enforcing QoS guarantees. We also evaluate the effectiveness of combining various levers and show that this combined approach can yield power reductions of up to 28%
- …