44 research outputs found
Scalable and Reliable Middlebox Deployment
Middleboxes are pervasive in modern computer networks providing functionalities beyond mere packet forwarding. Load balancers, intrusion detection systems, and network address translators are typical examples of middleboxes. Despite their benefits, middleboxes come with several challenges with respect to their scalability and reliability.
The goal of this thesis is to devise middlebox deployment solutions that are cost effective, scalable, and fault tolerant. The thesis includes three main contributions: First, distributed service function chaining with multiple instances of a middlebox deployed on different physical servers to optimize resource usage; Second, Constellation, a geo-distributed middlebox framework enabling a middlebox application to operate with high performance across wide area networks; Third, a fault tolerant service function chaining system
Near Data Acceleration with Concurrent Host Access
Near-data accelerators (NDAs) that are integrated with main memory have the
potential for significant power and performance benefits. Fully realizing these
benefits requires the large available memory capacity to be shared between the
host and the NDAs in a way that permits both regular memory access by some
applications and accelerating others with an NDA, avoids copying data, enables
collaborative processing, and simultaneously offers high performance for both
host and NDA. We identify and solve new challenges in this context: mitigating
row-locality interference from host to NDAs, reducing read/write-turnaround
overhead caused by fine-grain interleaving of host and NDA requests,
architecting a memory layout that supports the locality required for NDAs and
sophisticated address interleaving for host performance, and supporting both
packetized and traditional memory interfaces. We demonstrate our approach in a
simulated system that consists of a multi-core CPU and NDA-enabled DDR4 memory
modules. We show that our mechanisms enable effective and efficient concurrent
access using a set of microbenchmarks, and then demonstrate the potential of
the system for the important stochastic variance-reduced gradient (SVRG)
algorithm
Hybrid Designs for Caches and Cores.
Processor power constraints have come to the forefront over the last decade, heralded by the stagnation of clock frequency scaling. High-performance core and cache designs often utilize power-hungry techniques to increase parallelism. Conversely, the most energy-efficient designs opt for a serial execution to avoid unnecessary overheads. While both of these extremes constitute one-size-fits-all approaches, a judicious mix of parallel and serial execution has the potential to achieve the best of both high-performing and energy-efficient designs. This dissertation examines such hybrid designs for cores and caches. Firstly, we introduce a novel, hybrid out-of-order/in-order core microarchitecture. Instructions that are steered towards in-order execution skip register allocation, reordering and dynamic scheduling. At the same time, these instructions can interleave on an instruction-by-instruction basis with instructions that continue to benefit from these conventional out-of-order mechanisms. Secondly, this dissertation revisits a hybrid technique introduced for L1 caches, way-prediction, in the context of last-level caches that are larger, have higher associativity, and experience less locality.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113484/1/sleimanf_1.pd
Domain-Specific Modelling for Coordination Engineering
Multi-core processors offer increased speed and efficiency on various devices, from desktop computers to smartphones. But the challenge is not only how to gain the utmost performance, but also how to support portability, continuity with prevalent technologies, and the dissemination of existing principles of parallel software design. This thesis shows how model-driven software development can help engineering parallel systems. Rather than simply offering yet another programming approach for concurrency, it proposes using an explicit coordination model as the first development artefact. Key topics include: Basic foundations of parallel software design, coordination models and languages, and model-driven software development How Coordination Engineering eases parallel software design by separating concerns and activities across roles How the Space-Coordinated Processes (SCOPE) coordination model combines coarse-grained choreography of parallel processes with fine-grained parallelism within these processes Extensive experimental evaluation on SCOPE implementations and the application of Coordination Engineerin
Recommended from our members
A Paradigm for Scalable, Transactional, and Efficient Spatial Indexes
With large volumes of geo-tagged data collected in various applications, spatial query pro- cessing becomes essential. Query engines depend on efficient indexes to expedite processing. There are three main challenges: scaling out to accommodate large volumes of spatial data, support- ing transactional primitives for strong consistency guarantees, and adapting to highly dynamic workloads. This thesis proposes a paradigm for scalable, transactional, and efficient spatial indexes to significantly reduce development efforts in designing and comparing multiple spatial indexes.This thesis first introduces a distributed and transactional key value store called DTranx to persist the spatial indexes. DTranx follows the SEDA architecture to exploit high concurrency in multi-core environments and it adopts a hybrid of optimistic concurrency control and two-phase commit protocols to narrow down the critical sections of distributed locking during transaction com- mits. Moreover, DTranx integrates a persistent memory based write-ahead log to reduce durability overhead and combines a garbage collection mechanism without affecting normal transactions. To maintain high throughput for search workloads when databases are constantly updated, snapshot transactions are introduced.Then, a paradigm is presented with a set of intuitive APIs and a Mempool runtime to re- duce development efforts. Mempool transparently synchronizes local states of data structures with DTranx and it handles two critical tasks: address translation and transparent server synchroniza- tion, of which the latter includes transaction construction and data synchronization. Furthermore, a dynamic partitioning strategy is integrated into DTranx to generate partitioning and replication plans that reduce inter-server communications and balance resource usage.Lastly, single-threaded data structures BTree and RTree are converted into distributed versions within two weeks. The BTree and RTree achieve 253.07 kops/sec and 77.83 kops/sec through- put respectively for pure search operations in a 25-server cluster
Recommended from our members
Utilizing Runtime Information for Accurate Root Cause Identification in Performance Diagnosis
This dissertation highlights that existing performance diagnostic tools often become less effective due to their inherent inaccuracies in modern software. To overcome these inaccuracies and effectively identify the root causes of performance issues, it is necessary to incorporate supplementary runtime information into these tools. Within this context, the dissertation integrates specific runtime information into two typical performance diagnostic tools: profilers and causal tracing tools.
The integration yields a substantial enhancement in the effectiveness of performance diagnosis. Among these tools, gprof stands out as a representative profiler for performance diagnosis. Nonetheless, its effectiveness diminishes as the time cost calculated based on CPU sampling fails to accurately and adequately pinpoint the root causes of performance issues in complex software. To tackle this challenge, the dissertation introduces an innovative methodology called value-assisted cost profiling (vProf). This approach incorporates variable values observed during runtime into the profiling process.
By continuously sampling variable values from both normal and problematic executions, vProf refines function cost estimates, identifies anomalies in value distributions, and highlights potentially problematic code areas that could be the actual sources of performance is- sues. The effectiveness of vProf is validated through the diagnosis of 18 real-world performance is- sues in four widely-used applications. Remarkably, vProf outperforms other state-of-the-art tools, successfully diagnosing all issues, including three that had remained unresolved for over four years.
Causal tracing tools reveal the root causes of performance issues in complex software by generating tracing graphs. However, these graphs often suffer from inherent inaccuracies, characterized by superfluous (over-connected) and missed (under-connected) edges. These inaccuracies arise from the diversity of programming paradigms. To mitigate the inaccuracies, the dissertation proposes an approach to derive strong and weak edges in tracing graphs based on the vertices’ semantics collected during runtime. By leveraging these edge types, a beam-search-based diagnostic algorithm is employed to identify the most probable causal paths. Causal paths from normal and buggy executions are differentiated to provide key insights into the root causes of performance issues. To validate this approach, a causal tracing tool named Argus is developed and tested across multiple versions of macOS. It is evaluated on 12 well-known spinning pinwheel issues in popular macOS applications. Notably, Argus successfully diagnoses the root causes of all identified issues, including 10 issues that had remained unresolved for several years.
The results from both tools exemplify a substantial enhancement of performance diagnostic tools achieved by harnessing runtime information. The integration can effectively mitigate inherent inaccuracies, lend support to inaccuracy-tolerant diagnostic algorithms, and provide key insights to pinpoint the root causes
Middleware support for locality-aware wide area replication
technical reportCoherent wide-area data caching can improve the scalability and responsiveness of distributed services such as wide-area file access, database and directory services, and content distribution. However, distributed services differ widely in the frequency of read/write sharing, the amount of contention between clients for the same data, and their ability to make tradeoffs between consistency and availability. Aggressive replication enhances the scalability and availability of services with read-mostly data or data that need not be kept strongly consistent. However, for applications that require strong consistency of writeshared data, you must throttle replication to achieve reasonable performance. We have developed a middleware data store called Swarm designed to support the widearea data sharing needs of distributed services. To support the needs of diverse distributed services, Swarm provides: (i) a failure-resilient proximity-aware data replication mechanism that adjusts the replication hierarchy based on observed network characteristics and node availability, (ii) a customizable consistency mechanism that allows applications to specify allowable consistency-availability tradeoffs, and (iii) a contention-aware caching mechanism that monitors contention between replicas and adjusts its replication policies accordingly. On a 240-node P2P file sharing system, Swarm's proximity-aware caching and replica hierarchy maintenance mechanisms improve latency by 80%, reduce WAN bandwidth consumed by 80%, and limit the impact of high node churn (5 node deaths/sec) to roughly one-fifth that of random replication. In addition, Swarm's contention-aware caching mechanism outperforms RPCs and static caching mechanisms at all levels of contention on an enterprise service workload