2,532 research outputs found
On Characterizing the Data Access Complexity of Programs
Technology trends will cause data movement to account for the majority of
energy expenditure and execution time on emerging computers. Therefore,
computational complexity will no longer be a sufficient metric for comparing
algorithms, and a fundamental characterization of data access complexity will
be increasingly important. The problem of developing lower bounds for data
access complexity has been modeled using the formalism of Hong & Kung's
red/blue pebble game for computational directed acyclic graphs (CDAGs).
However, previously developed approaches to lower bounds analysis for the
red/blue pebble game are very limited in effectiveness when applied to CDAGs of
real programs, with computations comprised of multiple sub-computations with
differing DAG structure. We address this problem by developing an approach for
effectively composing lower bounds based on graph decomposition. We also
develop a static analysis algorithm to derive the asymptotic data-access lower
bounds of programs, as a function of the problem size and cache size
GOSSIPKIT: A Unified Component Framework for Gossip
International audienceAlthough the principles of gossip protocols are relatively easy to grasp, their variety can make their design and evaluation highly time consuming. This problem is compounded by the lack of a unified programming framework for gossip, which means developers cannot easily reuse, compose, or adapt existing solutions to fit their needs, and have limited opportunities to share knowledge and ideas. In this paper, we consider how component frameworks, which have been widely applied to implement middleware solutions, can facilitate the development of gossip-based systems in a way that is both generic and simple. We show how such an approach can maximise code reuse, simplify the implementation of gossip protocols, and facilitate dynamic evolution and re-deployment
Effective Cache Apportioning for Performance Isolation Under Compiler Guidance
With a growing number of cores in modern high-performance servers, effective
sharing of the last level cache (LLC) is more critical than ever. The primary
agenda of such systems is to maximize performance by efficiently supporting
multi-tenancy of diverse workloads. However, this could be particularly
challenging to achieve in practice, because modern workloads exhibit dynamic
phase behaviour, which causes their cache requirements & sensitivities to vary
at finer granularities during execution. Unfortunately, existing systems are
oblivious to the application phase behavior, and are unable to detect and react
quickly enough to these rapidly changing cache requirements, often incurring
significant performance degradation. In this paper, we propose Com-CAS, a new
apportioning system that provides dynamic cache allocations for co-executing
applications. Com-CAS differs from the existing cache partitioning systems by
adapting to the dynamic cache requirements of applications just-in-time, as
opposed to reacting, without any hardware modifications. The front-end of
Com-CAS consists of compiler-analysis equipped with machine learning mechanisms
to predict cache requirements, while the back-end consists of proactive
scheduler that dynamically apportions LLC amongst co-executing applications
leveraging Intel Cache Allocation Technology (CAT). Com-CAS's partitioning
scheme utilizes the compiler-generated information across finer granularities
to predict the rapidly changing dynamic application behaviors, while
simultaneously maintaining data locality. Our experiments show that Com-CAS
improves average weighted throughput by 15% over unpartitioned cache system,
and outperforms state-of-the-art partitioning system KPart by 20%, while
maintaining the worst individual application completion time degradation to
meet various Service-Level Agreement (SLA) requirements
Analytical Query Execution Optimized for all Layers of Modern Hardware
Analytical database queries are at the core of business intelligence and decision support. To analyze the vast amounts of data available today, query execution needs to be orders of magnitude faster. Hardware advances have made a profound impact on database design and implementation. The large main memory capacity allows queries to execute exclusively in memory and shifts the bottleneck from disk access to memory bandwidth. In the new setting, to optimize query performance, databases must be aware of an unprecedented multitude of complicated hardware features. This thesis focuses on the design and implementation of highly efficient database systems by optimizing analytical query execution for all layers of modern hardware. The hardware layers include the network across multiple machines, main memory and the NUMA interconnection across multiple processors, the multiple levels of caches across multiple processor cores, and the execution pipeline within each core. For the network layer, we introduce a distributed join algorithm that minimizes the network traffic. For the memory hierarchy, we describe partitioning variants aware to the dynamics of the CPU caches and the NUMA interconnection. To improve the memory access rate of linear scans, we optimize lightweight compression variants and evaluate their trade-offs. To accelerate query execution within the core pipeline, we introduce advanced SIMD vectorization techniques generalizable across multiple operators. We evaluate our algorithms and techniques on both mainstream hardware and on many-integrated-core platforms, and combine our techniques in a new query engine design that can better utilize the features of many-core CPUs. In the era of hardware becoming increasingly parallel and datasets consistently growing in size, this thesis can serve as a compass for developing hardware-conscious databases with truly high-performance analytical query execution
- …