1,482 research outputs found
A Closer Look at Lightweight Graph Reordering
Graph analytics power a range of applications in areas as diverse as finance,
networking and business logistics. A common property of graphs used in the
domain of graph analytics is a power-law distribution of vertex connectivity,
wherein a small number of vertices are responsible for a high fraction of all
connections in the graph. These richly-connected (hot) vertices inherently
exhibit high reuse. However, their sparse distribution in memory leads to a
severe underutilization of on-chip cache capacity. Prior works have proposed
lightweight skew-aware vertex reordering that places hot vertices adjacent to
each other in memory, reducing the cache footprint of hot vertices. However, in
doing so, they may inadvertently destroy the inherent community structure
within the graph, which may negate the performance gains achieved from the
reduced footprint of hot vertices.
In this work, we study existing reordering techniques and demonstrate the
inherent tension between reducing the cache footprint of hot vertices and
preserving original graph structure. We quantify the potential performance loss
due to disruption in graph structure for different graph datasets. We further
show that reordering techniques that employ fine-grain reordering significantly
increase misses in the higher level caches, even when they reduce misses in the
last-level cache.
To overcome the limitations of existing reordering techniques, we propose
Degree-Based Grouping (DBG), a novel lightweight reordering technique that
employs a coarse-grain reordering to largely preserve graph structure while
reducing the cache footprint of hot vertices. Our evaluation on 40 combinations
of various graph applications and datasets shows that, compared to a baseline
with no reordering, DBG yields an average application speed-up of 16.8% vs
11.6% for the best-performing existing lightweight technique.Comment: Fixed ill-formatted page 6 from the earlier version. No content
change
Domain-Specialized Cache Management for Graph Analytics
Graph analytics power a range of applications in areas as diverse as finance,
networking and business logistics. A common property of graphs used in the
domain of graph analytics is a power-law distribution of vertex connectivity,
wherein a small number of vertices are responsible for a high fraction of all
connections in the graph. These richly-connected, hot, vertices inherently
exhibit high reuse. However, this work finds that state-of-the-art hardware
cache management schemes struggle in capitalizing on their reuse due to highly
irregular access patterns of graph analytics.
In response, we propose GRASP, domain-specialized cache management at the
last-level cache for graph analytics. GRASP augments existing cache policies to
maximize reuse of hot vertices by protecting them against cache thrashing,
while maintaining sufficient flexibility to capture the reuse of other vertices
as needed. GRASP keeps hardware cost negligible by leveraging lightweight
software support to pinpoint hot vertices, thus eliding the need for
storage-intensive prediction mechanisms employed by state-of-the-art cache
management schemes. On a set of diverse graph-analytic applications with large
high-skew graph datasets, GRASP outperforms prior domain-agnostic schemes on
all datapoints, yielding an average speed-up of 4.2% (max 9.4%) over the
best-performing prior scheme. GRASP remains robust on low-/no-skew datasets,
whereas prior schemes consistently cause a slowdown.Comment: No content changes from the previous versio
Addressing variability in reuse prediction for last-level caches
Last-Level Cache (LLC) represents the bulk of a modern CPU processor's transistor budget and is essential for application performance as LLC enables fast access to data in contrast to much slower main memory. Problematically, technology constraints make it infeasible to scale LLC capacity to meet the ever-increasing working set size of the applications. Thus, future processors will rely on effective cache management mechanisms and policies to get more performance out of the scarce LLC capacity.
Applications with large working set size often exhibit streaming and/or thrashing access patterns at LLC. As a result, a large fraction of the LLC capacity is occupied by dead blocks that will not be referenced again, leading to inefficient utilization of the LLC capacity. To improve cache efficiency, the state-of-the-art cache management techniques employ prediction mechanisms that learn from the past access patterns with an aim to accurately identify as many dead blocks as possible. Once identified, dead blocks are evicted from LLC to make space for potentially high reuse cache blocks.
In this thesis, we identify variability in the reuse behavior of cache blocks as the key limiting factor in maximizing cache efficiency for state-of-the-art predictive techniques. Variability in reuse prediction is inevitable due to numerous factors that are outside the control of LLC. The sources of variability include control-flow variation, speculative execution and contention from cores sharing the cache, among others. Variability in reuse prediction challenges existing techniques in reliably identifying the end of a block's useful lifetime, thus causing lower prediction accuracy, coverage, or both. To address this challenge, this thesis aims to design robust cache management mechanisms and policies for LLC in the face of variability in reuse prediction to minimize cache misses, while keeping the cost and complexity of the hardware implementation low. To that end, we propose two cache management techniques, one domain-agnostic and one domain-specialized, to improve cache efficiency by addressing variability in reuse prediction.
In the first part of the thesis, we consider domain-agnostic cache management, a conventional approach to cache management, in which the LLC is managed fully in hardware, and thus the cache management is transparent to the software. In this context, we propose Leeway, a novel domain-agnostic cache management technique. Leeway introduces a new metric, Live Distance, that captures the largest interval of temporal reuse for a cache block, providing a conservative estimate of a cache block's useful lifetime. Leeway implements a robust prediction mechanism that identifies dead blocks based on their past Live Distance values. Leeway monitors the change in Live Distance values at runtime and dynamically adapts its reuse-aware policies to maximize cache efficiency in the face of variability.
In the second part of the thesis, we identify applications, for which existing domain-agnostic cache management techniques struggle in exploiting the high reuse due to variability arising from certain fundamental application characteristics. Specifically, applications from the domain of graph analytics inherently exhibit high reuse when processing natural graphs. However, the reuse pattern is highly irregular and dependent on graph topology; a small fraction of vertices, hot vertices, exhibit high reuse whereas a large fraction of vertices exhibit low- or no-reuse. Moreover, the hot vertices are sparsely distributed in the memory space. Data-dependent irregular access patterns, combined with the sparse distribution of hot vertices, make it difficult for existing domain-agnostic predictive techniques in reliably identifying, and, in turn, retaining hot vertices in cache, causing severe underutilization of the LLC capacity.
In this thesis, we observe that the software is aware of the application reuse characteristics, which, if passed on to the hardware efficiently, can help hardware in reliably identifying the most useful working set even amidst irregular access patterns. To that end, we propose a holistic approach of software-hardware co-design to effectively manage LLC for the domain of graph analytics. Our software component implements a novel lightweight software technique, called Degree-Based Grouping (DBG), that applies a coarse-grain graph reordering to segregate hot vertices in a contiguous memory region to improve spatial locality. Meanwhile, our hardware component implements a novel domain-specialized cache management technique, called Graph Specialized Cache Management (GRASP). GRASP augments existing cache policies to maximize reuse of hot vertices by protecting them against cache thrashing, while maintaining sufficient flexibility to capture the reuse of other vertices as needed. To reliably identify hot vertices amidst irregular access patterns, GRASP leverages the DBG-enabled contiguity of hot vertices. Our domain-specialized cache management not only outperforms the state-of-the-art domain-agnostic predictive techniques, but also eliminates the need for any storage-intensive prediction mechanisms
Optimistic Prediction of Synchronization-Reversal Data Races
Dynamic data race detection has emerged as a key technique for ensuring
reliability of concurrent software in practice. However, dynamic approaches can
often miss data races owing to nondeterminism in the thread scheduler.
Predictive race detection techniques cater to this shortcoming by inferring
alternate executions that may expose data races without re-executing the
underlying program. More formally, the dynamic data race prediction problem
asks, given a trace \sigma of an execution of a concurrent program, can \sigma
be correctly reordered to expose a data race? Existing state-of-the art
techniques for data race prediction either do not scale to executions arising
from real world concurrent software, or only expose a limited class of data
races, such as those that can be exposed without reversing the order of
synchronization operations.
In general, exposing data races by reasoning about synchronization reversals
is an intractable problem. In this work, we identify a class of data races,
called Optimistic Sync(hronization)-Reversal races that can be detected in a
tractable manner and often include non-trivial data races that cannot be
exposed by prior tractable techniques. We also propose a sound algorithm OSR
for detecting all optimistic sync-reversal data races in overall quadratic
time, and show that the algorithm is optimal by establishing a matching lower
bound. Our experiments demonstrate the effectiveness of OSR on our extensive
suite of benchmarks, OSR reports the largest number of data races, and scales
well to large execution traces.Comment: ICSE'2
Metamorphic Code Generation from LLVM IR Bytecode
Metamorphic software changes its internal structure across generations with its functionality remaining unchanged. Metamorphism has been employed by malware writers as a means of evading signature detection and other advanced detection strate- gies. However, code morphing also has potential security benefits, since it increases the âgenetic diversityâ of software. In this research, we have created a metamorphic code generator within the LLVM compiler framework. LLVM is a three-phase compiler that supports multiple source languages and target architectures. It uses a common intermediate representation (IR) bytecode in its optimizer. Consequently, any supported high-level programming language can be transformed to this IR bytecode as part of the LLVM compila- tion process. Our metamorphic generator functions at the IR bytecode level, which provides many advantages over previously developed metamorphic generators. The morphing techniques that we employ include dead code insertionâwhere the dead code is actually executed within the morphed codeâand subroutine permutation. We have tested the effectiveness of our code morphing using hidden Markov model analysis
ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation
Web archives are a valuable resource for researchers of various disciplines.
However, to use them as a scholarly source, researchers require a tool that
provides efficient access to Web archive data for extraction and derivation of
smaller datasets. Besides efficient access we identify five other objectives
based on practical researcher needs such as ease of use, extensibility and
reusability.
Towards these objectives we propose ArchiveSpark, a framework for efficient,
distributed Web archive processing that builds a research corpus by working on
existing and standardized data formats commonly held by Web archiving
institutions. Performance optimizations in ArchiveSpark, facilitated by the use
of a widely available metadata index, result in significant speed-ups of data
processing. Our benchmarks show that ArchiveSpark is faster than alternative
approaches without depending on any additional data stores while improving
usability by seamlessly integrating queries and derivations with external
tools.Comment: JCDL 2016, Newark, NJ, US
- âŠ