167 research outputs found
June 14th, 2017
Cache replacement and branch prediction are two important microarchitectural prediction techniques for improving performance. We propose a data-driven approach to designing microarchitectural predictors. Through simulation, we collect traces giving detailed control-flow and memory behavior. Then use stochastic search techniques, such as genetic algorithms, to find points in a large design space of predictors that yield good accuracy on the traces. We then evaluate the predictors on held-out data.
This talk will present two techniques resulting from this methodology. In Multiperspective Branch Prediction, many features and their parameters are tuned using a genetic algorithm to yield a very accurate perceptron-based branch predictor. Multiperspective Reuse Prediction uses the same idea for cache management. Many features of memory accesses to predict the reuse of a given memory access. The features and their parameters are chosen by a stochastic search yielding a very accurate predictor. This predictor is applied to a placement, replacement, and bypass optimization that out-performs the state of the art
A general guide to applying machine learning to computer architecture
The resurgence of machine learning since the late 1990s has been enabled by significant advances in computing performance and the growth of big data. The ability of these algorithms to detect complex patterns in data which are extremely difficult to achieve manually, helps to produce effective predictive models. Whilst computer architects have been accelerating the performance of machine learning algorithms with GPUs and custom hardware, there have been few implementations leveraging these algorithms to improve the computer system performance. The work that has been conducted, however, has produced considerably promising results.
The purpose of this paper is to serve as a foundational base and guide to future computer
architecture research seeking to make use of machine learning models for improving system efficiency.
We describe a method that highlights when, why, and how to utilize machine learning
models for improving system performance and provide a relevant example showcasing the effectiveness of applying machine learning in computer architecture. We describe a process of data
generation every execution quantum and parameter engineering. This is followed by a survey of a
set of popular machine learning models. We discuss their strengths and weaknesses and provide
an evaluation of implementations for the purpose of creating a workload performance predictor
for different core types in an x86 processor. The predictions can then be exploited by a scheduler
for heterogeneous processors to improve the system throughput. The algorithms of focus are
stochastic gradient descent based linear regression, decision trees, random forests, artificial neural
networks, and k-nearest neighbors.This work has been supported by the European Research Council (ERC) Advanced Grant RoMoL (Grant Agreemnt 321253) and by the Spanish Ministry of Science and Innovation (contract TIN 2015-65316P).Peer ReviewedPostprint (published version
Exploring Alternate Cache Indexing Techniques
Cache memory is a bridging component which covers the increasing gap between the speed of
a processor and main memory. An excellent performance of the cache is crucial to improve system
performance. Conflict misses are one of the critical reasons that limit the cache performance by
mapping blocks to the same set which results in the eviction of many blocks. However, many
blocks in the cache sets are not mapped, and thus the available space is not efficiently utilized. A
direct way to reduce conflict misses is to increase associativity, but this comes with the cost of an
increase in the hit time. Another way to reduce conflict misses is to change the cache-indexing
scheme and distribute the accesses across all sets.
This thesis focuses on the second way mentioned above and aims to evaluate the impact of the
matrix-based indexing scheme on cache performance against the traditional modulus-based indexing
scheme. A correlation between the proposed indexing scheme and different cache replacement
policies is also observed.
The matrix-based indexing scheme yields a geometric mean speedup of 1.2% for SPEC CPU
2017 benchmarks for single core simulations when applied for direct-mapped last level cache. In
this case, an improvement of 1.5% and 4% is observed for at least eighteen and seven of SPEC
CPU2017 applications respectively. Also, it yields 2% of performance improvement over sixteen
SPEC CPU2006 benchmarks. The new indexing scheme correlates well with multiperspective
reuse prediction. It is observed that LRU benefits machine learning benchmark by a performance
of 5.1%. For multicore simulations, the new indexing scheme does not improve performance
significantly. However, this scheme also does not impact the application’s performance negatively
Domain-Specialized Cache Management for Graph Analytics
Graph analytics power a range of applications in areas as diverse as finance,
networking and business logistics. A common property of graphs used in the
domain of graph analytics is a power-law distribution of vertex connectivity,
wherein a small number of vertices are responsible for a high fraction of all
connections in the graph. These richly-connected, hot, vertices inherently
exhibit high reuse. However, this work finds that state-of-the-art hardware
cache management schemes struggle in capitalizing on their reuse due to highly
irregular access patterns of graph analytics.
In response, we propose GRASP, domain-specialized cache management at the
last-level cache for graph analytics. GRASP augments existing cache policies to
maximize reuse of hot vertices by protecting them against cache thrashing,
while maintaining sufficient flexibility to capture the reuse of other vertices
as needed. GRASP keeps hardware cost negligible by leveraging lightweight
software support to pinpoint hot vertices, thus eliding the need for
storage-intensive prediction mechanisms employed by state-of-the-art cache
management schemes. On a set of diverse graph-analytic applications with large
high-skew graph datasets, GRASP outperforms prior domain-agnostic schemes on
all datapoints, yielding an average speed-up of 4.2% (max 9.4%) over the
best-performing prior scheme. GRASP remains robust on low-/no-skew datasets,
whereas prior schemes consistently cause a slowdown.Comment: No content changes from the previous versio
Perceptron Learning in Cache Management and Prediction Techniques
Hardware prefetching is an effective technique for hiding cache miss latencies in modern processor designs. An efficient prefetcher should identify complex memory access patterns during program execution. This ability enables the prefetcher to read a block ahead of its demand access, potentially preventing a cache miss. Accurately identifying the right blocks to prefetch is essential to achieving high performance from the prefetcher.
Prefetcher performance can be characterized by two main metrics that are generally at odds with one another: coverage, the fraction of baseline cache misses which the prefetcher brings into the cache; and accuracy, the fraction of prefetches which are ultimately used. An overly aggressive prefetcher may improve coverage at the cost of reduced accuracy. Thus, performance may be harmed by this over-aggressiveness because many resources are wasted, including cache capacity and bandwidth. An ideal prefetcher would have both high coverage and accuracy.
In this thesis, I propose Perceptron-based Prefetch Filtering (PPF) as a way to increase the coverage of the prefetches generated by a baseline prefetcher without negatively impacting accuracy. PPF enables more aggressive tuning of a given baseline prefetcher, leading to increased coverage by filtering out the growing numbers of inaccurate prefetches such an aggressive tuning implies. I also explore a range of features to use to train PPF’s perceptron layer to identify inaccurate prefetches. PPF improves performance on a memory-intensive subset of the SPEC CPU 2017 benchmarks by 3.78% for a single-core configuration, and by 11.4% for a 4-core configuration, compared to the baseline prefetcher alone
TCOR: a tile cache with optimal replacement
Cache Replacement Policies are known to have an important impact on hit rates. The OPT replacement policy [27] has been formally proven as optimal for minimizing misses. Due to its need to look far ahead for future memory accesses, it is often reduced to a yardstick for measuring the efficacy of other practical caches. In this paper, we bring the OPT to life, in architectures for mobile GPUs, for which energy efficiency is of great consequence. We also mold other factors in the memory hierarchy to enhance its impact. The end results are a 13.8% decrease in the memory hierarchy energy consumption and an increased throughput in the Tiling Engine. We also observe a 5.5% decrease in the total GPU energy and a 3.7% increase in frames per second (FPS).This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, the ICREA Academia program and the AGAUR grant 2020-FISDU-00287. We would also like to thank the anonymous reviewers for their valuable comments.Peer ReviewedPostprint (author's final draft
SemEHR:A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research
OBJECTIVE: Unlocking the data contained within both structured and unstructured components of electronic health records (EHRs) has the potential to provide a step change in data available for secondary research use, generation of actionable medical insights, hospital management, and trial recruitment. To achieve this, we implemented SemEHR, an open source semantic search and analytics tool for EHRs. METHODS: SemEHR implements a generic information extraction (IE) and retrieval infrastructure by identifying contextualized mentions of a wide range of biomedical concepts within EHRs. Natural language processing annotations are further assembled at the patient level and extended with EHR-specific knowledge to generate a timeline for each patient. The semantic data are serviced via ontology-based search and analytics interfaces. RESULTS: SemEHR has been deployed at a number of UK hospitals, including the Clinical Record Interactive Search, an anonymized replica of the EHR of the UK South London and Maudsley National Health Service Foundation Trust, one of Europe's largest providers of mental health services. In 2 Clinical Record Interactive Search-based studies, SemEHR achieved 93% (hepatitis C) and 99% (HIV) F-measure results in identifying true positive patients. At King's College Hospital in London, as part of the CogStack program (github.com/cogstack), SemEHR is being used to recruit patients into the UK Department of Health 100 000 Genomes Project (genomicsengland.co.uk). The validation study suggests that the tool can validate previously recruited cases and is very fast at searching phenotypes; time for recruitment criteria checking was reduced from days to minutes. Validated on open intensive care EHR data, Medical Information Mart for Intensive Care III, the vital signs extracted by SemEHR can achieve around 97% accuracy. CONCLUSION: Results from the multiple case studies demonstrate SemEHR's efficiency: weeks or months of work can be done within hours or minutes in some cases. SemEHR provides a more comprehensive view of patients, bringing in more and unexpected insight compared to study-oriented bespoke IE systems. SemEHR is open source, available at https://github.com/CogStack/SemEHR
- …