2,254 research outputs found
Recommended from our members
Accurate modeling of core and memory locality for proxy generation targeting emerging applications and architectures
Designing optimal computer systems for improved performance and energy efficiency requires architects and designers to have a deep understanding of the end-user workloads. However, many end-users (e.g., large corporations, banks, defense organizations, etc.) are apprehensive to share their applications with designers due to the confidential nature of software code and data. In addition, emerging applications pose significant challenges to early design space exploration due to their long-running nature and the highly complex nature of their software stack that cannot be supported on many early performance models.
The above challenges can be overcome by using a proxy benchmark. A miniaturized proxy benchmark can be used as a substitute of the original workload to perform early computer performance evaluation. The process of generating a proxy benchmark consists of extracting a set of key statistics to summarize the behavior of end-user applications through profiling and using the collected statistics to synthesize a representative proxy benchmark. Using such proxy benchmarks can help designers to understand the behavior of end-user’s workloads in a reasonable time without the users having to disclose sensitive information about their workloads.
Prior proxy benchmarking schemes leverage micro-architecture independent metrics, derived from detailed simulation tools, to generate proxy benchmarks. However, many emerging workloads do not work reliably with many profiling or simulation tools, in which case it becomes impossible to apply prior proxy generation techniques to generate proxy benchmarks for such complex applications. Furthermore, these techniques model instruction pipeline-level locality in great detail, but abstract out memory locality modeling using simple stride-based models. This results in poor cloning accuracy especially for emerging applications, which have larger memory footprints and complex access patterns. A few detailed cache and memory locality modeling techniques have also been proposed in literature. However, these techniques either model limited locality metrics and suffer from poor cloning accuracy or are fairly accurate, but at the expense of significant metadata overhead. Finally, none of the prior proxy benchmarking techniques model both core and memory locality with high accuracy. As a result, they are not useful for studying system-level performance behavior. Keeping the above key limitations and shortcomings of prior work in mind, this dissertation presents several techniques that expand the frontiers of workload proxy benchmarking, thereby enabling computer designers to gain a better and faster understanding of end-user application behavior without compromising the privileged nature of software or data.
This dissertation first presents a core-level proxy benchmark generation methodology that leverages performance metrics derived from hardware performance counter measurements to create miniature proxy benchmarks targeting emerging big-data applications. The presented performance counter based characterization and associated extrapolation into generic parameters for proxy generation enables faster analysis (runs almost at native hardware speeds, unlike prior workload cloning proposals) and proxy generation for emerging applications that do not work with simulators or profiling tools. The generated proxy benchmarks are representative of the performance of the real-world big-data applications, including operating system and run-time effects, and yet converge to results quickly without needing any complex software stack support.
Next, to improve upon the accuracy and efficiency of prior memory proxy benchmarking techniques, this dissertation presents a novel memory locality modeling technique that leverages localized pattern detection to create miniature memory proxy benchmarks. The presented technique models memory reference locality by decomposing an application’s memory accesses into a set of independent streams (localized by using address region based localization property), tracking fine-grained patterns within the localized streams and, finally, chaining or interleaving accesses from different localized memory streams to create an ordered proxy memory access sequence. This dissertation further extends the workload cloning approach to Graphics Processing Units (GPUs) and presents a novel proxy generation methodology to model the inherent memory access locality of GPU applications, while also accounting for the GPU’s parallel execution model. The generated memory proxy benchmarks help to enable fast and efficient design space exploration of futuristic memory hierarchies.
Finally, this dissertation presents a novel technique to integrate accurate core and memory locality models to create system-level proxy benchmarks targeting emerging applications. This is a new capability that can facilitate efficient overall system (core, cache and memory subsystem) design-space exploration. This dissertation further presents a novel methodology that exploits the synthetic benchmark generation framework to create hypothetical workloads with performance behavior that does not currently exist. Such proxies can be generated to cover anticipated code trends and can represent futuristic workloads before the workloads even exist.Electrical and Computer Engineerin
Bio-inspired call-stack reconstruction for performance analysis
The correlation of performance bottlenecks and their associated source code has become a cornerstone of performance analysis. It allows understanding why the efficiency of an application falls behind the computer's peak performance and enabling optimizations on the code ultimately. To this end, performance analysis tools collect the processor call-stack and then combine this information with measurements to allow the analyst comprehend the application behavior. Some tools modify the call-stack during run-time to diminish the collection expense but at the cost of resulting in non-portable solutions. In this paper, we present a novel portable approach to associate performance issues with their source code counterpart. To address it, we capture a reduced segment of the call-stack (up to three levels) and then process the segments using an algorithm inspired by multi-sequence alignment techniques. The results of our approach are easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code. To demonstrate the usefulness of our approach, we have applied the algorithm to several first-time seen in-production applications to describe them finely, and optimize them by using tiny modifications based on the analyses.We thankfully acknowledge Mathis Bode for giving us access to the Arts CF binaries, and Miguel Castrillo and Kim Serradell for their valuable insight regarding Nemo. We would like to thank Forschungszentrum JĂĽlich for the computation time on their Blue Gene/Q system. This research has been partially funded by the CICYT under contracts No. TIN2012-34557 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
Big Data Caching for Networking: Moving from Cloud to Edge
In order to cope with the relentless data tsunami in wireless networks,
current approaches such as acquiring new spectrum, deploying more base stations
(BSs) and increasing nodes in mobile packet core networks are becoming
ineffective in terms of scalability, cost and flexibility. In this regard,
context-aware G networks with edge/cloud computing and exploitation of
\emph{big data} analytics can yield significant gains to mobile operators. In
this article, proactive content caching in G wireless networks is
investigated in which a big data-enabled architecture is proposed. In this
practical architecture, vast amount of data is harnessed for content popularity
estimation and strategic contents are cached at the BSs to achieve higher
users' satisfaction and backhaul offloading. To validate the proposed solution,
we consider a real-world case study where several hours of mobile data traffic
is collected from a major telecom operator in Turkey and a big data-enabled
analysis is carried out leveraging tools from machine learning. Based on the
available information and storage capacity, numerical studies show that several
gains are achieved both in terms of users' satisfaction and backhaul
offloading. For example, in the case of BSs with of content ratings
and Gbyte of storage size ( of total library size), proactive
caching yields of users' satisfaction and offloads of the
backhaul.Comment: accepted for publication in IEEE Communications Magazine, Special
Issue on Communications, Caching, and Computing for Content-Centric Mobile
Network
PRL: standardizing performance monitoring library for high-integrity real-time systems
The use of complex processors is becoming ubiquitous in High-Integrity Systems (HIS). To deal with processor’s increased complexity, Performance Monitoring Counters (PMCs) are increasingly used to reason on software behavior and provide the necessary evidence to support software certification. However, the use of PMCs in HIS is relatively recent and hence far from being standardized. As a result, software engineers are forced to resort to highly-customized, low-level programming of platform-specific PMC control registers, which is both error prone and time consuming. To cover this gap, we propose building on the PAPI library, a standardized performance monitoring solution in the mainstream domain, and develop a PMC Reading Library (PRL) for configuring and collecting traceable events while capturing HIS specific requirements and peculiarities. We instantiate PRL in a reference automotive configuration to show that PRL meets key HIS requirements: negligible footprint, limited and predictable overhead, and accuracy collecting hardware events by filtering out the impact of interrupts and context switches.This work has been supported by the Spanish Ministry of Science and Innovation under grant PID2019-107255GBC21/AEI/10.13039/501100011033, the European Unions Horizon 2020 Framework Programme under grant agreement No. 878752 (MASTECS), and the European Research Council (ERC) grant agreement No. 772773 (SuPerCom).Peer ReviewedPostprint (author's final draft
- …