Search CORE

295 research outputs found

Performance analysis and optimization of the Java memory system

Author: Lebsack Carl Stephen
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2008
Field of study

Digital Repository @ Iowa State University (ISU)

Dynamic data shapers optimize performance in Dynamic Binary Optimization (DBO) environment

Author: Venkatesan Varun Kumhar
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2015
Field of study

Processor hardware has been architected with the assumption that most data access patterns would be linearly spatial in nature. But, most applications involve algorithms that are designed with optimal efficiency in mind, which results in non-spatial, multi-dimensional data access. Moreover, this data view or access pattern changes dynamically in different program phases. This results in a mismatch between the processor hardware\u27s view of data and the algorithmic view of data, leading to significant memory access bottlenecks. This variation in data views is especially more pronounced in applications involving large datasets, leading to significantly increased latency and user response times. Previous attempts to tackle this problem were primarily targeted at execution time optimization. We present a dynamic technique piggybacked on the classical dynamic binary optimization (DBO) to shape the data view for each program phase differently resulting in program execution time reduction along with reductions in access energy. Our implementation rearranges non-adjacent data into a contiguous dataview. It uses wrappers to replace irregular data access patterns with spatially local dataview. HDTrans, a runtime dynamic binary optimization framework has been used to perform runtime instrumentation and dynamic data optimization to achieve this goal. This scheme not only ensures a reduced program execution time, but also results in lower energy use. Some of the commonly used benchmarks from the SPEC 2006 suite were profiled to determine irregular data accesses from procedures which contributed heavily to the overall execution time. Wrappers built to replace these accesses with spatially adjacent data led to a significant improvement in the total execution time. On average, 20% reduction in time was achieved along with a 5% reduction in energy

Digital Repository @ Iowa State University (ISU)

WCET-Driven Dynamic Data Scratchpad Management With Compiler-Directed Prefetching

Author: Pellizzoni Rodolfo
Soliman Muhammad Refaat
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 29th Euromicro Conference on Real-Time Systems (ECRTS 2017)
Publication date: 01/01/2017
Field of study

In recent years, the real-time community has produced a variety of approaches targeted at managing on-chip memory (scratchpads and caches) in a predictable way. However, to obtain safe WCET bounds, such techniques generally assume that the processor is stalled while waiting to reload the content of the on-chip memory; hence, they are less effective at hiding main memory latency compared to speculation-based techniques, such as hardware prefetching, that are largely used in general-purpose systems. In this work, we introduce a novel compiler-directed prefetching scheme for scratchpad memory that effectively hides the latency of main memory accesses by overlapping data transfers with the program execution. We implement and test an automated program compilation and optimization flow within the LLVM framework, and we show how to obtain improved WCET bounds through static analysis

Dagstuhl Research Online Publication Server

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Author: D. Lenoski
D. Wentzlaff
Dimitrios S. Nikolopoulos
H. Shan
I. Schoinas
J. Leverich
J.A. Kahle
J.M. Mellor-Crummey
K. Gharachorloo
M. Wen
M.M.K. Martin
Manolis Katevenis
Michail Zampetakis
P.S. Magnusson
S.L. Scott
S.P. Amarasinghe
S.W. Keckler
Stamatis Kavadias
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Queen's University Belfast Research Portal

Crossref

Springer - Publisher Connector

A memory-centric approach to enable timing-predictability within embedded many-core accelerators

Author: Bertogna Marko
Burgio Paolo
Marongiu Andrea
Valente Paolo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

There is an increasing interest among real-time systems architects for multi- and many-core accelerated platforms. The main obstacle towards the adoption of such devices within industrial settings is related to the difficulties in tightly estimating the multiple interferences that may arise among the parallel components of the system. This in particular concerns concurrent accesses to shared memory and communication resources. Existing worst-case execution time analyses are extremely pessimistic, especially when adopted for systems composed of hundreds-tothousands of cores. This significantly limits the potential for the adoption of these platforms in real-time systems. In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi- and manycore heterogeneous platforms. Using a state-of-the-art multi-core platform as a testbed, we validate that it is possible to obtain an order-of-magnitude improvement in the WCET bounds of parallel applications, if data movements are adequately orchestrated in accordance with PREM. We identify which system parameters mostly affect the tremendous performance opportunities offered by this approach, both on average and in the worst case, moving the first step towards predictable many-core systems

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Memory controller for vector processor

Author: Ayguadé Parra Eduard
Cristal Kestelman Adrián
Hussain Tassadaq
Palomar Oscar
Unsal Osman Sabri
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators and vector processing cores for data-intensive scientific applications. FPGA based vector accelerators are used to increase the performance of high-performance application kernels. Adding more vector lanes does not affect the performance, if the processor/memory performance gap dominates. In addition if on/off-chip communication time becomes more critical than computation time, causes performance degradation. The system generates multiple delays due to application’s irregular data arrangement and complex scheduling scheme. Therefore, just like generic scalar processors, all sets of vector machine – vector supercomputers to vector microprocessors – are required to have data management and access units that improve the on/off-chip bandwidth and hide main memory latency. In this work, we propose an Advanced Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized on-chip memory, a memory manager in hardware, and multiple DRAM controllers. We implemented and validated the proposed system on an Altera DE4 FPGA board. The PVMC is also integrated with ARM Cortex-A9 processor on Xilinx Zynq All-Programmable System on Chip architecture. We compare the performance of a system with vector and scalar processors without PVMC. When compared with a baseline vector system, the results show that the PVMC system transfers data sets up to 1.40x to 2.12x faster, achieves between 2.01x to 4.53x of speedup for 10 applications and consumes 2.56 to 4.04 times less energy.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Digital.CSIC

LEAP Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic [Extended Version]

Author: Adler Michael
Emer Joel
Fleming Kermin E.
Parashar Angshuman
Pellauer Michael
Publication venue
Publication date: 23/11/2010
Field of study

CORRECTION: The authors for entry [4] in the references should have been "E. S. Chung, J. C. Hoe, and K. Mai".Developers accelerating applications on FPGAs or other reconfigurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Software developers expect a programming environment to include automatic memory management. Virtual memory provides the illusion of very large arrays and processor caches reduce access latency without explicit programmer instructions. LEAP scratchpads for reconfigurable logic dynamically allocate and manage multiple, independent, memory arrays in a large backing store. Scratchpad accesses are cached automatically in multiple levels, ranging from shared on-board, RAM-based, set-associative caches to private caches stored in FPGA RAM blocks. In the LEAP framework, scratchpads share the same interface as on-die RAM blocks and are plug-in replacements. Additional libraries support heap management within a storage set. Like software developers, accelerator authors using scratchpads may focus more on core algorithms and less on memory management. Two uses of FPGA scratchpads are analyzed: buffer management in an H.264 decoder and memory management within a processor microarchitecture timing model

DSpace@MIT

SPM management using markov chain based data access prediction

Author: Kandemir M.
Ozturk O.
Srikantaiah S.
Yemliha T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

Leveraging the power of scratchpad memories (SPMs) available in most embedded systems today is crucial to extract maximum performance from application programs. While regular accesses like scalar values and array expressions with affine subscript functions have been tractable for compiler analysis (to be prefetched into SPM), irregular accesses like pointer accesses and indexed array accesses have not been easily amenable for compiler analysis. This paper presents an SPM management technique using Markov chain based data access prediction for such irregular accesses. Our approach takes advantage of inherent, but hidden reuse in data accesses made by irregular references. We have implemented our proposed approach using an optimizing compiler. In this paper, we also present a thorough comparison of our different dynamic prediction schemes with other SPM management schemes. SPM management using our approaches produces 12.7% to 28.5% improvements in performance across a range of applications with both regular and irregular access patterns, with an average improvement of 20.8%

Crossref

Bilkent University Institutional Repository