Search CORE

1,032 research outputs found

A hardware-assisted translation cache for dynamic binary translation in embedded systems

Author: Cabral Jorge
Gomes Tiago Manuel Ribeiro
Salgado Filipe Alexandre Andrade
Tavares Adriano
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Approaches to Dynamic Binary Translation (DBT) on resource-constrained embedded systems are not straight forward, leading to several improvements and acceleration suggestions that rely on dedicated hardware. Software to hardware offloading is a common acceleration procedure used when software-only approaches do not meet the performance requirements, making such approach suitable to be successfully applied to DBT. This article approaches hardware offloading to address some limitations of an in-house DBT engine, the DBTOR, regarding its Translation Cache (TCache) management mechanism. The suggested approaches are non-intrusive to the target architecture, which cope with the commercial-off-the-shelf (COTS)-driven deployment of DBT for the resource-constrained embedded devices. This work proposes a TCache management hardware module that overpasses the linked list and hash table software-only approaches, resulting in a performance improvement of 25% and 26%, respectively..This work has been supported by COMPETE: POCI-01-0145-FEDER-007043 and FCT - Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013

Universidade do Minho: RepositoriUM

Crossref

Proactive Aging Mitigation in CGRAs through Utilization-Aware Allocation

Author: Beck Antonio Carlos Schneider
Brandalero Marcelo
Hübner Michael
Lignati Bernardo Neuhaus
Shafique Muhammad
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/11/2020
Field of study

Resource balancing has been effectively used to mitigate the long-term aging effects of Negative Bias Temperature Instability (NBTI) in multi-core and Graphics Processing Unit (GPU) architectures. In this work, we investigate this strategy in Coarse-Grained Reconfigurable Arrays (CGRAs) with a novel application-to-CGRA allocation approach. By introducing important extensions to the reconfiguration logic and the datapath, we enable the dynamic movement of configurations throughout the fabric and allow overutilized Functional Units (FUs) to recover from stress-induced NBTI aging. Implementing the approach in a resource-constrained state-of-the-art CGRA reveals

2.2\times

lifetime improvement with negligible performance overheads and less than

10\%

increase in area.Comment: Please cite this as: M. Brandalero, B. N. Lignati, A. Carlos Schneider Beck, M. Shafique and M. H\"ubner, "Proactive Aging Mitigation in CGRAs through Utilization-Aware Allocation," 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2020, pp. 1-6, doi: 10.1109/DAC18072.2020.921858

arXiv.org e-Print Archive

Crossref

Boosting Single Thread Performance in Mobile Processors via Reconfigurable Acceleration

Author: A. Duran
A. Ghuloum
A.C.S. Beck
C. Bienia
C. Lattner
C.K. Luk
G.M. Amdahl
M.B. Rutzig
N. Clark
R. Lysecky
S. Che
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

The University of Manchester - Institutional Repository

Energy proportional computing with OpenCL on a FPGA-based overlay architecture

Author: Nunez-Yanez Jose Luis
Sani Awais Hussain
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/12/2016
Field of study

Crossref

Explore Bristol Research

Energy proportional computing with OpenCL on a FPGA-based overlay architecture

Author: Sani Awais Hussain
Nunez-Yanez Jose Luis
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date: 01/01/2016
Field of study

Crossref

Biblioteca Digital de la Comunidad de Madrid

Explore Bristol Research

NASA Automated Rendezvous and Capture Review. Executive summary

Author
Publication venue
Publication date
Field of study

In support of the Cargo Transfer Vehicle (CTV) Definition Studies in FY-92, the Advanced Program Development division of the Office of Space Flight at NASA Headquarters conducted an evaluation and review of the United States capabilities and state-of-the-art in Automated Rendezvous and Capture (AR&C). This review was held in Williamsburg, Virginia on 19-21 Nov. 1991 and included over 120 attendees from U.S. government organizations, industries, and universities. One hundred abstracts were submitted to the organizing committee for consideration. Forty-two were selected for presentation. The review was structured to include five technical sessions. Forty-two papers addressed topics in the five categories below: (1) hardware systems and components; (2) software systems; (3) integrated systems; (4) operations; and (5) supporting infrastructure

NASA Technical Reports Server

Efficient memory management for hardware accelerated Java Virtual Machines

Author: Bertels Peter
D'Hollander Erik
Heirman Wim
Stroobandt Dirk
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

Application-specific hardware accelerators can significantly improve a system's performance. In a Java-based system, we then have to consider a hybrid architecture that consists of a Java Virtual Machine running on a general-purpose processor connected to the hardware accelerator. In such a hybrid architecture, data communication between the accelerator and the general-purpose processor can incur a significant cost, which may even annihilate the original performance improvement of adding the accelerator. A careful layout of the data in the memory structure is therefore of major importance to maintain the acceleration performance benefits. This article addresses the reduction of the communication cost in a distributed shared memory consisting of the main memory of the processor and the accelerator's local memory, which are unified in the Java heap. Since memory access times are highly nonuniform, a suitable allocation of objects in either main memory or the accelerator's local memory can significantly reduce the communication cost. We propose several techniques for finding the optimal location for each Java object's data, either statically through profiling or dynamically at runtime. We show how we can reduce communication cost by up to 86&percnt; for the SPECjvm and DaCapo benchmarks. We also show that the best strategy is application dependent and also depends on the relative cost of remote versus local accesses. For a relative cost higher than 10, a self-learning dynamic approach often results in the best performance

Ghent University Academic Bibliography

On the Feasibility and Limitations of Just-in-Time Instruction Set Extension for FPGA-Based Reconfigurable Processors

Author: Christian Plessl
Mariusz Grad
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2012
Field of study

Reconfigurable instruction set processors provide the possibility of tailor the instruction set of a CPU to a particular application. While this customization process could be performed during runtime in order to adapt the CPU to the currently executed workload, this use case has been hardly investigated. In this paper, we study the feasibility of moving the customization process to runtime and evaluate the relation of the expected speedups and the associated overheads. To this end, we present a tool flow that is tailored to the requirements of this just-in-time ASIP specialization scenario. We evaluate our methods by targeting our previously introduced Woolcano reconfigurable ASIP architecture for a set of applications from the SPEC2006, SPEC2000, MiBench, and SciMark2 benchmark suites. Our results show that just-in-time ASIP specialization is promising for embedded computing applications, where average speedups of 5x can be achieved by spending 50 minutes for custom instruction identification and hardware generation. These overheads will be compensated if the applications execute for more than 2 hours. For the scientific computing benchmarks, the achievable speedup is only 1.2x, which requires significant execution times in the order of days to amortize the overheads

Crossref

Directory of Open Access Journals