1,032 research outputs found
A hardware-assisted translation cache for dynamic binary translation in embedded systems
Approaches to Dynamic Binary Translation (DBT) on resource-constrained embedded systems are not straight forward, leading to several improvements and acceleration suggestions that rely on dedicated hardware. Software to hardware offloading is a common acceleration procedure used when software-only approaches do not meet the performance requirements, making such approach suitable to be successfully applied to DBT. This article approaches hardware offloading to address some limitations of an in-house DBT engine, the DBTOR, regarding its Translation Cache (TCache) management mechanism. The suggested approaches are non-intrusive to the target architecture, which cope with the commercial-off-the-shelf (COTS)-driven deployment of DBT for the resource-constrained embedded devices. This work proposes a TCache management hardware module that overpasses the linked list and hash table software-only approaches, resulting in a performance improvement of 25% and 26%, respectively..This work has been supported by COMPETE: POCI-01-0145-FEDER-007043 and FCT - Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013
Proactive Aging Mitigation in CGRAs through Utilization-Aware Allocation
Resource balancing has been effectively used to mitigate the long-term aging
effects of Negative Bias Temperature Instability (NBTI) in multi-core and
Graphics Processing Unit (GPU) architectures. In this work, we investigate this
strategy in Coarse-Grained Reconfigurable Arrays (CGRAs) with a novel
application-to-CGRA allocation approach. By introducing important extensions to
the reconfiguration logic and the datapath, we enable the dynamic movement of
configurations throughout the fabric and allow overutilized Functional Units
(FUs) to recover from stress-induced NBTI aging. Implementing the approach in a
resource-constrained state-of-the-art CGRA reveals lifetime
improvement with negligible performance overheads and less than increase
in area.Comment: Please cite this as: M. Brandalero, B. N. Lignati, A. Carlos
Schneider Beck, M. Shafique and M. H\"ubner, "Proactive Aging Mitigation in
CGRAs through Utilization-Aware Allocation," 2020 57th ACM/IEEE Design
Automation Conference (DAC), San Francisco, CA, USA, 2020, pp. 1-6, doi:
10.1109/DAC18072.2020.921858
NASA Automated Rendezvous and Capture Review. Executive summary
In support of the Cargo Transfer Vehicle (CTV) Definition Studies in FY-92, the Advanced Program Development division of the Office of Space Flight at NASA Headquarters conducted an evaluation and review of the United States capabilities and state-of-the-art in Automated Rendezvous and Capture (AR&C). This review was held in Williamsburg, Virginia on 19-21 Nov. 1991 and included over 120 attendees from U.S. government organizations, industries, and universities. One hundred abstracts were submitted to the organizing committee for consideration. Forty-two were selected for presentation. The review was structured to include five technical sessions. Forty-two papers addressed topics in the five categories below: (1) hardware systems and components; (2) software systems; (3) integrated systems; (4) operations; and (5) supporting infrastructure
Efficient memory management for hardware accelerated Java Virtual Machines
Application-specific hardware accelerators can significantly improve a system's performance. In a Java-based system, we then have to consider a hybrid architecture that consists of a Java Virtual Machine running on a general-purpose processor connected to the hardware accelerator. In such a hybrid architecture, data communication between the accelerator and the general-purpose processor can incur a significant cost, which may even annihilate the original performance improvement of adding the accelerator. A careful layout of the data in the memory structure is therefore of major importance to maintain the acceleration performance benefits.
This article addresses the reduction of the communication cost in a distributed shared memory consisting of the main memory of the processor and the accelerator's local memory, which are unified in the Java heap. Since memory access times are highly nonuniform, a suitable allocation of objects in either main memory or the accelerator's local memory can significantly reduce the communication cost. We propose several techniques for finding the optimal location for each Java object's data, either statically through profiling or dynamically at runtime. We show how we can reduce communication cost by up to 86% for the SPECjvm and DaCapo benchmarks. We also show that the best strategy is application dependent and also depends on the relative cost of remote versus local accesses. For a relative cost higher than 10, a self-learning dynamic approach often results in the best performance
On the Feasibility and Limitations of Just-in-Time Instruction Set Extension for FPGA-Based Reconfigurable Processors
Reconfigurable instruction set processors provide the possibility of tailor the instruction set of a CPU to a particular application. While this customization process could be performed during runtime in order to adapt the CPU to the currently executed workload, this use case has been hardly investigated. In this paper, we study the feasibility of moving the customization process to runtime and evaluate the relation of the expected speedups and the associated overheads. To this end, we present a tool flow that is tailored to the requirements of this just-in-time ASIP specialization scenario. We evaluate our methods by targeting our previously introduced Woolcano reconfigurable ASIP architecture for a set of applications from the SPEC2006, SPEC2000, MiBench, and SciMark2 benchmark suites. Our results show that just-in-time ASIP specialization is promising for embedded computing applications, where average speedups of 5x can be achieved by spending 50 minutes for custom instruction identification and hardware generation. These overheads will be compensated if the applications execute for more than 2 hours. For the scientific computing benchmarks, the achievable speedup is only 1.2x, which requires significant execution times in the order of days to amortize the overheads
- …