92,547 research outputs found
CROSS-LAYER CUSTOMIZATION FOR LOW POWER AND HIGH PERFORMANCE EMBEDDED MULTI-CORE PROCESSORS
Due to physical limitations and design difficulties, computer processor architecture has shifted to multi-core and even many-core based approaches in recent years. Such architectures provide potentials for sustainable performance scaling into future peta-scale/exa-scale computing platforms, at affordable power budget, design complexity, and verification efforts. To date, multi-core processor products have been replacing uni-core processors in almost every market segment, including embedded systems, general-purpose desktops and laptops, and super computers.
However, many issues still remain with multi-core processor architectures that need to be addressed before their potentials could be fully realized. People in both academia and industry research community are still seeking proper ways to make efficient and effective use of these processors. The issues involve hardware architecture trade-offs, the system software service, the run-time management, and user application design, which demand more research effort into this field.
Due to the architectural specialties with multi-core based computers, a Cross-Layer Customization framework is proposed in this work, which combines application specific information and system platform features, along with necessary operating system service support, to achieve exceptional power and performance efficiency for targeted multi-core platforms. Several topics are covered with specific optimization goals, including snoop cache coherence protocol, inter-core communication for producer-consumer applications, synchronization mechanisms, and off-chip memory bandwidth limitations.
Analysis of benchmark program execution with conventional mechanisms is made to reveal the overheads in terms of power and performance. Specific customizations are proposed to eliminate such overheads with support from hardware, system software, compiler, and user applications. Experiments show significant improvement
on system performance and power efficiency
Design-Space Exploration of Stream Programs through Semantic-Preserving Transformations
Stream languages explicitly describe fork-join parallelism and pipelines, offering a powerful programming model for many-core Multi-Processor Systems on Chip (MPSoC). In an embedded resource-constrained system, adapting stream programs to fit memory requirements is particularly important. In this paper we present a design-space exploration technique to reduce the minimal memory required when running stream programs on MPSoC; this allows to target memory constrained systems and in some cases obtain better performance. Using a set of semantically preserving transformations, we explore a large number of equivalent program variants; we select the variant that minimizes a buffer evaluation metric. To cope efficiently with large program instances we propose and evaluate an heuristic for this method. We demonstrate the interest of our method on a panel of ten significant benchmarks. As an illustration, we measure the minimal memory required using a multi-core modulo scheduling. Our approach lowers considerably the minimal memory required for seven of the ten benchmarks
A Memory Scheduling Infrastructure for Multi-Core Systems with Re-Programmable Logic
The sharp increase in demand for performance has prompted an explosion in the complexity of modern multi-core embedded systems. This has lead to unprecedented temporal unpredictability concerns in Cyber-Physical Systems (CPS). On-chip integration of programmable logic (PL) alongside a conventional Processing System (PS) in modern Systems-on-Chip (SoC) establishes a genuine compromise between specialization, performance, and reconfigurability. In addition to typical use-cases, it has been shown that the PL can be used to observe, manipulate, and ultimately manage memory traffic generated by a traditional multi-core processor.
This paper explores the possibility of PL-aided memory scheduling by proposing a Scheduler In-the-Middle (SchIM). We demonstrate that the SchIM enables transaction-level control over the main memory traffic generated by a set of embedded cores. Focusing on extensibility and reconfigurability, we put forward a SchIM design covering two main objectives. First, to provide a safe playground to test innovative memory scheduling mechanisms; and second, to establish a transition path from software-based memory regulation to provably correct hardware-enforced memory scheduling. We evaluate our design through a full-system implementation on a commercial PS-PL platform using synthetic and real-world benchmarks
BRISC-V: An Open-Source Architecture Design Space Exploration Toolbox
In this work, we introduce a platform for register-transfer level (RTL)
architecture design space exploration. The platform is an open-source,
parameterized, synthesizable set of RTL modules for designing RISC-V based
single and multi-core architecture systems. The platform is designed with a
high degree of modularity. It provides highly-parameterized, composable RTL
modules for fast and accurate exploration of different RISC-V based core
complexities, multi-level caching and memory organizations, system topologies,
router architectures, and routing schemes. The platform can be used for both
RTL simulation and FPGA based emulation. The hardware modules are implemented
in synthesizable Verilog using no vendor-specific blocks. The platform includes
a RISC-V compiler toolchain to assist in developing software for the cores, a
web-based system configuration graphical user interface (GUI) and a web-based
RISC-V assembly simulator. The platform supports a myriad of RISC-V
architectures, ranging from a simple single cycle processor to a multi-core SoC
with a complex memory hierarchy and a network-on-chip. The modules are designed
to support incremental additions and modifications. The interfaces between
components are particularly designed to allow parts of the processor such as
whole cache modules, cores or individual pipeline stages, to be modified or
replaced without impacting the rest of the system. The platform allows
researchers to quickly instantiate complete working RISC-V multi-core systems
with synthesizable RTL and make targeted modifications to fit their needs. The
complete platform (including Verilog source code) can be downloaded at
https://ascslab.org/research/briscv/explorer/explorer.html.Comment: In Proceedings of the 2019 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA '19
Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies
Broadband Wireless Access technologies have significant market potential, especially the
WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high
performance WiMAX solutions is forcing designers to seek help from multi-core processors
that offer competitive advantages in terms of all performance metrics, such as speed, power
and area. Through the provision of a degree of flexibility similar to that of a DSP and
performance and power consumption advantages approaching that of an ASIC,
coarse-grained dynamically reconfigurable processors are proving to be strong candidates
for processing cores used in future high performance multi-core processor systems.
This thesis investigates multi-core architectures with a newly emerging dynamically
reconfigurable processor â RICA, targeting WiMAX physical layer applications. A novel
master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC
based simulator, called MRPSIM, is devised to model this multi-core architecture. This
simulator provides fast simulation speed and timing accuracy, offers flexible architectural
options to configure the multi-core architecture, and enables the analysis and investigation
of multi-core architectures. Meanwhile a profiling-driven mapping methodology is
developed to partition the WiMAX application into multiple tasks as well as schedule and
map these tasks onto the multi-core architecture, aiming to reduce the overall system
execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly
integrated with the existing RICA tool flow.
Based on the proposed master-slave multi-core architecture, a series of diverse
homogeneous and heterogeneous multi-core solutions are designed for different fixed
WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM
simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at
relatively low area costs. Meanwhile a design space exploration methodology is developed
to search the design space for multi-core systems to find suitable solutions under certain
system constraints. Finally, laying a foundation for future multithreading exploration on the
proposed multi-core architecture, this thesis investigates the porting of a real-time operating
system â Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is
implemented on a single RICA processor with the operating system support
Enabling Shared Memory Communication in Networks of MPSoCs
Ongoing transistor scaling and the growing complexity of embedded system designs has led to the rise of MPSoCs (MultiâProcessor SystemâonâChip), combining multiple hardâcore CPUs and accelerators (FPGA, GPU) on the same physical die. These devices are of great interest to the supercomputing community, who are increasingly reliant on heterogeneity to achieve power and performance goals in these closing stages of the race to exascale. In this paper, we present a network interface architecture and networking infrastructure, designed to sit inside the FPGA fabric of a cuttingâedge MPSoC device, enabling networks of these devices to communicate within both a distributed and shared memory context, with reduced need for costly software networking system calls. We will present our implementation and prototype system and discuss the main design decisions relevant to the use of the Xilinx Zynq Ultrascale+, a stateâofâtheâart MPSoC, and the challenges to be overcome given the device's limitations and constraints. We demonstrate the working prototype system connecting two MPSoCs, with communication between processor and remote memory region and accelerator. We then discuss the limitations of the current implementation and highlight areas of improvement to make this solution productionâready
Memory-Aware Scheduling for Fixed Priority Hard Real-Time Computing Systems
As a major component of a computing system, memory has been a key performance and power consumption bottleneck in computer system design. While processor speeds have been kept rising dramatically, the overall computing performance improvement of the entire system is limited by how fast the memory can feed instructions/data to processing units (i.e. so-called memory wall problem). The increasing transistor density and surging access demands from a rapidly growing number of processing cores also significantly elevated the power consumption of the memory system. In addition, the interference of memory access from different applications and processing cores significantly degrade the computation predictability, which is essential to ensure timing specifications in real-time system design. The recent IC technologies (such as 3D-IC technology) and emerging data-intensive real-time applications (such as Virtual Reality/Augmented Reality, Artificial Intelligence, Internet of Things) further amplify these challenges. We believe that it is not simply desirable but necessary to adopt a joint CPU/Memory resource management framework to deal with these grave challenges.
In this dissertation, we focus on studying how to schedule fixed-priority hard real-time tasks with memory impacts taken into considerations. We target on the fixed-priority real-time scheduling scheme since this is one of the most commonly used strategies for practical real-time applications. Specifically, we first develop an approach that takes into consideration not only the execution time variations with cache allocations but also the task period relationship, showing a significant improvement in the feasibility of the system. We further study the problem of how to guarantee timing constraints for hard real-time systems under CPU and memory thermal constraints. We first study the problem under an architecture model with a single core and its main memory individually packaged. We develop a thermal model that can capture the thermal interaction between the processor and memory, and incorporate the periodic resource sever model into our scheduling framework to guarantee both the timing and thermal constraints. We further extend our research to the multi-core architectures with processing cores and memory devices integrated into a single 3D platform. To our best knowledge, this is the first research that can guarantee hard deadline constraints for real-time tasks under temperature constraints for both processing cores and memory devices. Extensive simulation results demonstrate that our proposed scheduling can improve significantly the feasibility of hard real-time systems under thermal constraints
A Scalable Parallel Architecture with FPGA-Based Network Processor for Scientific Computing
This thesis discuss the design and the implementation of an FPGA-Based
Network Processor for scientific computing, like Lattice Quantum ChromoDinamycs
(LQCD) and fluid-dynamics applications based on Lattice Boltzmann
Methods (LBM). State-of-the-art programs in this (and other similar)
applications have a large degree of available parallelism, that can be easily
exploited on massively parallel systems, provided the underlying communication
network has not only high-bandwidth but also low-latency.
I have designed in details, built and tested in hardware, firmware and
software an implementation of a Network Processor, tailored for the most
recent families of multi-core processors. The implementation has been developed
on an FPGA device to easily interface the logic of NWP with the CPU
I/O sub-system.
In this work I have assessed several ways to move data between the main
memory of the CPU and the I/O sub-system to exploit high data throughput
and low latency, enabling the use of âProgrammed Input Outputâ (PIO),
âDirect Memory Accessâ (DMA) and âWrite Combiningâ memory-settings.
On the software side, I developed and test a device driver for the Linux
operating system to access the NWP device, as well as a system library to
efficiently access the network device from user-applications.
This thesis demonstrates the feasibility of a network infrastructure that
saturates the maximum bandwidth of the I/O sub-systems available on recent
CPUs, and reduces communication latencies to values very close to those
needed by the processor to move data across the chip boundary
Video Processing Acceleration using Reconfigurable Logic and Graphics Processors
A vexing question is `which architecture will prevail as the core feature of the next state of
the art video processing system?' This thesis examines the substitutive and collaborative
use of the two alternatives of the reconfigurable logic and graphics processor architectures.
A structured approach to executing architecture comparison is presented - this includes a
proposed `Three Axes of Algorithm Characterisation' scheme and a formulation of perfor-
mance drivers. The approach is an appealing platform for clearly defining the problem,
assumptions and results of a comparison. In this work it is used to resolve the advanta-
geous factors of the graphics processor and reconfigurable logic for video processing, and
the conditions determining which one is superior. The comparison results prompt the
exploration of the customisable options for the graphics processor architecture. To clearly
define the architectural design space, the graphics processor is first identifed as part of
a wider scope of homogeneous multi-processing element (HoMPE) architectures. A novel
exploration tool is described which is suited to the investigation of the customisable op-
tions of HoMPE architectures. The tool adopts a systematic exploration approach and a
high-level parameterisable system model, and is used to explore pre- and post-fabrication
customisable options for the graphics processor. A positive result of the exploration is the
proposal of a reconfigurable engine for data access (REDA) to optimise graphics processor
performance for video processing-specific memory access patterns. REDA demonstrates
the viability of the use of reconfigurable logic as collaborative `glue logic' in the graphics
processor architecture
- âŠ