performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support.
Introduction and motivation
State-of-the-art silicon technology nodes empowered VLSI designers to integrate complex systems on a single chip with such advanced Systems-on-Chip (SoC) incorporating multiple processing engines. These are, as a minimum, scalar 32-bit central processing units (CPUs), digital signal processors (DSPs) augmented either by data-level parallel (DLP) standalone coprocessors [1] , instruction set architecture (ISA) extensions or combinations thereof [2] , connected via numerous, high bandwidth, point-to-point buses and more recently, packet-switched networks [3] . These units are supplied by many local memory blocks under the control of Direct Memory Access (DMA) engines. This ever-increasing circuit complexity needs to be delivered in the face of very tight design deadlines, mandated by Time-to-Market (TTM) imperatives, resulting in the final SoC being almost always over-engineered (hence, sub-optimal). Over-engineering is the effect of an increasing productivity gap where the chip complexity that can be handled (and verified) by design teams falls well short of the potential offered by these advanced silicon process nodes. As a result, the investigation of appropriate architectures and micro-architectures and the exploration of the implementation space are minimal in most designs.
At the same time the industry is witnessing a revolution in the capability of Field-Programmable Silicon 
System Synthesis and Exploration phase:
During this stage the validated design is partitioned automatically (under DSE control) with objects allocated to CPU cores or directly synthesized to gates via FalconML.
The target embedded system architecture is shown in Fig. 2 and consists of a standard host processor (Leon3 for standard-cell technologies and the Xilinx Microblaze for FPGA targets) and a number of default peripherals. A key component of the flow is the multi-core VLIW engine [16, 17] (Figs. 2b and 2c ) which forms the first-level accelerator. Objects that can't be efficiently executed on the combined Microblaze+LE1 CMP are offloaded to UML-designed accelerators (Hard-wired System Domain) in Fig. 2a . The allocation of software objects to either software classes executing on the LE1 CMP or hard-wired implementations as depicted is left to the DSE element of the flow. The whole platform uses a bespoke API (bare-metal on the Microblaze and run-to-completion on the LE1).
Motivation and Contributions
Lessons learned during the ENOSYS effort were in a number of areas namely low-level software (OS), hardware (tightly-coupled accelerators) and technology (FPGA vs standard-cell).
OS:
The need for a low-level OS executive (a simplified RTOS) to be used by the LE1 VLIW accelerator arose towards the end of that research; the authors identified that the silicon overheads and the relatively low performance of a soft-CPU such as the Microblaze weren't always justified as the algorithm com- putations were performed by the LE1 CMP and the HW accelerators. It was decided that a new LE1 micro-architecture was necessary to make more efficient use of the 'bare-metal' CPU cores and their close interaction with the hard-wired UML accelerators, limiting the host processor to purely OS and I/O. This is achieved through native (hardware-assisted) PThreads support as code portability between PThreads and code executing on the VThreads CPU was deemed as important.
2.
Hardware: The LE1 micro-architecture was also at the focus of performance optimization efforts as it was discovered that the LE1 memory subsystem (based on a configurable, multi-banked tightly-coupled RAMs) was more effective for the ENOSYS co-design tasks compared to the host system DDR2 solution. This called for a number of improvements in certain areas in the existing LE1 design namely the Instruction Fetch engine (IFE), tight integration of streaming accelerators, Multi-input/Multi-output (MIMO) Instruction Set Extensions (ISEs), branch prediction, extensive instrumentation and extended use of customization.
3. Target technology: Finally, it was decided that the LE1 optimizations should be such that the new version of the processor be fully technology-agnostic in order to support both FPGA and ASIC flows.
This called for modifications in the embedded SRAM blocks and the system clocking.
The outcome of this evaluation is the next-generation LE1 system, know as VThreads which is the focus of this paper. The processor addresses (1) by providing minimalist low-level, low-latency, hardware-assisted PThreads support; (2) with more efficient hardware organization and 3) via a re-design in a technology-independent way.
An overview of VThreads is given in the next section.
VThreads Overview
VThreads is a configurable CMP designed to be used in deeply-embedded co-design applications on FPGA and ASIC technologies. It implements a hybrid, explicit [18] parallelism model supporting both shared-memory semantics in hardware (hardware-accelerated PThreads) and a small set of Message-Passing primitives, facilitated by the host system. VThrads is designed to be attached to a larger system which includes a host CPU running the OS, high-level data scheduling and interfacing. A programmer's view of the VThreads system architecture (full deployment) is depicted in Figure 3a . At the highest level (Galaxy), VThreads consists of a configurable number of shared-memory multiprocessors (System(X)) and the Galaxy-level interconnect 3 . The host can initiate remote memory access operations across these multiprocessors using it's own DMA facilities. Each such System hierarchy is a shared-memory VThreads supports a single local data memory per System (DRAM on Fig. 3a) , accessible from all the Contexts/HCs in that system, and a configurable, instruction RAM (IRAM). Both memories are parametric and multi-banked.VThreads relies on an ISA-agnostic pipelined micro-architecture with partial support for fully-predicated (EPIC) [19] and full support for partially-predicated (Multiflow-type) statically-scheduled, long instruction word architectures [20] . These are known as perspectives with the default perspective (VT32PP) being the 32-bit partially-predicated VLIW architecture. This is loosely based on the Multiflow architecture with architectural support for Single-Instruction, Multiple-Data (SIMD) processing, multiple register files, MIMO ISEs and control state and hardware support for POSIX Threads. There are also 512 control/configuration 3 VThreads doesn't bus-master and the Galaxy-Level (Inter-System) Interconnect is implemented on the host system, typically using the AXI4 Interconnect IP on Xilinx 7 Series. Inter-System transfers are scheduled by the host DMA. 4 The 256 Systems x 256 Contexts define the architectural extrema of VThreads. The 64K Contexts, each with a max. of 16 HCs are encoded in 20-bits which are returned by the VT32PP CPUID instruction. Whilst Figs. 3 and 4 depict a single AHB/AXI4 slave port per system, the authors have considered the use of an internal (Galaxy-level) NoC, instead of the external AXI4 system, to limit the number of such slave channels. This is not elaborated further as it's not part of the existing VThreads RTL database. All performance experiments have been conducted with a single-system Galaxy configuration as discussed in Section 5. registers forming the CTRL_SPACE, accessible by the host system and the HCs via the DBG_IF (section 3.6) and RDCTRL and WRCTRL instructions respectively. Finally, there are 256 32-bit peripheral registers (PERIPH_SPACE) per context used to map the control registers of attached streaming peripherals. They are accessed via the DBG_IF and the HCs with the RDPERIPH and WRPERIPH instructions. VT32PP is a true Harvard architecture; instructions have their own 32-bit private address space which is only accessed by the host system with single or streaming IRAM read/write transactions. Long instruction words (LIWs) consist of a variable number (up to the architectural width of the processor) of RISC-type operations known as syllables or RISCops (used interchangeably in the text). Instruction accesses are byte-aligned (to allow for future implementations where variable instruction lengths are supported) and control transfer instructions target the first syllable of the target LIW. In its current form, VThreads supports blocked (vertical) hyper-threading with micro-architectural hooks in place to support Simultaneous (SMT) [21] and Cluster-Simultaneous (CSMT) [22] multi-threading.
Whilst there are numerous research and commercial machines in the VLIW domain such as CoreVA [23] , PACDSP [24] , Kalray [25] the VT32PP ISA was chosen for the VThreads organization due to it's similarity to the Multiflow and the LX [26] machines, the maturity of the compiler and the knowledge the designers had acquired during the development of the precursor LE1 CPU. A particularly important point and a major differentiator of our work was the architectural support for MIMO ISEs and embedded streaming accelerators which are not found in most VLIW implementations. The following section goes into the details of the VThreads Organization. 
VThreads organization
From Fig. 4 the Galaxy is connected to a host system (Microblaze/Nios2/Leon3) which provides a number, equal to the number of instantiated Systems in the Galaxy, of memory-mapped ports (AXI4MM/AvalonMM/AHB) and a single non-pipelined interface (AXI4LITE/APB) for debug purposes. Each System includes a configurable number of Contexts, a peripheral wrap component and the multi-ported, multi-banked Data Memory. That memory is host-mapped and accessed from via host DMA transactions. The DBG_IF logic is distributed within VThreads to all Systems/Contexts/periph_wrap instances and follows a simple Req/Ack protocol between the DBG_IF Finite State Machine (FSM), the per-context controller and associated processor and peripheral state.
The expert reader will notice that Inter-System communications and synchronization are supported only via host-intervention (host-side DMA and use of MUTEX-type IP) as VThreads is not in its present form designed to bus-master the host system. At the same time, multiple VThreads can be instantiated at a higher level of hierarchy (full system) however sufficient configurability and extensibility is built within the Galaxy to ensure there is no need to do that.
The following sections provide additional details into all the blocks of Fig. 4 with emphasis on the Context Instruction Fetch Engine (IFE), mid-pipe section (MIDPIPE), execution data-paths (Cluster-Architecture) and cluster-level Load-Store unit (CL_LSU)) followed by a discussion of the periph_wrap, the System-Level Memory (DRAM) and the DBG_IF which supports Hardware-accelerated PThreads. 
Instruction Fetch Engine (IFE)

Mid-Pipe
The Mid-Pipe (Fig. 6 ) includes the decode logic, register files, bypass logic and issue queues per HC. Following from the IFE, one of the non-blocked HCs is selected by the Thread_Select_Logic (bottom of Fig. 5 ) for access to the array of instruction decoders. These are combinatorial blocks producing a large array of control fields which schedule the downstream pipeline resources. The output from Declogic includes the register source/destination information which specifies the number of register sources (1-7), destinations (up to 2), type 
Cluster Architecture
Clusters are autonomous execution units including pipelined integer (SCORE), floating-point (FPCORE), MIMO ISE data-paths (CCORE, not shown) and associated processor state. A high level view of the cluster hierarchy is depicted in Figure 7a .
The SCORE encapsulates the data-path pipelines (IALU, IMULT) and the branch logic (BRU, not shown).
These are the pipelined execution units and support arithmetic, logical, shift and multiplication operations.
The IALU consists of primarily single-cycle data-paths and IMULT instantiates a configurable DesignWare component with an explicit latency annotation. Use of configurable latency for the IALU (Fig. 7b ) and the IMULT (Fig. 7c) ensures that a high Fmax is achieved (via retiming) despite the series Thread_Select_Logic in Fig. 6 . This series dependency can also be broken with the explicit specification (via an HDL constant) of a Thread_Pick_Stage (TPICK) however, at the expense of an extra clock in Branch/Jump resolution which penalizes the IPC of the machine. Finally, the BRU is responsible for all updates to the per-HC PC and validates the branch prediction, in the process re-steering the former to the correct address if a mis-predicted path was followed.
The FPCORE includes an iterative divider unit (FDIV) and two 4-stage pipelined generic floating point data-paths capable of performing single-precision addition/subtraction and multiplication. In addition, the generic FP data-path includes format conversion from signed 32-bit integers to single-precision FP. FPCORE also makes use of DesignWare IP.
Finally, the cluster includes a multiplicity of register files, one per-HC, each with a statically-defined number of R/W ports and the Cluster Load/Store Unit (discussed in Section 3.5). Cluster organization VThreads specifies up to 16 cluster templates with a template specifying the SCORE, FPCORE, CCORE units; The designer selects, via HDL constants,the templates to be instantiated (number of clusters).
Custom Core (CCORE) and Peripheral Wrapper
VThreads includes architectural support for custom MIMO ISEs with syllables extended to 64 bits to allow for the encoding of a maximum of 7 register specifiers. There are 5 categories (CASM4, CASM5, CASM6, CASM7) each supporting 4, 5, 6 and 7 such specifiers. CCORE incorporates the data-paths that implement all these extensions and has direct access to the multi-ported RF.
To satisfy the deeply-embedded peripheral requirement of the Hardware aims of Section 1.2, VThreads was architected to tightly integrate such streaming peripherals, operating in parallel to the instantiated context.
These peripherals are typically stream accelerators, designed with ESL flows [28] or directly in RTL, and access the common system memory via pipelined ports. Fig. 8 depicts the schematic of the block. VThreads peripherals include a programmer's interface consisting of both mandatory and user-architected registers which are read from/written to by individual HCs as well as from the host. Peripherals are instantiated at the System level as shown in Figure 4 ; relevant parameters identify the name of the instantiated peripheral, the number of LSU ports required and secondary channel arbitration. As it stands, the micro-architecture fully supports the accelerators designed with ROCCC 2.0 [29] . The close integration of pipelined accelerators to the processor core is a key characteristic of this design and a differentiator to the previous LE1 effort.
Cluster Load/Store Unit (LSU) and System-Level Data RAM
The Load/Store Unit ( Fig. 9 ) is the interface to the system memory at the Cluster level. A Cluster can include an instance of the LSU and supports a number of direct channels (ports) to system shared DRAM. Active HCs The macro thread-affinity observed when executing software on VThreads is based on the simple Context affinity (C-Affinity) and HC-Affinity hardware algorithms. These rely on two hidden state vectors (System-wide registers, C_Affin, HC_Affin) which maintain a bit-mask associating Contexts and HCs, per Context, used in vthread_create operations, for that context. From Fig. 12 block which resolves one out of all the available Contexts. This is clocked in the newCptr register, updates the cPtr of the issuing Context and is further used to select the first available HC (using the same biased ff1 block). The final output are the vectors newCPTR, newHCPtr which are read by the FSM which subsequently loads the PC, LR and SP of the chosen Context, HC and drives the Galaxy-level pipeline signals for the latter to commence execution in MIMD mode. Though not studied in this research, we are investigating other algorithms in which closely-related threads (SPMD paradigm) are grouped on the same Context to maintain close tracking. This is not elaborated further as its not implemented in the current RTL.
Processor Customization and Tool-chain
VThreads is a highly configurable, extensible processor architecture implemented in RTL VHDL and includes a low-level C-based API with the whole Sw/Hw flow orchestrated by the LE1 Tool-chain. This section provides more details into the customization capabilities of VThreads, discusses the developed tool-chain and demonstrates user-related software aspects when programming the system. Galaxy-level HDL customization is achieved via an XML file as shown in Fig. 13 which is parsed to extract a number of parameters. The example diagram RHS specifies three enumerated types (GALAXY_T, DARCH_T, ISA_PRSPCTV_T, IARCH_T) which are used to configure the Galaxy (Systems and type); Systems (Contexts, availability of scalar processor at system-level, Data memory architecture/block size/size/banks); Contexts (VLIW width, ISA perspective, IFE architecture and Fetch width, Cluster Templates, HCs, IRAM banks/-size/block size). The data-path components are described in ClusterTemplate. The extracted XML parameters are used to populate a top-level RTL VHDL configuration file (mastercfg_pkg.vhd), precisely specifying the hardware architecture, and are communicated to the processor hierarchy via VHDL generics.
Processor Customization
VThreads Tool-chain
The tool-chain developed for VThreads consists of the HP Labs VEX research compiler 5 [26] along with scripting infrastructure to process the generated assembly. The final output of the flow is a set of header files, incorporated on the host application, with the initialized IRAM and DRAM images. The low-level API is used to communicate between a master authority (VHDL simulator/Host processor/x86 workstation) to a slave authority (Insizzle/FPGA silicon) in a seamless way. The host driver and VThreads application is cross-compiled for the host with gcc (Leon3/Microblaze/ARM target) resulting in the app.elf file. This is loaded via the Xilinx xmd utility to the final FPGA silicon for real-time execution. The flow is depicted in Fig. 14 1. xml2mm: The xml2mm Perl utility parses the system.xml file (Fig. 13) , recovers all configuration parameters and auto-generates a single mastercfg_pkg.vhd file which customizes the RTL database. It generate also the machine description file (MM) which is input to VEX compiler along with the application sources (App.c). The latter produces a number of assembly files (App.s).
Compiled Simulator Support:
This is the first level of execution of the user application (App.c).
It consists of a collection of Perl utilities which use headers from VEX and auto-generate a number of C files (App.cs.c). When all these files are combined together the resulting x86 executable faithfully represents the behaviour of the user application, as it would when running on silicon. This mode of operation supports a single Context/HC however, it is particularly useful for providing a Dynamic Trace (the delta of the processor state affected by every dynamically-executing instruction) which is used, via the Foreign Language Interface (FLI) mechanism below, to communicate with the VThreads RTL in Modelsim Co-simulation of the application and the processor RTL. is the second source of architectural Traces, used for RTL co-simulation.
VThreads API:
The VTAPI is a C library statically incorporated in the host-resident application code.
It is used to identify, initialize and test features of the attached VThreads slave. In addition, it provides a simplified interface to load the IRAM and DRAM of the latter, begin execution and poll the VThreads state for completion. It provides a simplified DMA interface to the host. The API is designed to be used in architectural, cycle-accurate (single and multi-threaded) and RTL simulation modes and on the FPGA prototypes while it allows an external host (Ethernet proxy) to be used for stdio at runtime. A subset of the VTAPI calls and relevant categories are listed in Fig. 15 . 
Category
API Call Description
Application Environment
The VThreads tool-chain allows the programmer to seamlessly develop, compile and execute the target application in a Unix environment. Fig. 16 depicts an extract from the DES benchmark (inner loop creating multiple threads, Section 5.3.4) which was run on the benchmarked processors (Leon3+FSU, Leon3MP+Custom
PThreads and VThreads, Section 5.3) with the only modification being the initialization of the FSU library (pthread_init) for Leon3. The automation afforded by our tool-chain takes care of all low-level details resulting in a final executable which can be run on Insizzle or FPGA silicon.
Performance Evaluation
This section discusses the performance of VThreads compared to a number of other CPU cores for A) a number of micro-benchmarks (section 5.2) and B) by executing four larger threaded applications (section 5.3). Four processors are used in the study. The MicroBlaze and Leon3 are two scalar (single-issue) soft CPUs designed for FPGAs (both) and standard-cell implementation (Leon3). Leon3 was included in this study as it was originally considered for the ENOSYS CMP architecture (CPU on Fig. 2 ) and dismissed due to it's scalar issue and lack of sufficient RF bandwidth for MIMO ISEs. The MicroBlaze is included in the Xilinx tools (PlanAhead and Vivado for the latest FPGA families). It is AXI4-based, highly configurable and can be optimized for either area 
Leon3MP and the Custom PThreads Library
To provide a realistic benchmarking target for VThreads, a custom PThreads implementation was devised for the Leon3MP processor. The latter allows up to 16 AHB masters (a total of 12 CPUs) to be instantiated within a single-tier 32-bit AHB system; Our custom PThreads library relies on CPU0 being the master processor with all others activated on demand through the use of an interrupt register. A custom boot-loader was developed to offset the initial CPU stack pointers (defined in crt0.s) for each Leon3 core and override memory resetting when any CPU other than CPU0 is started. Additionally methods for performing thread creation/termination/synchronization were devised using global arrays and data structures to make the state of each CPU globally available. pthread_create is stalled until another core is made available using the same mechanism. Though not ideal from an energy-efficiency point of view, this mechanism does provide a very fast PThreads implementation on Leon3MP; a second generation of this library is being considered to identify and take into account low-power modes of the Leon3 processor thus dispensing with the use of tight polling.
Micro-benchmarking
This section evaluates the performance of basic PThreads operations across the processors and PThreads API implementations of Table 1 . The following test cases are identified:
• Test 1: Create: Measures the time elapsed between the master thread issuing a pthread_create and the slave thread beginning execution. A timer is started prior to pthread_create and stopped within the slave thread. In the case of VThreads, internal instrumentation is used to obtain a precise figure of the number of elapsed clocks.
• Test 2: Join: Measures the elapsed time from the slave thread completing to the master thread being aware of the termination of the slave thread through a pthread_join operation. A timer is started and the slave thread exits; the master thread performs a pthread_join and once it is made aware of the termination of the slave thread the timer is stopped from the master thread.
• 
C benchmarks
This section presents more substantial application benchmarks threaded at fine and coarse levels. In coarse-level threading, a single pthread_create operation is executed per available core at the beginning (initiated by core 0). Each thread then performs the relevant computation and finally pthread_join is used to synchronize all active threads. In this case the number of pthread_create and pthread_join operations used is equal to N-1 where N=number of CPUs. Fine-grain threaded benchmarks make deliberately as much use of the PThreads support as possible. Also each benchmark takes into account the physical cores available with the coarse-grained benchmarks always creating enough threads to fully utilize all cores and each thread computing an equal, if possible, fraction of the workload. Fine-grained benchmarks differ in that at certain points within the code multiple threads are created to saturate the architecture. These threads compute their fraction of work and are then synchronized to the main thread. Results were extracted from FPGA silicon for both the IMT (Leon3, MicroBlaze) and CMP (Leon3MP, VThreads) architectures.
• IMT Architectures: An issue experienced in some benchmarks in the coming sections related to the onboard timers for the IMT Leon3 and MB. When clocked at 75MHz the 32-bit MB timer overflowed after 57 seconds which was simply not long enough for the successful execution of all instances of the benchmark.
A separate process, running as a thread, was investigated to track this overflow and increment a second counter however introducing this housekeeping thread added overheads in thread interleaving. Similarly for the IMT Leon3 the timers returned odd numbers executing on silicon. It was discovered that the timer only incremented for CPU0 while it was executing (disabled during context switch) thus skewing the results. As a result of these issues IMT systems were not studied further in the following sections.
Mandelbrot Set (Multi-threaded)
The Mandelbrot Set [30] is a mathematical set of points whose boundary is a distinctive 2D fractal shape named after the mathematician Benot B. Mandelbrot. This benchmark was selected as it is a highly parallel and truly data-independent. Each point can be calculated in parallel based on a predefined magnification setting and origin coordinates. A second set of co-ordinates is then passed to calculate the pixel value at a specific position.
In this study a 160 by 480 Mandelbrot fractal is computed using a 10-colour palette and a maximum iteration value of 100. Originally the benchmark was split into contiguous sections (slices) of the image which resulted in large computational load imbalance. Subsequently, computations were split on a row-basis resulting in an interleaved output with the load more evenly balanced across threads. In coarse-grained threading a master thread generates a new thread on each available core tasked to compute a section of the output image. The fine-grained threading uses multiple PThreads operations. This are ((Size / N umber of C ores) * (N umber of C ores − 1) with Size equal to the number of output (76,800 based on a 160x480 image). The master thread iterates through all output pixels and creates new threads for each. It then computes a pixel value itself and synchronizes with all other threads. This is performed in a loop until all output pixels have been generated.
This results in 0 PThreads operations pairs on single core and up to 67,200 PThreads operations pairs on 8 cores. Table 3 shows the real time taken for the execution of the benchmarks on the Virtex6LX240T device and the extrapolated Leon3MP ASIC results. 
JPEG Decode (Multi-threaded/Multi-programmed)
This JPEG decoder is based on a small C implementation called NanoJPEG 10 . modified to remove dynamic memory allocation and file I/O in order to run on embedded VLSI processors. A 64x64 JPEG image (Lena) was used as the input data set. Due to the block-based nature of JPEG there are data-dependencies between macro blocks (MBs) with pixels computed possibly relying on other pixels in the same or adjacent MBs. As a result, the coarse-threaded version uses eight instances of the JPEG decoder with all instances decoding a separate image. For fine-grained threading it was noted that at the end of MB decoding an inverse discrete cosine transform (IDCT) is performed on each column of pixels within that MB (colIDCT), called eight times, once per column. This loop was modified to split the work over all available cores and then use a unique identifier along with the total number of parallel threads to specify whether or not to perform the computation. Fine-grained threading is performed on a 64x64 image resulting in 96 MBs and up to 672 PThreads operation pairs in the 8-core configuration. As the amount of computation performed within that function is small it serves as a good example to demonstrate the low-latency threading support in Leon3MP and VThreads. Real-time results from executing on JPEG benchmark are shown in Table 4 : 
Sobel Filter (Multi-threaded)
The Sobel Filter algorithm is typically used as first stage processing in feature-detection in which a source image results in an output grey-scale image displaying "edges". Each pixel in the source image is used in calculations with its surrounding pixels and two masks are used to find horizontal and vertical transitions in order to detect edges and corners. The algorithm is computationally-intensive however, each output pixel can be calculated independently. Inputs to the algorithm are a 640x480 image and two 3x3 mask arrays. Coarse-grained threading results in between 0 and 7 PThreads operation pairs. This example is split in an interleaved fashion similar to the Mandelbrot Set where each thread (core) processes a full row of the input image and then moves down a set number of rows. The real-time results recorded from FPGA silicon are shown in Table 5 : 
Data Encryption Standard (Multi-threaded)
The Data Encryption Standard (DES) was developed in the early 1970s and published as an official standard in 1977. It has since been surpassed by other such standards as it is considered insecure due to the key size being small enough to be susceptible to brute force attacks. The algorithm uses a 64-bit key to generate a set of 16 48-bit sub-keys used to encrypt 64-bit blocks of plain-text, through a 16 stage Feistel Network, into 64-bit blocks of cipher-text. This benchmark can be threaded with multiple threads processing 64-bit plain-text and cipher-text pairs as they are data-independent. Similarly to previous benchmarks DES is parallelized both at coarse and fine-grained levels. 64KB of data are processed resulting in 8192 blocks of 64-bits to encrypt. In coarse-grained threading a thread is instantiated within each available core and then works across the input data using its knowledge of the total number of threads and the size of the data to be encrypted whereas in the fine-grained threading the maximum number of PThreads library operations pairs executed is 7,168, based on 8,192 blocks. The real-time extracted from FPGA silicon and the scaled ASIC results are shown in Table 6 : strates a speed-up of between x3.6 -x7.23, for create and join, compared to the very tight custom PThreads implementation of the dual-core 500 MHz Leon3. As the Leon3MP data are extrapolated it is expected that results will improve somewhat for Leon3MP when using the very same memory subsystem as the VThreads.
Discussion of results
Overall, micro-benchmarking quantifies the benefit of very fine-grained, low-latency PThreads support compared to state-of-the-art commercial and research processors and API implementations. Overall, VThreads with its hardware-assisted PThreads support demonstrates better performance to the Leon3MP system particularly for highly-parallel workloads such as Mandelbrot and Sobel filter; the pattern on more complex benchmarks such as JPEG decode and DES is slightly different with the Leon3MP system demonstrating better performance at higher core counts (8) . This can be attributed to the very fast customPThreads implementation in which all cores are active in a very tight polling loop in a shared-memory system whereas in VThreads, the DBG_IF PThreads mechanism in Section 3.6 is a natural synchronization point which can be further optimized if implemented in a pipelined fashion such as the MPI coprocessors of [31] .
Silicon Implementation Results
An important mandate for this work was the re-architecture of the legacy LE1 CPU to a technology-independent form. This is discussed in this section with three implementations carried out. These include mature 40 nm FPGA, 28 nm SoC-FPGA and 65 nm ASIC technologies. In the latter case, an 8-way Leon3 CMP system was implemented with identical number of CPU cores with VThreads Contexts and utilizing the same memory subsystem (256KB, 4 banks). The use of the VThreads DRAM subsystem was done to level the field and provide a more accurate comparison across the ASIC implementations of the CMPs.
FPGA implementation
Data were collected for a 40 nm device (Xilinx V6LX240T-FG1156) and a 28 nm SoC FPGA(z7045 device on the Xilinx ZC706 development board). The 40 nm target was synthesized with the older (deprecated) xst-based flow whereas the z7045 target used of the advanced Vivado environment (2014.2). It should be noted that the LX240T-FG1156 results didn't include the re-timing option which further increases the performance of the wide-multiplexer-heavy VThreads design. In addition, the z7045 results don't include data for the Leon3 CPU as the host processor in this case is a high-speed (800 MHz) dual ARM A9 Cortex CPU. Tables 7 and 8 summarize the FPGA implementation results: 
Related work
The authors in [32] discuss Hthreads, a mechanism to abstract the CPU/FPGA interface and seamlessly execute threaded applications on either using the PThreads programming model. They propose an abstract hardware/thread interface (HWTI) and support the compilation of data-parallel sections of code onto hardwired accelerators. In addition, the authors in [33] discuss distributed, hardware-based Micro-kernels, based on the PThreads programming model as a framework for heterogeneous, FPGA-based MPSoCs parallel software development. They propose a system where each CPU core includes a Hardware Abstraction Layer (HAL) library and all make use of modified Hthreads hardware micro-kernel cores. They performed their investigation on a heterogeneous MP-SoC system (Xilinx Virtex5 series) consisting of a hard-PPC and multiple soft MicroBlaze processors. Whilst sharing the idea of using the PThreads model to express application parallelism, we note that VThreads is a highly-customizable and extensible, technology-agnostic architecture and as shown, can be targeted to both field-programmable silicon (section 6.1) and standard-cell technologies (section 6.2).
At the same time, VThreads provides a full software-based environment and it's performance can be further enhanced via custom MIMO ISEs and pipelined accelerators to closely match the processing requirements of the application. Similar to this work is [34] which makes use of the task flow graph of an application with Kahn Process Networks (KPNs) to efficiently design MPSoCs with hardware accelerators (Hardware Threads, HWTs) for partially-reconfigurable systems and streaming applications. REconos [35] is an extension to the eCos RTOS, making integrated use of software threads and hardware cores (hardware threads) to provide POSIX-compliant services on FPGAs. We note that the authors make use of a single synchronization point (where hardware interact with software threads), similarly to the VThreads DBG_IF FSM being the synchronization point for all pthread operations. The authors propsed the use of distinct calls (e.g., pthread_create() and rthread_create() for Sw and Hw threads respectively). The latter is not an issue with VThreads as only Sw threads can be created on the unallocated HCs. Along the same lines, the Berkeley Operating system for Re-programmable Hardware (BORPH) [36] provides unified Unix-like interface based on inter-process 
Conclusions
This paper discussed VThreads, a novel VLIW CMP designed to provide lightweight OS-like services into deeply-embedded applications requiring very fast thread management capabilities, typically not provided by software implementations. VThreads, building on from it's predecessor, demonstrated better performance and power efficiency compared to the Leon3MP CMP in a variety of workloads whilst providing much improved application optimization opportunities via it's unique configurability. In the course of this work a number of micro-architectural issues were identified and these are noted here as suggestions for future research, to improve the performance of the processor. In particular VThreads relies heavily on single-port memories as the primary target was standard-cell technologies; modern 28 nm+ FPGAs provide high speed (250 MHz operation on the Kintex7 fabric) dual-port RAM blocks which can be time-multiplexed to provide 4 independent R/W ports.
A third-generation micro-architecture will address this and allow for such multi-ported configurations. This would permit fetching from multiple HCs per clock thus improving the performance of the IFE and permitting for true SMT implementations instead of the current VMT approach. Further performance degradation was noted due to the use of a shallow (1-stage deep) pipeline between the clients addressing the banked DRAM.
This proved to be a performance bottleneck (particularly on FPGA targets) and will also be addressed. It is noted that the IFE FSM, responsible for the packing of LIWs (and the separation of embedded 32-bit constants) is perhaps overly complicated and limits the frequency of the VThreads design, particularly on 4-wide configurations (Fig 19) . On the software side, we plan to provide more support for PThreads primitives and integrate the Trimaran environment to to allow the use of the predicated ISA. A video of a dual-context, dual-issue VThreads system executing the Mandelbrot benchmark in the context of the ENOSYS FP7 project can be seen in https://www.youtube.com/watch?feature=player_embedded&v=Ltp4xWcEqr0.
Acknowledgements
at Loughborough University and conducts research in CPU architecture and micro-architecture, and ESL methodologies amongst others. He was previously a Senior R&D Engineer/CPU architect for ARC International, working on various hardware projects in the embedded CPU domain. Prior to ARC he worked as an ASIC design engineer for a telecommunications organization. He is the Loughborough University Investigator and technical director in the ENOSYS FP7 project in which he contributed the tool-chain, automation and the FPGA System-on-Programmable-Chip (SoPC). He is the architect and designer of the LE1 and VThreads VLIW machines and a founder of Axilica Ltd, a Loughborough University spin-out company commercializing disruptive R&D into high-level (UML) behavioural synthesis. 
David Stevens
