48 research outputs found
Exploring the sequence length bottleneck in the Transformer for Image Captioning
Most recent state of the art architectures rely on combinations and
variations of three approaches: convolutional, recurrent and self-attentive
methods. Our work attempts in laying the basis for a new research direction for
sequence modeling based upon the idea of modifying the sequence length. In
order to do that, we propose a new method called "Expansion Mechanism" which
transforms either dynamically or statically the input sequence into a new one
featuring a different sequence length. Furthermore, we introduce a novel
architecture that exploits such method and achieves competitive performances on
the MS-COCO 2014 data set, yielding 134.6 and 131.4 CIDEr-D on the Karpathy
test split in the ensemble and single model configuration respectively and 130
CIDEr-D in the official online evaluation server, despite being neither
recurrent nor fully attentive. At the same time we address the efficiency
aspect in our design and introduce a convenient training strategy suitable for
most computational resources in contrast to the standard one. Source code is
available at https://github.com/jchenghu/explorin
Modelli e strumenti di programmazione parallela per piattaforme many-core
The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains.
Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors.
Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC.
This thesis presents a set of techniques and HW/SW extensions that enable performance improvements and that simplify programmability for heterogeneous many-core platforms.
The thesis contributions cover vertically the entire software stack for many-core platforms, from hardware abstraction layers running on top of bare-metal, to programming models; from hardware extensions for efficient parallelism support to middleware that enables optimized resource management within many-core platforms.
First, we present mechanisms to decrease parallelism overheads on parallel programming runtimes for many-core platforms, targeting fine-grain parallelism.
Second, we present programming model support that enables the offload of computational kernels within heterogeneous many-core systems.
Third, we present a novel approach to dynamically sharing and managing many-core platforms when multiple applications coded with different programming models execute concurrently.
All these contributions were validated using STMicroelectronics STHORM, a real embodiment of a state-of-the-art many-core system. Hardware extensions and architectural explorations were explored using VirtualSoC, a SystemC based cycle-accurate simulator of many-core platforms
A RISC-V-based FPGA Overlay to Simplify Embedded Accelerator Deployment
Modern cyber-physical systems (CPS) are increasingly adopting heterogeneous systems-on-chip (HeSoCs) as a computing platform to satisfy the demands of their sophisticated workloads. FPGA-based HeSoCs can reach high performance and energy efficiency at the cost of increased design complexity. High-Level Synthesis (HLS) can ease IP design, but automated tools still lack the maturity to efficiently and easily tackle system-level integration of the many hardware and software blocks included in a modern CPS. We present an innovative hardware overlay offering plug-and-play integration of HLS-compiled or handcrafted acceleration IPs thanks to a customizable wrapper attached to the overlay interconnect and providing shared-memory communication to the overlay cores. The latter are based on the open RISC-V ISA and offer simplified software management of the acceleration IP. Deploying the proposed overlay on a Xilinx ZU9EG shows ≈ 20% LUT usage and ≈ 4× speedup compared to program execution on the ARM host core
On the Effectiveness of OpenMP teams for Programming Embedded Manycore Accelerators
With the introduction of more powerful and massively parallel embedded processors, embedded systems are becoming HPC capable. In particular heterogeneous on-chip systems (SoC) that couple a general-purpose host processor to a many-core accelerator are becoming more and more widespread, and provide tremendous peak performance/watt, well suited to execute HPC-class programs. The increased computation potential is however traded off for ease programming. Application developers are indeed required to manually deal with outlining code parts suitable for acceleration, parallelize there efficiently over many available cores, and orchestrate data transfers to/from the accelerator. In addition, since most manycores are organized as a collection of clusters, featuring fast local communication but slow remote communication (i.e., to another cluster's local memory), the programmer should also take care of properly mapping the parallel computation so as to avoid poor data locality. OpenMP v4.0 introduces new constructs for computation offloading, as well as directives to deploy parallel computation in a cluster-aware manner. In this paper we assess the effectiveness of OpenMP v4.0 at exploiting the massive parallelism available in embedded heterogeneous SoCs, comparing to standard parallel loops over several computation-intensive applications from the linear algebra and image processing domains
Exploiting Robot Redundancy for Online Learning and Control
Accurate trajectory tracking in the task space is critical in many robotics applications. Model-based robot controllers are able to ensure very good tracking but lose effectiveness in the presence of model uncertainties. On the other hand, online learning-based control laws can handle poor dynamic modeling, as long as prediction errors are kept small and decrease over time. However, in the case of redundant robots directly controlled in the task space, this condition is not usually met. We present an online learning-based control framework that exploits robot redundancy so as to increase the overall performance and shorten the learning transient. The validity of the proposed approach is shown through a comparative study conducted in simulation on a KUKA LWR4+ robot
The Importance of Worst-Case Memory Contention Analysis for Heterogeneous SoCs
Memory interference may heavily inflate task execution times in Heterogeneous
Systems-on-Chips (HeSoCs). Knowing worst-case interference is consequently
fundamental for supporting the correct execution of time-sensitive
applications. In most of the literature, worst-case interference is assumed to
be generated by, and therefore is estimated through read-intensive synthetic
workloads with no caching. Yet these workloads do not always generate
worst-case interference. This is the consequence of the general results
reported in this work. By testing on multiple architectures, we determined that
the highest interference generation traffic pattern is actually hardware
dependant, and that making assumptions could lead to a severe underestimation
of the worst-case (in our case, of more than 9x).Comment: Accepted for presentation at the CPS workshop 2023
(http://www.cpsschool.eu/cps-workshop
HULK-V: a Heterogeneous Ultra-low-power Linux capable RISC-V SoC
IoT applications span a wide range in performance and memory footprint, under
tight cost and power constraints. High-end applications rely on power-hungry
Systems-on-Chip (SoCs) featuring powerful processors, large LPDDR/DDR3/4/5
memories, and supporting full-fledged Operating Systems (OS). On the contrary,
low-end applications typically rely on Ultra-Low-Power ucontrollers with a
"close to metal" software environment and simple micro-kernel-based runtimes.
Emerging applications and trends of IoT require the "best of both worlds":
cheap and low-power SoC systems with a well-known and agile software
environment based on full-fledged OS (e.g., Linux), coupled with extreme energy
efficiency and parallel digital signal processing capabilities. We present
HULK-V: an open-source Heterogeneous Linux-capable RISC-V-based SoC coupling a
64-bit RISC-V processor with an 8-core Programmable Multi-Core Accelerator
(PMCA), delivering up to 13.8 GOps, up to 157 GOps/W and accelerating the
execution of complex DSP and ML tasks by up to 112x over the host processor.
HULK-V leverages a lightweight, fully digital memory hierarchy based on
HyperRAM IoT DRAM that exposes up to 512 MB of DRAM memory to the host CPU.
Featuring HyperRAMs, HULK-V doubles the energy efficiency without significant
performance loss compared to featuring power-hungry LPDDR memories, requiring
expensive and large mixed-signal PHYs. HULK-V, implemented in Global Foundries
22nm FDX technology, is a fully digital ultra-low-cost SoC running a 64-bit
Linux software stack with OpenMP host-to-PMCA offload within a power envelope
of just 250 mW.Comment: This paper has been accepted as full paper at DATE23
https://www.date-conference.com/date-2023-accepted-papers#Regular-Paper