12 research outputs found
Estimating the Potential Speedup of Computer Vision Applications on Embedded Multiprocessors
Computer vision applications constitute one of the key drivers for embedded
multicore architectures. Although the number of available cores is increasing
in new architectures, designing an application to maximize the utilization of
the platform is still a challenge. In this sense, parallel performance
prediction tools can aid developers in understanding the characteristics of an
application and finding the most adequate parallelization strategy. In this
work, we present a method for early parallel performance estimation on embedded
multiprocessors from sequential application traces. We describe its
implementation in Parana, a fast trace-driven simulator targeting OpenMP
applications on the STMicroelectronics' STxP70 Application-Specific
Multiprocessor (ASMP). Results for the FAST key point detector application show
an error margin of less than 10% compared to the reference cycle-approximate
simulator, with lower modeling effort and up to 20x faster execution time.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
Opportunistic acceleration of array-centric Python computation in heterogeneous environments
Dynamic scripting languages, like Python, are growing in popularity and increasingly used by non-expert programmers. These languages provide high level abstractions such as safe memory management, dynamic type handling and array bounds checking. The reduction in boilerplate code enables the concise expression of computation compared to statically typed and compiled languages. This improves programmer productivity. Increasingly, scripting languages are used by domain experts to write numerically intensive code in a variety of domains (e.g. Economics, Zoology, Archaeology and Physics). These programs are often used not just for prototyping but also in deployment. However, such managed program execution comes with a significant performance penalty arising from the interpreter having to decode and dispatch based on dynamic type checking.
Modern computer systems are increasingly equipped with accelerators such as GPUs. However, the massive speedups that can be achieved by GPU accelerators come at the cost of program complexity. Directly programming a GPU requires a deep understanding of the computational model of the underlying hardware architecture. While the complexity of such devices is abstracted by programming languages specialised for heterogeneous devices such as CUDA and OpenCL, these are dialects of the low-level C systems programming language used primarily by expert programmers.
This thesis presents the design and implementation of ALPyNA, a loop parallelisation and GPU code generation framework. A novel staged parallelisation approach is used to aggressively parallelise each execution instance of a loop nest. Loop dependence relationships that cannot be inferred statically are deferred for runtime analysis. At runtime, these dependences are augmented with runtime information obtained by introspection and the loop nest is parallelised. Parallel GPU kernels are customised to the runtime dependence graph, JIT compiled and executed.
A systematic analysis of the execution speed of loop nests is performed using 12 standard loop intensive benchmarks. The evaluation is performed on two CPU–GPU machines. One is a server grade machine while the other is a typical desktop. ALPyNA’s GPU kernels achieve orders of magnitude speedup over the baseline interpreter execution time (up to 16435x) and large speedups (up to 179.55x) over JIT compiled CPU code.
The varied performance of JIT compiled GPU code motivates the need for a sophisticated cost model to select the device providing the best speedups at runtime for varying domain sizes. This thesis describes a novel lightweight analytical cost model to determine the fastest device to execute a loop nest at runtime. The ALPyNA Cost Model (ACM) adapts to runtime dependence analysis and is parameterised on the hardware characteristics of the underlying target CPU or GPU. The cost model also takes into account the relative rate at which the interpreter is able to supply the GPU with computational work. ACM is re-targetable to other accelerator devices and only requires minimal install time profiling
Recommended from our members
Logical partitioning of parallel system simulations
Simulation has been a fundamental tool to prototype, hypothesize, and evaluate
new ideas to continue improving system performance. However, increasing levels
of processor parallelism and heterogeneity have introduced additional
constraints when evaluating new designs. The work embodied in this dissertation
explores how to leverage novel ideas in simulator partitioning to improve
simulator speed and flexibility for simulating these new types of systems.
The contribution of this work includes the introduction of optimistic
partitioned simulation to improve parallelization, and the introduction of
warped partitioned simulation for improved flexibility. These ideas are refined
and demonstrated through the use of prototypes to demonstrate their benefits
compared to state-of-the-art approaches. By leveraging partitioning in a
structured manner, it is possible to design simulators that better address the
open challenges of parallel and heterogeneous systems design.Electrical and Computer Engineerin
Design of efficient Java communications for high performance computing
[Abstract] There is an increasing interest to adopt Java as the parallel programming language for the multi-core
era. Although Java offers important advantages, such as built-in multithreading and networking support,
productivity and portability, the lack of efficient communication middleware is an important drawback
for its uptake in High Performance Computing (HPC). This PhD Thesis presents the design, implementation
and evaluation of several solutions to improve this situation: (1) a high performance Java sockets
implementation (JFS, Java Fast Sockets) on high-speed networks (e.g., Myrinet, InfiniBand) and shared
memory (e.g., multi-core) machines; (2) a low-level messaging device, iodev, which efficiently overlaps
communication and computation; and (3) a more scalable Java message-passing library, Fast MPJ (F-MPJ).
Furthermore, new Java parallel benchmarks have been implemented and used for the performance evaluation
of the developed middleware. The final and main conclusion is that the use of Java for HPC is feasible
and even advisable when looking for productive development, provided that efficient communication
middleware is made available, such as the projects presented in this Thesis.[Resumen] La tesis doctoral "Design of Efficient Java Communications for High Performance Computing"
parte de la hipótesis inicial de que es posible desarrollar aplicaciones Java en computación
de altas prestaciones, un ámbito en el que el rendimiento es crucial, siempre que esté
disponible un middleware de comunicación eficiente. Así, se han diseñado, desarrollado y
evaluado diferentes bibliotecas de comunicación en Java, desde el nivel de sockets al de
paso de mensajes, obteniendo notables incrementos de eficiencia, confirmando que la hipótesis
inicial es factible
On Information-centric Resiliency and System-level Security in Constrained, Wireless Communication
The Internet of Things (IoT) interconnects many heterogeneous embedded devices either locally between each other, or globally with the Internet. These things are resource-constrained, e.g., powered by battery, and typically communicate via low-power and lossy wireless links. Communication needs to be secured and relies on crypto-operations that are often resource-intensive and in conflict with the device constraints. These challenging operational conditions on the cheapest hardware possible, the unreliable wireless transmission, and the need for protection against common threats of the inter-network, impose severe challenges to IoT networks. In this thesis, we advance the current state of the art in two dimensions.
Part I assesses Information-centric networking (ICN) for the IoT, a network paradigm that promises enhanced reliability for data retrieval in constrained edge networks. ICN lacks a lower layer definition, which, however, is the key to enable device sleep cycles and exclusive wireless media access. This part of the thesis designs and evaluates an effective media access strategy for ICN to reduce the energy consumption and wireless interference on constrained IoT nodes.
Part II examines the performance of hardware and software crypto-operations, executed on off-the-shelf IoT platforms. A novel system design enables the accessibility and auto-configuration of crypto-hardware through an operating system. One main focus is the generation of random numbers in the IoT. This part of the thesis further designs and evaluates Physical Unclonable Functions (PUFs) to provide novel randomness sources that generate highly unpredictable secrets, on low-cost devices that lack hardware-based security features.
This thesis takes a practical view on the constrained IoT and is accompanied by real-world implementations and measurements. We contribute open source software, automation tools, a simulator, and reproducible measurement results from real IoT deployments using off-the-shelf hardware. The large-scale experiments in an open access testbed provide a direct starting point for future research
Doctor of Philosophy
dissertationNetwork emulation has become an indispensable tool for the conduct of research in networking and distributed systems. It offers more realism than simulation and more control and repeatability than experimentation on a live network. However, emulation testbeds face a number of challenges, most prominently realism and scale. Because emulation allows the creation of arbitrary networks exhibiting a wide range of conditions, there is no guarantee that emulated topologies reflect real networks; the burden of selecting parameters to create a realistic environment is on the experimenter. While there are a number of techniques for measuring the end-to-end properties of real networks, directly importing such properties into an emulation has been a challenge. Similarly, while there exist numerous models for creating realistic network topologies, the lack of addresses on these generated topologies has been a barrier to using them in emulators. Once an experimenter obtains a suitable topology, that topology must be mapped onto the physical resources of the testbed so that it can be instantiated. A number of restrictions make this an interesting problem: testbeds typically have heterogeneous hardware, scarce resources which must be conserved, and bottlenecks that must not be overused. User requests for particular types of nodes or links must also be met. In light of these constraints, the network testbed mapping problem is NP-hard. Though the complexity of the problem increases rapidly with the size of the experimenter's topology and the size of the physical network, the runtime of the mapper must not; long mapping times can hinder the usability of the testbed. This dissertation makes three contributions towards improving realism and scale in emulation testbeds. First, it meets the need for realistic network conditions by creating Flexlab, a hybrid environment that couples an emulation testbed with a live-network testbed, inheriting strengths from each. Second, it attends to the need for realistic topologies by presenting a set of algorithms for automatically annotating generated topologies with realistic IP addresses. Third, it presents a mapper, assign, that is capable of assigning experimenters' requested topologies to testbeds' physical resources in a manner that scales well enough to handle large environments
Design and Evaluation of Low-Latency Communication Middleware on High Performance Computing Systems
[Resumen]El interés en Java para computación paralela está motivado por sus interesantes
características, tales como su soporte multithread, portabilidad, facilidad de aprendizaje,alta productividad y el aumento significativo en su rendimiento omputacional.
No obstante, las aplicaciones paralelas en Java carecen generalmente de mecanismos
de comunicación eficientes, los cuales utilizan a menudo protocolos basados
en sockets incapaces de obtener el máximo provecho de las redes de baja latencia,
obstaculizando la adopción de Java en computación de altas prestaciones (High Per-
formance Computing, HPC). Esta Tesis Doctoral presenta el diseño, implementación
y evaluación de soluciones de comunicación en Java que superan esta limitación. En
consecuencia, se desarrollaron múltiples dispositivos de comunicación a bajo nivel
para paso de mensajes en Java (Message-Passing in Java, MPJ) que aprovechan al
máximo el hardware de red subyacente mediante operaciones de acceso directo a memoria remota que proporcionan comunicaciones de baja latencia. También se incluye una biblioteca de paso de mensajes en Java totalmente funcional, FastMPJ, en la
cual se integraron los dispositivos de comunicación. La evaluación experimental ha
mostrado que las primitivas de comunicación de FastMPJ son competitivas en comparación con bibliotecas nativas, aumentando significativamente la escalabilidad de
aplicaciones MPJ. Por otro lado, esta Tesis analiza el potencial de la computación en
la nube (cloud computing) para HPC, donde el modelo de distribución de infraestructura
como servicio (Infrastructure as a Service, IaaS) emerge como una alternativa
viable a los sistemas HPC tradicionales. La evaluación del rendimiento de recursos
cloud específicos para HPC del proveedor líder, Amazon EC2, ha puesto de manifiesto el impacto significativo que la virtualización impone en la red, impidiendo
mover las aplicaciones intensivas en comunicaciones a la nube. La clave reside en un soporte de virtualización apropiado, como el acceso directo al hardware de red, junto
con las directrices para la optimización del rendimiento sugeridas en esta Tesis.[Resumo]O interese en Java para computación paralela está motivado polas súas interesantes características, tales como o seu apoio multithread, portabilidade, facilidade de aprendizaxe, alta produtividade e o aumento signi cativo no seu rendemento computacional. No entanto, as aplicacións paralelas en Java carecen xeralmente de mecanismos de comunicación e cientes, os cales adoitan usar protocolos baseados en sockets que son incapaces de obter o máximo proveito das redes de baixa latencia, obstaculizando a adopción de Java na computación de altas prestacións (High
Performance Computing, HPC). Esta Tese de Doutoramento presenta o deseño, implementaci
ón e avaliación de solucións de comunicación en Java que superan esta limitación. En consecuencia, desenvolvéronse múltiples dispositivos de comunicación a baixo nivel para paso de mensaxes en Java (Message-Passing in Java, MPJ) que aproveitan ao máaximo o hardware de rede subxacente mediante operacións de acceso
directo a memoria remota que proporcionan comunicacións de baixa latencia.
Tamén se inclúe unha biblioteca de paso de mensaxes en Java totalmente funcional,
FastMPJ, na cal foron integrados os dispositivos de comunicación. A avaliación experimental amosou que as primitivas de comunicación de FastMPJ son competitivas
en comparación con bibliotecas nativas, aumentando signi cativamente a escalabilidade
de aplicacións MPJ. Por outra banda, esta Tese analiza o potencial da computación na nube (cloud computing) para HPC, onde o modelo de distribución de infraestrutura como servizo (Infrastructure as a Service, IaaS) xorde como unha alternativa viable aos sistemas HPC tradicionais. A ampla avaliación do rendemento de recursos cloud específi cos para HPC do proveedor líder, Amazon EC2, puxo de manifesto o impacto signi ficativo que a virtualización impón na rede, impedindo mover as aplicacións intensivas en comunicacións á nube. A clave atópase no soporte de virtualización apropiado, como o acceso directo ao hardware de rede, xunto coas directrices para a optimización do rendemento suxeridas nesta Tese.[Abstract]The use of Java for parallel computing is becoming more promising owing to
its appealing features, particularly its multithreading support, portability, easy-tolearn properties, high programming productivity and the noticeable improvement in its computational performance. However, parallel Java applications generally su er
from inefficient communication middleware, most of which use socket-based protocols
that are unable to take full advantage of high-speed networks, hindering the
adoption of Java in the High Performance Computing (HPC) area. This PhD Thesis
presents the design, development and evaluation of scalable Java communication
solutions that overcome these constraints. Hence, we have implemented several lowlevel
message-passing devices that fully exploit the underlying network hardware while taking advantage of Remote Direct Memory Access (RDMA) operations to provide low-latency communications. Moreover, we have developed a productionquality Java message-passing middleware, FastMPJ, in which the devices have been integrated seamlessly, thus allowing the productive development of Message-Passing in Java (MPJ) applications. The performance evaluation has shown that FastMPJ communication primitives are competitive with native message-passing libraries, improving signi cantly the scalability of MPJ applications. Furthermore, this Thesis
has analyzed the potential of cloud computing towards spreading the outreach of
HPC, where Infrastructure as a Service (IaaS) o erings have emerged as a feasible
alternative to traditional HPC systems. Several cloud resources from the leading
IaaS provider, Amazon EC2, which speci cally target HPC workloads, have been
thoroughly assessed. The experimental results have shown the signi cant impact
that virtualized environments still have on network performance, which hampers
porting communication-intensive codes to the cloud. The key is the availability of
the proper virtualization support, such as the direct access to the network hardware,
along with the guidelines for performance optimization suggested in this Thesis
Feasibility Study of High-Level Synthesis : Implementation of a Real-Time HEVC Intra Encoder on FPGA
High-Level Synthesis (HLS) on automatisoitu suunnitteluprosessi, joka pyrkii parantamaan tuottavuutta perinteisiin suunnittelumenetelmiin verrattuna, nostamalla suunnittelun abstraktiota rekisterisiirtotasolta (RTL) käyttäytymistasolle. Erilaisia kaupallisia HLS-työkaluja on ollut markkinoilla aina 1990-luvulta lähtien, mutta vasta äskettäin ne ovat alkaneet saada hyväksyntää teollisuudessa sekä akateemisessa maailmassa. Hidas käyttöönottoaste on johtunut pääasiassa huonommasta tulosten laadusta (QoR) kuin mitä on ollut mahdollista tavanomaisilla laitteistokuvauskielillä (HDL). Uusimmat HLS-työkalusukupolvet ovat kuitenkin kaventaneet QoR-aukkoa huomattavasti.
Tämä väitöskirja tutkii HLS:n soveltuvuutta videokoodekkien kehittämiseen. Se esittelee useita HLS-toteutuksia High Efficiency Video Coding (HEVC) -koodaukselle, joka on keskeinen mahdollistava tekniikka lukuisille nykyaikaisille mediasovelluksille. HEVC kaksinkertaistaa koodaustehokkuuden edeltäjäänsä Advanced Video Coding (AVC) -standardiin verrattuna, saavuttaen silti saman subjektiivisen visuaalisen laadun. Tämä tyypillisesti saavutetaan huomattavalla laskennallisella lisäkustannuksella. Siksi reaaliaikainen HEVC vaatii automatisoituja suunnittelumenetelmiä, joita voidaan käyttää rautatoteutus- (HW ) ja varmennustyön minimoimiseen.
Tässä väitöskirjassa ehdotetaan HLS:n käyttöä koko enkooderin suunnitteluprosessissa. Dataintensiivisistä koodaustyökaluista, kuten intra-ennustus ja diskreetit muunnokset, myös enemmän kontrollia vaativiin kokonaisuuksiin, kuten entropiakoodaukseen. Avoimen lähdekoodin Kvazaar HEVC -enkooderin C-lähdekoodia hyödynnetään tässä työssä referenssinä HLS-suunnittelulle sekä toteutuksen varmentamisessa. Suorituskykytulokset saadaan ja raportoidaan ohjelmoitavalla porttimatriisilla (FPGA).
Tämän väitöskirjan tärkein tuotos on HEVC intra enkooderin prototyyppi. Prototyyppi koostuu Nokia AirFrame Cloud Server palvelimesta, varustettuna kahdella 2.4 GHz:n 14-ytiminen Intel Xeon prosessorilla, sekä kahdesta Intel Arria 10 GX FPGA kiihdytinkortista, jotka voidaan kytkeä serveriin käyttäen joko peripheral component interconnect express (PCIe) liitäntää tai 40 gigabitin Ethernettiä. Prototyyppijärjestelmä saavuttaa reaaliaikaisen 4K enkoodausnopeuden, jopa 120 kuvaa sekunnissa. Lisäksi järjestelmän suorituskykyä on helppo skaalata paremmaksi lisäämällä järjestelmään käytännössä minkä tahansa määrän verkkoon kytkettäviä FPGA-kortteja.
Monimutkaisen HEVC:n tehokas mallinnus ja sen monipuolisten ominaisuuksien mukauttaminen reaaliaikaiselle HW HEVC enkooderille ei ole triviaali tehtävä, koska HW-toteutukset ovat perinteisesti erittäin aikaa vieviä. Tämä väitöskirja osoittaa, että HLS:n avulla pystytään nopeuttamaan kehitysaikaa, tarjoamaan ennen näkemätöntä suunnittelun skaalautuvuutta, ja silti osoittamaan kilpailukykyisiä QoR-arvoja ja absoluuttista suorituskykyä verrattuna olemassa oleviin toteutuksiin.High-Level Synthesis (HLS) is an automated design process that seeks to improve productivity over traditional design methods by increasing design abstraction from register transfer level (RTL) to behavioural level. Various commercial HLS tools have been available on the market since the 1990s, but only recently they have started to gain adoption across industry and academia. The slow adoption rate has mainly stemmed from lower quality of results (QoR) than obtained with conventional hardware description languages (HDLs). However, the latest HLS tool generations have substantially narrowed the QoR gap.
This thesis studies the feasibility of HLS in video codec development. It introduces several HLS implementations for High Efficiency Video Coding (HEVC) , that is the key enabling technology for numerous modern media applications. HEVC doubles the coding efficiency over its predecessor Advanced Video Coding (AVC) standard for the same subjective visual quality, but typically at the cost of considerably higher computational complexity. Therefore, real-time HEVC calls for automated design methodologies that can be used to minimize the HW implementation and verification effort.
This thesis proposes to use HLS throughout the whole encoder design process. From data-intensive coding tools, like intra prediction and discrete transforms, to more control-oriented tools, such as entropy coding. The C source code of the open-source Kvazaar HEVC encoder serves as a design entry point for the HLS flow, and it is also utilized in design verification. The performance results are gathered with and reported for field programmable gate array (FPGA) .
The main contribution of this thesis is an HEVC intra encoder prototype that is built on a Nokia AirFrame Cloud Server equipped with 2.4 GHz dual 14-core Intel Xeon processors and two Intel Arria 10 GX FPGA Development Kits, that can be connected to the server via peripheral component interconnect express (PCIe) generation 3 or 40 Gigabit Ethernet. The proof-of-concept system achieves real-time.
4K coding speed up to 120 fps, which can be further scaled up by adding practically any number of network-connected FPGA cards.
Overcoming the complexity of HEVC and customizing its rich features for a real-time HEVC encoder implementation on hardware is not a trivial task, as hardware development has traditionally turned out to be very time-consuming. This thesis shows that HLS is able to boost the development time, provide previously unseen design scalability, and still result in competitive performance and QoR over state-of-the-art hardware implementations