17 research outputs found

    Object tracking using a many-core embedded system

    Get PDF
    Object localization and tracking is essential for many practical applications, such as mancomputer interaction, security and surveillance, robot competitions, and Industry 4.0. Because of the large amount of data present in an image, and the algorithmic complexity involved, this task can be computationally demanding, mainly for traditional embedded systems, due to their processing and storage limitations. This calls for investigation and experimentation with new approaches, as emergent heterogeneous embedded systems, that promise higher performance, without compromising energy e ciency. This work explores several real-time color-based object tracking techniques, applied to images supplied by a RGB-D sensor attached to di erent embedded platforms. The main motivation was to explore an heterogeneous Parallella board with a 16-core Epiphany coprocessor, to reduce image processing time. Another goal was to confront this platform with more conventional embedded systems, namely the popular Raspberry Pi family. In this regard, several processing options were pursued, from low-level implementations specially tailored to the Parallella, to higher-level multi-platform approaches. The results achieved allow to conclude that the programming e ort required to e - ciently use the Epiphany co-processor is considerable. Also, for the selected case study, the performance attained was bellow the one o ered by simpler approaches running on quad-core Raspberry Pi boards.A localização e o seguimento de objetos sĂŁo essenciais para muitas aplicaçÔes prĂĄticas, como interação homem-computador, segurança e vigilĂąncia, competiçÔes de robĂŽs e Industria 4.0. Devido `a grande quantidade de dados presentes numa imagem, e a` complexidade algorĂ­tmica envolvida, esta tarefa pode ser computacionalmente exigente, principalmente para os sistemas embebidos tradicionais, devido Ă s suas limitaçÔes de processamento e armazenamento. Desta forma, ÂŽe importante a investigação e experimentação com novas abordagens, tais como sistemas embebidos heterogĂ©neos emergentes, que trazem consigo a promessa de melhor desempenho, sem comprometer a eficiĂȘncia energĂ©tica. Este trabalho explora vĂĄrias tÂŽtĂ©cnicas de seguimento de objetos em tempo real baseado em imagens a cores adquiridas por um sensor RBD-D, conectado a diferentes sistemas em- bebidos. A motivação principal foi a exploração de uma placa heterogĂ©nea Parallella com um co-processador Epiphany de 16 nĂșcleos, a fim de reduzir o tempo de processamento das imagens. Outro objetivo era confrontar esta plataforma com sistemas embebidos mais convencionais, nomeadamente a popular famĂ­lia Raspberry Pi. Nesse sentido, foram prosseguidas diversas opçÔes de processamento, desde implementaçÔes de baixo nĂ­vel, especĂ­ficas da placa Parallella, atĂ© abordagens multi-plataforma de mais alto nĂ­vel. Os resultados alcançados permitem concluir que o esforço de programação necessĂĄrio para utilizar eficientemente o co-processador Epiphany Ă© considerĂĄvel. Adicionalmente, para o caso de estudo deste trabalho, o desempenho alcançado fica aquĂ©m do conseguido por abordagens mais simples executando em sistemas Raspberry Pi com quatro nĂșcleos

    An OpenMP Programming Environment on Mobile Devices

    Get PDF

    Piattaforme multicore e integrazione tri-dimensionale: analisi architetturale e ottimizzazione

    Get PDF
    Modern embedded systems embrace many-core shared-memory designs. Due to constrained power and area budgets, most of them feature software-managed scratchpad memories instead of data caches to increase the data locality. It is therefore programmers’ responsibility to explicitly manage the memory transfers, and this make programming these platform cumbersome. Moreover, complex modern applications must be adequately parallelized before they can the parallel potential of the platform into actual performance. To support this, programming languages were proposed, which work at a high level of abstraction, and rely on a runtime whose cost hinders performance, especially in embedded systems, where resources and power budget are constrained. This dissertation explores the applicability of the shared-memory paradigm on modern many-core systems, focusing on the ease-of-programming. It focuses on OpenMP, the de-facto standard for shared memory programming. In a first part, the cost of algorithms for synchronization and data partitioning are analyzed, and they are adapted to modern embedded many-cores. Then, the original design of an OpenMP runtime library is presented, which supports complex forms of parallelism such as multi-level and irregular parallelism. In the second part of the thesis, the focus is on heterogeneous systems, where hardware accelerators are coupled to (many-)cores to implement key functional kernels with orders-of-magnitude of speedup and energy efficiency compared to the “pure software” version. However, three main issues rise, namely i) platform design complexity, ii) architectural scalability and iii) programmability. To tackle them, a template for a generic hardware processing unit (HWPU) is proposed, which share the memory banks with cores, and the template for a scalable architecture is shown, which integrates them through the shared-memory system. Then, a full software stack and toolchain are developed to support platform design and to let programmers exploiting the accelerators of the platform. The OpenMP frontend is extended to interact with it.I sistemi integrati moderni sono architetture many-core, in cui spesso lo spazio di memoria Ăš condiviso fra i processori. Per ridurre i consumi, molte di queste architetture sostituiscono le cache dati con memorie scratchpad gestite in software, per massimizzarne la localitĂ  alle CPU e aumentare le performance. Questo significa che i dati devono essere spostati manualmente da parte del programmatore. Inoltre, tradurre in perfomance l’enorme parallelismo potenziale delle piattaforme many-core non Ăš semplice. Per supportare la programmazione, diversi programming model sono stati proposti, e siccome lavorano ad un alto livello di astrazione, sfruttano delle librerie di runtime che forniscono servizi di base quali sincronizzazione, allocazione della memoria, threading. Queste librerie hanno un costo, che nei sistemi integrati Ăš troppo elevato e ostacola il raggiungimento delle piene performance. Questa tesi analizza come un programming model ad alto livello di astrazione – OpenMP – possa essere efficientemente supportato, se il suo stack software viene adattato per sfruttare al meglio la piattaforma sottostante. In una prima parte, studio diversi meccanismi di sincronizzazione e comunicazione fra thread paralleli, portati sulle piattaforme many-core. In seguito, li utilizzo per scrivere un runtime di supporto a OpenMP che sia il piĂč possibile efficente e “leggero” e che supporti paradigmi di parallelismo multi-livello e irregolare, spesso presenti nelle applicazioni moderne. Una seconda parte della tesi esplora le architetture eterogenee, ossia con acceleratori hardware. Queste architetture soffrono di problematiche sia i) per il processo di design della piattaforma, che ii) di scalabilitĂ  della piattaforma stessa (aumento del numero degli acceleratori e dei processori), che iii) di programmabilitĂ . La tesi propone delle soluzioni a tutti e tre i problemi. Il linguaggio di programmazione usato Ăš OpenMP, sia per la sua grande espressivitĂ  a livello semantico, sia perchĂ© Ăš lo standard de-facto per programmare sistemi a memoria condivisa

    Improving MPI Threading Support for Current Hardware Architectures

    Get PDF
    Threading support for Message Passing Interface (MPI) has been defined in the MPI standard for more than twenty years. While many standard-compliance MPI implementations fully support multithreading, the threading support in MPI still cannot provide the optimal performance on the same level as the non-threading environment. The performance disparity leads to low adoption rate from applications, and eventually, lesser interest in optimizing MPI threading support. However, with the current advancement in computation hardware, the number of CPU core per packet is growing drastically. Using shared-memory MPI communication has become more costly. MPI threading without local communication is one of the alternatives and the some interests are shifting back toward threading to MPI.In this work, we investigate different approaches to leverage the power of thread parallelism and tools to help us to raise the multi-threaded MPI performance to reasonable level. We propose a novel multi-threaded MPI benchmark with multiple communication patterns to stress multiple points of the MPI implementation, with the ability to switch between using MPI process and threads for quick comparison between two modes. Enabling the us, and the others MPI developers to stress test their implementation design.We address the interoperability between MPI implementation and threading frameworks by introducing the thread-synchronization object, an object that gives the MPI implementation more control over user-level thread, allowing for more thread utilization in MPI. In our implementation, the synchronization object relieves the lock contention on the internal progress engine and able to achieve up to 7x the performance of the original implementation. Moving forward, we explore the possibility of harnessing the true thread concurrency. We proposed several strategies to address the bottlenecks in MPI implementation. From our evaluation, with our novel threading optimization, we can achieve up to 22x the performance comparing to the legacy MPI designs

    Modelli e strumenti di programmazione parallela per piattaforme many-core

    Get PDF
    The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains. Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors. Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC. This thesis presents a set of techniques and HW/SW extensions that enable performance improvements and that simplify programmability for heterogeneous many-core platforms. The thesis contributions cover vertically the entire software stack for many-core platforms, from hardware abstraction layers running on top of bare-metal, to programming models; from hardware extensions for efficient parallelism support to middleware that enables optimized resource management within many-core platforms. First, we present mechanisms to decrease parallelism overheads on parallel programming runtimes for many-core platforms, targeting fine-grain parallelism. Second, we present programming model support that enables the offload of computational kernels within heterogeneous many-core systems. Third, we present a novel approach to dynamically sharing and managing many-core platforms when multiple applications coded with different programming models execute concurrently. All these contributions were validated using STMicroelectronics STHORM, a real embodiment of a state-of-the-art many-core system. Hardware extensions and architectural explorations were explored using VirtualSoC, a SystemC based cycle-accurate simulator of many-core platforms

    Providing Insight into the Performance of Distributed Applications Through Low-Level Metrics

    Get PDF
    The field of high-performance computing (HPC) has always dealt with the bleeding edge of computational hardware and software to achieve the maximum possible performance for a wide variety of workloads. When dealing with brand new technologies, it can be difficult to understand how these technologies work and why they work the way they do. One of the more prevalent approaches to providing insight into modern hardware and software is to provide tools that allow developers to access low-level metrics about their performance. The modern HPC ecosystem supports a wide array of technologies, but in this work, I will be focusing on two particularly influential technologies: The Message Passing Interface (MPI), and Graphical Processing Units (GPUs).For many years, MPI has been the dominant programming paradigm in HPC. Indeed, over 90% of applications that are a part of the U.S. Exascale Computing Project plan to use MPI in some fashion. The MPI Standard provides programmers with a wide variety of methods to communicate between processes, along with several other capabilities. The high-level MPI Profiling Interface has been the primary method for profiling MPI applications since the inception of the MPI Standard, and more recently the low-level MPI Tool Information Interface was introduced.Accelerators like GPUs have been increasingly adopted as the primary computational workhorse for modern supercomputers. GPUs provide more parallelism than traditional CPUs through a hierarchical grid of lightweight processing cores. NVIDIA provides profiling tools for their GPUs that give access to low-level hardware metrics.In this work, I propose research in applying low-level metrics to both the MPI and GPU paradigms in the form of an implementation of low-level metrics for MPI, and a new method for analyzing GPU load imbalance with a synthetic efficiency metric. I introduce Software-based Performance Counters (SPCs) to expose internal metrics of the Open MPI implementation along with a new interface for exposing these counters to users and tool developers. I also analyze a modified load imbalance formula for GPU-based applications that uses low-level hardware metrics provided through nvprof in a hierarchical approach to take the internal load imbalance of the GPU into account

    Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension

    Get PDF
    As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software. First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications. Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient

    Parallel cryptanalysis

    Get PDF
    Most of today’s cryptographic primitives are based on computations that are hard to perform for a potential attacker but easy to perform for somebody who is in possession of some secret information, the key, that opens a back door in these hard computations and allows them to be solved in a small amount of time. To estimate the strength of a cryptographic primitive it is important to know how hard it is to perform the computation without knowledge of the secret back door and to get an understanding of how much money or time the attacker has to spend. Usually a cryptographic primitive allows the cryptographer to choose parameters that make an attack harder at the cost of making the computations using the secret key harder as well. Therefore designing a cryptographic primitive imposes the dilemma of choosing the parameters strong enough to resist an attack up to a certain cost while choosing them small enough to allow usage of the primitive in the real world, e.g. on small computing devices like smart phones. This thesis investigates three different attacks on particular cryptographic systems: Wagner’s generalized birthday attack is applied to the compression function of the hash function FSB. Pollard’s rho algorithm is used for attacking Certicom’s ECC Challenge ECC2K-130. The implementation of the XL algorithm has not been specialized for an attack on a specific cryptographic primitive but can be used for attacking some cryptographic primitives by solving multivariate quadratic systems. All three attacks are general attacks, i.e. they apply to various cryptographic systems; the implementations of Wagner’s generalized birthday attack and Pollard’s rho algorithm can be adapted for attacking other primitives than those given in this thesis. The three attacks have been implemented on different parallel architectures. XL has been parallelized using the Block Wiedemann algorithm on a NUMA system using OpenMP and on an Infiniband cluster using MPI. Wagner’s attack was performed on a distributed system of 8 multi-core nodes connected by an Ethernet network. The work on Pollard’s Rho algorithm is part of a large research collaboration with several research groups; the computations are embarrassingly parallel and are executed in a distributed fashion in several facilities with almost negligible communication cost. This dissertation presents implementations of the iteration function of Pollard’s Rho algorithm on Graphics Processing Units and on the Cell Broadband Engine
    corecore