78 research outputs found
Efficient multicore implementation of NAS benchmarks with FastFlow
The thesis describes an efficient implementation of a subset of the NPB algorithms for the multicore architecture with the FastFlow framework. The NPB is a specification of numeric benchmarks to compare different environments and implementations. FastFlow is a framework, targeted to shared memory systems, to sustain parallel algorithms based on structured parallel programming. Starting from the NPB specification, the thesis selects a subset of the NPB algorithms and discerns an efficient implementation for both the sequential and parallel algorithms, through FastFlow. Finally, experiments on state of the art multicore architectures compare the derived code with the reference implementation, provided by the NPB authors
Suporte de Computações Octave Independentes em Multiprocessadores de Memória Partilhada
O objetivo deste projeto é desenvolver uma biblioteca para o Octave capaz de paralelizar
funções em sistemas de memória partilhada, neste caso, o problema principal que a
biblioteca está a resolver são algoritmos de otimização de funções sem derivadas.
O Octave é open source, ou seja, qualquer pessoa pode criar uma biblioteca para qualquer
problema que tenha, por isso existem várias, nomeadamente bibliotecas que correm fun-
ções Octave em paralelo, como a parallel package. No entanto, esta biblioteca foi desenvolvida
principalmente para paralelizar usando diferentes máquinas em sistemas distribuÃdos
resultando num sistema de paralelização de memória partilhada bastante simples onde
apenas se cria os processos, corre-se a função uma vez e devolve-se os resultados.
Portanto para este projeto criei uma biblioteca do zero. Difere da que já existe porque
funciona mais como uma worker pool, isto é, cria o número requerido de workers que ficam
inifinitamente à espera de receber novos jobs, jobs, no contexto deste projeto são um novo
conjunto de valores de entrada que os workers recebem para correr a função. A biblioteca
tem suporte para funções implementadas em Octave (funções interpretadas) e funções
implementadas em C/C++ (funções compiladas) compiladas em bibliotecas dinâmicas
com duas interfaces diferentes, mas usadas da mesma maneira.
Para implementar a biblioteca usei octfiles que são pedaços de código C++ compilados
com o API do Octave e podem ser usados como funções normais no Octave. Além disso,
usei bibliotecas padrão do C e sistemas do Linux: os threads do C++11 e a system call
fork do Linux para a paralelização. Os threads foram usados na interface para funções
compiladas e os processos para a interface de funções interpretadas.
Para verificar a eficácia e correção da biblioteca modifiquei um algoritmo de otimização
BoostDMS para, no passo que avalia a função com os vários valores de entrada gerados,
sequencialmente, usar a minha biblioteca e correr este passo concorrentemente e corri-o
em máquinas com processadores de vários núcleos de computação onde obti resultados
positivos. Os resultados mostram uma têndencia a diminuir o tempo de computação
do algoritmo com o aumento de workers utilizados maximizando quando o número de workers é igual ou superior ao número valores gerados para analisar.The objective of this project is to develop a library for Octave capable of parallelizing
functions on shared memory systems. In this case, the main problem that the library is
addressing is optimization algorithms for functions without derivatives.
Octave is open source, which means anyone can create a library for any problem they
may have, which is why there are several, including libraries that run Octave functions
in parallel, such as the ’parallel’ package. However, this library was mainly developed
to parallelize using different machines in distributed systems, resulting in a very simple
shared memory parallelization system where processes are created, the function is run
once, and the results are returned.
Therefore, for this project, I created a library from scratch. It differs from what already
exists because it works more like a worker pool, that is, it creates the required number of
workers that are infinitely waiting to receive new jobs. In the context of this project, jobs
are a new set of input values that workers receive to run the function. The library supports
functions implemented in Octave (interpreted functions) and functions implemented in
C/C++ (compiled functions) compiled into dynamic libraries with two different interfaces,
but used in the same way.
To implement the library, I used octfiles, which are pieces of C++ code compiled with
the Octave API and can be used as regular functions in Octave. In addition to this, I used
standard C libraries and Linux systems: C++11 threads and the Linux fork system call, for
parallelization. Threads were used in the interface for compiled functions, and processes
were used in the interface for interpreted functions.
To verify the effectiveness and correctness of the library, I modified an optimization
algorithm named BoostDMS to use my library to run the step that evaluates the function
with the various generated input values concurrently instead of sequentially. I ran this
modified algorithm on machines with processors with multiple computing cores and
obtained positive results. The results show a tendency to decrease the computation time
of the algorithm with the increase of workers used, maximizing when the number of
workers is equal to or greater than the number of generated values to analyze
Software for Exascale Computing - SPPEXA 2016-2019
This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest
A data dependency recovery system for a heterogeneous multicore processor
Multicore processors often increase the performance of applications. However, with their deeper pipelining, they have proven increasingly difficult to improve. In an attempt to deliver enhanced performance at lower power requirements, semiconductor microprocessor manufacturers have progressively utilised chip-multicore processors. Existing research has utilised a very common technique known as thread-level speculation. This technique attempts to compute results before the actual result is known. However, thread-level speculation impacts operation latency, circuit timing, confounds data cache behaviour and code generation in the compiler. We describe an software framework codenamed Lyuba that handles low-level data hazards and automatically recovers the application from data hazards without programmer and speculation intervention for an asymmetric chip-multicore processor. The problem of determining correct execution of multiple threads when data hazards occur on conventional symmetrical chip-multicore processors is a significant and on-going challenge. However, there has been very little focus on the use of asymmetrical (heterogeneous) processors with applications that have complex data dependencies. The purpose of this thesis is to: (i) define the development of a software framework for an asymmetric (heterogeneous) chip-multicore processor; (ii) present an optimal software control of hardware for distributed processing and recovery from violations;(iii) provides performance results of five applications using three datasets. Applications with a small dataset showed an improvement of 17% and a larger dataset showed an improvement of 16% giving overall 11% improvement in performance
Optimization of Pattern Matching Algorithms for Multi- and Many-Core Platforms
Image and video compression play a major role in the world today, allowing the
storage and transmission of large multimedia content volumes. However, the processing
of this information requires high computational resources, hence the improvement of
the computational performance of these compression algorithms is very important.
The Multidimensional Multiscale Parser (MMP) is a pattern-matching-based compression
algorithm for multimedia contents, namely images, achieving high compression
ratios, maintaining good image quality, Rodrigues et al. [2008]. However, in comparison
with other existing algorithms, this algorithm takes some time to execute. Therefore,
two parallel implementations for GPUs were proposed by Ribeiro [2016] and Silva
[2015] in CUDA and OpenCL-GPU, respectively. In this dissertation, to complement
the referred work, we propose two parallel versions that run the MMP algorithm in
CPU: one resorting to OpenMP and another that converts the existing OpenCL-GPU
into OpenCL-CPU. The proposed solutions are able to improve the computational
performance of MMP by 3 and 2:7 , respectively.
The High Efficiency Video Coding (HEVC/H.265) is the most recent standard for
compression of image and video. Its impressive compression performance, makes it
a target for many adaptations, particularly for holoscopic image/video processing (or
light field). Some of the proposed modifications to encode this new multimedia content
are based on geometry-based disparity compensations (SS), developed by Conti et al.
[2014], and a Geometric Transformations (GT) module, proposed by Monteiro et al.
[2015]. These compression algorithms for holoscopic images based on HEVC present
an implementation of specific search for similar micro-images that is more efficient
than the one performed by HEVC, but its implementation is considerably slower than
HEVC. In order to enable better execution times, we choose to use the OpenCL API
as the GPU enabling language in order to increase the module performance. With its
most costly setting, we are able to reduce the GT module execution time from 6.9 days
to less then 4 hours, effectively attaining a speedup of 45
Concurrency Platforms for Real-Time and Cyber-Physical Systems
Parallel processing is an important way to satisfy the increasingly demanding computational needs of modern real-time and cyber-physical systems, but existing parallel computing technologies primarily emphasize high-throughput and average-case performance metrics, which are largely unsuitable for direct application to real-time, safety-critical contexts. This work contrasts two concurrency platforms designed to achieve predictable worst case parallel performance for soft real-time workloads with millisecond periods and higher. One of these is then the basis for the CyberMech platform, which enables parallel real-time computing for a novel yet representative application called Real-Time Hybrid Simulation (RTHS). RTHS combines demanding parallel real-time computation with real-time simulation and control in an earthquake engineering laboratory environment, and results concerning RTHS characterize a reasonably comprehensive survey of parallel real-time computing in the static context, where the size, shape, timing constraints, and computational requirements of workloads are fixed prior to system runtime. Collectively, these contributions constitute the first published implementations and evaluations of general-purpose concurrency platforms for real-time and cyber-physical systems, explore two fundamentally different design spaces for such systems, and successfully demonstrate the utility and tradeoffs of parallel computing for statically determined real-time and cyber-physical systems
Development of a PC-Based Object-Oriented Real-Time Robotics Controller
The industrial world of robotics requires leading-edge controllers to match the speed of new manipulators. At the University of Waterloo, a three degree-of-freedom ultra high-speed cable-based robot was created called Deltabot. In order to improve the performance of the Deltabot, a new controller called the QNX Multi-Axis Robotic Controller (QMARC) was developed. QMARC is a PC-based controller built for the replacement of the existing commercial controller called PMAC, manufactured by Delta Tau Data Systems. Although the PMAC has its own real-time processor, the rigid and complex internal structure of the PMAC makes it difficult to apply advanced control algorithms and interpolation methods. Adding unconventional hardware to PMAC, such as a camera and vision system is also quite challenging. With the development of QMARC, the flexibility issue of the controller is resolved. QMARC?s open-sourced object-oriented software structure allows the addition of new control and interpolation techniques as required. In addition, the software structure of the main Controller process is decoupled for the hardware, so that any hardware change does not affect the main controller, just the hardware drivers. QMARC is also equipped with a user-friendly graphical user interface, and many safety protocols to make it a safe and easy-to-use system. Experimental tests has proven QMARC to be a safe and reliable controller. The stable software foundation created by the QMARC will allow for future development of the controller as research on the Deltabot progresses
Embedded System Design
A unique feature of this open access textbook is to provide a comprehensive introduction to the fundamental knowledge in embedded systems, with applications in cyber-physical systems and the Internet of things. It starts with an introduction to the field and a survey of specification models and languages for embedded and cyber-physical systems. It provides a brief overview of hardware devices used for such systems and presents the essentials of system software for embedded systems, including real-time operating systems. The author also discusses evaluation and validation techniques for embedded systems and provides an overview of techniques for mapping applications to execution platforms, including multi-core platforms. Embedded systems have to operate under tight constraints and, hence, the book also contains a selected set of optimization techniques, including software optimization techniques. The book closes with a brief survey on testing. This fourth edition has been updated and revised to reflect new trends and technologies, such as the importance of cyber-physical systems (CPS) and the Internet of things (IoT), the evolution of single-core processors to multi-core processors, and the increased importance of energy efficiency and thermal issues
- …