153 research outputs found
Parallel Solving Tasks of Digital Image Processing
In this paper is given methods of establishment parallel computing on the several calculating stream of image spectrum processes on two dimensional FFT, DCT and Walsh-Hadamard basic systems. Creating parallel algorithms of spectrum analysis as a used technologies OpenMP, Intel TBB and Intel Cilk Plus and their library opportunities are provide
Parallel fast fourier transform in SPMD style of cilk
Copyright © 2019 Inderscience Enterprises Ltd. In this paper, we propose a parallel one-dimensional non-recursive fast Fourier transform (FFT) program based on conventional Cooley-Tukey’s algorithm written in C using Cilk in single program multiple data (SPMD) style. As a highly compact designed code, this code is compared with a highly tuned parallel recursive fast Fourier transform (FFT) using Cilk, which is included in Cilk package of version 5.4.6. Both algorithms are executed on multicore servers, and experimental results show that the performance of the SPMD style of Cilk fast Fourier transform (FFT) parallel code is highly competitive and promising
Optimization of the coherence function estimation for multi-core central processing unit
The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method
A perceptual hash function to store and retrieve large scale DNA sequences
This paper proposes a novel approach for storing and retrieving massive DNA
sequences.. The method is based on a perceptual hash function, commonly used to
determine the similarity between digital images, that we adapted for DNA
sequences. Perceptual hash function presented here is based on a Discrete
Cosine Transform Sign Only (DCT-SO). Each nucleotide is encoded as a fixed gray
level intensity pixel and the hash is calculated from its significant frequency
characteristics. This results to a drastic data reduction between the sequence
and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes
are not affected by "avalanche effect" and thus can be compared. The similarity
distance between two hashes is estimated with the Hamming Distance, which is
used to retrieve DNA sequences. Experiments that we conducted show that our
approach is relevant for storing massive DNA sequences, and retrieving them
Efficient algorithms for the fast computation of space charge effects caused by charged particles in particle accelerators
In this dissertation, a Poisson solver is improved with three parts: the efficient integrated Green's function; the discrete cosine transform of the efficient integrated Green's function values; the implicitly zero-padded fast Fourier transform for charge density. In addition, the high performance computing technology is utilized for the further improvement of efficiency, such as: OpenMP API, OpenMP+CUDA, MPI, and MPI+OpenMP parallelizations. The examples and simulation results are matched with the results of the commonly used Poisson solver to demonstrate the accuracy performance
Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation
The most popular multithreaded languages based on the fork-join concurrency model (CIlkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural choice. Since the programming concepts of CilkPlus and OpenMP are very different from those of CUDA, it is desirable to automatically generate optimized CUDA-like code from CilkPlus or OpenMP.
In this thesis, we propose an accelerator model for annotated C/C++ code together with an implementation that allows the automatic generation of CUDA code. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are treated as unknown symbols. Hence, these parameters need not to be known at code-generation-time: machine parameters and program parameters can be respectively determined when the generated code is installed on the target machine. In addition, we show how these parametric CUDA programs can be optimized at compile-time in the form of a case discussion, where cases depend on the values of machine parameters (e.g. hardware resource limits) and program parameters (e.g. dimension sizes of thread-blocks). This generation of parametric CUDA kernels requires to deal with non-linear polynomial expressions during the dependence analysis and tiling phase. To achieve these algebraic calculations, we take advantage of techniques from computer algebra, in particular in the RegularChains library of Maple. Various illustrative examples are provided together with performance evaluation
Scheduling (ir)regular applications on heterogeneous platforms
Dissertação de mestrado em Engenharia de InformáticaCurrent computational platforms have become continuously more and more heterogeneous and parallel over the last years, as a consequence of incorporating accelerators whose architectures are parallel and different from the CPU. As a result, several frameworks were developed to aid to program these platforms mainly targeting better productivity ratios. In this context, GAMA framework is being developed by the research group involved in this work, targeting
both regular and irregular algorithms to efficiently run in heterogeneous platforms.
Scheduling is a key issue of GAMA-like frameworks. The state of the art solutions of scheduling on heterogeneous platforms are efficient for regular applications but lack adequate mechanisms for irregular ones. The scheduling of irregular applications is particularly complex due to the unpredictability and the differences on the execution time of their composing computational tasks.
This dissertation work comprises the design and validation of a dynamic scheduler’s model and implementation, to simultaneously address regular and irregular algorithms. The devised scheduling mechanism is validated within the GAMA framework, when running relevant scientific algorithms, which include the SAXPY, the Fast Fourier Transform and two n-Body solvers. The proposed mechanism is validated regarding its efficiency in finding good scheduling decisions and the efficiency and scalability of GAMA, when using it.
The results show that the model of the devised dynamic scheduler is capable of working in heterogeneous systems with high efficiency and finding good scheduling decisions in the general tested cases. It achieves not only the scheduling decision that represents the real capacity of the devices in the platform, but also enables GAMA to achieve more than 100% of efficiency as defined in [3], when running a relevant scientific irregular algorithm.
Under the designed scheduling model, GAMA was also able to beat CPU and GPU efficient libraries of SAXPY, an important scientific algorithm. It was also proved GAMA’s scalability under the devised dynamic scheduler, which properly leveraged the platform computational resources, in trials with one central quad-core CPU-chip and two GPU accelerators.As plataformas computacionais actuais tornaram-se cada vez mais heterogéneas e paralelas
nos últimos anos, como consequência de integrarem aceleradores cujas arquitecturas são
paralelas e distintas do CPU. Como resultado, várias frameworks foram desenvolvidas para
programar estas plataformas, com o objectivo de aumentar os níveis de produtividade de
programação. Neste sentido, a framework GAMA está a ser desenvolvida pelo grupo de
investigação envolvido nesta tese, tendo como objectivo correr eficientemente algoritmos regulares
e irregulares em plataformas heterogéneas.
Um aspecto chave no contexto de frameworks congéneres ao GAMA é o escalonamento.
As soluções que compõem o estado da arte de escalonamento em plataformas heterogéneas são
eficientes para aplicaçóes regulares, mas ineficientes para aplicações irregulares. O escalonamento
destas é particularmente complexo devido à imprevisibilidade e ás diferenças no tempo
de computação das tarefas computacionais que as compõem.
Esta dissertação propõe o design e validação de um modelo de escalonamento e respectiva
implementação, que endereça tanto aplicações regulares como irregulares. O mecanismo de
escalonamento desenvolvido é validado na framework GAMA, executando algoritmos científicos
relevantes, que incluem a SAXPY, a Transformada Rápida de Fourier e dois algoritmos
de resolução do problema n-Corpos. O mecanismo proposto é validado quanto à sua eficiência
em encontrar boas decisões de escalonamento e quanto à eficiência e escalabilidade do
GAMA, quando fazendo uso do mesmo.
Os resultados obtidos mostram que o modelo de escalonamento proposto é capaz de executar
em plataformas heterogéneas com alto grau de eficiência, uma vez que encontra boas
decisões de escalonamento na generalidade dos casos testados. Além de atingir a decisão
de escalonamento que melhor representa o real poder computacional dos dispositivos na
plataforma, também permite ao GAMA atingir mais de 100% de eficiência tal como definida
em [3], executando um importante algoritmo científico irregular.
Integrando o modelo de escalonamento desenvolvido, o GAMA superou ainda bibliotecas
eficientes para CPU e GPU na execução do SAXPY, um importante algoritmo científico.
Foi também provada a escalabilidade do GAMA sob o modelo desenvolvido, que aproveitou
da melhor forma os recursos computacionais disponíveis, em testes para um CPU-chip de 4
núcleos e dois GPUs
Towards efficient exploitation of GPUs : a methodology for mapping index-digit algorithms
[Resumen]La computación de propósito general en GPUs supuso un gran paso, llevando la
computación de alto rendimiento a los equipos domésticos. Lenguajes de programación de alto nivel como OpenCL y CUDA redujeron en gran medida la complejidad
de programación. Sin embargo, para poder explotar totalmente el poder computacional
de las GPUs, se requieren algoritmos paralelos especializados. La complejidad
en la jerarquía de memoria y su arquitectura masivamente paralela hace que la
programación de GPUs sea una tarea compleja incluso para programadores experimentados.
Debido a la novedad, las librerías de propósito general son escasas y las
versiones paralelas de los algoritmos no siempre están disponibles.
En lugar de centrarnos en la paralelización de algoritmos concretos, en esta tesis
proponemos una metodología general aplicable a la mayoría de los problemas de tipo
divide y vencerás con una estructura de mariposa que puedan formularse a través de
la representación Indice-Dígito. En primer lugar, se analizan los diferentes factores que afectan al rendimiento de la arquitectura de las GPUs. A continuación, estudiamos
varias técnicas de optimización y diseñamos una serie de bloques constructivos
modulares y reutilizables, que se emplean para crear los diferentes algoritmos. Por último, estudiamos el equilibrio óptimo de los recursos, y usando vectores de mapeo
y operadores algebraicos ajustamos los algoritmos para las configuraciones deseadas.
A pesar del enfoque centrado en la exibilidad y la facilidad de programación, las
implementaciones resultantes ofrecen un rendimiento muy competitivo, que llega a superar conocidas librerías recientes.[Resumo] A computación de propósito xeral en GPUs supuxo un gran paso, levando a
computación de alto rendemento aos equipos domésticos. Linguaxes de programación de alto nivel como OpenCL e CUDA reduciron en boa medida a complexidade
da programación. Con todo, para poder aproveitar totalmente o poder computacional
das GPUs, requírense algoritmos paralelos especializados. A complexidade na
xerarquía de memoria e a súa arquitectura masivamente paralela fai que a programación de GPUs sexa unha tarefa complexa mesmo para programadores experimentados.
Debido á novidade, as librarías de propósito xeral son escasas e as versións
paralelas dos algoritmos non sempre están dispoñibles.
En lugar de centrarnos na paralelización de algoritmos concretos, nesta tese propoñemos unha metodoloxía xeral aplicable á maioría dos problemas de tipo divide e
vencerás cunha estrutura de bolboreta que poidan formularse a través da representación Índice-Díxito. En primeiro lugar, analízanse os diferentes factores que afectan
ao rendemento da arquitectura das GPUs. A continuación, estudamos varias técnicas
de optimización e deseñamos unha serie de bloques construtivos modulares e
reutilizables, que se empregan para crear os diferentes algoritmos. Por último, estudamos
o equilibrio óptimo dos recursos, e usando vectores de mapeo e operadores
alxbricos axustamos os algoritmos para as configuracións desexadas. A pesar do enfoque
centrado na exibilidade e a facilidade de programación, as implementacións
resultantes ofrecen un rendemento moi competitivo, que chega a superar coñecidas
librarías recentes.[Abstract]GPU computing supposed a major step forward, bringing high performance computing
to commodity hardware. Feature-rich parallel languages like CUDA and
OpenCL reduced the programming complexity. However, to fully take advantage of
their computing power, specialized parallel algorithms are required. Moreover, the
complex GPU memory hierarchy and highly threaded architecture makes programming
a difficult task even for experienced programmers. Due to the novelty of GPU
programming, common general purpose libraries are scarce and parallel versions of
the algorithms are not always readily available.
Instead of focusing in the parallelization of particular algorithms, in this thesis
we propose a general methodology applicable to most divide-and-conquer problems
with a buttery structure which can be formulated through the Index-Digit
representation. First, we analyze the different performance factors of the GPU architecture.
Next, we study several optimization techniques and design a series of
modular and reusable building blocks, which will be used to create the different
algorithms. Finally, we study the optimal resource balance, and through a mapping
vector representation and operator algebra, we tune the algorithms for the desired
configurations. Despite the focus on programmability and exibility, the resulting
implementations offer very competitive performance, being able to surpass other
well-known state of the art libraries
Enhancement and optimization of a multi-command-based brain-computer interface
Brain-computer interfaces (BCI) assist disabled person to control many appliances without any physically interaction (e.g., pressing a button). SSVEP is brain activities elicited by evoked signals that are observed by visual stimuli paradigm. In this dissertation were addressed the problems which are oblige more usability of BCI-system by optimizing and enhancing the performance using particular design. Main contribution of this work is improving brain reaction response depending on focal approaches
- …