164,231 research outputs found

    Kulla, a container-centric construction model for building infrastructure-agnostic distributed and parallel applications

    Get PDF
    This paper presents the design, development, and implementation of Kulla, a virtual container-centric construction model that mixes loosely coupled structures with a parallel programming model for building infrastructure-agnostic distributed and parallel applications. In Kulla, applications, dependencies and environment settings, are mapped with construction units called Kulla-Blocks. A parallel programming model enables developers to couple those interoperable structures for creating constructive structures named Kulla-Bricks. In these structures, continuous dataflow and parallel patterns can be created without modifying the code of applications. Methods such as Divide&Containerize (data parallelism), Pipe&Blocks (streaming), and Manager/Block (task parallelism) were developed to create Kulla-Bricks. Recursive combinations of Kulla instances can be grouped in deployment structures called Kulla-Boxes, which are encapsulated into VCs to create infrastructure-agnostic parallel and/or distributed applications. Deployment strategies were created for Kulla-Boxes to improve the IT resource profitability. To show the feasibility and flexibility of this model, solutions combining real-world applications were implemented by using Kulla instances to compose parallel and/or distributed system deployed on different IT infrastructures. An experimental evaluation based on use cases solving satellite and medical image processing problems revealed the efficiency of Kulla model in comparison with some traditional state-of-the-art solutions.This work has been partially supported by the EU project "ASPIDE: Exascale Programing Models for Extreme Data Processing" under grant 801091 and the project "CABAHLA-CM: Convergencia Big data-Hpc: de los sensores a las Aplicaciones" S2018/TCS-4423 from Madrid Regional Government

    Dimension and shape invariant programming: the implementation and application

    Get PDF
    This thesis implements a model for the shape and dimension invariant programming based on the notation of the Mathematics of Arrays (MOA) algebra. It focuses on dimension and shape invariance implementation, and their effect in parallel computing. A new design for the MOA notation is implemented that eliminates the need for another PSI-compiler, or a language extension to functional programming languages. The MOA notation is designed as a library of Application Programming Interfaces (APIs), contains object oriented classes implemented in C++. The library executes array operations correctly, and is expected to enhance the performance invariant of dimension and shape. To implement these APIs, the mathematical equations of the original notation were analyzed and sometimes simplified to become more comprehensible to implement from the programming point of view, and some more operations were added. The APIs reduce the erroneous loops starts, strides, and stops used by programmers in the traditional handling of multi-dimension arrays. The library defines the dimension and shape of the arrays at runtime; and gives the source code of the problem in hand better chances to be automatically parallelized. The MOA library testing tool developed and implemented in this thesis, can be used by mathematicians and computer arithmetic researchers to translate high level arithmetic functions in applications like image processing, video -111 processing, fluid dynamic, ... etc. to the MOA notation, utilizing its benefits. An image-processing tool is implemented using this new MOA library, proving the correctness of the design on 2D-array application, where image operations are expressed concisely in the source code and easily manipulated on the conceptual leveL Image processing transfonnations, filtering and detections are implemented. Video processing operations like transformations on the AVI Frames after decomposing them, and motion detection scheme are implemented using the MOA library, to prove the correctness of the library on a 3D-array application. Also, the parailelisation factors inherent in the MOA library design are discussed in terms of shape polymorphism, MOA parallel architecture, data redistribution, and Tiling algorithms, in relation to the MOA notation. Furthermore, pipelining with MOA has been investigated. In addition to the above experiments, a hardware implementation of the MOA APIs was implemented using VHDL on Renoir as a package, and simulated using ModelSim. Perfonuance analysis is conducted in tenus of general benefits of programming invariant of shape and dimension as designed in this thesis, which is open to further analysis based on the application domain

    Accelerating Scientific Computing Models Using GPU Processing

    Get PDF
    GPGPUs offer significant computational power for programmers to leverage. This computational power is especially useful when utilized for accelerating scientific models. This thesis analyzes the utilization of GPGPU programming to accelerate scientific computing models. First the construction of hardware for visualization and computation of scientific models is discussed. Several factors in the construction of the machines focus on the performance impacts related to scientific modeling. Image processing is an embarrassingly parallel problem well suited for GPGPU acceleration. An image processing library was developed to show the processes of recognizing embarrassingly parallel problems and serves as an excellent example of converting from a serial CPU implementation to a GPU accelerated implementation. Genetic algorithms are biologically inspired heuristic search algorithms based on natural selection. The Tetris genetic algorithm with A* pathfinding discusses memory bound limitations that can prevent direct algorithm conversions from the CPU to the GPU. An analysis of an existing landscape evolution model, CHILD, for GPU acceleration explores that even when a model shows promise for GPU acceleration, the underlying data structures can have a significant impact upon that ability to move to a GPU implementation. CHILD also offers an example of creating tighter MATLAB integration between existing models. Lastly, a parallel spatial sorting algorithm is discussed as a possible replacement for current spatial sorting algorithms implemented in models such as smoothed particle hydrodynamics

    PARALLEL IMAGE PROCESSING FROM CLOUD USING CUDA AND HADOOP ARCHITECTURE: A NOVEL APPROACH

    Get PDF
    — In There is an increased, large quantity if data with the super-resolution quality data, hence there is an increased demand in high quality image data. This requirements causes a challenge in disk space in single PC or computers. A primary solution to employ the storage of large quantity of high quality is provided by use of Cloud computing. The proposed approach uses a Hadoop based remote sensing image processing system (HBRSIPS) which is used in areas of big data analysis, particularly text analysis. Use of this approach can enable the remote sensing of image data from single cloud storage by using Java constructs with MapReduce framework. The proposed saves remote sensing image into Hadoop’s distribute storage environment and uses cloud based algorithm environment or uses GPU which is massively parallel platforms with more efficiency and less expensiveness using CUDA architecture which is C-based programming model proposed by NVIDIA for leveraging the parallel computing capabilities that results in 25% improvements in data processing throughput. Use of CUDA with POSIX thread technologies we can process the remote image from cloud architectur

    High-performance computing and communication models for solving the complex interdisciplinary problems on DPCS

    Get PDF
    The paper presents some advanced high performance (HPC) and parallel computing (PC) methodologies for solving a large space complex problem involving the integrated difference research areas. About eight interdisciplinary problems will be accurately solved on multiple computers communicating over the local area network. The mathematical modeling and a large sparse simulation of the interdisciplinary effort involve the area of science, engineering, biomedical, nanotechnology, software engineering, agriculture, image processing and urban planning. The specific methodologies of PC software under consideration include PVM, MPI, LUNA, MDC, OpenMP, CUDA and LINDA integrated with COMSOL and C++/C. There are different communication models of parallel programming, thus some definitions of parallel processing, distributed processing and memory types are explained for understanding the main contribution of this paper. The matching between the methodology of PC and the large sparse application depends on the domain of solution, the dimension of the targeted area, computational and communication pattern, the architecture of distributed parallel computing systems (DPCS), the structure of computational complexity and communication cost. The originality of this paper lies in obtaining the complex numerical model dealing with a large scale partial differential equation (PDE), discretization of finite difference (FDM) or finite element (FEM) methods, numerical simulation, high-performance simulation and performance measurement. The simulation of PDE will perform by sequential and parallel algorithms to visualize the complex model in high-resolution quality. In the context of a mathematical model, various independent and dependent parameters present the complex and real phenomena of the interdisciplinary application. As a model executes, these parameters can be manipulated and changed. As an impact, some chemical or mechanical properties can be predicted based on the observation of parameter changes. The methodologies of parallel programs build on the client-server model, slave-master model and fragmented model. HPC of the communication model for solving the interdisciplinary problems above will be analyzed using a flow of the algorithm, numerical analysis and the comparison of parallel performance evaluations. In conclusion, the integration of HPC, communication model, PC software, performance and numerical analysis happens to be an important approach to fulfill the matching requirement and optimize the solution of complex interdisciplinary problems

    Design and Programming Methods for Reconfigurable Multi-Core Architectures using a Network-on-Chip-Centric Approach

    Get PDF
    A current trend in the semiconductor industry is the use of Multi-Processor Systems-on-Chip (MPSoCs) for a wide variety of applications such as image processing, automotive, multimedia, and robotic systems. Most applications gain performance advantages by executing parallel tasks on multiple processors due to the inherent parallelism. Moreover, heterogeneous structures provide high performance/energy efficiency, since application-specific processing elements (PEs) can be exploited. The increasing number of heterogeneous PEs leads to challenging communication requirements. To overcome this challenge, Networks-on-Chip (NoCs) have emerged as scalable on-chip interconnect. Nevertheless, NoCs have to deal with many design parameters such as virtual channels, routing algorithms and buffering techniques to fulfill the system requirements. This thesis highly contributes to the state-of-the-art of FPGA-based MPSoCs and NoCs. In the following, the three major contributions are introduced. As a first major contribution, a novel router concept is presented that efficiently utilizes communication times by performing sequences of arithmetic operations on the data that is transferred. The internal input buffers of the routers are exchanged with processing units that are capable of executing operations. Two different architectures of such processing units are presented. The first architecture provides multiply and accumulate operations which are often used in signal processing applications. The second architecture introduced as Application-Specific Instruction Set Routers (ASIRs) contains a processing unit capable of executing any operation and hence, it is not limited to multiply and accumulate operations. An internal processing core located in ASIRs can be developed in C/C++ using high-level synthesis. The second major contribution comprises application and performance explorations of the novel router concept. Models that approximate the achievable speedup and the end-to-end latency of ASIRs are derived and discussed to show the benefits in terms of performance. Furthermore, two applications using an ASIR-based MPSoC are implemented and evaluated on a Xilinx Zynq SoC. The first application is an image processing algorithm consisting of a Sobel filter, an RGB-to-Grayscale conversion, and a threshold operation. The second application is a system that helps visually impaired people by navigating them through unknown indoor environments. A Light Detection and Ranging (LIDAR) sensor scans the environment, while Inertial Measurement Units (IMUs) measure the orientation of the user to generate an audio signal that makes the distance as well as the orientation of obstacles audible. This application consists of multiple parallel tasks that are mapped to an ASIR-based MPSoC. Both applications show the performance advantages of ASIRs compared to a conventional NoC-based MPSoC. Furthermore, dynamic partial reconfiguration in terms of relocation and security aspects are investigated. The third major contribution refers to development and programming methodologies of NoC-based MPSoCs. A software-defined approach is presented that combines the design and programming of heterogeneous MPSoCs. In addition, a Kahn-Process-Network (KPN) –based model is designed to describe parallel applications for MPSoCs using ASIRs. The KPN-based model is extended to support not only the mapping of tasks to NoC-based MPSoCs but also the mapping to ASIR-based MPSoCs. A static mapping methodology is presented that assigns tasks to ASIRs and processors for a given KPN-model. The impact of external hardware components such as sensors, actuators and accelerators connected to the processors is also discussed which makes the approach of high interest for embedded systems

    Content-aware image resizing in OpenCL

    Get PDF
    The purpose of this thesis was to test if the algorithm for content-aware image resizing runs faster on graphics processing unit in comparison to central processing unit. For that we chose content-aware image resizing algorithm called seam carving. With seam carving we can change image dimensions by finding the optimal seam which we can carve out or put in, depending on weather we want to shrink or enlarge the image. Seam is connected path from one side of the image to another and holds least important information of the image. With our testing we realized that this algorithm works best in images with monotone background. Because algorithm itself was not the purpose of this thesis we did not try to improve it. For implementation of this algorithm on graphics processing unit we used heterogeneous programming framework called OpenCL. OpenCL is a standard for heterogeneous parallel computing on cross-vendor and cross-platform hardware. We can describe OpenCL architecture with platform model, execution model, memory model and programming model. Each of them is described in details in chapter three. In chapter four we look at our implementation of seam carving algorithm. We had two approaches. One is carving one seam at the time, which means recalculating energy and its cumulative every time we carve out a seam. Second approach is carving multiple seams at a time. In this case we try to find more seams that we can carve out based on calculated energy and cumulative. We repeat the process until we get the desired image dimensions. Based on testing we realised that choosing the right work group size is really important, as well as implementation of kernels. If we choose wrong approach we can slow down its execution considerably, which is evident from the results of second approach. In this case the execution of the algorithm on central processing unit was faster then execution of it on graphics processing unit. We were more successful with implementation of first approach which runs faster on graphics processing unit then on central processing unit
    corecore