212,722 research outputs found

    Hardware Design, Prototyping and Studies of the Explicit Multi-Threading (XMT) Paradigm

    Get PDF
    With the end of exponential performance improvements in sequential computers, parallel computers, dubbed "chip multiprocessor", "multicore", or "manycore", has been introduced. Unfortunately, programming current parallel computers tends to be far more difficult than programming sequential computers. The Parallel Random Access Model (PRAM) is known to be an easy-to-program parallel computer model and has been widely used by theorists to develop parallel algorithms because it abstracts away architecture details and allows algorithm designers to focus on critical issues. The eXplicit Multi-Threading (XMT) PRAM-On-Chip project seeks to build an easy-to-program on-chip parallel processor by supporting a PRAM-like programming (performance) model. This dissertation focuses on the design, study of the micro-architecture of the XMT processor as well as performance optimization. The main contributions are:(1) Presented a scalable micro-architecture of the XMT based on high level description of the architecture. (2) Designed a synthesizable Verilog HDL (hardware design language) description of XMT, which lead to the first commitment to the silicon of the XMT processor, a 75 MHz XMT FPGA computer. With the same design, we expect to see the first XMT ASIC processor using IBM 90nm technology. (3) Proposed and implemented some architecture upgrades to the XMT: (i)value broadcasting, (ii)hardware/software co-managed prefetch buffers and (iii) hardware/software co-managed read-only buffers. (4) Quantitatively studied the performance of XMT using non-trivial application kernels with the 75 MHz XMT FPGA computer, in addition, the performance of a 800MHz XMT processor is projected. (5) The choice of not having local private caches in the XMT architecture is studied by comparing current architecture with an alternative one that includes conventional coherent private caches

    Defining Asymptotic Parallel Time Complexity of Data-dependent Algorithms

    Get PDF
    The scientific research community has reached a stage of maturity where its strong need for high-performance computing has diffused into also everyday life of engineering and industry algorithms. In efforts to satisfy this need, parallel computers provide an efficient and economical way to solve large-scale and/or time-constrained problems. As a consequence, the end-users of these systems have a vested interest in defining the asymptotic time complexity of parallel algorithms to predict their performance on a particular parallel computer. The asymptotic parallel time complexity of data-dependent algorithms depends on the number of processors, data size, and other parameters. Discovering the main other parameters is a challenging problem and the clue in obtaining a good estimate of performance order. Great examples of these types of applications are sorting algorithms, searching algorithms and solvers of the traveling salesman problem (TSP). This article encompasses all the knowledge discovery aspects to the problem of defining the asymptotic parallel time complexity of datadependent algorithms. The knowledge discovery methodology begins by designing a considerable number of experiments and measuring their execution times. Then, an interactive and iterative process explores data in search of patterns and/or relationships detecting some parameters that affect performance. Knowing the key parameters which characterise time complexity, it becomes possible to hypothesise to restart the process and to produce a subsequent improved time complexity model. Finally, the methodology predicts the performance order for new data sets on a particular parallel computer by replacing a numerical identification. As a case of study, a global pruning traveling salesman problem implementation (GP-TSP) has been chosen to analyze the influence of indeterminism in performance prediction of data-dependent parallel algorithms, and also to show the usefulness of the defined knowledge discovery methodology. The subsequent hypotheses generated to define the asymptotic parallel time complexity of the TSP were corroborated one by one. The experimental results confirm the expected capability of the proposed methodology; the predictions of performance time order were rather good comparing with real execution time (in the order of 85%)

    High Performance Computing of Complex Electromagnetic Algorithms Based on GPU/CPU Heterogeneous Platform and Its Applications to EM Scattering and Multilayered Medium Structure

    Get PDF
    The fast and accurate numerical analysis for large-scale objects and complex structures is essential to electromagnetic simulation and design. Comparing to the exploration in EM algorithms from mathematical point of view, the computer programming realization is coordinately significant while keeping up with the development of hardware architectures. Unlike the previous parallel algorithms or those implemented by means of parallel programming on multicore CPU with OpenMP or on a cluster of computers with MPI, the new type of large-scale parallel processor based on graphics processing unit (GPU) has shown impressive ability in various scenarios of supercomputing, while its application in computational electromagnetics is especially expected. This paper introduces our recent work on high performance computing based on GPU/CPU heterogeneous platform and its application to EM scattering problems and planar multilayered medium structure, including a novel realization of OpenMP-CUDA-MLFMM, a developed ACA method and a deeply optimized CG-FFT method. With fruitful numerical examples and their obvious enhancement in efficiencies, it is convincing to keep on deeply investigating and understanding the computer hardware and their operating mechanism in the future

    Comparative study of different approaches to solve batch process scheduling and optimisation problems

    Get PDF
    Effective approaches are important to batch process scheduling problems, especially those with complex constraints. However, most research focus on improving optimisation techniques, and those concentrate on comparing their difference are inadequate. This study develops an optimisation model of batch process scheduling problems with complex constraints and investigates the performance of different optimisation techniques, such as Genetic Algorithm (GA) and Constraint Programming (CP). It finds that CP has a better capacity to handle batch process problems with complex constraints but it costs longer time

    Performance optimization of checkpointing schemes with task duplication

    Get PDF
    In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults by comparing the processors' states at checkpoints, and reducing fault recovery time by supplying a safe point to rollback to. In this paper, we show that, by tuning the checkpointing schemes to a given architecture, a significant reduction in the execution time can be achieved. The main idea is to use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (storing the states to reduce recovery time). With two types of checkpoints, we can use both the comparison and storage operations in an efficient way and improve the performance of checkpointing schemes. Results we obtained show that, in some cases, using compare and store checkpoints can reduce the overhead of DMR checkpointing schemes by as much as 30 percent
    • …
    corecore