892 research outputs found

    Shape-based cost analysis of skeletal parallel programs

    Get PDF
    Institute for Computing Systems ArchitectureThis work presents an automatic cost-analysis system for an implicitly parallel skeletal programming language. Although deducing interesting dynamic characteristics of parallel programs (and in particular, run time) is well known to be an intractable problem in the general case, it can be alleviated by placing restrictions upon the programs which can be expressed. By combining two research threads, the “skeletal” and “shapely” paradigms which take this route, we produce a completely automated, computation and communication sensitive cost analysis system. This builds on earlier work in the area by quantifying communication as well as computation costs, with the former being derived for the Bulk Synchronous Parallel (BSP) model. We present details of our shapely skeletal language and its BSP implementation strategy together with an account of the analysis mechanism by which program behaviour information (such as shape and cost) is statically deduced. This information can be used at compile-time to optimise a BSP implementation and to analyse computation and communication costs. The analysis has been implemented in Haskell. We consider different algorithms expressed in our language for some example problems and illustrate each BSP implementation, contrasting the analysis of their efficiency by traditional, intuitive methods with that achieved by our cost calculator. The accuracy of cost predictions by our cost calculator against the run time of real parallel programs is tested experimentally. Previous shape-based cost analysis required all elements of a vector (our nestable bulk data structure) to have the same shape. We partially relax this strict requirement on data structure regularity by introducing new shape expressions in our analysis framework. We demonstrate that this allows us to achieve the first automated analysis of a complete derivation, the well known maximum segment sum algorithm of Skillicorn and Cai

    Financial Reform and Development in the Philippines, 1980-1997: Imperatives, Performance and Challenges

    Get PDF
    This paper discusses financial reform and development in the Philippines during the past one and a half decades and the challenges facing the financial sector in the light of greater international financial integration. Issues on prudential regulation and how it can be improved and strengthened are presented. Key lessons from the Philippines and Southeast Asian experience are also discussed.Asian financial crisis, financial sector, financial liberalization, financial integration

    Financial Reform and Development in the Philippines, 1980-1997: Imperatives, Performance and Challenges

    Get PDF
    This paper discusses financial reform and development in the Philippines during the past one and a half decades and the challenges facing the financial sector in the light of greater international financial integration. Issues on prudential regulation and how it can be improved and strengthened are presented. Key lessons from the Philippines and Southeast Asian experience are also discussed.Asian financial crisis, financial sector, financial liberalization, financial integration

    Toward optimised skeletons for heterogeneous parallel architecture with performance cost model

    Get PDF
    High performance architectures are increasingly heterogeneous with shared and distributed memory components, and accelerators like GPUs. Programming such architectures is complicated and performance portability is a major issue as the architectures evolve. This thesis explores the potential for algorithmic skeletons integrating a dynamically parametrised static cost model, to deliver portable performance for mostly regular data parallel programs on heterogeneous archi- tectures. The rst contribution of this thesis is to address the challenges of program- ming heterogeneous architectures by providing two skeleton-based programming libraries: i.e. HWSkel for heterogeneous multicore clusters and GPU-HWSkel that enables GPUs to be exploited as general purpose multi-processor devices. Both libraries provide heterogeneous data parallel algorithmic skeletons including hMap, hMapAll, hReduce, hMapReduce, and hMapReduceAll. The second contribution is the development of cost models for workload dis- tribution. First, we construct an architectural cost model (CM1) to optimise overall processing time for HWSkel heterogeneous skeletons on a heterogeneous system composed of networks of arbitrary numbers of nodes, each with an ar- bitrary number of cores sharing arbitrary amounts of memory. The cost model characterises the components of the architecture by the number of cores, clock speed, and crucially the size of the L2 cache. Second, we extend the HWSkel cost model (CM1) to account for GPU performance. The extended cost model (CM2) is used in the GPU-HWSkel library to automatically nd a good distribution for both a single heterogeneous multicore/GPU node, and clusters of heteroge- neous multicore/GPU nodes. Experiments are carried out on three heterogeneous multicore clusters, four heterogeneous multicore/GPU clusters, and three single heterogeneous multicore/GPU nodes. The results of experimental evaluations for four data parallel benchmarks, i.e. sumEuler, Image matching, Fibonacci, and Matrix Multiplication, show that our combined heterogeneous skeletons and cost models can make good use of resources in heterogeneous systems. Moreover using cores together with a GPU in the same host can deliver good performance either on a single node or on multiple node architectures

    Design of testbed and emulation tools

    Get PDF
    The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

    Fault Tolerant Distributed Computing Framework for Scientific Algorithms

    Get PDF
    Arvuti riistvara füüsilised piirangud on lõpetanud protsessorite tuumade arvutusvõimsuse suurenemist, kuid arvutiarhitektuuride suurenev parallelsus säilitab Moore'i seaduse kehtivust. Samal ajal tõuseb arvutusvõimsuse nõudlus pidevalt, sundides inimesi kohandada algoritme paralleelsete arhitektuuride kasutamiseks. Üks paljudest paralleelsete arhitektuuride probleemidest on tõrkete tekkimise tõenäosuse suurenemine parallelsete komponentide arvu suurenemisega. Piinlikult paralleelsete ja andmemahukate algoritmidega seoses on MapReduce läbinud pika tee, et tagada kasutajatele suure hulga hajutatud arvutiressursside lihtsustatud kasutamine ilma töö kaotamise hirmuta. Sama ei sa öelda kommunikatsiooni intensiivsete algoritmide jaoks mis on levinud teadusarvutuse domeenis. Selles töös on pakutud uus BSP ({\it Bulk Synchronous Parallel}) inspireeritud parallelprogrammeerimise mudel, mille lähenemisviis on sarnane {\it continuation passing} programmeerimis stiiliga ja mis võimaldab rakendada BSP struktuuril baseeruvat loomulikku tõrkekindlust. Töös on kirjeldatud loodud hajusarvutuste raamistik NEWT, mis põhineb pakutud mudelil ja on kasutatud selle lähenemisviisi valideerimiseks. Raamistik säilitab enamik MapReduce eelisi ning efektiivsemalt toetab suuremat algoritmide hulka, nagu näiteks eelmainitud iteratiivsed algoritmid.The physical limitations of computing hardware have put a stop on the increase of a single processor core's computing power. However, Moore's law is still maintained through the ever increasing parallelism of the computing architectures. At the same time the demand for computational power has been unrelentingly growing, forcing people to adapt the algorithms they use to these parallel architectures. One of the many downsides to parallel architectures is that with the rise in the number of components, the chance of failure of one of these components increases. When it comes to embarrassingly parallel data-intensive algorithms, Map-Reduce has gone a long way in ensuring users can easily utilize large amounts of distributed computing resources without the fear of losing work. However, this does not apply to iterative communication-intensive algorithms common in the scientific computing domain. In this work a new BSP-inspired (Bulk Synchronous Parallel) programming model is proposed, which adopts an approach similar to continuation passing for implementing parallel algorithms and facilitates fault-tolerance inherent in the BSP program structure. The distributed computing framework NEWT, which is based on the proposed model, is described and used to validate the approach. The framework retains most of the advantages that Map-Reduce provides, yet efficiently supports a larger assortment of algorithms, such as the aforementioned iterative ones

    Profiling large-scale lazy functional programs

    Get PDF
    The LOLITA natural language processing system is an example of one of the ever increasing number of large-scale systems written entirely in a functional programming language. The system consists of over 50,000 lines of Haskell code and is able to perform a number of tasks such as semantic and pragmatic analysis of text, context scanning and query analysis. Such a system is more useful if the results are calculated in real-time, therefore the efficiency of such a system is paramount. For the past three years we have used profiling tools supplied with the Haskell compilers GHC and HBC to analyse and reason about our programming solutions and have achieved good results; however, our experience has shown that the profiling life-cycle is often too long to make a detailed analysis of a large system possible, and the profiling results are often misleading. A profiling system is developed which allows three types of functionality not previously found in a profiler for lazy functional programs. Firstly, the profiler is able to produce results based on an accurate method of cost inheritance. We have found that this reduces the possibility of the programmer obtaining misleading profiling results. Secondly, the programmer is able to explore the results after the execution of the program. This is done by selecting and deselecting parts of the program using a post-processor. This greatly reduces the analysis time as no further compilation, execution or profiling of the program is needed. Finally, the new profiling system allows the user to examine aspects of the run-time call structure of the program. This is useful in the analysis of the run-time behaviour of the program. Previous attempts at extending the results produced by a profiler in such a way have failed due to the exceptionally high overheads. Exploration of the overheads produced by the new profiling scheme show that typical overheads in profiling the LOLITA system are: a 10% increase in compilation time; a 7% increase in executable size and a 70% run-time overhead. These overheads mean a considerable saving in time in the detailed analysis of profiling a large, lazy functional program

    A BSP algorithm for on-the-fly checking CTL* formulas on security protocols

    Get PDF
    International audienceThis paper presents a distributed (Bulk-Synchronous Parallel or bsp) algorithm to compute on-the-fly whether a structured model of a security protocol satisfies a ctl {Mathematical expression} formula. Using the structured nature of the security protocols allows us to design a simple method to distribute the state space under consideration in a need-driven fashion. Based on this distribution of the states, the algorithm for logical checking of a ltl formula can be simplified and optimised allowing, with few tricky modifications, the design of an efficient algorithm for ctl {Mathematical expression} checking. Some prototype implementations have been developed, allowing to run benchmarks to investigate the parallel behaviour of our algorithms

    HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data Analytics

    Get PDF
    To cope with the rapid growth in available data, the efficiency of data analysis and machine learning libraries has recently received increased attention. Although great advancements have been made in traditional array-based computations, most are limited by the resources available on a single computation node. Consequently, novel approaches must be made to exploit distributed resources, e.g. distributed memory architectures. To this end, we introduce HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI. It provides both low-level array computations, as well as assorted higher-level algorithms. With HeAT, it is possible for a NumPy user to take full advantage of their available resources, significantly lowering the barrier to distributed data analysis. When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.Comment: 10 pages, 8 figures, 5 listings, 1 tabl
    corecore