141 research outputs found

    A performance study on dynamic load balancing algorithms.

    Get PDF
    by Sau-ming Lau.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 131-134).Abstract --- p.iAcknowledgement --- p.iiiList of Tables --- p.viiiList of Figures --- p.xChapter 1 --- Introduction --- p.1Chapter 2 --- Basic Concepts and Related Work --- p.9Chapter 2.1 --- Components of Dynamic Load Balancing Algorithms --- p.10Chapter 2.2 --- Classification of Load Balancing Algorithms --- p.11Chapter 2.2.1 --- Casavant and Kuhl's Taxonomy --- p.12Chapter 3 --- System Model and Assumptions --- p.19Chapter 3.1 --- The System Model and Assumptions --- p.19Chapter 3.2 --- Survey on Cost Models --- p.21Chapter 3.2.1 --- "Eager, Lazowska, and Zahorjan's Model" --- p.22Chapter 3.2.2 --- "Shivaratri, Krueger, and Singhal's Model" --- p.23Chapter 3.3 --- Our Cost Model --- p.24Chapter 3.3.1 --- Design Philosophy --- p.24Chapter 3.3.2 --- Polling Query Cost Model --- p.25Chapter 3.3.3 --- Load State Broadcasting Cost Model --- p.26Chapter 3.3.4 --- Task Assignment Cost Model --- p.27Chapter 3.3.5 --- Task Migration Cost Model --- p.28Chapter 3.3.6 --- Execution Priority --- p.29Chapter 3.3.7 --- Simulation Parameter Values --- p.31Chapter 3.4 --- Performance Metrics --- p.33Chapter 4 --- A Performance Study on Load Information Dissemination Strategies --- p.36Chapter 4.1 --- Algorithm Descriptions --- p.37Chapter 4.1.1 --- Transfer Policy --- p.37Chapter 4.1.2 --- Information Policy --- p.40Chapter 4.1.3 --- Location Policy --- p.40Chapter 4.1.4 --- Categorization of the Algorithms --- p.43Chapter 4.2 --- Simulations and Analysis of Results --- p.43Chapter 4.2.1 --- Performance Comparisons --- p.44Chapter 4.2.2 --- Effect of Imbalance Factor on AWLT Algorithms --- p.49Chapter 4.2.3 --- Comparison of Average Performance --- p.52Chapter 4.2.4 --- Raw Simulation Results --- p.54Chapter 4.3 --- Discussions --- p.55Chapter 5 --- Resolving Processor Thrashing with Batch Assignment --- p.56Chapter 5.1 --- The GR.batch Algorithm --- p.57Chapter 5.1.1 --- The Guarantee and Reservation Protocol --- p.57Chapter 5.1.2 --- The Location Policy --- p.58Chapter 5.1.3 --- Batch Size Determination --- p.60Chapter 5.1.4 --- The Complete GR.batch Description --- p.62Chapter 5.2 --- Additional Performance Metrics --- p.66Chapter 5.3 --- Simulations and Analysis of Results --- p.67Chapter 5.4 --- Discussions --- p.73Chapter 6 --- Applying Batch Assignment to Systems with Bursty Task Arrival Patterns --- p.75Chapter 6.1 --- Bursty Workload Pattern Characterization Model --- p.76Chapter 6.2 --- Algorithm Descriptions --- p.77Chapter 6.2.1 --- The GR.batch Algorithm --- p.77Chapter 6.2.2 --- The SK .single Algorithm --- p.77Chapter 6.2.3 --- Summary of Algorithm Properties --- p.77Chapter 6.3 --- Analysis of Simulation Results --- p.77Chapter 6.3.1 --- Performance Comparison --- p.79Chapter 6.3.2 --- Time Trace --- p.80Chapter 6.4 --- Discussions --- p.80Chapter 7 --- A Preliminary Study on Task Assignment Augmented with Migration --- p.87Chapter 7.1 --- Algorithm Descriptions --- p.87Chapter 7.1.1 --- Information Policy --- p.88Chapter 7.1.2 --- Location Policy --- p.88Chapter 7.1.3 --- Transfer Policy --- p.88Chapter 7.1.4 --- The Three Load Balancing Algorithms --- p.89Chapter 7.2 --- Simulations and Analysis of Results --- p.90Chapter 7.2.1 --- Even Task Service Time --- p.90Chapter 7.2.2 --- Uneven Task Service Time --- p.94Chapter 7.3 --- Discussions --- p.99Chapter 8 --- Assignment Augmented with Migration Revisited --- p.100Chapter 8.1 --- Algorithm Descriptions --- p.100Chapter 8.1.1 --- The GR.BATCH.A Algorithm --- p.101Chapter 8.1.2 --- The SK.SINGLE.AM Algorithm --- p.101Chapter 8.1.3 --- Summary of Algorithm Properties --- p.101Chapter 8.2 --- Simulations and Analysis of Results --- p.101Chapter 8.2.1 --- Performance Comparisons --- p.102Chapter 8.2.2 --- Effect of Workload Imbalance --- p.105Chapter 8.3 --- Discussions --- p.106Chapter 9 --- Applying Batch Transfer to Heterogeneous Systems with Many Task Classes --- p.108Chapter 9.1 --- Heterogeneous System Model --- p.109Chapter 9.1.1 --- Processing Node Specification --- p.110Chapter 9.1.2 --- Task Type Specification --- p.111Chapter 9.1.3 --- Workload State Measurement --- p.112Chapter 9.1.4 --- Task Selection Candidates --- p.113Chapter 9.2 --- Algorithm Descriptions --- p.115Chapter 9.2.1 --- First Category ´ؤ The Sk .single Variations --- p.115Chapter 9.2.2 --- Second Category ´ؤ The GR. batch Variation Modeled with SSP --- p.117Chapter 9.3 --- Analysis of Simulation Results --- p.123Chapter 10 --- Conclusions and Future Work --- p.127Bibliography --- p.131Appendix A System Model Notations and Definitions --- p.131Appendix A.1 Processing Node Model --- p.131Appendix A.2 Cost Models --- p.132Appendix A.3 Load Measurement --- p.134Appendix A.4 Batch Size Determination Rules --- p.135Appendix A.5 Bursty Arrivals Modeling --- p.135Appendix A.6 Heterogeneous Systems Modeling --- p.135Appendix B Shivaratri and Krueger's Location Policy --- p.13

    Efficient caching algorithms for memory management in computer systems

    Get PDF
    As disk performance continues to lag behind that of memory systems and processors, fully utilizing memory to reduce disk accesses is a highly effective effort to improve the entire system performance. Furthermore, to serve the applications running on a computer in distributed systems, not only the local memory but also the memory on remote servers must be effectively managed to minimize I/O operations. The critical challenges in an effective memory cache management include: (1) Insightfully understanding and quantifying the locality inherent in the memory access requests; (2) Effectively utilizing the locality information in replacement algorithms; (3) Intelligently placing and replacing data in the multi-level caches of a distributed system; (4) Ensuring that the overheads of the proposed schemes are acceptable.;This dissertation provides solutions and makes unique and novel contributions in application locality quantification, general replacement algorithms, low-cost replacement policy, thrashing protection, as well as multi-level cache management in a distributed system. First, the dissertation proposes a new method to quantify locality strength, and accurately to identify the data with strong locality. It also provides a new replacement algorithm, which significantly outperforms existing algorithms. Second, considering the extremely low-cost requirements on replacement policies in virtual memory management, the dissertation proposes a policy meeting the requirements, and considerably exceeding the performance existing policies. Third, the dissertation provides an effective scheme to protect the system from thrashing for running memory-intensive applications. Finally, the dissertation provides a multi-level block placement and replacement protocol in a distributed client-server environment, exploiting non-uniform locality strengths in the I/O access requests.;The methodology used in this study include careful application behavior characterization, system requirement analysis, algorithm designs, trace-driven simulation, and system implementations. A main conclusion of the work is that there is still much room for innovation and significant performance improvement for the seemingly mature and stable policies that have been broadly used in the current operating system design

    Cactus Framework: Black Holes to Gamma Ray Bursts

    Get PDF
    Gamma Ray Bursts (GRBs) are intense narrowly-beamed flashes of gamma-rays of cosmological origin. They are among the most scientifically interesting astrophysical systems, and the riddle concerning their central engines and emission mechanisms is one of the most complex and challenging problems of astrophysics today. In this article we outline our petascale approach to the GRB problem and discuss the computational toolkits and numerical codes that are currently in use and that will be scaled up to run on emerging petaflop scale computing platforms in the near future. Petascale computing will require additional ingredients over conventional parallelism. We consider some of the challenges which will be caused by future petascale architectures, and discuss our plans for the future development of the Cactus framework and its applications to meet these challenges in order to profit from these new architectures

    Automatic contention detection and amelioration for data-intensive operations

    Full text link

    GPU PERFORMANCE MODELLING AND OPTIMIZATION

    Get PDF
    Ph.DNUS-TU/E JOINT PH.D

    Memory hierarchies for future HPC architectures

    Get PDF
    Efficiently managing the memory subsystem of modern multi/manycore architectures is increasingly becoming a challenge as systems grow in complexity and heterogeneity. In the field of high performance computing (HPC) in particular, where massively parallel architectures are used and input sets of several terabytes are common, careful management of the memory hierarchy is crucial to exploit the full computing power of these systems. The goal of this thesis is to provide computer architects with valuable information to guide the design of future systems, and in particular of those more widely used in the field of HPC, i.e., symmetric multicore processors (SMPs) and GPUs. With that aim, we present an analysis of some of the inefficiencies and shortcomings of current memory management techniques and propose two novel schemes leveraging the opportunities that arise from the use of new and emerging programming models and computing paradigms. The first contribution of this thesis is a block prefetching mechanism for task-based programming models. Using a task-based programming model simplifies parallel programming and allows for better resource utilization in the supercomputers used in the field of HPC, while enabling sophisticated memory management techniques. The scheme proposed relies on a memory-aware runtime system to guide prefetching while avoiding the main drawbacks of traditional prefetching mechanisms, i.e., cache pollution and lack of timeliness. It leverages the information provided by the user about tasks¿ input data to prefetch contiguous blocks of memory that are certain to be useful. The proposed scheme targets SMPs with large cache hierarchies and uses heuristics to dynamically decide the best cache level to prefetch into without evicting useful data. The focus of this thesis then turns to heterogeneous architectures combining GPUs and traditional multicore processors. The current trend towards tighter coupling of GPU and CPU enables new collaborative computations that tax the memory subsystem in a different manner than previous heterogeneous computations did, and requires careful analysis to understand the trade-offs that are to be expected when designing future memory organizations. The second contribution is an in-depth analysis on the impact of sharing the last-level cache between GPU and CPU cores on a system where the GPU is integrated on the same die as the CPU. The analysis focuses on the effect that a shared cache can have on collaborative computations where GPU and CPU threads concurrently work on a problem and share data at fine granularities. The results presented here show that sharing the last-level cache is largely beneficial as it allows for better resource utilization. In addition, the evaluation shows that collaborative computations benefit significantly from the faster CPU-GPU communication and higher cache hit rates that a shared cache level provides. The final contribution of this thesis analyzes the inefficiencies and drawbacks of demand paging as currently implemented in discrete GPUs by NVIDIA. Then, it proposes a novel memory organization and dynamic migration scheme that allows for efficient data sharing between GPU and CPU, specially when executing collaborative computations where data is migrated back and forth between the two separate memories. This scheme migrates data at cache line granularities transparently to the user and operating system, avoiding false sharing and the unnecessary data transfers that occur with demand paging. The results show that the proposed scheme is able to outperform the baseline system by reducing the migration latency of data that is copied multiple times between the two memories. In addition, analysis of different interconnect latencies shows that fine-grained data sharing between GPU and CPU is feasible as long as future interconnect technologies achieve four to five times lower round-trip times than PCI-Express 3.0.La gestión eficiente del subsistema de memoria se ha convertido en un problema complejo a la vez que los sistemas crecen en complejidad y heterogeneidad. En el campo de la computación de altas prestaciones (HPC) en particular, donde arquitecturas masivamente paralelas son usadas y entradas de varios terabytes son comunes, una gestión cuidadosa de la jerarquía de memoria es crucial para conseguir explotar todo el potencial de estas arquitecturas. El objetivo de esta tesis es proporcionar a los arquitectos de computadores información valiosa para el diseño de los sistemas del futuro, y en concreto de los más comúnmente usados en el campo de HPC, los procesadores multinúcleo simétricos (SMP) y las tarjetas gráficas (GPU). Para ello, presentamos un análisis de las ineficiencias y los inconvenientes de los sistemas de gestión de memoria actuales, y proponemos dos técnicas nuevas que aprovechan las oportunidades surgidas del uso de nuevos y emergentes modelos de programación y paradigmas de computación. La primera contribución de esta tesis es un mecanismo de prefetch de bloques para modelos de programación basados en tareas. Usando modelos de programación orientados a tareas simplifica la programación paralela y permite hacer un mejor uso de los recursos en los supercomputadores usados en HPC, mientras permiten el uso de sofisticados mecanismos de gestión de memoria. La técnica propuesta se basa en un sistema de runtime para guiar el prefetch de datos mientras evita los principales inconvenientes tradicionalmente asociados con prefetching, la polución de cache y la medida incorrecta de los tiempos. El mecanismo utiliza la información sobre las entradas de las tareas proporcionada por el usuario para prefetchear bloques contiguos de memoria sobre los que hay certeza que serán utilizados. El mecanismo está dirigido a arquitecturas SMP con amplias jerarquías de cache, y usa heurísticas para decidir dinámicamente en qué nivel de caché colocar los datos sin desplazar datos útiles. El focus de la tesis gira luego a arquitecturas heterogéneas que combinan GPUs con procesadores multinúcleo tradicionales. La actual tendencia a unir GPU y CPU permite el uso de una nueva serie de computaciones colaborativas que afectan al subsistema de memoria de forma diferente que las computaciones heterogéneas anteriores, y requiere de un cuidadoso análisis para entender las consecuencias que esto tiene en el diseño de las organizaciones de memoria futuras. La segunda contribución de la tesis es un análisis detallado del impacto que supone compartir el último nivel de cache entre núcleos de GPU y CPU en sistemas donde la GPU está integrada en el mismo chip que la CPU. El análisis se centra en el efecto que la cache compartida tiene en colaboraciones colaborativas donde hilos de GPU y CPU trabajan concurrentemente en un problema y comparten datos a grano fino. Los resultados presentados en esta tesis muestran que compartir el último nivel de cache es mayormente beneficioso ya que permite un mejor uso de los recursos. Además, la evaluación muestra que las computaciones colaborativas se benefician en gran medida de la comunicación más rápida entre GPU y CPU y las mayores tasas de acierto de cache que un nivel de cache compartido proporcionan

    Integrated shared-memory and message-passing communication in the Alewife multiprocessor

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 237-246) and index.by John David Kubiatowicz.Ph.D
    corecore