3,748 research outputs found

    Scheduling with Communication Delays

    Get PDF
    International audiencehanbook on ordonnancemen

    Scheduling Under Non-Uniform Job and Machine Delays

    Get PDF

    Instruction replication for clustered microarchitectures

    Get PDF
    This work presents a new compilation technique that uses instruction replication in order to reduce the number of communications executed on a clustered microarchitecture. For such architectures, the need to communicate values between clusters can result in a significant performance loss. Inter-cluster communications can be reduced by selectively replicating an appropriate set of instructions. However, instruction replication must be done carefully since it may also degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-of-the-art modulo scheduling algorithm that effectively reduces communications. Results show that the number of communications can decrease using replication, which results in significant speed-ups. IPC is increased by 25% on average for a 4-cluster microarchitecture and by as mush as 70% for selected programs.Peer ReviewedPostprint (published version

    Grain-size optimization and scheduling for distributed memory architectures

    Get PDF
    The problem of scheduling parallel programs for execution on distributed memory parallel architectures has become the subject of intense research in recent, years. Because of the high inter-processor communication overhead in existing parallel machines, a crucial step in scheduling is task clustering, the process of coalescing heavily communicating fine grain tasks into coarser ones in order to reduce the communication overhead so that the overall execution time is minimized. The thesis of this research is that the task of exposing the parallelism in a given application should be left to the algorithm designer. On the other hand, the task of limiting the parallelism in a chosen parallel algorithm is best handled by the compiler or operating system for the target parallel machine. Toward this end, we have developed CASS (for Clustering And Scheduling System), a. task management system that provides facilities for automatic granularity optimization and task scheduling of parallel programs on distributed memory parallel architectures. In CASS, a task graph generated by a profiler is used by the clustering module to find the best granularity al which to execute the program so that the overall execution time is minimized. The scheduling module maps the clusters onto a. fixed number of processors and determines the order of execution of tasks in each processor. The output of scheduling module is then used by a code generator to generate machine instructions. CASS employs two efficient heuristic algorithms for clustering static task graphs: CASS-I for clustering with task duplication, and CASS-II for clustering without task duplication. It is shown that the clustering algorithms used by CASS outperform the best known algorithms reported in the literature. For the scheduling module in CASS, a heuristic algorithm based on load balancing is used to merge clusters such that the number of clusters matches the number of available physical processors. We also investigate task clustering algorithms for dynamic task graphs and show that it is inherently more difficult than the static case

    Improving latency tolerance of multithreading through decoupling

    Get PDF
    The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.Peer ReviewedPostprint (published version

    Instruction scheduling in micronet-based asynchronous ILP processors

    Get PDF
    • …
    corecore