3 research outputs found

    Design space exploration strategies for FPGA implementation of signal processing systems using CAL dataflow program

    Get PDF
    This paper presents some strategies for design space exploration of FPGA-based signal processing systems that are specified using the CAL dataflow language. The actor- oriented, high-level of abstraction provided by CAL allows flexible exploration and consequently results in a wide range of feasible design implementations. We have applied and ex- tended the existing techniques for refactoring and pipelining actors and actions by means of critical path analysis, and in- troduced some new buffering techniques based on heuristics. The combinations of these techniques have been applied on the CAL specification of the MPEG-4 video decoder, and synthesized to HDL for evaluation in the design implementa- tion space. Results show that using our configuration for the exploration of 48 design points, a throughput range of roughly 8x has been achieved, for slice, block RAM, frequency, and latency range of 1.3x, 2.5x, 2.5x, and 2.9x respectively

    Extracting Critical Path Graphs from MPI Applications

    Full text link
    The critical path is one of the fundamental runtime characteristics of a parallel program. It identifies the longest execution sequence without wait delays. In other words, the critical path is the global execution path that inflicts wait operations on other nodes without itself being stalled. Hence, it dictates the overall runtime and knowing it is important to understand an application's runtime and message behavior and to target optimizations. We have developed a toolset that identifies the critical path of MPI applications, extracts it, and then produces a graphical representation of the corresponding program execution graph to visualize it. To implement this, we intercept all MPI library calls, use the information to build the relevant subset of the execution graph, and then extract the critical path from there. We have applied our technique to several scientific benchmarks and successfully produced critical path diagrams for applications running on up to 128 processors

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis
    corecore