2,488 research outputs found

    A cost-effective clustered architecture

    Get PDF
    In current superscalar processors, all floating-point resources are idle during the execution of integer programs. As previous works show, this problem can be alleviated if the floating-point cluster is extended to execute simple integer instructions. With minor hardware modifications to a conventional superscalar processor, the issue width can potentially be doubled without increasing the hardware complexity. In fact, the result is a clustered architecture with two heterogeneous clusters. We propose to extend this architecture with a dynamic steering logic that sends the instructions to either cluster. The performance of clustered architectures depends on the inter-cluster communication overhead and the workload balance. We present a scheme that uses run-time information to optimise the trade-off between these figures. The evaluation shows that this scheme can achieve an average speed-up of 35% over a conventional 8-way issue (4 int+4 fp) machine and that it outperforms the previously proposed one.Peer ReviewedPostprint (published version

    Empowering a helper cluster through data-width aware instruction selection policies

    Get PDF
    Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost- and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. helper cluster), potentially providing performance benefits. We complement a 32-bit monolithic processor with a low-complexity 8-bit helper cluster. Then, in our main focus, we propose various ideas to select suitable instructions to execute in the data-width based clusters. We add data-width information as another instruction steering decision metric and introduce new data-width based selection algorithms which also consider dependency, inter-cluster communication and load imbalance. Utilizing those techniques, the performance of a wide range of workloads are substantially increased; helper cluster achieves an average speedup of 11% for a wide range of 412 apps. When focusing on integer applications, the speedup can be as high as 22% on averagePeer ReviewedPostprint (published version

    A new hardware prefetching scheme based on dynamic interpretation of the instruction stream

    Get PDF
    It is well known that memory latency is a major deterrent to achieving the maximum possible performance of a today\u27s high speed RISC processors. Techniques to reduce or tolerate large memory latencies become essential for achieving high processor utilization;Many methods, ranging from software to hardware solutions, have been studied with varying amounts of success. Most techniques have concentrated on data prefetching. However, our simulations show that the CPU is stalled up to 50% of the time waiting for instructions. The instruction memory latency reduction technique typically used in CPU designs today is the one block look-ahead (OBL) method;In this thesis, I present a new hardware prefetching scheme based on dynamic interpretation of the instruction stream. This is done by adding a small pipeline to the cache that scans forward in the instruction stream interpreting each instruction and predicting the future execution path. It then prefetches what it predicts the CPU will be executing in the near future;The pipelined prefetching engine has been shown to be a very effective technique for decreasing the instruction stall cycles in typical on-chip cache memories used today. It performs well, yielding reductions in stall cycles up to 30% or more for both scientific and general purpose programs, and has been shown to reduce the number of instruction stall cycles as compared to the OBL technique as well;The idea of sub-line prefetching was also studied and presented. It was thought that prefetching full cache lines might present too much overhead in terms of bus bandwidth, so prefetches should only fill partial cache lines instead. However it was determined that prefetching partial cache lines does not show any benefit when dealing with cache lines smaller than 128 bytes

    Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.

    Get PDF
    To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications

    CAREER: Automated software understanding for retargeting embedded image processing software for data parallel execution

    Get PDF
    Issued as final reportNational Science Foundation (U.S.
    • …
    corecore