5 research outputs found

    Briques de base pour la réalisation d'architectures parallèles specialisées

    Get PDF
    Les études et recherches liées au développement d'applications de traitement d'image et plus particulièrement en traitement numérique du signal vidéo requièrent des composants flexibles permettant la mise en oeuvre "en vraie grandeur" des algorithmes en vue de leur validation dans un contexte temps réel. L'objectif de ce rapport est de présenter les nouveaux circuits spécifiques programmables "briques de base" pour la réalisation d'architectures parallèles spécialisées. Les approches de type traitement de signal (DSP) de Texas Instruments et Motorola, celles orientées traitement d'image et du signal proposees par Intel (iWARP) ainsi que les circuits de traitement vidéo de Philips, NEC et ITT sont considérés

    Architecture and Compiler Tradeoffs for a Long Instruction Word Microprocessor

    No full text
    A very long instruction word (VLIW) processor exploits parallelism by controlling multiple operations in a single instruction word. This paper describes the architecture and compiler tradeoffs in the design of iWarp, a VLIW single-chip microprocessor developed in a joint project with Intel Corp. The iWarp processor is capable of spec-ifying up to nine operations in an instruction word and has a peak performance of 20 million floating-point op-erations and 20 million integer operations per second. An optimizing compiler has been constructed and used as a tool to evaluate the different architectural proposals in the development of iWarp. We present here the anal-ysis and compiler optimizations for those architectural features that address two key issues in the design of a VLIW microprocessor: code density and a streamlined execution cycle. We support the results of our analysis with performance data for the Livermore Loops and a selection of programs from the LINPACK library

    Exploiting Fine-Grain Concurrency Analytical Insights in Superscalar Processor Design

    Get PDF
    This dissertation develops analytical models to provide insight into various design issues associated with superscalar-type processors, i.e., the processors capable of executing multiple instructions per cycle. A survey of the existing machines and literature has been completed with a proposed classification of various approaches for exploiting fine-grain concurrency. Optimization of a single pipeline is discussed based on an analytical model. The model-predicted performance curves are found to be in close proximity to published results using simulation techniques. A model is also developed for comparing different branch strategies for single-pipeline processors in terms of their effectiveness in reducing branch delay. The additional instruction fetch traffic generated by certain branch strategies is also studied and is shown to be a useful criterion for choosing between equally well performing strategies. Next, processors with multiple pipelines are modelled to study the tradeoffs associated with deeper pipelines versus multiple pipelines. The model developed can reveal the cause of performance bottleneck: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection. The cost associated with speculative (i.e., beyond basic block) execution is examined via probability distributions that characterize the inherent parallelism in the instruction stream. The throughput prediction of the analytic model is shown, using a variety of benchmarks, to be close to the measured static throughput of the compiler output, under resource and scope constraints. Further experiments provide misprediction delay estimates for these benchmarks under scope constraints, assuming beyond-basic-block, out-of-order execution and run-time scheduling. These results were derived using traces generated by the Multiflow TRACE SCHEDULINGâ„¢(*) compacting C and FORTRAN 77 compilers. A simplified extension to the model to include multiprocessors is also proposed. The extended model is used to analyze combined systems, such as superpipelined multiprocessors and superscalar multiprocessors, both with shared memory. It is shown that the number of pipelines (or processors) at which the maximum throughput is obtained is increasingly sensitive to the ratio of memory access time to network access delay, as memory access time increases. Further, as a function of inter-iteration dependency distance, optimum throughput is shown to vary nonlinearly, whereas the corresponding Optimum number of processors varies linearly. The predictions from the analytical model agree with published results based on simulations. (*)TRACE SCHEDULING is a trademark of Multiflow Computer, Inc
    corecore