21 research outputs found

    The Effect of Code Expanding Optimizations on Instruction Cache Design

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Science Foundation / MIP-8809478NCRAMD 29K Advanced Processor Development DivisionNational Aeronautics and Space Administration / NASA NAG 1-613N00014-91-J-128

    Advancements in Compiler Design and Optimization Techniques

    Get PDF
    The modern period has seen advancements in compiler design, optimization technique, and software system efficiency. The influence of the most recent developments in compiler design and optimization techniques on program execution speed, memory utilization, and overall software quality is highlighted in this study. The design of the compiler is advanced by the efficient code that is now structured in research with high-speed performance without manual intervention. The influence of the most recent developments in compiler design and optimization techniques on program execution speed, memory utilization, and overall software quality is highlighted in this paper's thorough analysis

    Exploring boundaries in game processing

    Get PDF

    Profiling and optimizing K-means algorithms in a beowulf cluster environment

    Get PDF
    The K-means algorithm is a well known statistical agglomeration algorithm used to sort a database of unlabeled items into K groups. As part of the fitness function of an Evolutionary Algorithm (EA), the optimization of the K-means algorithm has become a point of great interest. Although many approaches have been proposed for its parallelization and optimization, very few address the question of scalability and efficiency. In most cases, the description of the execution environment remains opaque and precise profiles of the program are mostly absent. Performance and efficiency issues are quickly relegated to communicafion issues. We address these deficiencies by presenting a detailed description of two parallel environments, the Beowulf style clusters and the Symmetric Multi-Processors (SMP) parallel machines. A mixture of theoretical and empirical models were used to characterize these environments and set baseline expectations pertaining to the K-means algorithm. Due to the necessity of a multidisciplinary expertise, a detailed use of Tuning and Analysis Utilities (TAU) is provided to ease the parallel performance profiling task. Coupled with the high precision counter interface provided by Performance Application Programming Interface (PAPI), we present a grey box method by which a parallel master-slave implementation of the K-means is evolved into a highly efficient island version of itself. Communications and computational optimization were guided by prior theoretical and empirical models of the parallel execution environment. Our work has revealed that there is much more to parallel processing than the simple balance between computation and communications. We have brought forth the negative impact of using mathematical libraries for specific problems and identified performance issues specific to some versions of the same series of Message Passing Inerface (MPI) libraries. High precision profiling has shown that data representation and processing can be a more significant source of scalability bottleneck than computation and communications put together

    Emulation of a Complex Instruction Set Computer with a Reduced Instruction Set Computer

    Get PDF
    This paper analyzes some of the difficulties of emulating a Complex Instruction Set Computer (CISC) with a Reduced Instruction Set Computer (RISC). It will be shown that although the speed advantage of a RISC is sacrificed, a CISC can be emulated with the exception of software constructs that support nonstandard hardware interfaces. Some concrete examples will be used to help illustrate the execution- time bottlenecks as well as to discuss possible solutions from an architectural point of view for both Silicon and Gallium Arsenide (GaAs). In addition, it will be shown that the most efficient method of emulation involves debugging compiled High-Level Language (HLL) source code on a CISC, and then recompiling the HLL code with a compiler that is familiar with the target RISC architectur

    Compiling Programs for Nonshared Memory Machines

    Get PDF
    Nonshared-memory parallel computers promise scalable performance for scientific computing needs. Unfortunately, these machines are now difficult to program because the message-passing languages available for them do not reflect the computational models used in designing algorithms. This introduces a semantic gap in the programming process which is difficult for the programmer to fill. The purpose of this research is to show how nonshared-memory machines can be programmed at a higher level than is currently possible. We do this by developing techniques for compiling shared-memory programs for execution on those architectures. The heart of the compilation process is translating references to shared memory into explicit messages between processors. To do this, we first define a formal model for distribution data structures across processor memories. Several abstract results describing the messages needed to execute a program are immediately derived from this formalism. We then develop two distinct forms of analysis to translate these formulas into actual programs. Compile-time analysis is used when enough information is available to the compiler to completely characterize the data sent in the messages. This allows excellent code to be generated for a program. Run-time analysis produces code to examine data references while the program is running. This allows dynamic generation of messages and a correct implementation of the program. While the over-head of the run-time approach is higher than the compile-time approach, run-time analysis is applicable to any program. Performance data from an initial implementation show that both approaches are practical and produce code with acceptable efficiency

    The exokernel operating system architecture

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 115-120).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.On traditional operating systems only trusted software such as privileged servers or the kernel can manage resources. This thesis proposes a new approach, the exokernel architecture, which makes resource management unprivileged but safe by separating management from protection: an exokernel protects resources, while untrusted application-level software manages them. As a result, in an exokernel system, untrusted software (e.g., library operating systems) can implement abstractions such as virtual memory, file systems, and networking. Themain thrusts of this thesis are: (1) how to build an exokernel system; (2) whether it is possible to build a real one; and (3) whether doing so is a good idea. Our results, drawn from two exokernel systems [25, 48], show that the approach yields dramatic benefits. For example, Xok, an exokernel, runs a web server an order of magnitude faster than the closest equivalent on the same hardware, common unaltered Unix applications up to three times faster, and improves global system performance up to a factor of five. The thesis also discusses some of the new techniques we have used to remove the overhead of protection. Themost unusual technique, untrusted deterministic functions, enables an exokernel to verify that applications correctly track the resources they own, eliminating the need for it to do so. Additionally, the thesis reflects on the subtle issues in using downloaded code for extensibility and the sometimes painful lessons learned in building three exokernel-based systems.by Dawson R. Engler.Ph.D

    The exploitation of parallelism on shared memory multiprocessors

    Get PDF
    PhD ThesisWith the arrival of many general purpose shared memory multiple processor (multiprocessor) computers into the commercial arena during the mid-1980's, a rift has opened between the raw processing power offered by the emerging hardware and the relative inability of its operating software to effectively deliver this power to potential users. This rift stems from the fact that, currently, no computational model with the capability to elegantly express parallel activity is mature enough to be universally accepted, and used as the basis for programming languages to exploit the parallelism that multiprocessors offer. To add to this, there is a lack of software tools to assist programmers in the processes of designing and debugging parallel programs. Although much research has been done in the field of programming languages, no undisputed candidate for the most appropriate language for programming shared memory multiprocessors has yet been found. This thesis examines why this state of affairs has arisen and proposes programming language constructs, together with a programming methodology and environment, to close the ever widening hardware to software gap. The novel programming constructs described in this thesis are intended for use in imperative languages even though they make use of the synchronisation inherent in the dataflow model by using the semantics of single assignment when operating on shared data, so giving rise to the term shared values. As there are several distinct parallel programming paradigms, matching flavours of shared value are developed to permit the concise expression of these paradigms.The Science and Engineering Research Council

    Compilation and Binary Editing for Performance and Security

    Get PDF
    Traditionally, execution of a program follows a straight and inflexible path starting from source code, extending through a compiled executable file on disk, and culminating in an executable image in memory. This dissertation enables more flexible programs through new compilation mechanisms and binary editing techniques. To assist analysis of functions in binaries, a new compilation mechanism generates data representing control flow graphs of each function. These data allow binary analysis tools to identify the boundaries of basic blocks and the types of edges between them without examining individual instructions. A similar compilation mechanism is used to create individually relocatable basic blocks that can be relocated anywhere in memory at runtime to simplify runtime instrumentation. The concept of generating relocatable program components is also applied at function-level granularity. Through link-time function relocation, unused functions in shared libraries are moved to a section that is not loaded into the memory at runtime, reducing the memory footprint of these shared libraries. Moreover, function relocation is extended to the runtime where functions are continuously moved to random addresses to thwart system intrusion attacks. The techniques presented above result in a 74% reduction in binary parsing times as well as an 85% reduction in memory footprint of the code segment of shared libraries, while simplifying instrumentation of binary code. The techniques also provide a way to make return-oriented programming attacks virtually impossible to succeed

    Adaptable register file organization for vector processors

    Get PDF
    Today there are two main vector processors design trends. On the one hand, we have vector processors designed for long vectors lengths such as the SX-Aurora TSUBASA which implements vector lengths of 256 elements (16384-bits). On the other hand, we have vector processors designed for short vectors such as the Fujitsu A64FX that implements vector lengths of 8 elements (512-bit) ARM SVE. However, short vector designs are the most widely adopted in modern chips. This is because, to achieve high-performance with a very high-efficiency, applications executed on long vector designs must feature abundant DLP, then limiting the range of applications. On the contrary, short vector designs are compatible with a larger range of applications. In fact, in the beginnings, long vector length implementations were focused on the HPC market, while short vector length implementations were conceived to improve performance in multimedia tasks. However, those short vector length extensions have evolved to better fit the needs of modern applications. In that sense, we believe that this compatibility with a large range of applications featuring high, medium and low DLP is one of the main reasons behind the trend of building parallel machines with short vectors. Short vector designs are area efficient and are "compatible" with applications having long vectors; however, these short vector architectures are not as efficient as longer vector designs when executing high DLP code. In this thesis, we propose a novel vector architecture that combines the area and resource efficiency characterizing short vector processors with the ability to handle large DLP applications, as allowed in long vector architectures. In this context, we present AVA, an Adaptable Vector Architecture designed for short vectors (MVL = 16 elements), capable of reconfiguring the MVL when executing applications with abundant DLP, achieving performance comparable to designs for long vectors. The design is based on three complementary concepts. First, a two-stage renaming unit based on a new type of registers termed as Virtual Vector Registers (VVRs), which are an intermediate mapping between the conventional logical and the physical and memory registers. In the first stage, logical registers are renamed to VVRs, while in the second stage, VVRs are renamed to physical registers. Second, a two-level VRF, that supports 64 VVRs whose MVL can be configured from 16 to 128 elements. The first level corresponds to the VVRs mapped in the physical registers held in the 8KB Physical Vector Register File (P-VRF), while the second level represents the VVRs mapped in memory registers held in the Memory Vector Register File (M-VRF). While the baseline configuration (MVL=16 elements) holds all the VVRs in the P-VRF, larger MVL configurations hold a subset of the total VVRs in the P-VRF, and map the remaining part in the M-VRF. Third, we propose a novel two-stage vector issue unit. In the first stage, the second level of mapping between the VVRs and physical registers is performed, while issuing to execute is managed in the second stage. This thesis also presents a set of tools for designing and evaluating vector architectures. First, a parameterizable vector architecture model implemented on the gem5 simulator to evaluate novel ideas on vector architectures. Second, a Vector Architecture model implemented on the McPAT framework to evaluate power and area metrics. Finally, the RiVEC benchmark suite, a collection of ten vectorized applications from different domains focusing on benchmarking vector microarchitectures.Hoy en día existen dos tendencias principales en el diseño de procesadores vectoriales. Por un lado, tenemos procesadores vectoriales basados en vectores largos como el SX-Aurora TSUBASA que implementa vectores con 256 elementos (16384-bits) de longitud. Por otro lado, tenemos procesadores vectoriales basados en vectores cortos como el Fujitsu A64FX que implementa vectores de 8 elementos (512-bits) de longitud ARM SVE. Sin embargo, los diseños de vectores cortos son los más adoptados en los chips modernos. Esto es porque, para lograr alto rendimiento con muy alta eficiencia, las aplicaciones ejecutadas en diseños de vectores largos deben presentar abundante paralelismo a nivel de datos (DLP), lo que limita el rango de aplicaciones. Por el contrario, los diseños de vectores cortos son compatibles con un rango más amplio de aplicaciones. En sus orígenes, implementaciones basadas en vectores largos estaban enfocadas al HPC, mientras que las implementaciones basadas en vectores cortos estaban enfocadas en tareas de multimedia. Sin embargo, esas extensiones basadas en vectores cortos han evolucionado para adaptarse mejor a las necesidades de las aplicaciones modernas. Creemos que esta compatibilidad con un mayor rango de aplicaciones es una de las principales razones de construir máquinas paralelas basadas en vectores cortos. Los diseños de vectores cortos son eficientes en área y son compatibles con aplicaciones que soportan vectores largos; sin embargo, estos diseños de vectores cortos no son tan eficientes como los diseños de vectores largos cuando se ejecuta un código con alto DLP. En esta tesis, proponemos una novedosa arquitectura vectorial que combina la eficiencia de área y recursos que caracteriza a los procesadores vectoriales basados en vectores cortos, con la capacidad de mejorar en rendimiento cuando se presentan aplicaciones con alto DLP, como lo permiten las arquitecturas vectoriales basadas en vectores largos. En este contexto, presentamos AVA, una Arquitectura Vectorial Adaptable basada en vectores cortos (MVL = 16 elementos), capaz de reconfigurar el MVL al ejecutar aplicaciones con abundante DLP, logrando un rendimiento comparable a diseños basados en vectores largos. El diseño se basa en tres conceptos. Primero, una unidad de renombrado de dos etapas basada en un nuevo tipo de registros denominados registros vectoriales virtuales (VVR), que son un mapeo intermedio entre los registros lógicos y físicos y de memoria. En la primera etapa, los registros lógicos se renombran a VVR, mientras que, en la segunda etapa, los VVR se renombran a registros físicos. En segundo lugar, un VRF de dos niveles, que admite 64 VVR cuyo MVL se puede configurar de 16 a 128 elementos. El primer nivel corresponde a los VVR mapeados en los registros físicos contenidos en el banco de registros vectoriales físico (P-VRF) de 8 KB, mientras que el segundo nivel representa los VVR mapeados en los registros de memoria contenidos en el banco de registros vectoriales de memoria (M-VRF). Mientras que la configuración de referencia (MVL=16 elementos) contiene todos los VVR en el P-VRF, las configuraciones de MVL más largos contienen un subconjunto del total de VVR en el P-VRF y mapean la parte restante en el M-VRF. En tercer lugar, proponemos una novedosa unidad de colas de emisión de dos etapas. En la primera etapa se realiza el segundo nivel de mapeo entre los VVR y los registros físicos, mientras que en la segunda etapa se gestiona la emisión de instrucciones a ejecutar. Esta tesis también presenta un conjunto de herramientas para diseñar y evaluar arquitecturas vectoriales. Primero, un modelo de arquitectura vectorial parametrizable implementado en el simulador gem5 para evaluar novedosas ideas. Segundo, un modelo de arquitectura vectorial implementado en McPAT para evaluar las métricas de potencia y área. Finalmente, presentamos RiVEC, una colección de diez aplicaciones vectorizadas enfocadas en evaluar arquitecturas vectorialesPostprint (published version
    corecore