713 research outputs found

    Empowering a helper cluster through data-width aware instruction selection policies

    Get PDF
    Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost- and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. helper cluster), potentially providing performance benefits. We complement a 32-bit monolithic processor with a low-complexity 8-bit helper cluster. Then, in our main focus, we propose various ideas to select suitable instructions to execute in the data-width based clusters. We add data-width information as another instruction steering decision metric and introduce new data-width based selection algorithms which also consider dependency, inter-cluster communication and load imbalance. Utilizing those techniques, the performance of a wide range of workloads are substantially increased; helper cluster achieves an average speedup of 11% for a wide range of 412 apps. When focusing on integer applications, the speedup can be as high as 22% on averagePeer ReviewedPostprint (published version

    High performance extendable instruction set computing

    Get PDF
    In this paper, a new architecture called the extendable instruction set computer (EISC) is introduced that addresses the issues of memory size and performance in embedded microprocessor systems. The architecture exhibits an efficient fixed length 16-bit instruction set with short length offset and immediate operands. The offset and immediate operands can be extended to 32 bits via the operation of an extension flag. The code density of the EISC instruction set and its memory transfer erformance is shown to be significantly higher than current architectures making it a suitable candidate for the next generation of embedded computer systems. The compact EISC instruction set introduces data dependencies that seemingly limit deep pipeline and superscalar implementations. This paper suggests a mechanism by which these dependencies might be removed in hardware

    Direct instruction wakeup for out-of-order processors

    Get PDF
    Instruction queues consume a significant amount of power in high-performance processors, primarily due to instruction wakeup logic access to the queue structures. The wakeup logic delay is also a critical timing parameter. This paper proposes a new queue organization using a small number of successor pointers plus a small number of dynamically allocated full successor bit vectors for cases with a larger number of successors. The details of the new organization are described and it is shown to achieve the performance of CAM-based or full dependency matrix organizations using just one pointer per instruction plus eight full bit vectors. Only two full bit vectors are needed when two successor pointers are stored per instruction. Finally, a design and pre-layout of all critical structures in 70 nm technology was performed for the proposed organization as well as for a CAM-based baseline. The new design is shown to use 1/2 to 1/5th of the baseline instruction queue power, depending on queue size. It is also shown to use significantly less power than the full dependency matrix based design.Peer ReviewedPostprint (published version

    An efficient multi-core SIMD implementation for H.264/AVC encoder

    Get PDF
    The optimization process of a H.264/AVC encoder on three different architectures is presented. The architectures are multi- and singlecore and SIMD instruction sets have different vector registers size. The need of code optimization is fundamental when addressing HD resolutions with real-time constraints. The encoder is subdivided in functional modules in order to better understand where the optimization is a key factor and to evaluate in details the performance improvement. Common issues in both partitioning a video encoder into parallel architectures and SIMD optimization are described, and author solutions are presented for all the architectures. Besides showing efficient video encoder implementations, one of the main purposes of this paper is to discuss how the characteristics of different architectures and different set of SIMD instructions can impact on the target application performance. Results about the achieved speedup are provided in order to compare the different implementations and evaluate the more suitable solutions for present and next generation video-coding algorithms

    Formal Verification of an Iterative Low-Power x86 Floating-Point Multiplier with Redundant Feedback

    Full text link
    We present the formal verification of a low-power x86 floating-point multiplier. The multiplier operates iteratively and feeds back intermediate results in redundant representation. It supports x87 and SSE instructions in various precisions and can block the issuing of new instructions. The design has been optimized for low-power operation and has not been constrained by the formal verification effort. Additional improvements for the implementation were identified through formal verification. The formal verification of the design also incorporates the implementation of clock-gating and control logic. The core of the verification effort was based on ACL2 theorem proving. Additionally, model checking has been used to verify some properties of the floating-point scheduler that are relevant for the correct operation of the unit.Comment: In Proceedings ACL2 2011, arXiv:1110.447

    Memory system for a relational database processor

    Get PDF
    An associative memory for a relational database management system, with content addressing capability, is studied and analyzed. The system utilizes one level of indexing and the database is clustered. The logic-per-track approach is used for parallel processing of the data in a cylinder. The attributes and the tuples are allowed to have an arbitrary length and no encoding algorithm is used. The performance of the system is analyzed and it is demonstrated to have superior performance in comparison to software-based systems. The cost effectiveness of the system is also shown
    • …
    corecore