8,096 research outputs found

    Compiler and Architecture Design for Coarse-Grained Programmable Accelerators

    Get PDF
    abstract: The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels. At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism. Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRADissertation/ThesisDoctoral Dissertation Computer Science 201

    Survey of the Itanium architecture from a programmer's perspective

    Get PDF
    Journal ArticleThe Itanium family of processors represents Intel;s foray into the world of Explicitly Parallel Instruction Computing and 64-bit system design. This survey contains an introduction to the Itanium architecture and instruction set, as well as some of the available implementations. Taking a programmer's perspective, we have attempted to distill the relevant information from a variety of sources, including the Intel Itanium architecture documentation

    A Quantum Algorithm To Locate Unknown Hashes For Known N-Grams Within A Large Malware Corpus

    Full text link
    Quantum computing has evolved quickly in recent years and is showing significant benefits in a variety of fields. Malware analysis is one of those fields that could also take advantage of quantum computing. The combination of software used to locate the most frequent hashes and nn-grams between benign and malicious software (KiloGram) and a quantum search algorithm could be beneficial, by loading the table of hashes and nn-grams into a quantum computer, and thereby speeding up the process of mapping nn-grams to their hashes. The first phase will be to use KiloGram to find the top-kk hashes and nn-grams for a large malware corpus. From here, the resulting hash table is then loaded into a quantum machine. A quantum search algorithm is then used search among every permutation of the entangled key and value pairs to find the desired hash value. This prevents one from having to re-compute hashes for a set of nn-grams, which can take on average O(MN)O(MN) time, whereas the quantum algorithm could take O(N)O(\sqrt{N}) in the number of table lookups to find the desired hash values.Comment: IEEE Quantum Week 2020 Conferenc

    Clustered VLIW architecture based on queue register files

    Get PDF
    Institute for Computing Systems ArchitectureInstruction-level parallelism (ILP) is a set of hardware and software techniques that allow parallel execution of machine operations. Superscalar architectures rely most heavily upon hardware schemes to identify parallelism among operations. Although successful in terms of performance, the hardware complexity involved might limit the scalability of this model. VLIW architectures use a different approach to exploit ILP. In this case all data dependence analyses and scheduling of operations are performed at compile time, resulting in a simpler hardware organization. This allows the inclusion of a larger number of functional units (FUs) into a single chip. IN spite of this relative simplification, the scalability of VLIW architectures can be constrained by the size and number of ports of the register file. VLIW machines often use software pipelining techniques to improve the execution of loop structures, which can increase the register pressure. Furthermore, the access time of a register file can be compromised by the number of ports, causing a negative impact on the machine cycle time. For these reasons we understand that the benefits of having parallel FUs, which have motivated the investigation of alternative machine designs. This thesis presents a scalar VLIW architecture comprising clusters of FUs and private register files. Register files organised as queue structures are used as a mechanism for inter-cluster communication, allowing the enforcement of fixed latency in the process. This scheme presents better possibilities in terms of scalability as the size of the individual register files is not determined by the total number of FUs, suggesting that the silicon area may grow only linearly with respect to the total number of FUs. However, the effectiveness of such an organization depends on the efficiency of the code partitioning strategy. We have developed an algorithm for a clustered VLIW architecture integrating both software pipelining and code partitioning in a a single procedure. Experimental results show it may allow performance levels close to an unclustered machine without communication restraints. Finally, we have developed silicon area and cycle time models to quantify the scalability of performance and cost for this class of architecture

    GPU-based fast iterative reconstruction of fully 3-D PET sinograms

    Get PDF
    This work presents a graphics processing unit (GPU)- based implementation of a fully 3-D PET iterative reconstruction code, FIRST (Fast Iterative Reconstruction Software for [PET] Tomography), which was developed by our group. We describe the main steps followed to convert the FIRST code (which can run on several CPUs using the message passing interface [MPI] protocol) into a code where the main time-consuming parts of the reconstruction process (forward and backward projection) are massively parallelized on a GPU. Our objective was to obtain significant acceleration of the reconstruction without compromising the image quality or the flexibility of the CPU implementation. Therefore, we implemented a GPU version using an abstraction layer for the GPU, namely, CUDA C. The code reconstructs images from sinogram data, and with the same System Response Matrix obtained from Monte Carlo simulations than the CPU version. The use of memory was optimized to ensure good performance in the GPU. The code was adapted for the VrPET small-animal PET scanner. The CUDA version is more than 70 times faster than the original code running in a single core of a high-end CPU, with no loss of accuracy.This work was supported in part by AMIT Project funded by CDTI (CENIT Programme), UCM (Grupos UCM, 910059), CPAN (Consolider-Ingenio 2010, CSPD-2007-00042), RECAVA- RETIC network, Comunidad de Madrid (ARTEMIS S2009/DPI-1802), Ministerio de Ciencia e Innovación, Spanish Government (ENTEPRASE grant, PSE-300000-2009-5 and TEC2007-64731/TCM), and European Regional funds.Publicad

    A Simulator Program for Evaluating and Improving the Nottingham Muse Architecture.

    Get PDF
    This paper describes the modelling and simulation of the Nottingham MUSE (MUltiple Stream Evaluator) machine. MUSE is a data flow machine capable of supporting structured parallel computation. The simulator described in this paper was designed to enable alterations, improvements and additions to be made to the prototype MUSE architecture. The stages through which the model has progressed, and the implementation details of this model as a program, are discussed. The validation experiments are explained, and future plans for alterations and modifications to the basic model are suggested

    What Matters in Rural and Microfinance

    Get PDF
    Due to the overall failure of donor-driven subsidized directed credit administered by government-owned development finance institutions, the emphasis in development policy has shifted to (rural) financial systems development and the building of self-reliant, sustainable institutions. While savings-based self-help groups and member-managed small cooperatives are more appropriate to remote and marginal areas, there is little to further differentiate rural and urban microfinance. Regardless of ownership, type of institution, and rural or urban sphere of operation, they ultimately all have to: - Mobilize their own resources through savings - Have their loans repaid - Cover their costs from their operational income - Finance their expansion from their profits. Three worlds of finance continue to exist, in which donors may intervene in very different ways: - The old world of donor-driven development finance, which need to be transformed into sustainable institutions - A new world of development finance, comprising viable formal and semiformal institutions with a commercial orientation, which do not, or not fully, rely on donor support for expansion - Informal financial institutions of ancient or recent origin, based on principles of self-reliance and viability with their potential for innovation and mainstreaming, to which donors may contribute.. There a numerous notable new developments in R/MF; but in the majority of countries, there are still major shortcomings that call for country-driven, coordinated interventions. Donors with their projects are found in both worlds; but there is an overall move from the old world of supply-driven development finance to the new world of demand-driven commercial finance. --

    Modulo scheduling with reduced register pressure

    Get PDF
    Software pipelining is a scheduling technique that is used by some product compilers in order to expose more instruction level parallelism out of innermost loops. Module scheduling refers to a class of algorithms for software pipelining. Most previous research on module scheduling has focused on reducing the number of cycles between the initiation of consecutive iterations (which is termed II) but has not considered the effect of the register pressure of the produced schedules. The register pressure increases as the instruction level parallelism increases. When the register requirements of a schedule are higher than the available number of registers, the loop must be rescheduled perhaps with a higher II. Therefore, the register pressure has an important impact on the performance of a schedule. This paper presents a novel heuristic module scheduling strategy that tries to generate schedules with the lowest II, and, from all the possible schedules with such II, it tries to select that with the lowest register requirements. The proposed method has been implemented in an experimental compiler and has been tested for the Perfect Club benchmarks. The results show that the proposed method achieves an optimal II for at least 97.5 percent of the loops and its compilation time is comparable to a conventional top-down approach, whereas the register requirements are lower. In addition, the proposed method is compared with some other existing methods. The results indicate that the proposed method performs better than other heuristic methods and almost as well as linear programming methods, which obtain optimal solutions but are impractical for product compilers because their computing cost grows exponentially with the number of operations in the loop body.Peer ReviewedPostprint (published version
    corecore