30 research outputs found
Preliminary basic performance analysis of the Cedar multiprocessor memory system
Some preliminary basic results on the performance of the Cedar multiprocessor memory system are presented. Empirical results are presented and used to calibrate a memory system simulator which is then used to discuss the scalability of the system
Direct instruction wakeup for out-of-order processors
Instruction queues consume a significant amount of power in high-performance processors, primarily due to instruction wakeup logic access to the queue structures. The wakeup logic delay is also a critical timing parameter. This paper proposes a new queue organization using a small number of successor pointers plus a small number of dynamically allocated full successor bit vectors for cases with a larger number of successors. The details of the new organization are described and it is shown to achieve the performance of CAM-based or full dependency matrix organizations using just one pointer per instruction plus eight full bit vectors. Only two full bit vectors are needed when two successor pointers are stored per instruction. Finally, a design and pre-layout of all critical structures in 70 nm technology was performed for the proposed organization as well as for a CAM-based baseline. The new design is shown to use 1/2 to 1/5th of the baseline instruction queue power, depending on queue size. It is also shown to use significantly less power than the full dependency matrix based design.Peer ReviewedPostprint (published version
Decoupled Access DRAM Architecture
This paper discusses an approach to reducing memory latency in future systems. It focuses on systems where a single chip DRAM/processor will not be feasible even in 10 years, e.g. systems requiring a large memory and/or many CPU's. In such systems a solution needs to be found to DRAM latency and bandwidth as well as to inter-chip communication. Utilizing the projected advances in chip I/O bandwidth we propose to implement a decoupled access-execute processor where the access processor is placed in memory. Aprogram is compiledtorunasacomputational process and several access processes with the latter executing in the DRAM processors. Instruction set extensions are discussedto support this paradigm. Using multi-level branch prediction the access processor stays ahead of the execute processor and keeps the latter supplied with data. The system reduces latency by moving address computation to memory and thus avoiding sending address to memory by the computational processor. This and the fetchahead capabilities of the access processor arecombined with multiple DRAM "streaming" to improve performance. DRAM caching is assumedtobeused to assist in this as well
A Simple Low-Energy Instruction Wakeup Mechanism
Instruction issue consumes a large amount of energy in out of order processors, largely in the wakeup logic. Proposed solutions to the problem require prediction or additional hardware complexity to reduce energy consumption and, in some cases, may have a negative impact on processor performance. This paper proposes a mechanism for instruction wakeup, which uses a multi-block instruction queue design. The blocks are turned off until the mechanism determines which blocks to access on wakeup using a simple successor tracking mechanism. The proposed approach is shown to require as little as 1.5 comparisons per committed instruction for SPEC2000 benchmarks
Recommended from our members
Towards an achievable performance for the loop nests
Numerous code optimization techniques, including loop nest optimizations, have been developed over the last four decades. Loop optimization techniques transform loop nests to improve the performance of the code on a target architecture, including exposing parallelism. Finding and evaluating an optimal, semantic-preserving sequence of transformations is a complex problem. The sequence is guided using heuristics and/or analytical models and there is no way of knowing how close it gets to optimal performance or if there is any headroom for improvement. This paper makes two contributions. First, it uses a comparative analysis of loop optimizations/transformations across multiple compilers to determine how much headroom may exist for each compiler. And second, it presents an approach to characterize the loop nests based on their hardware performance counter values and a Machine Learning approach that predicts which compiler will generate the fastest code for a loop nest. The prediction is made for both auto-vectorized, serial compilation and for auto-parallelization. The results show that the headroom for state-of-the-art compilers ranges from 1.10x to 1.42x for the serial code and from 1.30x to 1.71x for the auto-parallelized code. These results are based on the Machine Learning predictions
Recommended from our members
A radiative transfer module for calculating photolysis rates and solar heating in climate models: Solar-J v7.5
Solar-J is a comprehensive radiative transfer model for the solar spectrum that addresses the needs of both solar heating and photochemistry in Earth system models. Solar-J is a spectral extension of Cloud-J, a standard in many chemical models that calculates photolysis rates in the 0.18-0.8 μm region. The Cloud-J core consists of an eight-stream scattering, plane-parallel radiative transfer solver with corrections for sphericity. Cloud-J uses cloud quadrature to accurately average over correlated cloud layers. It uses the scattering phase function of aerosols and clouds expanded to eighth order and thus avoids isotropic-equivalent approximations prevalent in most solar heating codes. The spectral extension from 0.8 to 12 μm enables calculation of both scattered and absorbed sunlight and thus aerosol direct radiative effects and heating rates throughout the Earth's atmosphere. The Solar-J extension adopts the correlated-k gas absorption bins, primarily water vapor, from the shortwave Rapid Radiative Transfer Model for general circulation model (GCM) applications (RRTMG-SW). Solar-J successfully matches RRTMG-SW's tropospheric heating profile in a clear-sky, aerosol-free, tropical atmosphere. We compare both codes in cloudy atmospheres with a liquid-water stratus cloud and an ice-crystal cirrus cloud. For the stratus cloud, both models use the same physical properties, and we find a systematic low bias of about 3 % in planetary albedo across all solar zenith angles caused by RRTMG-SW's two-stream scattering. Discrepancies with the cirrus cloud using any of RRTMG-SW's three different parameterizations are as large as about 20-40 % depending on the solar zenith angles and occur throughout the atmosphere. Effectively, Solar-J has combined the best components of RRTMG-SW and Cloud-J to build a high-fidelity module for the scattering and absorption of sunlight in the Earth's atmosphere, for which the three major components - wavelength integration, scattering, and averaging over cloud fields - all have comparably small errors. More accurate solutions with Solar-J come with increased computational costs, about 5 times that of RRTMG-SW for a single atmosphere. There are options for reduced costs or computational acceleration that would bring costs down while maintaining improved fidelity and balanced errors
A radiative transfer module for calculating photolysis rates and solar heating in climate models: Solar-J v7.5
Solar-J is a comprehensive radiative transfer model for the solar spectrum that addresses the needs of both solar heating and photochemistry in Earth system models. Solar-J is a spectral extension of Cloud-J, a standard in many chemical models that calculates photolysis rates in the 0.18-0.8 μm region. The Cloud-J core consists of an eight-stream scattering, plane-parallel radiative transfer solver with corrections for sphericity. Cloud-J uses cloud quadrature to accurately average over correlated cloud layers. It uses the scattering phase function of aerosols and clouds expanded to eighth order and thus avoids isotropic-equivalent approximations prevalent in most solar heating codes. The spectral extension from 0.8 to 12 μm enables calculation of both scattered and absorbed sunlight and thus aerosol direct radiative effects and heating rates throughout the Earth's atmosphere. The Solar-J extension adopts the correlated-k gas absorption bins, primarily water vapor, from the shortwave Rapid Radiative Transfer Model for general circulation model (GCM) applications (RRTMG-SW). Solar-J successfully matches RRTMG-SW's tropospheric heating profile in a clear-sky, aerosol-free, tropical atmosphere. We compare both codes in cloudy atmospheres with a liquid-water stratus cloud and an ice-crystal cirrus cloud. For the stratus cloud, both models use the same physical properties, and we find a systematic low bias of about 3 % in planetary albedo across all solar zenith angles caused by RRTMG-SW's two-stream scattering. Discrepancies with the cirrus cloud using any of RRTMG-SW's three different parameterizations are as large as about 20-40 % depending on the solar zenith angles and occur throughout the atmosphere. Effectively, Solar-J has combined the best components of RRTMG-SW and Cloud-J to build a high-fidelity module for the scattering and absorption of sunlight in the Earth's atmosphere, for which the three major components - wavelength integration, scattering, and averaging over cloud fields - all have comparably small errors. More accurate solutions with Solar-J come with increased computational costs, about 5 times that of RRTMG-SW for a single atmosphere. There are options for reduced costs or computational acceleration that would bring costs down while maintaining improved fidelity and balanced errors