225,081 research outputs found
Ultra low power cooperative branch prediction
Branch Prediction is a key task in the operation of a high performance processor. An
inaccurate branch predictor results in increased program run-time and a rise in energy
consumption. The drive towards processors with limited die-space and tighter energy
requirements will continue to intensify over the coming years, as will the shift towards
increasingly multicore processors. Both trends make it increasingly important and
increasingly difficult to find effective and efficient branch predictor designs.
This thesis presents savings in energy and die-space through the use of more efficient
cooperative branch predictors achieved through novel branch prediction designs.
The first contribution is a new take on the problem of a hybrid dynamic-static branch
predictor allocating branches to be predicted by one of its sub-predictors. A new bias
parameter is introduced as a mechanism for trading off a small amount of performance
for savings in die-space and energy. This is achieved by predicting more branches
with the static predictor, ensuring that only the branches that will most benefit from
the dynamic predictorâs resources are predicted dynamically. This reduces pressure on
the dynamic predictorâs resources allowing for a smaller predictor to achieve very high
accuracy. An improvement in run-time of 7-8% over the baseline BTFN predictor is
observed at a cost of a branch predictor bits budget of much less than 1KB.
Next, a novel approach to branch prediction for multicore data-parallel applications
is presented. The Peloton branch prediction scheme uses a pack of cyclists as an
illustration of how a group of processors running similar tasks can share branch predictions
to improve accuracy and reduce runtime. The results show that sharing updates
for conditional branches across the existing interconnect for I-cache and D-cache updates
results in a reduction of mispredictions of up to 25% and a reduction in run-time
of up to 6%. McPAT is used to present an energy model that suggests the savings are
achieved at little to no increase in energy required. The technique is then extended to
architectures where the size of the branch predictors may differ between cores. The
results show that such heterogeneity can dramatically reduce the die-space required
for an accurate branch predictor while having little impact on performance and up to
9% energy savings. The approach can be combined with the Peloton branch prediction
scheme for reduction in branch mispredictions of up to 5%
Reducing complexity of processor front ends with static analysis and selective preloading
General purpose processors were once designed with the major goal of maximizing performance. As power consumption has grown, with the advent of multi-core processors and the rising importance of embedded and mobile devices, the importance of designing efficient and low cost architectures has increased. This dissertation focuses on reducing the complexity of the front end of the processor, mainly branch predictors. Branch predictors have also been designed with a focus on improving prediction accuracy so that performance is maximized. To accomplish this, the predictors proposed in the literature and used in real systems have become increasingly complex and larger, a trend that is inconsistent with the anticipated trend of simpler and more numerous cores in future processors. Much of the increased complexity in many recently proposed predictors is used to select a part of history most correlated to a branch. This makes them costly, if not impossible to implement practically. We suggest that the complex decisions do not have to be made in hardware at prediction or run time and can be moved offline. High accuracy can be achieved by making complex prediction decisions in a one-time profile run instead of using complex hardware. We apply these techniques to Spotlight, our own low cost, low complexity branch predictor. A static analysis step determines, for each branch, the history segment yielding the highest accuracy. This information is placed in unused instruction space. Spotlight achieves higher accuracy than other implementation-simple predictors such as Gshare and YAGS and matches or outperforms the two complex neural predictors that we compare it to. To ensure timely access, we evaluate using a hardware table (called a BIT) to store profile bits after they are extracted from instructions, and the accuracy of using this table. The drawback of a BIT is its size. We introduce a novel technique, Preloading that places data for an instruction in prior blocks on the path to the instruction. By doing so, it is able to significantly reduce the size of the BIT needed for good performance. We discuss other applications of Preloading on the front end other than branch predictors
Control speculation for energy-efficient next-generation superscalar processors
Conventional front-end designs attempt to maximize the number of "in-flight" instructions in the pipeline. However, branch mispredictions cause the processor to fetch useless instructions that are eventually squashed, increasing front-end energy and issue queue utilization and, thus, wasting around 30 percent of the power dissipated by a processor. Furthermore, processor design trends lead to increasing clock frequencies by lengthening the pipeline, which puts more pressure on the branch prediction engine since branches take longer to be resolved. As next-generation high-performance processors become deeply pipelined, the amount of wasted energy due to misspeculated instructions will go up. The aim of this work is to reduce the energy consumption of misspeculated instructions. We propose selective throttling, which triggers different power-aware techniques (fetch throttling, decode throttling, or disabling the selection logic) depending on the branch prediction confidence level. Results show that combining fetch-bandwidth reduction along with select-logic disabling provides the best performance in terms of overall energy reduction and energy-delay product improvement (14 percent and 10 percent, respectively, for a processor with a 22-stage pipeline and 16 percent and 13 percent, respectively, for a processor with a 42-stage pipeline).Peer ReviewedPostprint (published version
Speculative Thread Framework for Transient Management and Bumpless Transfer in Reconfigurable Digital Filters
There are many methods developed to mitigate transients induced when abruptly
changing dynamic algorithms such as those found in digital filters or
controllers. These "bumpless transfer" methods have a computational burden to
them and take time to implement, causing a delay in the desired switching time.
This paper develops a method that automatically reconfigures the computational
resources in order to implement a transient management method without any delay
in switching times. The method spawns a speculative thread when it predicts if
a switch in algorithms is imminent so that the calculations are done prior to
the switch being made. The software framework is described and experimental
results are shown for a switching between filters in a filter bank.Comment: 6 pages, 7 figures, to be presented at American Controls Conference
201
Instruction prefetching techniques for ultra low-power multicore architectures
As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computing performance. To reduce this bottleneck, designers have been working on techniques to hide these latencies. On the other hand, design of embedded processors typically targets low cost and low power consumption. Therefore, techniques which can satisfy these constraints are more desirable for embedded domains. While out-of-order execution, aggressive speculation, and complex branch prediction algorithms can help hide the memory access latency in high-performance systems, yet they can cost a heavy power budget and are not suitable for embedded systems. Prefetching is another popular method for hiding the memory access latency, and has been studied very well for high-performance processors. Similarly, for embedded processors with strict power requirements, the application of complex prefetching techniques is greatly limited, and therefore, a low power/energy solution is mostly desired in this context.
In this work, we focus on instruction prefetching for ultra-low power processing architectures and aim to reduce energy overhead of this operation by proposing a combination of simple, low-cost, and energy efficient prefetching techniques. We study a wide range of applications from cryptography to computer vision and show that our proposed mechanisms can effectively improve the hit-rate of almost all of them to above 95%, achieving an average performance improvement of more than 2X. Plus, by synthesizing our designs using the state-of-the-art technologies we show that the prefetchers increase systemâs power consumption less than 15% and total silicon area by less than 1%. Altogether, a total energy reduction of 1.9X is achieved, thanks to the proposed schemes, enabling a significantly higher battery life
The accretion rate independence of horizontal branch oscillation in XTE J1701-462
We study the temporal and energy spectral properties of the unique neutron
star low-mass X-ray binary XTE J1701-462. After assuming the HB/NB vertex as a
reference position of accretion rate, the horizontal branch oscillation (HBO)
of the HB/NB vertex is roughly 50 Hz. It indicates that the HBO is independent
with the accretion rate or the source intensity. The spectral analysis shows
in the HB/NB vertex and
in the NB/FB vertex, which
implies that different accretion rate may be produced in the HB/NB vertex and
the NB/FB vertex. The Comptonization component could be fitted by constrained
broken power law (CBPL) or nthComp. Different with GX 17+2, the frequencies of
HBO positively correlate with the inner disk radius, which contradict with the
prediction of Lense-Thirring precession model. XTE J1701-462, both in the
Cyg-like phase and in the Sco-like phase, follows a positive correlation
between the break frequency of broad band noise and the characteristic
frequency of HBO, which is called the W-K relation. An anticorrelation between
the frequency of HBO and photon energy is observed. Moreover, the rms of HBO
increases with photon energy till ~10 keV. We discuss the possible origin of
HBO from corona in XTE J1701-462.Comment: 45 pages, 18 figures, accepted by Ap
The GRB luminosity function in the internal shock model confronted to observations
We compute the expected luminosity function of GRBs in the context of the
internal shock model. We assume that GRB central engines generate relativistic
outflows characterized by the respective distributions of injected kinetic
power Edot and contrast in Lorentz factor Kappa = Gamma_max/Gamma_min. We find
that if the distribution of contrast extends down to values close to unity
(i.e. if both highly variable and smooth outflows can exist) the luminosity
function has two branches. At high luminosity it follows the distribution of
Edot while at low luminosity it is close to a power law of slope -0.5. We then
examine if existing data can constrain the luminosity function. Using the log N
- log P curve, the Ep distribution of bright BATSE bursts and the XRF/GRB ratio
obtained by HETE2 we show that single and broken power-laws can provide equally
good fits of these data. Present observations are therefore unable to favor one
form of the other. However when a broken power-law is adopted they clearly
indicate a low luminosity slope ~ -0.6 +- 0.2, compatible with the prediction
of the internal shock model.Comment: 9 pages, 5 figure
- âŠ