Parallelizing Bisection Root-Finding: A Case for Accelerating Serial
  Algorithms in Multicore Substrates by Bakhshalipour, Mohammad & Sarbazi-Azad, Hamid
Parallelizing Bisection Root-Finding:
A Case for Accelerating Serial Algorithms
in Multicore Substrates
Mohammad Bakhshalipour and Hamid Sarbazi-Azad
Department of Computer Engineering, Sharif University of Technology
School of Computer Science, Institute for Research in Fundamental Sciences (IPM)
Abstract—Multicore architectures dominate today’s proces-
sor market. Even though the number of cores and threads are
pretty high and continues to grow, inherently serial algorithms
do not benefit from the abundance of cores and threads. In this
paper, we propose Runahead Computing, a technique which uses
idle threads in a multi-threaded architecture for accelerating the
execution time of serial algorithms. Through detailed evaluations
targeting both CPU and GPU platforms and a specific serial
algorithm, our approach reduces the execution latency up to 9x
in our experiments.
Keywords—Multi-Thread Programming, Single-Thread Perfor-
mance, Multicore Processor, GPU, Bisection Root-Finding.
I. INTRODUCTION
Multi-threaded architectures presently appear across the
whole spectrum of computing machines, from the low-end
embedded processors to high-end general-purpose devices.
Chip Multiprocessor (CMP) [1] is a type of multi-threaded
processors, in which, cores do not share the computational re-
sources. CMPs are implemented in many commercial systems
and have high usage in broad classes of computations. Intel
Nehalem i7 [2], AMD Bulldozer [3], IBM Power5 [4], Sun
Niagara T2 [5], and TILE64 [6] are examples of commercial
CMPs. CMP enhances the performance of a single program
when the program can be split into multiple pieces, each piece
run by one core, in parallel with others.
In the other side, Graphics Processing Units (GPUs)
are also being adjusted for general-purpose computations.
GPUs use an aggregate form of single-instruction-multiple-
data (SIMD) paradigm [7] with fine-grain multi-threading.
NVIDIA Kepler GK110 [8] and AMD TeraScale [9] are
examples of commercial GPUs. GPU exploits the data-level
parallelism of a single program which consists of the same
operation on multiple data points for accelerating the execu-
tion time of the application [10]–[17].
Even though the trend is growing the number of cores
in both multicore and manycore systems [18]–[21], serial
programs do not benefit from increasing the core count. Serial
programs (or serial parts of a program) run on a single thread
and cannot be split into two or multiple threads. Executing
such programs on a manycore platform results in only one
core running the whole program and other cores remain-
ing idle. The situation gets worse when boosting the core
count diminishes the single-core performance. The problem
arises from two reasons: (1) Tight physical constraints of
the chip (i.e., limited power and area budgets) prevent from
accommodating hundreds of large and power-hungry cores
on a single chip [22], [23]. So increasing the core count
implies replacing high-performance cores with simple and
small cores, which leads to diminishing the performance of
single-thread and consequently protracting execution time of
serial programs. Some proposals resolve this problem with
asymmetric architectures [24]. (2) Increasing the core count
forces to replace non-scalable crossbars with on-chip networks
which use scalable topologies (e.g., Mesh). Latency incurred
by on-chip interconnect decreases the per-core performance
due to slower accesses to shared caches. Boosting the core
count grows the network hop count and results in longer
delays [25]–[42]. Some proposals offset this obstacle with
richly-connected topologies [43], [44] which have significant
area and energy overheads.
In this paper, we propose Runahead Computing, a tech-
nique for accelerating inherently serial algorithms on multi-
core and manycore platforms. Runahead Computing draws
on previous researches on non-traditional parallelism and
targets the execution latency of serial algorithms. The rest
of the paper is organized as follows: Section 2 explains
Runahead Computing and gives the essential background on
non-traditional parallelism. In section 3, we choose a proper
serial algorithm as a case study and describe its details.
Throughout section 4, we illustrate the implementation of
Runahead Computing in our case study. Section 5 discusses
our evaluation methodology. Section 6 presents the results of
evaluation experiments. Finally, section 7 concludes the paper.
II. RUNAHEAD COMPUTING
Runahead Computing refers to use the idle threads in a
multi-threaded architecture to improve the performance of the
single-threaded application. Runahead Computing is a kind of
speculation in which, idle threads operate few steps ahead of
the main thread. These speculative threads attempt to provide
a situation at their activity location, under which, when the
main thread reaches there, it executes faster than usual.
This is not the first research on using idle thread contexts
with the purpose of increasing the performance of the single-
threaded application. Various schemes have been proposed to
use idle thread contexts to provide certain kinds of assistance
to the main thread1. Some approaches leverage from idle
thread contexts as a prefetcher for main thread [45]–[50]
in order to defeat long-latency memory accesses [51]–[53].
Several proposals precompute branch outcomes through a
derived variant of main thread [50], [54]. There are also some
approaches which exploit idle thread contexts to precompute
dependent live-in data [55]. However, all of these approaches
require significant non-trivial changes to the hardware of
processors which makes their implementation challenging for
shipping products. In this paper, we target a method that
pushes everything to the software and hence is entirely appli-
cable in the context of commercial off-the-shelf processors.
1 This paradigm sometimes has been called non-traditional parallelism.
ar
X
iv
:1
80
5.
07
26
9v
1 
 [c
s.D
C]
  1
1 M
ay
 20
18
Input: X
A = F(X)
A
Even Odd
B = A + G(X) B = A + H(X)
Output: B
Fig. 1: Flowchart of a hypothetical program.
We proceed to describe our approach with an example.
Figure 1 shows the flowchart of a hypothetical program.
The program takes X as input and returns B as output. In
the beginning, the program computes F (X) and stores the
result in variable A. Then, if A is an even number, program
proceeds with calculating G(X) and storing the sum of the
result and A in variable B. But if A is an odd number, the
program computes the value of H(X) and put the sum of the
result and A into variable B. Finally, it returns B as the output
of the program. Because of existing dependencies between
parts of the program, it cannot be regularly parallelized.
So considering the execution time of 10, 5, and 5 seconds
respectively for F , G, and H , and neglecting the latency
of other operations (e.g., store and add), a single thread
can execute this program within 15 seconds. However, in
the Runahead Computing, programmer initiates two threads
at the beginning of execution. First thread (main thread)
begins with calculating F (X) and in parallel, the second
thread (helper thread) calculates G(X) and H(X) and stores
the result of them in variables G and H , respectively. In
this manner, the execution latency of calculating G(X) and
H(X) overlaps with execution latency of computing F (X).
So, when the main thread finishes the calculation of F (X),
the results of G(X) and H(X) are ready in variables G and
H . So, the main thread checks the value of A and picks one
of G or H for summing with A and storing the result in B
and returning the output. By this manner, the whole program
finishes within 10 seconds. So, in this example, Runahead
Computing improves the performance of the program by 50%.
III. BISECTION METHOD
In this section, we choose Bisection root-finding algo-
rithm [56] as our case study and describe its baseline manner
(without Runahead Computing). Bisection is a serial algo-
rithm and is a suitable case for our proposal due to its
operative behavior.
A. Algorithm
Bisection is a root-finding algorithm which operates on
continuous functions. The bisection method is based on this
lemma: If continuous function f returns opposite sign values
on a and b, then equation f(x) = 0 has at least one real root
in the interval (a, b), where a < b. Bisection algorithm takes
a function and an interval as inputs and repeatedly halves the
interval and then picks a subinterval in which a root must lie.
-1
0
1
2 2.5 3 3.5 4
F(X) = SIN(2X)
Fig. 2: Bisection algorithm on a sample function.
Figure 2 illustrates an example of the operation of this
algorithm. The algorithm takes f(x) = sin(2x) as input
function and (2, 4) as initial interval and then tries to find
a root for equation f(x) = 0 in the given interval. At first,
it computes f(2) and f(4) and finds that f(2) is negative
and f(4) is positive. The algorithm computes the middle
point of the interval and the value of the function at that
point (i.e., c = 2+42 = 3 and f(3)) and finds that f(3) is
negative. Now, the interval should be halved and replaced by
one of (2, 3) or (3, 4) intervals. Because the function returns
opposite sign values on the edge points of (3, 4) interval, the
algorithm chooses (3, 4) interval to continuation. In the next
step, the operation repeats similarly and (3, 3.5) interval is
chosen. The algorithm continues in an alike manner for more
iterations (not shown in the figure) till estimating the root with
acceptable accuracy. Due to halving the interval at each step,
in general, for achieving an error value less than , we need
to iterate dlog2 b−a e times for the initial interval of (a, b).
The main advantage of Bisection algorithm is its
simplicity and robustness. While other root-finding numerical
methods, which may have higher performance, operate just
when the function has specific conditions, Bisection method
works on any continuous function regardless of any particular
circumstances. The only problem with Bisection method is
its high execution time which we target in this work. Some
approaches have been proposed to use Bisection method for
achieving a nearness to a solution which is formerly used as
a starting point for more quickly converging methods [57].
B. Baseline Implementation
We implement the Bisection method like Algorithm 1 for
baseline evaluations. Throughout this implementation, even
if the exact root is found in the middle of the execution,
the program does not stop and continues for the predefined
number of iterations.
Data: Function: f , Interval: (a, b), Iterator: iterations
Result: root of f(x) = 0 in (a, b)
while iterations > 0 do
iterations← iterations− 1
root← a+b2
if f(a)× f(root) < 0 then
b← root
else
a← root
end
end
return root
Algorithm 1: Baseline implementation for Bisection algo-
rithm.
IV. RUNAHEAD BISECTION
In this section, we propose our approach for accelerating
the Bisection algorithm by leveraging available idle threads.
First, we describe our method in detail and then discuss its
complexity.
A. Algorithm
We begin with considering the same example (i.e., Fig. 2)
in an environment with three threads. One of the threads is the
main thread, and the two others are helper threads. At first, the
main thread computes the value of f(3), and in parallel, helper
threads operate on one step ahead of the main thread. One
helper thread predicts that f(3) will be positive and begins to
compute the value of f(2.5), speculatively. The other helper
thread predicts a negative value for f(3) and speculatively
calculates the amount of f(3.5). Each helper thread stores
the result of its computation in a shared variable (say f3.5 and
f2.5). So, when the main thread completes its calculation, it
compares the result of f(3) with the results of helper threads
which have been stored in two shared variables. Because f(3)
is negative (i.e., the prediction of second helper thread is
correct) and f3.5 is positive, the next interval is (3, 3.5). In the
next step, the main thread computes f(3.25), and the helper
threads calculate f(3.125) and f(3.375). Operations repeat
the subsequent steps similarly. In this fashion, if we ignore the
latency of some operations (e.g., joining threads and storing
the values), the execution latency reduces by 50%.
We can further reduce the execution latency of the al-
gorithm with devoting more helper threads to the program.
For example, if we have seven threads, we can do the
computations related to the two steps ahead, in the current
stage. In this situation for our example, initially, the main
thread computes f(3), two helper threads compute f(3.5)
and f(2.5), and four helper threads compute f(2.25), f(2.75),
f(3.25), and f(3.75). By this way, the execution latency drops
to one-third of the baseline latency.
For preserving the scalability of our method to the
number of threads, we implement the shared variables as
an array. Each thread has a particular element in the array.
The array also contains two elements which do not belong
to any thread and show the sign of edge points’ value of the
current interval. Any thread fills the corresponding element
in the array after its computation finishes. If the result of the
computation is negative, the thread fills the corresponding
element in the array with ‘1’. Otherwise, it sets the element to
‘0’. Whenever all the threads write the results into the array,
each thread compares the entries in the array which belong
to the two neighbor threads (it is a simple XOR). If they
are not the same, this means one edge of the new interval is
the point of this thread, and the other edge is the neighbor
point which has a different value in the array. Based on this,
the new interval is chosen, then the main thread begins with
computing the value of middle-point of the new interval,
and helper threads pick their corresponding points, and the
scenario repeats like before. To prevent false-sharing [58],
[59], we implement the array as a two-dimensional structure,
but we only use one dimension2. Figure 3 illustrates the
array-based implementation of Runahead Bisection for our
example with three threads. At first, main thread computes
f(3) and writes ‘1’ to the array, as the result is negative.
Parallel with the main thread, helper thread-1 computes
f(2.5), and helper thread-2 calculates f(3.5); then, they
write their results (‘0’ or ‘1’) into the array. Now, the array is
complete3. Helper thread-1 compares sign of the value of its
2 In this manner, shared variables map to different coherence units.
3 Formerly, we know the signs of f(2) and f(4).
1
1
1
0
0
3
3.25
3.5
3.75
4
1
0
0
0
0
3
3.0625
3.125
3.1875
3.25
1
1
1
0
0
2
2.5
3
3.5
4
Fig. 3: Array-based implementation of Runahead Bisection.
neighbor points (i.e., f(2) and f(3)). In parallel, main thread
and helper thread-2 do similar work on their corresponding
entries in the array. Based on comparisons, helper thread-2
finds that its neighbors have different-sign values. So, for
next step, helper thread-2 sets the interval (which is a shared
variable among all threads) to (3, 3.5). The sign of the value
of edge points gets copied to the top and bottom of the array,
and the other operations repeat in the previous fashion.
B. Complexity
Because of similarity between operations of all threads,
we first analyze the time complexity of a single thread. Each
thread computes the value of its corresponding point; then it
writes the sign of result in the array; afterward, it waits for
synchronizing with other threads. Next, the thread compares
the sign of results of neighbor threads and updates the interval
if required. All of these operations take O(1) latency (as
the execution times of these operations are constant and
independent of demanded accuracy or the length of the initial
interval).
If we need to iterate n times to reach to a certain accuracy
in the baseline algorithm, in the Runahead manner, we need
to iterate nk times for each thread, where k depends on the
number of threads. Generally, we can reduce the number of
iterations for each thread, from n to nk , if we have 2
k − 1
threads. So, given k threads, our approach reduces the total
execution time complexity of the program from O(n) to
O( nlog2(k+1) ), where n is the number of required iterations
in the baseline implementation.
V. METHODOLOGY
We evaluate our approach on both CMP and GPU. Table 1
summarizes the parameters of our platforms. For CPU, we
compile the program using GCC without optimization. For
GPU, we use NVCC for compilation, again without optimiza-
tion. For eliminating the effects of compulsory cache misses,
we run each program two times and report the results of
the second execution. To demonstrate the effectiveness of our
method, we choose the function as a high-latency function. We
use trigonometric functions and calculate them with Taylor
series [60].
TABLE I: Evaluation parameters.
Parameter Value
CPU x86 Architecuter, Intel Core i7, 2.4 GHz, Eight cores
OS Linux, Kernel version: 4.4.0-34
GPU NVidia Tesla K20, 732 MHz
Program
f(x) = sin(cos(x)), Taylor series with 104 iterations
Initial interval: (1, 2)
0.00
0.20
0.40
0.60
0.80
1.00
1 2 3 4 5 6 7
NO
RM
AL
IZ
ED
 EX
EC
UI
TI
ON
 T
IM
E
NUMBER OF THREADS
Fig. 4: Sensitivity of execution time of CPU program to the
number of threads.
VI. EVALUATION RESULTS
In this section, we report the results of two sensitivity
analysis: (1) Sensitivity of execution time to the number of
threads, and (2) Sensitivity of speed-up of our method to the
latency of input function.
A. Sensitivity to The Number of Threads
For a specific input program which has been defined
in Table 1, we sweep the number of threads and measure
the execution time of the application. For CPU program,
we set the maximum tolerable error to 2−6 and sweep the
number of threads from 1 to 7. For GPU program, we set the
maximum acceptable error to 2−2520 and sweep the number
of threads from 1 to 1021. We choose 2−6 and 2−2520 as
the maximum tolerable errors because, in these situations,
the number of iterations of the single-threaded program is
divisible by log2(#Threads+1).
Figure 4 shows the result of thread-sweeping on CPU
program. The execution latency values are normalized to
that of the single-threaded program. The execution time of
CPU program drops to 55% (using three threads) and 38%
(using seven threads) of its baseline serial implementation.
As the figure illustrates, the performance nearly scales with
increasing the thread count. The noises in the scaling come
from the latency of operations which we do just in the
multi-threaded program (e.g., creating and synchronizing
the threads, filling the variables which are shared among
threads). Notably, by increasing the thread count, the latency
of these operations (e.g., synchronizing the threads) increases
and prevents reaching perfect performance scaling.
Figure 5 presents the result of sweeping the number
of threads on the execution latency of the GPU program.
Again, the execution times are normalized to that of the
single-threaded program. The latency of program drops to
50% (using three threads) and 10% (using 1023 threads) of
its baseline serial implementation. As the figure confirms,
the performance perfectly scales with growing the number
0.00
0.20
0.40
0.60
0.80
1.00
1 10 100 1000
NO
RM
AL
IZ
ED
 EX
EC
UI
TI
ON
 T
IM
E
NUMBER OF THREADS
Fig. 5: Sensitivity of execution time of GPU program to the
number of threads.
-100%
-50%
0%
50%
100%
10 100 1000 10000 100000 1000000
RU
NA
HE
AD
 SP
EE
D-
UP
TAYLOR SERIES ITERATIONS
Fig. 6: Sensitivity of speed-up to the execution latency of
input function in CPU platform.
of threads. The low overhead of creating/joining hardware
threads in the GPU platform [61] provides this near-ideal
scalability.
B. Sensitivity to The Execution Latency of Input Function
In this section, we investigate the sensitivity of our
approach’s speed-up to the computation latency of input
function. For this reason, we sweep the number of iterations
of Taylor series for computing trigonometric functions. By
growing the number of iterations in Taylor series, the latency
of calculating sin(x) and cos(x) raises, and consequently,
the entire latency for computing the value of a given point
grows. In this experiment, we set the maximum tolerable error
to 2−6 and restrict the number of threads to three. By this
way, the single-threaded program needs to iterate six times,
and the multi-threaded program requires three iterations. In
an ideal situation (i.e., neglecting the latencies of creating
and joining threads and filling shared variables), the multi-
threaded application should take the half of execution time of
the single-threaded program (in other words, multi-threading
should raise the performance by 100%).
Figure 6 shows the result of this study for CPU program.
As the figure illustrates, when the execution latency of
the function is low (below 500 iterations), the Runahead
Computing not only does not improve the performance but
also decreases it. This occurs because, in this situation, the
overhead of creating and joining threads is more than the
latency of computing the value of a point. But when the
computation time of function goes beyond of a threshold,
Runahead Computing improves the performance. In our
experiment, Runahead Computing decreases the performance
by 86% when the latency of input function is small (10
iterations for Taylor series). By increasing the latency
of function, the speed-up of Runahead Computing also
increases. By setting the number of iterations of Taylor series
to 10000, Runahead Computing improves the performance
by 97% and converges to the ideal speed-up value.
Figure 7 presents the result of the same experiment on
GPU platform. As shown, unlike CPU, Runahead Computing
in the GPU never loses performance in our analysis. The
reason is, the overhead of creating and joining the threads
on GPU is very low in comparison with CPU [61]. Even
for a small number of iterations for Taylor series, Runahead
Computing considerably increases the performance. For 10
iterations of Taylor series, Runahead Computing speeds up
the execution latency by 19%. For iterations beyond 500,
Runahead Computing improves the performance by 99% and
reaches to its ideal speed-up value.
0%
25%
50%
75%
100%
10 100 1000 10000 100000 1000000
RU
NA
HE
AD
 SP
EE
D-
UP
TAYLOR SERIES ITERATIONS
Fig. 7: Sensitivity of speed-up to the execution latency of
input function in GPU platform.
VII. CONCLUSION
In this paper, we proposed Runahead Computing, a tech-
nique for increasing the performance of single-threaded appli-
cations in multi-threaded architectures, by exploiting available
idle threads. In the proposed approach, programmer codes the
idle threads for working some steps ahead of the main thread.
As a case study, we chose Bisection root-finding algorithm
and accelerated it on a CMP and a GPU. While we examined
our method on a particular algorithm, we believe that ideas
in this paper can be applied to other similar algorithms (e.g.,
Binary Search).
REFERENCES
[1] Kunle Olukotun et al. The Case for a Single-Chip Multiprocessor. In ASPLOS,
1996.
[2] Martin Dixon et al. The Next-Generation Intel Core Microarchitecture. Intel
Technology Journal, 2010.
[3] Michael Butler et al. Bulldozer: An Approach to Multithreaded Compute
Performance. IEEE Micro, 2011.
[4] Ron Kalla et al. IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE
Micro, 2004.
[5] Manish Shah et al. UltraSPARC T2: A Highly-Treaded, Power-Efficient, SPARC
SOC. In ASSCC, 2007.
[6] Shane Bell et al. Tile64-Processor: A 64-Core SoC with Mesh Interconnect. In
ISSCC, 2008.
[7] Michael J Flynn. Some Computer Organizations and Their Effectiveness. IEEE
Transactions on Computers, 1972.
[8] NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110. Tech-
nical report, 2012.
[9] M Houston. Anatomy of AMD’s TeraScale Graphics Engine. 2008.
[10] Mohammad Sadrosadati et al. LTRF: Enabling High-Capacity Register Files
for GPUs via Hardware/Software Cooperative Register Prefetching. In ASPLOS,
2018.
[11] Amir Yazdanbakhsh et al. Neural Acceleration for GPU Throughput Processors.
In MICRO, 2015.
[12] Farzad Khorasani et al. RegMutex: Inter-Warp GPU Register Time-Sharing. In
ISCA, 2018.
[13] Mohammad Sadrosadati et al. Effective Cache Bank Placement for GPUs. In
DATE, 2017.
[14] Homa Aghilinasab et al. Reducing Power Consumption of GPGPUs Through
Instruction Reordering. In ISLPED, 2016.
[15] Ali Karami et al. A Statistical Performance Prediction Model for OpenCL Kernels
on NVIDIA GPUs. In CADS, 2013.
[16] Amin Abbasi et al. A Preliminary Study of Incorporating GPUs in the Hadoop
Framework. In CADS, 2012.
[17] Sayyed Ali Mirsoleimani et al. A Parallel Memetic Algorithm on GPU to Solve
the Task Scheduling Problem in Heterogeneous Environments. In GECCO, 2013.
[18] Pejman Lotfi-Kamran et al. Scale-Out Processors. In ISCA, 2012.
[19] Michael Ferdman et al. Cuckoo Directory: A Scalable Directory for Many-Core
Systems. In HPCA, 2011.
[20] Boris Grot et al. Optimizing Data-Center TCO with Scale-Out Processors. IEEE
Micro, 2012.
[21] Pejman Lotfi-Kamran et al. TurboTag: Lookup Filtering to Reduce Coherence
Directory Power. In ISLPED, 2010.
[22] Hadi Esmaeilzadeh et al. Dark Silicon and the End of Multicore Scaling. In
ISCA, 2011.
[23] Nikos Hardavellas et al. Toward Dark Silicon in Servers. IEEE Micro, 2011.
[24] M. Aater Suleman et al. Accelerating Critical Section Execution with Asymmetric
Multi-Core Architectures. In ASPLOS, 2009.
[25] Mohammad Bakhshalipour et al. Fast Data Delivery for Many-Core Processors.
IEEE Transactions on Computers, 2018.
[26] Amirhossein Mirhosseini et al. BiNoCHS: Bimodal Network-on-Chip for CPU-
GPU Heterogeneous Systems. In NOCS, 2017.
[27] Amirhossein Mirhosseini et al. An Energy-Efficient Virtual Channel Power-gating
Mechanism for On-Chip Networks. In DATE, 2015.
[28] Mohammad Sadrosadati et al. An Efficient DVS Scheme for On-Chip Networks
Using Reconfigurable Virtual Channel Allocators. In ISLPED, 2015.
[29] Pejman Lotfi-Kamran et al. NOC-Out: Microarchitecting a Scale-Out Processor.
In MICRO, 2012.
[30] Pejman Lotfi-Kamran et al. Near-Ideal Networks-on-Chip for Servers. In HPCA,
2017.
[31] Pejman Lotfi-Kamran et al. EDXY–A Low Cost Congestion-Aware Routing
Algorithm for Network-on-Chips. Journal of Systems Architecture, 2010.
[32] Pejman Lotfi-Kamran et al. BARP-A Dynamic Routing Protocol for Balanced
Distribution of Traffic in NoCs. In DATE, 2008.
[33] Pejman Lotfi-Kamran et al. An Efficient Hybrid-Switched Network-on-Chip for
Chip Multiprocessors. IEEE Transactions on Computers, 2016.
[34] Pejman Lotfi-Kamran et al. NOC Characteristics of Cloud Applications. In CADS,
2017.
[35] Babak Falsafi et al. Network-on-Chip Using Request and Reply Trees for Low-
Latency Processor-Memory Communication, 2017. US Patent 9,703,707.
[36] Mehdi Modarressi et al. Application-Aware Topology Reconfiguration for On-
Chip Networks. IEEE Transactions on Very Large Scale Integration Systems,
2011.
[37] Mehdi Modarressi et al. Virtual Point-to-Point Connections for NoCs. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2010.
[38] Mehdi Modarressi et al. A Hybrid Packet-Circuit Switched On-Chip Network
Based on SDM. In DATE, 2009.
[39] Mehdi Modarressi et al. Power-Aware Mapping for Reconfigurable NoC Archi-
tectures. In ICCD, 2007.
[40] Mehdi Modarressi et al. An Efficient Dynamically Reconfigurable On-Chip
Network Architecture. In DAC, 2010.
[41] Amirhossein Mirhosseini et al. Quantifying the Difference in Resource Demand
among Classic and Modern NoC Workloads. In ICCD, 2016.
[42] Amirhossein Mirhosseini et al. POSTER: Elastic Reconfiguration for Heteroge-
neous NoCs with BiNoCHS. In PACT, 2017.
[43] Boris Grot et al. Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for
Scalability and Service Guarantees. In ISCA, 2011.
[44] John Kim et al. Flattened Butterfly: A Cost-Efficient Topology for High-Radix
Networks. In ISCA, 2007.
[45] Jamison D Collins et al. Dynamic Speculative Precomputation. In MICRO, 2001.
[46] Jamison D. Collins et al. Speculative Precomputation: Long-Range Prefetching
of Delinquent Loads. In ISCA, 2001.
[47] Dongkeun Kim and Donald Yeung. Design and Evaluation of Compiler Algo-
rithms for Pre-Execution. In ASPLOS, 2002.
[48] Chi-Keung Luk. Tolerating Memory Latency Through Software-Controlled Pre-
Execution in Simultaneous Multithreading Processors. In ISCA, 2001.
[49] Weifeng Zhang et al. An Event-Driven Multithreaded Dynamic Optimization
Framework. In PACT, 2005.
[50] Craig Zilles and Gurindar Sohi. Execution-Based Prediction Using Speculative
Slices. In ISCA, 2001.
[51] Mohammad Bakhshalipour et al. Domino Temporal Data Prefetcher. In HPCA,
2018.
[52] Mohammad Bakhshalipour et al. An Efficient Temporal Data Prefetcher for L1
Caches. IEEE Computer Architecture Letters, 2017.
[53] Armin Vakil-Ghahani et al. Cache Replacement Policy Based on Expected Hit
Count. IEEE Computer Architecture Letters, 2018.
[54] Robert S Chappell et al. Simultaneous Subordinate Microthreading (SSMT). In
ISCA, 1999.
[55] Carlos Madriles et al. Mitosis: A Speculative Multithreaded Processor Based on
Precomputation Slices. IEEE Transactions on Parallel and Distributed Systems,
2008.
[56] George Corliss. Which Root Does The Bisection Algorithm Find? Siam Review,
1977.
[57] Richard L Burden and J Douglas Faires. The Bisection Algorithm. Numerical
Analysis. Prindle, Weber & Schmidt, 1985.
[58] Josep Torrellas et al. False Sharing and Spatial Locality in Multiprocessor Caches.
IEEE Transactions on Computers, 1994.
[59] Tor E. Jeremiassen and Susan J. Eggers. Reducing False Sharing on Shared
Memory Multiprocessors Through Compile Time Data Transformations. In
PPOPP, 1995.
[60] Lars V Ahlfors. Complex Analysis: An Introduction to The Theory of Analytic
Functions of One Complex Variable. New York, London, 1953.
[61] Alejandro Segovia. Parallel Programming with NVIDIA CUDA. Linux Journal,
2010.
