Hurry-up: Scaling Web Search on Big/Little Multi-core Architectures by Nishtala, Rajiv et al.
Hurry-up: Scaling Web Search on
Big/Little Multi-core Architectures
Rajiv Nishtala1, Vinicius Petrucci2, Paul Carpenter3, and Xavier Martorell3
1Norwegian University of Science and Technology (rajiv.nishtala@ntnu.no)
2Federal University Bahia, Brazil & University of Pittsburgh, USA (vpetrucci@upitt.edu)
3Barcelona Supercomputing Center, Spain (first.last@bsc.es)
Abstract—Heterogeneous multi-core systems such as big/little
architectures have been introduced as an attractive server design
option with the potential to improve performance under power
constraints in data centres. Since both big high-performing
and little power-efficient cores can run on the same system
sharing the workload processing, thread mapping/scheduling
turns out to be much more challenging. This is particularly
hard when considering the different trade-offs shaped by the
heterogeneous cores on the quality-of-service (expressed as tail
latency) experienced by user-facing applications, such as Web
Search.
In this work, we present Hurry-up, a runtime thread mapping
solution designed to select individual requests to run on the most
appropriate heterogeneous cores to improve tail latency. Hurry-
up accelerates compute-intensive requests on big cores, while
letting less intensive threads to execute on little cores. We imple-
ment and deploy Hurry-up on a real 64-bit big/little architecture
(ARM Juno), and show that, compared to a conservative policy
on Linux, Hurry-up reduces the server tail latency by 39.5%
(mean).
I. INTRODUCTION
Online large-scale data intensive services are becoming
increasingly susceptible and sensitive to the modern non-
deterministic servers. These type of services are typically fan-
out services (i.e., distributed across multiple servers) and need
to act in tandem to ensure user experience. Prior research has
shown that marginal delays (of hundreds of milliseconds) in
user experience can greatly impact advertising revenue [6].
For this reason, user experience is typically expressed in
distributed production systems as percentiles of tail latency,
such as 95th or 99th percentile. Ensuring consistent tail latency
is a hard problem because traditional IPC-based (Instructions
per Cycle) scheduling mechanisms have failed [15].
Given the stochastic nature of the incoming requests to
such services, it is important the serve the incoming requests
based on their heterogeneous computational demands. One
such avenue recently explored in the literature [17], [19], [20],
[24] to exploit such fan-out services has been heterogeneous
multi-core architectures.
Heterogeneous multi-core architectures include types of
cores having different power and performance characteristics,
typically big/high-performance and little/power-efficient cores
sharing the same ISA (Instruction Set Architecture). These
heterogeneous multi-cores have emerged as an architecture
design to keep meeting performance and energy efficiency
requirements for a wide range of applications. The key idea is
that threads with high computational load are more suitable to
run on big cores, while threads requiring low computational
resources can execute more on little cores, leaving space to
run additional threads on the big cores.
Modern data centres typically host services such as web
search or social network that require strict quality-of-service
(QoS) levels [11], [15]. Web search is a critical and challeng-
ing workload in data centres due to several factors: the service
imposes strict query response time constraints, the arrival time
of a query is hard to predict, and each query may have distinct
lengths and computing requirements. Prior research [20] has
shown that online interactive workloads can take advantage
of heterogeneous cores and can deliver higher throughput in
contrast to homogeneous cores.
In the data centre domain, an important problem with
heterogeneous multi-cores is to decide how best to map the
application workload on the most appropriate core type, given
the different latency requirements [10]. Prior work such as
Hipster [17] and Octopus-Man [19] have shown that heteroge-
neous cores are attractive to execute cloud workloads, enabling
mapping the application at runtime on a set of big or little
cores for improved energy efficiency.
By contrast to most recent prior work [17], [19] that map the
entire application on heterogeneous cores, this work presents a
thread/request-level mapping policy called Hurry-up. Hurry-
up can select and execute individual threads on heterogeneous
cores. These threads are typically compute-intensive and are
responsible for processing incoming user requests. We show
that such a fine-grained solution for thread mapping can lead to
improved throughput and tail latency in heterogeneous server
systems.
The main contributions of our work are:
1 We introduce a thread mapping solution, Hurry-up,
that exploits a key insight that user queries translate to
different computing requirements, such as by varying length
of keywords; thus, request threads can be individually mapped
at runtime to the most appropriate heterogeneous core for
improved tail latency.
2 We implement and deploy Hurry-up on a ARM 64-
bit big.LITTLE architecture (Juno R1 [2]). We instrument
Hurry-up to efficiently manage the execution of threads for
a representative Web Search benchmark [5].
ar
X
iv
:1
91
2.
09
84
4v
1 
 [c
s.D
C]
  2
0 D
ec
 20
19
0500
1000
1500
2000
2500
T
im
e
 T
a
ke
n
 (
m
s)
1 3 5 7 9 11 13 15 17
Key word length
0
200
400
600
800
1000
1200
1400
1600
1800
E
n
e
rg
y
 (
J)
Big Little
Fig. 1: Time taken for query processing and energy consumed
(in J) when varying the number of keywords while running
Web Search on big and little core. The error bars represent
the standard deviation.
3 We perform real system evaluation comparing Hurry-
up with baseline Linux scheduling (a conservative policy) and
show that, Hurry-up improves throughput by XX% (mean) and
reduces tail latency by 39.5% (mean).
II. MOTIVATION
We motivate our work by performing exploratory experi-
ments on a big/little multi-core architecture running a Web
Search benchmark configured to receive different user queries.
Details on the experimental setup are given in Section IV-A.
Figure 1 shows the processing time required (top plot) and
energy consumed (bottom plot, in Joules) for Web Search
as a function of big core or little core allocation and query
compute-intensity in number of keywords submitted by the
users (total of 1× 105 of requests). The error bars represent
the standard deviation for query processing time and energy.
If we consider a QoS target at 500ms, user requests with five
or more keywords (heavy requests) would violate the latency
constraint when running on a little core. On the other hand,
requests with fewer than five keywords (light requests) could
run on a little core without much impact on the latency target.
Note that this is non trivial since these requests experience a lot
of variability when running on little cores. In addition, observe
the big core can satisfy requests with up to 17 keywords
without violating the latency, but consumes higher energy. For
all light requests, considering only socket power, we observe
that the little core is much more energy efficient (performance
per watt of power consumed) than the big core.
Figure 2 shows an experiment varying the number of cores
and core types to see the impact on the tail latency distribution
(e.g., 90%-ile). Considering a QoS target of 90%-ile at 500ms,
the system cannot meet this constraint by using single little
0
50
0
0.0
0.5
1.0
Cu
m
ul
at
iv
e 
 P
ro
ba
bi
lit
y
0
10
0
20
0
30
0
40
0
0.0
0.5
1.0
0
10
00
20
00
30
00
40
00
Query Processing 
 Times (ms)
0.0
0.5
1.0
Cu
m
ul
at
iv
e 
 P
ro
ba
bi
lit
y
0
50
0
10
00
Query Processing 
 Times (ms)
0.0
0.5
1.0
Single big core
Single little core
Two big cores
Two little cores
Fig. 2: Query latency distribution on different number of cores
(1 or 2) and types (big or little).
1-L 2-L 1-B 2-B
0
5
10
15
Core Type
N
or
m
al
iz
ed
to
a
si
ng
le
lit
tle
co
re
Tail Latency Socket Power
Fig. 3: The tail latency (higher is better) and socket power
normalized (lower is better) to a single small core. B and L
represent big and little cores, respectively.
core, but it can using two little cores. We note that using one or
two big cores, we can greatly reduce the tail latency, but at the
expense of using more power. For instance, Figure 3 represents
the socket power consumption and tail latency normalised
to a single little core (1-L). The configurations in the X-
axis, B and L, are read as big and little cores, respectively.
For tail latency, higher is better whereas for socket power
consumption, lower is better. Observe that using single big
core can reduce tail latency by up to 3.2×, but consumes
7.8× higher power in contrast to a single small core. This
shows that requests can also experience highly unpredictable
variability depending on the core type they will run.
In real server systems, it is impractical to annotate all
applications to pass to the scheduler information, such as the
number of keywords, that is relevant to compute time. We
notice that heavy requests will stay much longer in the system
compared to light requests, so we can monitor the progress of
each request by reading runtime statistics from the application-
level and infer the query computational intensity at runtime.
This motivates the need for a solution that can determine
at runtime how to best map a search thread (either light or
heavy) on the most suitable core type (either little or big). It
is also important that this can be accomplished with minimal
additional power expense and thread mapping overhead.
III. HURRY-UP THREAD MAPPING DESIGN
We design Hurry-up to influence the OS scheduling by
mapping the threads to run on the core type that can make
best usage of the available resources and deliver improved
tail latency. To make thread mapping decisions, Hurry-up
takes advantage of application-level information regarding the
execution of each thread in the system. There are two major
phases in the Hurry-up design: profiling and mapping.
A. Profiling Phase
Hurry-up works by firstly identifying specific methods of
the application that are suitable for acceleration in the big
cores; those are refereed as “hot functions”. We select via
profiling the hottest function present in the critical path a
single request execution. We insert monitoring probes via code
instrumentation to record the events of entry and exit for such
a critical method. In this work, we consider a single hottest
function per application in the design.
B. Mapping Phase
As depicted in Figure 4, a client sends requests to be
processed by the Web Search back-end. Each request is
internally mapped to a search thread selected from a pool of
threads. Once the search thread starts processing the request,
it records a timestamp and a unique ID for that particular
request. When a search thread finishes processing a request,
it also records a timestamp for this event as well.
Hurry-up Mapper reads from a fast communication channel
(IPC, Inter-process Communication) the statistics of Web
Search application about thread and request identification,
and associated timestamps of start/end request processing.
Leveraging this runtime observation, collected and updated pe-
riodically, Hurry-up Mapper is able to perform thread mapping
decisions to improving tail latency, as described next.
C. Hurry-up Mapper
A high level description of Hurry-up Mapper is described
in Algorithm 1. Given the incoming user requests, Hurry-up is
responsible for mapping the search thread serving the request
to either a big or little core. Hurry-up works by mapping a
light search thread that can potentially finish its execution on
a little core without much impact on the tail latency, and a
heavy search thread that is more compute-intensive on a big
core to improve tail latency.
The empirically tuned parameters for the algorithm
are: SAMPLING_TIME that controls how frequently
Fast migration 
between cores
               Little cluster
Little core
Little core
Little core
Little core
              Big cluster
Big core
Big core
Web Search
Hurry-up
Mapper
Search 
Thread 
#1
Search 
Thread 
#2
Search 
Thread 
#3
Search 
Thread 
#K
Client
Requests
... ... ...
...
Req
. un
ique
 ID 
+ 
Tim
esta
mp 
Sta
rt/E
nd
Req. unique ID + 
Timestamp Start/End
Req. unique ID + 
Timestamp Start/End
R
eq unique ID
 + 
Tim
estam
p Start/End
Request X
Request Y
Request Z
Request W
Fig. 4: Hurry-up Thread Mapping for Web Search
we sample runtime statistics from the application and
MIGRATION_THRESHOLD that specifies a time threshold
used to identify a thread as compute-intensive and migrate
the thread to a big core.
Each search thread has unique ID in the pool and is
responsible for processing a request (also having unique ID);
a thread needs to finish an active request before another
request can be processed [5]. The initial mapping of the search
thread pool is carried out in a round-robin fashion so that the
workload is balanced among all the available cores uniformly.
In Algorithm 1, Lines 1-2 initialise RequestTable for storing
the runtime data and StartSamplingTime to determine the start
time of a sampling window. Line 4 is responsible for collecting
statistics from the application in the form of timestamps of
each thread-request unique pair. Next, Lines 5-8 check if it is
a new request or a request that has finished its processing. All
finished requests are removed from RequestTable. Lines 9-10
control the loop of runtime data reading.
The ReadStatsFromApp function (line 4) reads the applica-
tion data from a pipe channel (interprocess communication). It
blocks waiting in case there is no available data. An example
of a stream of data obtained after a call of ReadStatsFromApp
is shown below:
75;ixI.;1498060927539
77;1J.D;1498060927953
78;579[;1498060927954
79;Xrt@;1498060928003
80;qc8o;1498060928014
77;1J.D;1498060928023
Analysing the snapshot above at a given sampling interval, the
first line indicates that Thread ID 75 with request ID ixI. started
at timestamp 1498060927539, but it is still in progress because
looking further in the available data there is no event indicating the
end of its execution. Next, Thread ID 77 (request ID 1J.D) started
at timestamp 1498060927953; as shown further in the last line, it
finished at timestamp 1498060928023 and took 70 ms of execution
time (subtraction of 1498060928023 − 1498060927953). With the
given data, we note that all other Thread IDs such as 78, 79 and 80
are still processing their requests.
Lines 11-16 use the request table to identify any request-thread pair
Algorithm 1 Hurry-up Mapper
1 RequestTable = {}
2 StartSamplingTime = GetTimeInMilliSeconds()
3 while True do
. Read new request stats: Thread (running) ID, Request (unique) ID, Request (begin/end) Timestamp
4 TID, RID, RTS = ReadStatsFromApp()
5 if RID in RequestTable then
6 delete RequestTable[RID] . Request already finished
7 else
8 RequestTable[RID] = (TID, RTS) . Store new request data
9 if (GetTimeInMilliSeconds()− StartSamplingTime) < SAMPLING TIME then
10 continue . Restart loop to keep reading more data
11 ThreadsOnLittle = []
12 for (TID, RTS) in RequestTable do
13 TimeElapsed = TimeinMilliSeconds() - RTS
14 if (TimeElapsed > MIGRATION THRESHOLD) then
15 if TID is running on little core type then
16 Add pair (TID, TimeElapsed) on ThreadsOnLittle
17 Sort ThreadsOnLittle by TimeElapsed in descending order
18 for b = 0→ size(BigCoreList) do
19 if b >= size(ThreadsOnLittle) then
20 break . No more migration requests on little cores
21 BigCore = BigCoreList[b]
22 ThreadOnBig = GetRunningThread(BigCore)
23 ThreadID = ThreadsOnLittle[b]
24 LittleCore = GetRunningCore(ThreadID)
25 Map ThreadID to BigCore
26 Map ThreadOnBig to LittleCore
27 StartSamplingTime = GetTimeInMilliSeconds()
(to append in ThreadsOnLittle list) that has been long running on a
little core for at least the migration time threshold (in ms). In Line 18,
the threads running on little cores are sorted in the descending
order of their time elapsed in the system. Finally, Lines 18-26 are
responsible for the actual remapping of the threads on little cores.
Each long running thread on a little core (starting from the longest
thread) is selected to run on a big core, until there are no more big
cores or no more migrative threads on little cores. Line 27 reset the
start sampling time and the algorithm loop is resumed.
We empirically set the sampling interval (in ms) to ensure the
algorithm has enough runtime data to make the thread mapping deci-
sions. In our case, we found that 50ms worked best for periodically
reading the runtime data with low overhead, while any other longer
sampling times performed worse. On the other hand, we can notice
that the algorithm is very sensitive to the migration threshold time,
because it is responsible for triggering the mapping of threads from
little to big cores. In the next section, we will evaluate how the
Hurry-up algorithm performs in a real big/little platform and present
the sensitivity analysis.
IV. EXPERIMENTAL RESULTS
A. Methodology
[Big/Little System]: We perform the evaluation experiments on
an ARM Juno R1 developer board [2] with Linux (kernel 4.3). The
Juno board is a 64-bit ARMv8 big.LITTLE architecture with two
high-performance out-of-order Cortex-A57 (big) cores and four low-
power in-order Cortex-A53 (little) cores. The cores are integrated on
a single chip with off-chip 8GB DRAM. The two big cores form a
cluster with a shared 2MB L2 cache, and the four little cores form
another cluster with a shared 1MB L2 cache. The big and little
cores are set to the highest DVFS state of 1.15GHz and 0.6GHz,
respectively. The cache interconnect (CoreLink CCI-400) provides
full cache coherency among the heterogeneous cores, allowing a
shared memory application to run on both clusters simultaneously.
[Power Measures]: The power consumption of the Juno board
is obtained using four native energy meters [1]. The four energy
Fig. 5: Heterogeneous multi-core platform (ARM Juno R1)
meters are responsible for collecting results from the big cluster, little
cluster, rest of the system (including memory controllers, etc) and the
Mali GPU. The power consumption of the Mali GPU is negligible
because the GPU is disabled in all our experiments. The system
power consumption is reported as an aggregation of the big and little
clusters, and the rest of the system (including memory controllers,
etc). We observe that a single big core is 52% more power-efficient
than a single little core, in terms of IPS (Instructions per Second)
per watt. But, taking into account all cores in a cluster, and assuming
that all cores can be fully utilised, a little cluster is 25% more power-
efficient than a big cluster. This discrepancy is because the rest of the
system, excluding the core clusters, consumes about the same power
as the big core at full utilisation (0.76W). If we subtract the power
of the rest of the system, a single little core is 2.3× more power-
efficient than a big core. The little cores are attractive to improve
the throughput of sequential workloads, due to their power-efficient
0 200 400 600 800 1000 1200
Query Processing Times (ms)
0.000
0.002
0.004
0.006
0.008
Pr
ob
ab
ilit
y 
De
ns
ity
 F
un
ct
io
n Hurry-up
Linux mapping
B
C
A
Fig. 6: Latency distribution: Hurry-up vs Linux mapping.
characteristics. Big cores are, however, still necessary for lowering the
tail latency, as a result of computationally-intensive single-threaded
requests.
[Benchmark]: We evaluate Hurry-up using a Web Search bench-
mark. Typically, search engines are designed as scale-out workloads
to deliver high request throughput under strict levels of latency
constraints [3], [10]. A single user request is fanned-out to many
leaf node servers to process the query on their shard of the search
index [15]. We experiment with a big/little architecture server running
Elasticsearch [5], an open source implementation of a search engine
used by many companies including Netflix and Facebook. The
Elasticsearch has an index of the English Wikipedia database that
fits in the server memory. We configure the size of the search thread
pool (six threads) to match the number of cores in the system (two
big cores plus four little cores). Unless otherwise stated, we specify
the tail latency as the 90th percentile response latency. The load
generator (Faban) for Web-Search is adapted from CloudSuite 3.0 [7].
The maximum load is chosen such that platform can still improve
throughput when running on two big cores and four little cores at
maximum DVFS state without impacting too much on the long-tail
latency. In our experiments, the load generator simulates the load (i.e.,
clients) on another machine: an AMD Opteron 6140 64-bit with eight
cores at 2.6GHz and 32GB DRAM. The load generator machine is
connected via 1Gbit network to the big/little server machine.
B. Hurry-up Results
We show the effectiveness of Hurry-up by comparing against a
conservative/static Linux mapping policy. The Linux baseline maps
each request to a given core type randomly, and there exists no
migrations thereafter. In contrast, Hurry-up works by dynamically
mapping individual queries on a heterogeneous platform to improve
tail latency.
[Tail Latency]: We show that Hurry-up can reduce the query
processing time and maximise the total number queries served. To
demonstrate this, we show the Probability Distribution Function
(PDF) in Figure 6 from a experiment in which we sampled the
query processing time for each request with a simulated load of 30
QPS using Faban. For Hurry-up, we set the sampling interval and
migration threshold to 25ms and 50ms respectively. In the plot, as
can be seen at point A, Hurry-up reduces the worst case tail latency
from 1200ms to 800ms. At the other extreme, looking at point
B, Hurry-up shows a higher density than that of Linux mapping
because Hurry-up aggressively migrates potential, but not certain,
long-running requests from little to big cores. At point C, using
Hurry-up, we notice that the requests migrated to a big core much
earlier compared to Linux mapping. In Linux mapping, the requests
continue to execute on little cores, increasing their processing time.
[Tail latency vs. Energy]: Figure 7 shows the trade-off between
tail latency and system energy consumption for policies Hurry-up
200 300 400
Tail Latency (ms)
1500
1750
2000
2250
2500
Sy
st
em
 E
ne
rg
y 
(J)
Hurry-up Linux mapping
Fig. 7: Trade-off between tail latency and system energy (in
Joules) for Hurry-up and Linux mapping. The size of the data
point represents the load in QPS (5, 10, 20, 30 and 40).
and Linux mapping. The size of the scatter point represents the load
(i.e., the smallest point represents least load, whereas the largest
point represents highest load). For each policy, we conducted an
experiment with load fixed at 5, 10, 20, 30 and 40 QPS. We make
two observations.
1 Hurry-up has a lower tail latency in contrast to Linux mapping
while having a slightly higher energy consumption (4.6% mean). This
is because, Hurry-up maps heavy requests from little cores to big
cores after a migration threshold and allocates little core to lighter
requests. This helps Hurry-up improve the tail latency but increases
energy consumption as it utilises the bigger core for a longer duration.
By contrast, Linux mapping (may) execute light requests on the big
core, while keeping the big core idle for extended periods and thereby
consumes lower energy and a higher tail latency.
2 Observe at low load (5 QPS), Hurry-up has a higher tail latency
than at higher loads (10, 20 and 30 QPS) because, at low loads
a larger percentage of the requests are executed on little cores in
contrast to big core, whereas the contrary is true at high loads. For
instance, with 5 QPS there are approximately 33% of the requests
are executed on big cores while 67% of the requests are executed on
little cores. On the other hand, with 20 QPS, approximately 58% of
the requests are executed on big cores and the remaining on the little
cores. As the number of requests processed on big cores increase, so
does the energy consumption but the tail latency reduces until there
exists no queuing.
[Impact on tail latency]: Figure 8 shows the tail latency (in ms)
at various loads (in QPS) for policies Hurry-up and Linux mapping.
For Hurry-up, we set the sampling interval and migration threshold to
25ms and 50ms, respectively. In the plot, as can be seen, Hurry-up
reduces the worst-case tail latency at all loads in contrast to Linux
mapping because, it migrates heavy requests from little to big core
after a given migration threshold. This allows Hurry-up to reduce tail
latency by up to a maximum of 86% at 20 QPS and by 39.5% on
average. On the other hand, at the highest load of 40 QPS, Hurry-
up can only reduce by only 10% due to the high-tail latency (and
queuing) experienced by both scheduling policies.
[Parameters sensitivity]: To make the best use of Hurry-up, we
empirically tune the parameters: sampling interval and migration
threshold, and select the parameters that deliver the best balance
between tail latency and energy. Figure 9 shows the distribution of
5 10 15 20 30 40
0
100
200
300
400
500
Queries per Second (QPS)
Ta
il
la
te
nc
y
(m
s)
Hurry-up Linux mapping
Fig. 8: The tail latency (in ms) at various loads (in QPS) for
policies Hurry-up and Linux mapping.
the tail latency and energy as a function of migration threshold and
the load (in QPS), when the sampling interval is set to 50ms for
policy Hurry-up. We avoid showing for all sampling intervals to
avoid visual clutter. The primary y-axis shows the tail latency (in
ms) and secondary y-axis shows the system energy consumption (in
joules). Similar to Figure 7, notice that at lowest load (5 QPS) there
exists a high-tail latency because fewer requests are completed on
big core in contrast to little core; whereas at the highest load (40
QPS), despite having a large number of requests being processed
on big cores, the tail latency increases due to queuing. In addition,
observe that with load set to 10 QPS, 15 QPS, 20 QPS and 30 QPS, a
higher migration threshold entails a higher latency and lower energy
consumption. This is because, complex requests running on little
cores are migrated to big cores after a longer migration threshold and
thereby executing requests longer on little cores and consuming lower
power. On the contrary, a lower migration threshold entail a lower
latency and higher energy consumption as both, simple and complex
requests are migrated to big cores quicker and thereby consuming
higher power.
V. RELATED WORK
Twig [18] introduced a deep reinforcement learning based solution
to manage cores and DVFS control for multiple latency-critical
workloads to improve energy efficiency. Hipster [17] introduced a
hybrid scheme that combines heuristics and reinforcement learning
to manage heterogeneous cores with DVFS control for improved
energy efficiency and resource utilisation. Octopus-Man [19] was
designed for big.LITTLE architectures to map workloads on big
and little cores using a feedback controller in response to changes
in measured latency. Adrenaline [9] uses application level hints to
identify heavy threads that can affect the tail latency, and provides a
scheme for boosting those queries exploiting quick frequency/voltage
scaling. Ren et al. [20] investigate workloads that maximise through-
put on heterogeneous processors, and demonstrate that heteroge-
neous processors deliver up to 50% higher throughput in contrast
to homogeneous cores. GreenGear [24] proposed a heterogeneous
platform-aware power provisioning system for data centres. The
management framework distributes power from either renewable and
non-renewable sources between little and big cores to achieve a
higher energy efficiency while meeting SLO targets. KnightShift [22]
introduces a server architecture that couples commercial available
compute nodes to adapt the changes in system load and improve en-
ergy proportionality (i.e., system power consumption is proportional
to utilisation).
Haque et al. [8] introduce a few-to-many parallelism technique
that dynamically increases request-level parallelism at runtime. Their
system completes simple/less complex requests sequentially to save
resources, and parallelizes larger requests to reduce tail latency.
Kim et al. [12] estimate the tail latency of each request using
machine learning to execute the estimated request sequentially to save
resources, and to parallelise larger requests to reduce latency slack.
Li et al. [13] proposed an approach to improve service level objectives
(SLO) at request-level. They serialised the complex requests on the
system to reduce the impact of queuing on less complex requests.
Heracles [15] uses a feedback controller that exploits collocation
of latency-critical and batch workloads while increasing the resource
efficiency of CPU, memory and network as long as QoS target is
met. Pegasus [14] achieves high CPU energy proportionality for
low latency workloads using fine-grained DVFS techniques. Time
Trader [21] and Rubik [11] exploit request queuing latency variation
and apply any available slack from queuing delay to throughput-
oriented workloads to improve energy efficiency. Quasar [4] use
runtime classification to predict interference and collocate workloads
to minimise interference.
Mars et al. [16], [23] detect at runtime the memory pressure and
find the best collocation to avoid negative interference with latency-
critical workloads. They also have a mechanism to detect negative
interference allocations via execution modulation.
VI. CONCLUSION
This paper presented Hurry-up, a thread mapping approach that op-
timises for tail latency and improves throughput by accelerating long-
running compute-intensive requests on big cores, while letting light
and less intensive threads running on little cores. Hurry-up recognises
that search queries can require different compute requirements, and
such a knowledge can be inferred at runtime based on application-
level statistics. We show that Hurry-up outperforms a conservative
policy under Linux in terms of reducing tail latency by 39.5% (mean),
while requiring negligible additional energy.
ACKNOWLEDGEMENT
This work was funded by the European Union under grant agree-
ment No 754337 (EuroEXA), the Brazilian federal government under
CNPq grant (Process n 430188/2018-8).
The experiments were conducted on the Juno board at the
Barcelona Supercomputing Center, Spain.
REFERENCES
[1] ARM. ARM juno power registers, December 2016.
[2] ARM. ARM juno r1, December 2016.
[3] L. A. Barroso, J. Dean, and U. Holzle. Web search for a planet:
The google cluster architecture. volume 23, pages 22–28, March
2003.
[4] C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient
and qos-aware cluster management. In Proceedings of the
19th International Conference on Architectural Support for
Programming Languages and Operating Systems, ASPLOS ’14,
pages 127–144, New York, NY, USA, 2014. ACM.
[5] Elasticsearch. Elasticsearch, June 2017.
[6] S. Eric and B. Jake. The User and Business Impact of Server
Delays, Additional Bytes, and HTTP Chunking in Web Search.
Velocity, 2009.
[7] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee,
D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and
B. Falsafi. Clearing the clouds: A study of emerging scale-out
workloads on modern hardware. SIGPLAN Not., 47(4):37–48,
Mar. 2012.
0	
500	
1000	
1500	
2000	
2500	
3000	
0	
100	
200	
300	
400	
500	
600	
700	
800	
0ms	 25ms	50ms	75ms	 0ms	 25ms	50ms	75ms	 0ms	 25ms	50ms	75ms	 0ms	 25ms	50ms	75ms	 0ms	 25ms	50ms	75ms	 0ms	 25ms	50ms	75ms	
Migra2on	Threshold																																		
5	QPS	
Migra2on Threshold																																	
10	QPS	
Migra2on Threshold																																	
15	QPS	
Migra2on Threshold																																	
20	QPS	
Migra2on Threshold																																	
30	QPS	
Migra2on Threshold																																		
40	QPS	
En
er
gy
	(J
)	
Ta
il	
La
te
nc
y	
(m
s)
	
Tail	Latency	(ms)	 Energy	(J)	
Fig. 9: Distribution of tail latency and energy (in J) as a function of migration threshold and the load (in QPS), when sampling
interval is set to 50ms. The primary y-axis shows the tail latency (in ms) and the secondary y-axis shows the energy (in J).
[8] M. E. Haque, Y. h. Eom, Y. He, S. Elnikety, R. Bianchini,
and K. S. McKinley. Few-to-many: Incremental parallelism for
reducing tail latency in interactive services. SIGPLAN Not.,
50(4):161–175, Mar. 2015.
[9] C.-H. Hsu, Y. Zhang, M. A. Laurenzano, D. Meisner,
T. Wenisch, R. G. Dreslinski, J. Mars, and L. Tang. Reining
in long tails in warehouse-scale computers with quick voltage
boosting using adrenaline. ACM Transactions on Computer
Systems (TOCS), Apr. 2017.
[10] V. Janapa Reddi, B. C. Lee, T. Chilimbi, and K. Vaid. Web
search using mobile cores: Quantifying and mitigating the price
of efficiency. volume 38, pages 314–325, New York, NY, USA,
June 2010. ACM.
[11] H. Kasture, D. B. Bartolini, N. Beckmann, and D. Sanchez.
Rubik: Fast analytical power management for latency-critical
systems. In Proceedings of the 48th International Symposium
on Microarchitecture, MICRO-48, pages 598–610, New York,
NY, USA, 2015. ACM.
[12] S. Kim, Y. He, S.-w. Hwang, S. Elnikety, and S. Choi. Delayed-
dynamic-selective (dds) prediction for reducing extreme tail
latency in web search. In Proceedings of the Eighth ACM
International Conference on Web Search and Data Mining,
WSDM ’15, pages 7–16, New York, NY, USA, 2015. ACM.
[13] J. Li, K. Agrawal, S. Elnikety, Y. He, I.-T. A. Lee, C. Lu, and
K. S. McKinley. Work stealing for interactive services to meet
target latency. SIGPLAN Not., 51(8):14:1–14:13, Feb. 2016.
[14] D. Lo, L. Cheng, R. Govindaraju, L. A. Barroso, and
C. Kozyrakis. Towards energy proportionality for large-scale
latency-critical workloads. ACM SIGARCH Computer Architec-
ture News, 42(3):301–312, 10 2014.
[15] D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and
C. Kozyrakis. Heracles: Improving resource efficiency at scale.
In Proceedings of the 42Nd Annual International Symposium on
Computer Architecture, ISCA ’15, pages 450–462, New York,
NY, USA, 2015. ACM.
[16] J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa.
Bubble-Up: Increasing Utilization in Modern Warehouse Scale
Computers via Sensible Co-locations. In Proceedings of the 44th
Annual IEEE/ACM International Symposium on Microarchitec-
ture, MICRO-44, pages 248–259, New York, NY, USA, 2011.
ACM.
[17] R. Nishtala, P. Carpenter, V. Petrucci, and X. Martorell. Hipster:
Hybrid task manager for latency-critical cloud workloads. In
2017 IEEE International Symposium on High Performance
Computer Architecture (HPCA), pages 409–420, Feb 2017.
[18] R. Nishtala, V. Petrucci, P. Carpenter, and M. Sjalander. Twig:
Multi-agent task management for colocated latency-critical
cloud services. In 2020 IEEE International Symposium on High
Performance Computer Architecture (HPCA), Feb 2020.
[19] V. Petrucci, M. A. Laurenzano, J. Doherty, Y. Zhang, D. Mosse,
J. Mars, and L. Tang. Octopus-Man: QoS-driven task manage-
ment for heterogeneous multicores in warehouse-scale comput-
ers. In 2017 IEEE International Symposium on High Perfor-
mance Computer Architecture (HPCA), 2015.
[20] S. Ren, Y. He, S. Elnikety, and K. S. McKinley. Exploiting
processor heterogeneity in interactive services. In Proceedings
of the 10th International Conference on Autonomic Computing
(ICAC 13), pages 45–58, San Jose, CA, 2013. USENIX.
[21] B. Vamanan, H. B. Sohail, J. Hasan, and T. N. Vijaykumar.
Timetrader: Exploiting latency tail to save datacenter energy
for online search. In Proceedings of the 48th International
Symposium on Microarchitecture, MICRO-48, pages 585–597,
New York, NY, USA, 2015. ACM.
[22] D. Wong and M. Annavaram. KnightShift: Scaling the En-
ergy Proportionality Wall through Server-Level Heterogeneity.
In 2012 45th Annual IEEE/ACM International Symposium on
Microarchitecture, pages 119–130. IEEE, 12 2012.
[23] H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-flux:
Precise online qos management for increased utilization in
warehouse scale computers. In Proceedings of the 40th Annual
International Symposium on Computer Architecture, ISCA ’13,
pages 607–618, New York, NY, USA, 2013. ACM.
[24] X. Zhou, H. Cai, Q. Cao, H. Jiang, L. Tian, and C. Xie.
GreenGear: Leveraging and Managing Server Heterogeneity for
Improving Energy Efficiency in Green Data Centers. In Proceed-
ings of the 2016 International Conference on Supercomputing,
ICS ’16, pages 12:1–12:14, New York, NY, USA, 2016. ACM.
