Operating system support to an online hardware-software co-design scheduler for heterogeneous multicore architectures by Bueno, Maikon Adiles Fernandez et al.
  Universidade de São Paulo
 
2014-08-20
 
Operating system support to an online
hardware-software co-design scheduler for
heterogeneous multicore architectures
 
 
IEEE International Conference on Embedded and Real-Time Computing Systems and Applications,
20th, 2014, Chongqing.
http://www.producao.usp.br/handle/BDPI/48567
 
Downloaded from: Biblioteca Digital da Produção Intelectual - BDPI, Universidade de São Paulo
Biblioteca Digital da Produção Intelectual - BDPI
Departamento de Sistemas de Computação - ICMC/SSC Comunicações em Eventos - ICMC/SSC
Operating System Support to an Online 
Hardware-Software Co-Design Scheduler for 
Heterogeneous Multicore Architectures 
Maikon A. F. Bueno, Jose A. M. de Holanda, Erinaldo Pereira and Eduardo Marques 
USP - University of Sao Paulo 
Institute of Mathematics and Computer Science 
Sao Carlos, Brazil 
{maikon, arnaldo, erinaldo}@usp.br and emarques@icmc.usp.br 
Abstract-This paper aims at designing and implementing a 
scheduler model for heterogeneous multiprocessor architectures 
based on software and hardware. As a proof of concept, the 
scheduler model was applied to the Linux operating system run­
ning on the SPARe Leon3 processor. In this sense, performance 
monitors have been implemented within the processors, which 
identify demands of processes in real-time. For each process, its 
demand is projected for the other processors in the architecture 
and then, it is performed a balancing to maximize the total system 
performance by distributing processes among processors. The 
Hungarian maximization algorithm, used in balancing scheduler 
was developed in hardware, and provides greater parallelism and 
performance in the execution of the algorithm. The scheduler 
has been validated through the parallel execution of several 
benchmarks, resulting in decreased execution times compared 
to the scheduler without the heterogeneity support. 
I. INTRODUCTION 
Heterogeneous multiprocessor architectures have as main 
objective the extraction of higher performance from processes 
through the use of appropriate cores to their demands. How­
ever, the extraction of higher performance is dependent on an 
efficient scheduling mechanism, able to identify in real-time 
the demands of processes and to designate the most appropriate 
processor according to their resources. 
One of the problems that developers find when construct­
ing heterogeneous multicore architectures is to appropriately 
assign the running tasks to the different cores. This assignment 
must take into account the process behavior and its demands 
during execution, as well as the characteristics of each core. 
Also, the demands may change for each process or even vary 
at run time. 
Determining a heuristic that associates demands of the 
processes and processing power of all cores is not a simple 
task. It involves the projection of the process performance 
among the cores, obtained by sampling or by calculations 
that consider architectural characteristics. Many researches 
have being developed using the offline assignment approach, 
which assigns, in advance, processes to processors, even before 
they start executing. However, such approach is restricted 
to environments where processes are well known. They are 
not applicable to environments where process behaviour is 
undefined before it starts running or does not exist prior 
knowledge of which process should run. 
On the other hand, the online assignment approach is 
being adopted by many process schedulers. This approach 
aims to qualify the process at runtime and does not consider 
performance balancing for the system as a whole. It just 
selects the processor that better fits to the characteristcs of a 
process. The online method can ensure the reduction of the 
effective time of processor usage, but does not necessarily 
guarantee that the execution time will be smaller. This is 
because the heuristic adopted may assign several processes 
to the same processor, increasing the processing load and thus 
increasing the execution time of all processes running on the 
same processor. 
In this context, we developed a scheduler for hetero­
geneous multicore architectures. The scheduler follows the 
online approach and determines, in real time, the performance 
projections of any process running on any core of the archi­
tecture. Through this projection, processes can be migrated 
to processors that have the most suitable resources, reducing 
execution times for these processes. Part of this scheduler is 
implemented in hardware and relies on the help of the oper­
ating system to manage processes running on heterogeneous 
architecture. Through this performance projection, processes 
are migrated to processors that have the most suitable resources 
to each process, reducing the actual execution time for each 
process. Part of the scheduler was implemented in hardware 
and is supported by the operating system to manage the 
processes running on heterogeneous architecture. By running 
benchmarks on processors with small architectural differences 
among them, results show that the use of the proposed heuristic 
is able to reduce the execution time up to 8%. 
The rest of this paper is organized as follows. Section II 
demonstrates the importance of the heterogeneity support in 
schedulers through discussion several related studies. Sec­
tion III introduces the heuristic proposed by this paper as 
well the proposed architecture to implement the heuristic. 
Section IV describes the environment used to implement 
the solution. Section V discusses the experimental setup and 
results and Section VI presents our conclusion. 
II. RELATED WORK 
One of the first works in the area performs tests using cores 
that differ in size, processing capacity and power consump­
tion [1]. The authors employ the concept of weighted speedup, 
which consists of adding the IPCs (Instructions Per Cycle) 
from all running processes and dividing them by the CPI 
(Cycles Per Instruction) of each process. The work proposed 
by [2] has a similar approach, with the IPC being used as 
the main metrics for processor assignment. Here, there is 
no prediction mechanism, but only a thread that runs on all 
processors to get the IPC. In order to measure performance 
in the different processor architectures, the mentioned works 
use performance samples, which are obtained by migrating 
threads between processors. Besides that, performance and 
power reduction analisys are targeted specifically for each 
thread, rather than searching for global system optimization. 
Other approach, simpler but relatively efficient for some 
cases, performs the scheduling by considering the prior knowl­
edge of each task's loads. The HASS algorithm (Het. -Aware 
Signature-Supported) assumes that each application has a 
signature containing an summary of the architecture indicating 
what the application needs [3]. Other scheduling approaches 
use prior knowledge of process behavior (offline) and can be 
found on [4] [5]. 
Unlike previous approaches, the AMP (Asymmetric Multi­
processor Scheduler) has its focus on the balance of threads 
on a heterogeneous architecture This is based, primarily, on a 
processor model with greater computational capacity and on 
one or more processors with less computational capacity [6]. 
The work proposed by [7] presents a model called PIE 
(Performance Impact Estimation) for estimating the impact on 
performance. This model seeks to predict the performance of a 
particular task by assigning the process to a processor that will 
offer the greatest estimated performance. The PIE model col­
lects performance measures, such as CPI stack, MLP (Memory 
Level Parallelism) and ILP (Instruction Level Parallelism), and 
provides an estimation that assigns the performance impact of 
a process to the MLP or ILP. 
The Kinship method, recently proposed by [8], assesses 
processes dynamically, according to the types of the most 
requested features and combining the processes with the 
available system resources. The parameters used to calculate 
the metric are obtained either in real time or are previously 
supplied by the user. 
Another approach is to use a hardware to assist the 
scheduler during the process-to-processor assignment [9] [10]. 
The scheduler proposed in this work uses components directly 
implemented within the processor. They are responsible for 
process monitoring and for extracting performance measures, 
such as CPI and MPI (Misses Per Instruction). These measures 
are stored and used in future performance predictions for other 
processor, so the scheduler may assign the best processor to a 
given task [10]. The metrics adopted by the component of con­
trol logic are the decomposition of CPI stall, measured during 
the process execution, into base (useful processing cycles) and 
cycles spent in memory and cache related operations. 
Another decomposition of CPI to estimate the performance 
of each process on each processor is proposed by [11]. The 
CPI is decomposed into internal stalls and external stalls. Inter­
nal stalls are caused by internal latencies of the processor, such 
as cache, TLB and non-predicted branches. On the other hand, 
external stalls identify the latencies due to accesses to external 
resources. In the scheduler proposed by [11], processes that 
have a dominant amount of external or internal stalls must 
run on the processor with lower performace. Consequently, 
processes that have a dominant amount of effective cycles must 
run on a processor with higher performance. The metrics are 
sampled by hardware monitors. 
In [12], the authors propose a scheduler, named CAMP, for 
a system containing two processors: one for high-performance 
computing and the other one with less resources. The schedul­
ing is performed taking into account thread-level paralelism 
and instruction-level paralellism (TLP and ILP). All processes 
with a high level of ILP will run on the high-performance 
processor, whereas the other processor will run the processes 
with high levels of TLP (Thread-Level Parallelism). To this 
end, the scheduler uses a projection that considers the process 
behavior on the current processor, so it is not required sampling 
all processors to calculate the metrics. 
The CAMP scheduler computes the metrics in real time, 
categorizing the processes into TLP dominant or ILP dominant. 
The calculation also uses other information, like IPC, cache 
misses and stalls related to instruction execution. A similar 
idea is presented by [13], assigning processes with high ICP 
rate to the higher performance processors. The authors also 
present an optimization model for power consumption that 
considers the IPC and LLC cache misses. In the same context, 
another approach proposed by [14] uses performance monitors 
to assign processes to processors according to the power 
consumption. 
III. SYSTEM 
A. Scheduler Model 
In this section we present the heuristic used for process 
scheduling. This heuristic monitors the performance of running 
processes in real time, in order to estimate the performance 
projection of all processes for all processors. Based on the pro­
jections, the scheduler uses a combinatorial optimization algo­
rithm for process balancing. The algorithm aims to maximize 
system performance by performing a sub-optimal assignment 
of processes to processors. 
Several approaches have been used to measure the perfor­
mance of a process in different types of processors. The results 
of this measurement are useful for guiding the scheduler on 
making a real time decision about which processor should 
be chosen for a given process. Among some metrics for 
performance assessment are the measurement of IPC, CPI, 
number of cache misses, number of 10 operations, effective 
CPU time, decomposition of the CPI to measure the amount 
of stalls and useful time, and memory accesses [12] [7] [10] 
[11] [9] [13] [8]. 
In this work, the approach used by the scheduler's heuristic 
is focused on the architectural features of embedded processors 
that causes effect of heterogeneity on multiprocessors. Features 
like cache size, FPU (Float Point Unit) type, TLB (Translation 
Lookaside Buffer) size and branch prediction, are usually part 
of trade-offs in embedded processor designs, with regard to 
logical space, performance and power. Some studies indicate 
that the use of different types of processors, each containing 
different features, can result in an architecture with less power 
consumption, less logical space and greater performance when 
guided by an efficient scheduling [1] [2]. 
The metric addressed by this work is the CPI stack, 
which is the decomposition of the number of cycles spent 
executing instructions in components of stall [15] [16]. These 
components show where the CPU is spending more time to 
execute a particular process. It is important to consider that, 
in this project, CPI is decomposed for each process, i.e. , it is 
not applied globally, but only for those processes that need to 
be monitored. As a consequence , a more accurate measure 
for the performance projection is obtained. 
The components of stall may vary according to the adopted 
architecture. Therefore, the components chosen to decompose 
the CPI are directly related to the architecture. The most 
common examples of components that can cause stalls in 
processes are the following: 
• cpi_comp_base: This component is not a stall. It 
represents the number of cycles effectively spent on 
executing instructions; 
• cpCcompJpu_stall: Number of cycles spent on exe­
cuting floating-point instructions; 
• cpi_comp_muLstall: Latency of multiplication oper­
ation (it depends on how the processor was imple­
mented); 
• cpi_comp_div_stall: A division can take several cycles 
to execute and may vary according to the division 
method implemented in the processor; 
• cpi_comp_load_store_stall: Consists of the delay re­
lated to direct dependencies found in the pipeline. The 
dependencies can be minimized through the compila­
tion, as well as at runtime by the CPU, specifically by 
running out-of-order instructions; 
• cpCcomp_branch_stall: Delay related to the can­
celling and to the loading of instructions in the 
pipeline due to the occurrence of jumps; 
• cpCcomp_dcache_L(xLstall: In a system, there may 
be different levels of data caches in its processors 
and each level can have different sizes. This feature 
has a high influence on the execution of most of the 
processes, causing higher or lower latencies. Due to 
this fact, the cache-related delays are also considered 
in the decomposition of the CPI; 
• cpCcomp_icache_L(xLstall: Just as the data cache, 
the instruction cache may have different levels, and 
each level can have different sizes. This feature can 
also influence on the process performance; 
• cpCcomp_dtlb_L(xLstall:Processes running on pro­
cessors with MMU may have their performance af­
fected by using virtual addresses to access their data. 
This deterioration occurs due to the configuration used 
for data TLB, which can cause several cycles of 
latency in case of absence of the required references; 
and 
• cpi_comp_itlb_L(xLstall: On processors that use 
MMU, references from call instructions are also vir­
tual addresses and, likewise, need the assistance of the 
instruction TLB for physical address translation. 
The components mentioned above are the basis for calcu­
lating the performance of processes and of processor architec­
tures, as well as the projected performance of all processes 
on all processors used for process balancing. By this fact, 
there must be a method to extract the CPI decomposition 
from running processes. This task can be carried out by a 
component named performance monitor. As one of the goals 
of the proposed scheduler is to achieve real-time monitoring, 
it is considered that the performance monitor must be located 
within the processor. Therefore, it can receive information 
about the execution of the current process, such as control data 
from cache and TLBs. This data allows the decomposition to 
have more accurate results on the amount of cycles spent in 
each of the forementioned components. The decomposition of 
process stalls into components is called performance histogram 
in the remainder of this article. 
The same metric of decomposing the CPI into components 
is also used to qualify the performance of one processor in 
relation to the other processors in the architecture. This is done 
through the weights of each CPI component. As such, each 
processor has a set of weights. The weight of a component is 
the percentage of its performance in relation to the performace 
of the same component in the others processors. 
Considering C as the set containing the stall decomposition 
for CPU(i), where i E {O . .  n - I}, and n is the number of 
processors in the architecture: 
CCPU(i) = {cpi_comp_fpu_stall, cpi_comp_muLstall, 
cpi_comp_div_stall, cpCcomp_branch_stall, 
cpi_comp_l oad_store_stall, cpi_comp_dcache_stall, 
cpi_comp_icache_stallcpCcomp_dtlb_stall, 
cpCcomp_itlb_stall } 
(I) 
The SCPU(i) is the total time spent in stalls in the processor 
CPU(i): 
cECCPUU) 
SCPU(i) = L cpi(c) (2) 
For each processor (CPU(i)), PstallcPu(.i) (c) is the per­
centage of the component of stall C on the total delay SCPU(i) 
of CPU(i): 
cpi(c) PstallCPU(.i) (c) = -s-­CPU(i) 
where, c E CCPU(i) 
(3) 
Equation 4 is the final calculation of the weight, 
W(C)cPU(i) being the weight of the component C E CCPU(i) 
of CPU(i): 
min(PSl.allC PU(O .. n-l) (c» W (c)c PU (i) = --=----'-----;-:--'---­PstallCPU(i) (c) 
(4) 
When a process executes on processors with different 
architectures, the distribution of stalls can also change. The 
number of cycles spent in each component on a particular 
processor necessarily depends on the capabilities offered by 
this processor: the more resources are offered for a component 
(cache, for example) in a processor, the fewer cycles will 
be spent on this component. This dependence implies that 
data from the performance histogram of a given process 
before being considered in projecting performance, should be 
standardized according to the influence of architecture where 
the process is running. The normalization proposed by this 
paper makes the distribution of the stall generated in real 
time by each processor, where there is a component of the 
performance monitor, be minimized according to the weight 
of the processor's performance that executes the process. 
Equation 5 is used to accomplish this by minimizing the 
weight of CPU performance of CPU(i), where N(C)cPU(i) 
is the normalization of the component C of CPU(i): 
N(C)CPU(i) = W(C)CPU(i) * cpi(c) 
onde, c E CCPU(i) (
5) 
From the normalized performance histogram, the next step 
consists of calculating the performance projections of all 
running processes for all processors in the architecture. The 
projections indicate the gains or losses that would be achieved 
if a process were executed on other processors. 
Whereas the process p runs on CPU(i), the histogram 
of execution is sampled at time t from the output of the 
performance monitor located in the CPU (i). This histogram 
comprises the vector of components of stall cpi( c) (where c E 
CCPU(i), which is normalized and creates a new normalized 
histogram composed by N(C)cPU(i)( where c E CCPU(i). 
The normalized histogram is then used to obtain a perfor­
mance projection from CPU(i) to another processor in the 
architecture (CPU(i) --+ CPU (x». Such projection is given 
by Equation 6: 
CECCPU(i) je' ECCPU(x) 
PCPU(i)--+CPU(x) = L N(C) CPU(i) * W(c' ) cPU(x) 
(6) 
Figure 1 illustrates how the information is used to generate 
the final results, containing the set of assignments between 
processors and processes that maximizes the total performance. 
Initially, the flowchart shows the parallelism among the perfor­
mance monitors, which are continually decomposing the stalls 
of the running processes. The set of stall decompositions is 
the input of the normalization block. Its responsability is to 
remove the influences inserted in the stall components of the 
processes by the processor architecture on which they execute. 
For this normalization, the calculation considers the per­
formance vectors of the processors, as described later in this 
chapter. 
With the normalized performances, the method of perfor­
mance projection is executed, also taking into account the 
performance weights of the processors. The final result of 
the projection consists of a matrix of values containing the 
performance projections of all processes for all processors. 
From this matrix of projections, the maximization method must 
return the set of assignments that maximizes all projections. 
.-- ��------.,;�-----�-----,C CPuo Penormance I Ie CPU, Performance I 
--------IIIIII!!"PI============I·C CPUXPerformance I 
CPU1 Penormance 
I po� CPuo I I P1-? CPUO I I PX-? CPUO I Performance 
I PO -7 CPU1 I I P1 -? CPU1 I I PX -? CPU1 I projections for all the 
I PO -? CPUX I �'-;,�p�, :;:-) �cp"" u::-x �I' -""'-;:, �px;:-)�c� pu""x"'" mOni!�r;�o������:es to 
Performance maximization 
u I PO-+ CPU1 I I P1-+ CPUX I I px-+ CPUO I 
Fig. 1. Heuristic flow 
Final result 
containing a set of 
assigned processes 
and processors 
This project proposes the real-time projection of all pro­
cesses running on any processor in the architecture. Thus, the 
scheduler has performance measures of the current running 
process for all processors. From this point, the scheduler can 
choose the processor offering the highest performance for the 
next process execution. 
As shown in Figure 1, the last step of the proposed heuristic 
consists of performance balancing balancing. The goal is to 
distribute processes over all CPUs in the architecture, so that 
the final assignment maximizes the whole system performance. 
Results from the projection step provide a matrix of projections 
of all processes for all processors. This matrix is modeled 
as follows: columns represents the processors and the lines 
represent the performance projections of a process for each 
processor. The assignment of a process to a processor must 
be carried out so that only one matrix element is selected. 
Also, it must consider that the final assignment comprises 
the maximum sum of the elements chosen. The result of this 
step is a set of assignments between processes and processors 
balanced and maximized for performance. 
The choice of projections that maximize performance is 
considered a classic problem in the field of linear program­
ming, also named the assignment problem. In this context, 
this project considers two classical discrete algorithms, the 
Auction algorithm and Hungarian algorithm [17] [18]. Ac­
cording to [19], the Hungarian algorithm is not only superior 
in performance compared to the Auction algorithm, but also 
occupies less space in memory. In [20], the Auction algorithm 
is implemented in FPGA, and occupies a large amount of 
resources, while achieving good performance. In this work we 
adopted the Hungarian algorithm, mainly because it uses only 
integer operations of addition and subtraction, facilitating its 
implementation in hardware. 
B. Hardware 
The main objective of implementing tasks in hardware is to 
get real-time processing of process information acquired while 
processes are running. Thus, it is possible to outline the current 
behavior of a process and monitor changes on its status during 
run-time. In addition, maximization algorithms can occupy 
much of the CPU time. A hardware implementation makes 
this time available for tasks instead of running the scheduler. 
In the proposed model, on each processor there is a 
component that monitors the instruction execution and makes 
real-time performance measurements of processes, pointing out 
where the processes is spending most of its execution cycles. 
Those monitors communicate directly with the performance 
projection block, which is responsible for measuring how much 
a process would gain or lose in performance when running 
on another processor. From the performance projections, the 
hungarian algorithm carries out performance maximization, 
where processes are uniformly assigned to processors. Figure 2 
illustrates the communication between components. 
L-_-=-_._ . ...J! 
us interface 
Fig. 2. Communication between monitoring components and performance 
projection 
The monitor was built for processors with in-order 
pipelines. It is present in each processor and continually 
provides the stall decomposition of the current running process. 
The operating system is responsible for warning the perfor­
mance monitor at the beginning and at the end of a process 
execution. Each process instruction is measured in terms of 
cycles that may belong to one or more stall components. 
From the continuous availability of the CPI decomposition, 
the monitor calculates the normalization, and subsequently, the 
projected performance for each processor. In these calcula­
tions, the performance weights are used for each processor, 
They are loaded by software and remain stored in internal 
memories, which are read continuously by the scheduler's 
projection block. 
As illustrated in Figure 2, all performance monitors (one 
for each processor) continuously send CPI decompositions of 
each process to the projection component. The result of the 
projection calculation is a matrix of the form NcpusX N Prow 
where Ncpus is the number of processors in the architecture 
and N Procs is the number of processes. 
The projection matrix is the input to the maximization 
component. The idea of the Hungarian algorithm is to receive 
a matrix and perform arithmetic operations on all elements, 
between rows and between columns, so the number of zeros 
in the matrix increases until it is possible to choose a set 
of N elements (zeros), such that each element is unique in 
its row and column. This set of N elements is the result of 
the optimal assignment, which maximizes the performance 
distribution between processes and processors. As a result, 
the maximizing component returns an assignment vector of 
N Procs elements, indicating which processor should be used 
to execute each process. The operating system then reads the 
vector and performs the necessary migrations. 
C. Software 
1) lnteiface with hardware: The operating system sched­
uler uses the hardware access layer as an interface to control 
the performance projections provided by hardware. Through 
this interface, the operating system scheduler controls the 
beginning and the end of the process monitoring, and also 
obtains information from running processes and from process 
balancing to make migrations of CPUs 
2) Scope of heterogeneous scheduling: Most of the pro­
cesses belonging to the operating system does not present high 
processing load. So, only a subgroup of the running processes 
has to be controled by the scheduler for heterogeneous pro­
cessors. This scheduler will work to improve the performance 
of the whole subgroup. 
The assignment of a process to the subgroup controled by 
the heterogeneous scheduler ca be performed by the operating 
system. This is done by analyzing the processing load of the 
process or by the user who wants to execute a process with 
increased performance. 
3) Representation of processes: Each process within the 
operating system kernel has its information atored using an 
internal description. This description includes: Identification 
number (PID - Process ID), parent process, context and 
statistical information. For the heterogeneous scheduler im­
plementation, information about process performance is also 
inserted. We define the following data as essential for the 
process description: identification of the processor the offers 
the best performance, execution statistics of the process and 
the identification fo the group that the process belong to 
(heterogeneous or non-heterogeneous). 
4) Group of Heterogeneous Scheduling: This group con­
tains the processes to be monitored by the heterogeneous 
scheduling heuristic. Typically, in an embedded environment, 
the number of processes that performs heavy processing is 
less than the amount of native operating system processes. By 
separating the processes into a specific group, the operating 
system scheduler gets a faster response from the maximization 
and the projection hardware. 
5) Interaction with the as Scheduler: Different operating 
systems use different schedulers that implement different poli­
cies. A scheduler policy choses which process will run and 
how much time it will run. Most schedulers is guided by two 
basic principles: timeslice and process priority. Several policies 
have been implemented based only on these two principles. For 
example, consideration of processor clusters, times lice calcu­
lation by priority, timeslice calculation guided by processor 
usage (CPU-bound vs IO-Bound), fixed times lice, preemption 
by priority, grouping by process types, and timeslice calcula­
tion according to the system's computational load and priori­
ties. Since each operating system may implement a different 
policy, it is generally difficult to trace the interaction of the 
core's scheduler with the performance projection produced 
by hardware. However, a common element of all operating 
systems is the function that chooses a new task and performs 
context switching. From this point, this paper proposes some 
modifications to include support for heterogeneous scheduling. 
Figure 3 illustrates the main execution flow to include the 
verification of which processor will offer the best performance 
to a process. 
Fig. 3. Flowchart of the implementation of heterogeneous support in the 
operating system scheduler 
The first part of the flow corresponds to the first steps 
of the default scheduler. It simply chooses the next process 
e make context switch. As previously defined, monitoring is 
only activated for processes belonging to the group of the 
heterogeneous scheduler. So, it is necessary to discover the 
group to which the process belongs. If the process does not 
belong to the group of the heterogeneous scheduler, it will 
follow the normal scheduler flow. Otherwise, the monitoring 
of the current process will be finalized and its projections will 
be read. After that, the scheduler checks if the process is to 
be removed from the run queue, i.e. , if it was stopped or 
it is finishing its execution. If the process is being removed 
from the run queue, the algorithm will also remove it from 
the heterogenenous process group. Then, all processes that 
need to be migrated will be verified. If they are not running 
(or are under another specific restrictions of the operating 
system), processes can be migrated from the processor. The 
next process will be chosen for execution and a context switch 
will occur. From that point, a new process will be running and 
the scheduler will verify whether the new process belongs to 
the heterogeneous scheduling group. If it belongs to the group, 
the hardware monitoring will be started for that process. 
6) Task Migration: Task migration consists of assessing 
processes belonging to the heterogeneous scheduler group and 
determining the processes that will run on a different processor. 
As illustrated in Figure 3, verification happens before the 
context switch of a process that is on the verge of sleeping. If 
some process within the group are running on processor other 
than the one determined, a sequence of steps for migrating 
the CPU process is started. This sequence of steps for pro­
cess migration can not be described generically, since every 
operating system may use a different structure to determine 
which process is running on which processor. In this work, 
the support for heterogeneous scheduling was implemented on 
the Linux operating system, kernel version 2.6.32, where each 
processor has a running process queue. In the implementation 
of support heterogeneous scaling was performed on the Linux 
operating system, kernel version 2.6.32, where each processor 
has a queue of processes running on it. In older versions, 
there was only one process queue for all processors. This fact 
demonstrates the variety of implementation possibilities for 
this representation. It is important to note that migration is 
not always performed. In general, a migration may be not be 
performed at a specific time due to the fact that the process 
to be migrated is running, or because some synchronization 
mechanism is protecting the structure to be modified for 
migration. This impossibility may also vary according to the 
operating system implementation. 
IV. IMPLEMENTATION 
The proposed hardware was developed using a FPGA 
(Field Programmable Gate Array), due to its great flexibility 
which allows to prototype complete systems, from small logic 
circuits to complex architectures involving multiple processors, 
buses, and many other devices. We also use the framework 
GRUB [21] to implement the architecture. This framework 
was developed by Gaislerl and contains a VHDL implemen­
tation of the Leon3 processor. Leon3 is a 32-bit soft-core 
processor based on the SPARC V8 architecture. Its code is 
available under the GPL (General Public License) and is 
implemented in VHDL. ltd can be synthesized and simulated 
using various tools available in the market. Leon3 resources 
are fully configurable. Devices such as caches, FPU and 
MMU can be configured individually, including more features 
and higher performance. Furthermore, the architecture also 
supports multiprocessing and has several software tools that 
assist in the development of their applications. The Nucleus, 
RTEMS, uCLinux, VxWorks 5.4/6.5, eCos and Linux are 
examples of operating systems that supports the Leon3. The 
GrLib framework is tool used to design the architecture with 
processors. However, in its original version, it is not prepared 
to accommodate different settings for each individual proces­
sor. For this project, it was necessary to change the tool so 
that it could generate architectures containing processors with 
different resources. 
The Buildroot tool was used in this project to generate 
the file system and the kernel image to load Linux on the 
FPGA. We used the version 2.6.32, mainly due to the fact that 
it already uses scheduler CFS (Completely Fair Schedule) and 
for its widespread use, easier to find references of any problem 
that could occur. 
CFS (Completely Fair Schedule) is a scheduler that seeks 
equal execution across all system processes. In the algorithm 
view the ideal processor would be the one that could perform 
all tasks in parallel with equal speeds. Thus, virtually, the 
algorithm attempts to approximate this ideal execution, making 
1 Aeroflex Gaisler: http://www.gaisler.com 
all processes have an equal portion of processor time. For this, 
we use the concept of virtual execution time, which is simply 
the processor time given to each process. The processes that 
have the lowest virtual execution times of are the next to make 
use of the processor [22]. 
The development platform used in this project is the Xilinx 
ML507 board containing a Virtex 5 FPGA [23]. This board 
has high-speed serial interfaces, audio and video interfaces, 
general-purpose input and output pins, external clock inputs, 
Flash interface, PS/2 and USB interfaces, and other devices 
and interfaces for various types of applications. 
V. EXPERIMENTAL EVALUATION 
A. Benchmarks 
The embedded platform used in this project offers fewer 
resources compared to a current personal computer. Many 
benchmarks currently available in the market are designed to 
run in an environment with more resources. Because of the 
resource limitations in the embedded platform and also the low 
operation frequency of the processor and the other devices, the 
benchmarks were selected from different sources, so that they 
were in accordance with the embedded environment used. In 
this context, the following benchmarks were selected: 
• bzip2: Compression algorithm that is part of integer 
benchmark package SPEC-CPU2006; 
• libquantum: Library for the simulation of a quantum 
computer; 
• dhrystone: Performs several numerical integer opera­
tions; 
• stanford: It consists of floating-point and integer op­
erations; 
• fbench: Implements the Ray tracing algorithm from 
computer graphics for rendering three-dimensional 
images; 
• inverse: Calculates the inverse of a matrix (floating­
point operations); 
• sphinx: It implements a voice recognition system and 
performs several floating-point operations; 
• whetstone: This benchmark implements various 
floating-point operations. 
B. Experimental Setup 
In the experiments, we used two distinct time concepts: 
effective time and real time. Effective time stands for the CPU 
time amount used by a process to execute. The real time is the 
difference between the end time and the begin time. These two 
kinds of measure are differents when the paralelism is taken 
into account. 
In order to measure the usage time from the CPUs by 
the executing processes we used a resources set provided by 
the execution of the getrusage() Linux function. Among the 
main resources are the ru_utime and ru_stime fields, that are 
used respectively to measure the effective time spent by the 
benchmarks and the total amount of time spent executing in 
kernel mode. 
It is important to emphasize that when a set of processes is 
evaluated using the function getrusage() and there is more than 
one processor in the system, the effective CPU time returned 
may be larger than the actual execution time of all processes . 
As an example, consider the execution of processes PO, PI, P2 
and P3. Assuming an architecture containing two processors 
- CPUO and CPU 1 - the scheduler can determine that PO and 
PI runs on CPUO, and P2 and P3 runs on CPUl. Assume that 
each of these processes require a CPU time T returned by the 
function getrusage() for this set of processes is 4T. This is 
effective CPU time of these processes. However, considering 
the parallelism achieved by splitting the processes between the 
two processors, the real execution time can be 2T. Because of 
this, it is important to take into consideration the difference 
between effective and real processing time. 
The effective CPU time is used to separately evaluate the 
performance projections made by the scheduler, without take 
into account the load balancing among the processors. 
Conversely, the real execution time is useful for measuring 
execution times of processes scheduled with heterogeneous 
balacing. In this case, balancing means distributing processes 
to best-fit processors, in order to maximize performance 
through the generated projections. 
C. Architecture configurations 
Four different architectures were used for experimental 
evaluation, each containing two processors. Processor archi­
tectures are described in Table I. 
D. Results 
1) Heuristc Results: In this section, the results obtained 
from the performance projection of processess are presented. 
For executig the experiments, it is necessary to consider 
two important points: the way the heterogeneous scheduler 
migrates processes and the Linux CSF scheduling policy. 
The projection is performed continuously on processes 
from the heterogeneous group. When the performance pro­
jection of a process indicates that it should be running on 
another processor, the scheduler will take action in order to 
get a higher performance. It means that the scheduler needs 
to migrate the process to the process queue of that other 
processor. However, a process can not migrate itself, since 
it is running. The migration must be performed by another 
process. This another process must be running on the target 
processor and it must try to remove the process from its current 
queue and insert it in the target processor's queue. This will 
only happen if the process to be migrated is not running. The 
migration model proposed and implemented is the same model 
used by the Linux operating system. 
Instead of using the concept of times lice to determine 
the amount of time that each process should run, the CFS 
scheduling algorithm seeks to do a fair distribution of portions 
of CPU time for each process. In Linux operating systems, 
specifically on embedded versions, there is a small number of 
system processes. These processes usually do not have a high 
CPU utilization, since most of them are designed to handle 
operating system or hardware events. When these events occur, 
the time of the processor utilization is low and should not 
TABLE l. ARCHITECTURE CONFIGURATIONS 
Architecture I 
Resource CPUO CPUI CPUO 
Frequence 80Mhz 80Mhz 80Mhz 
Branch Prediction Yes No Yes 
Multiplication (cycles) 2 5 2 
dCache (sets/way) 411 KB IIIKB 411 KB 
Replacement LRU No LRU 
iCache (sets/way) 411KB IIIKB 411KB 
Replacement LRU No LRU 
dTLB Entries 8 8 8 
iTLB Entries 8 8 32 
FPU (full/lite) LITE FULL LITE 
compromise system performance. If a process that has a high 
processor utilization rate starts running , whereas the other 
system processes use very little processor or are in the wait 
queue 2, this new process will occupy the processor at a much 
higher ratio than any other. When a different process starts 
running , the CFS algorithm restores its CPU ratio taking into 
consideration the new system load. In this case, the process that 
takes much of the processor time receives a higher processor 
ratio. 
Given an architecture with two processors, the heteroge­
neous scheduler may indicate the process to be migrated to 
the other processor. Nevertheless, this migration is performed 
only when the process is not running. Considering that the 
process has the higher CPU ratio among other processes, it 
has the lowest rate of context switches. As a consequence, it 
has the shortest waiting time in the run queue while another 
process runs in its place in the CPU. In this context, for the 
process to migrate to another processor, some other process 
running on the target processor, must execute the scheduler. 
This execution happens due to the occurrence of some event 
and takes place at the exact moment that the process to be 
migrated has its context saved and is waiting on the run 
queue. This moment can take a long time, since it depends 
on events and a series of variables that are very difficultto 
be measured. Also, another factor related to process migration 
in the developed architecture is the use of a specific function 
from the kernel to decide whether migration should occur. This 
function is the can_migrate_task(). It verifies if the current 
processor cache has a cache-hot for this process, i.e., it checks 
whether the migration of this process can cause performance 
loss due to faults that will occur in the new processor cache. 
Considering the same example of the previous process, if 
it takes a while to be migrated (which is very likely due to the 
circumstances previously mentioned), the can_migrate_task() 
may indicate to the migration procedure that the process 
should remain in current processor. So, if a process with high 
processor utilization is running on an architecture with more 
than one processor and the scheduler indicates the need for 
migration, it will only be migrated when all mentioned events 
are favorable. 
Accordingly, in order to achieve greater accuracy in the 
measures of performance projections and rely less on oca­
sionalidades, four copies of the same process are executed 
to measure the effective time of each benchmark. This does 
not guarantee that migration will occur as expected. But the 
2Wait Queue: Stores the reference of processes that are waiting for some 
event and are not considered runnable, i.e. , are in one oft the states: 
TASKjNTERRUPTlBLE or TASK_UNINTERRUPTIBLE 
2 3 4 
CPUI CPUO CPUI CPUO CPUI 
80Mhz 80Mhz 80Mhz 80Mhz 80Mhz 
No Yes No No No 
5 2 5 2 5 
41lKB 41lKB 21lKB 41lKB 4 1lKB 
Random LRU LRR LRU LRU 
I /IKB 411KB 411KB 411KB 411KB 
sem LRU LRR LRU LRU 
32 32 16 8 8 
8 16 32 8 8 
FULL LITE FULL LITE FULL 
existence of four processes that have high processing power 
increases the probability of migration. It is necessary to em­
phasize that if the four copies of a process are migrated to the 
same processor, the occurrence of faults in data and instruction 
caches may be higher, affecting the measured performance 
projection. 
Experiments were conducted by running four copies of 
the same benchmark simultaneously. This scenario is repeated 
ten times using the Linux scheduler and ten times using 
the heterogeneous scheduler on the processes. This procedure 
is repeated for the four architectures presented previously. 
Apart from the simultaneous execution of copies of the same 
benchmarks, three groups of different benchmarks were used. 
Benchmarks for each group are: 
• Group 0: dhry, fbench, stanford and inverse; 
• Group 1: libquantum, whetstone, bzip2 and fbench; 
• Group 2: dhry, stanford, bzip2 and sphinx. 
Figure 4 illustrates the results for all architectures. 
20-
10-
0- I l 
, I I I 
Architecture1 Architecture2 Architecture3 Architecture4 
Architectures 
Fig. 4. Results for all architectures 
Benchmarks 
Before migrating a process, the scheduler verifies if the 
processor already has a cache-hot. If so, the scheduler does 
not perfonn the migration. This approach may bring a disad­
vantage for heterogeneous multiprocessor systems. Sometimes 
the scheduler may prevent a process from running on a 
higher performance processor, despite having a cache-cold for 
the process. In many cases, process migration should occur 
regardless of cache issues, mainly when the new processor 
offers a set of desirable features. Therefore, new experiments 
were made following the previous method, but without cache­
hot verification at migration time. Importantly, the use of this 
procedure may cause some performance loss due to the lack 
of cache verification. 
Figure 5 illustrates all the results for the architectures 
without cache-hot checking. These results of the heterogeneous 
scheduler are compared to the results of the native scheduler. 
It's possible to see a significant performance increase achieved 
by the scheduler without the cache-hot checking. 
4 0-
3 0-
20-
10-
0-
, 
Architecture1 
, , 
Architecture2 Architecture3 
Architectures 
, 
Architecture4 
Benchmarks 
Fig. 5. Results for all architectures (without cache-hot checking) 
So far, the results reported are related to the heuristic, i.e. , 
how well the algorithm is assigning a process to the most 
appropriate processor. The effective time was used as metric, 
which is the CPU time spent by each process. For the results 
of balancing, which gathers the projection and the distribution 
of processes between processors, real processing time was 
used for each benchmark. To that end, our experiments use 
four benchmark groups running in parallel. Each group has 
an specific distribution, so that they are balanced by resource 
needs of the benchmarks. The distribution of the groups is as 
follows: 
• Group 0: dhrystone, fbench, stanford and inverse; 
• Group 1: libquantum, whetstone, bzip2 and fbench; 
• Group 2: 2x dhrystone, 2x fbench, 2x inverse, 2x 
libquantum and 2x whetstone; 
• Group 3: 4x fbench and 4x libquantum. 
These groups were run one at a time on the architectures 1 
and 4. Architectures 2 and 3 were not used due to the amount 
of resources required after adding the Hungarian algorithm in 
FPGA logic. The logic required exceeds the capacity of the 
chip used. Figure 6 illustrates the results of running 10 times 
each group. 
8-
6-
4 -
2-
0-
, 
Architecture1 
Architectures 
, 
Architecture4 
Benchmarks 
i9rouPO 
group1 
group2 
group3 
Fig. 6. Scheduling results using balancing (Hungarian algorithm) 
As depicted, all groups showed positive results in the run 
time performance when using the heterogeneous scheduler. 
These results occur despite the few architectural differences 
between processors. In some cases there were no performance 
gain, indicating that the assignment of processes to processors 
brought no resource gains to processes. This may be due to the 
fact that the processes are balanced between the resources, but 
also because of the number of migrations that the algorithm 
causes. Because the scheduler is based on the resources needs 
of each process, it can initiate several migrations. This could 
impair the performance ou even inhibit it, as occurred in Group 
1 and Group 2 for Architecture 4 (which does not have a 
significant difference in resources between the two processors). 
Despite the positive results in performance achieved with 
the benchmark groups described in this section, the experi­
ment does not test the dynamic behavior of processes. Most 
benchmarks have their behavior constant during their execu­
tion time and do not require much variation in resources. 
In order to generate results with the dynamic behavior of 
processes, a single process containing all benchmarks runs 
on several instances. Each instance runs its benchmarks in 
a different order. The experiments were perfomed with three 
or four parallel processes, each running a different order of 
the following benchmarks: Dhrystone, stanford, float-point test 
(performs various floating-point operations without requiring 
arrays or structures that need intensive use of data cache), 
ICACHE-test, whetstone and inverse. Architecture 1 was used 
for this experiment because it has greater distinction between 
processor resources. For ten runs, an average performance of 
5.64% was obtained for running with 3 instances and 8.96% 
for running 4 with instances. With the results, we can infer 
that the algorithm is taking advantage each process stages to 
determine which processor the process should run, according 
to the resource demands required at each stage. 
One of the main advantages of the heterogeneous scheduler 
is the best-fit assignment between processes and processors. 
It means that a process take advantage of available hardware 
resources according to its needs. Nevertheless, this assignment 
can cause an increase in the number of process migrations 
between processors. This is mainly due to variations in the 
phases of the processes and can influence the performance with 
invalidations in cache lines and TLBs for data and instructions. 
For measuring the amount of additional migrations caused by 
the algorithm, the same benchmark groups from the previous 
section (Groups 0, 1, 2 and 3) were executed ten times on 
the Architecture 4. These runs occurred with and without the 
heterogeneous scheduler in the operating system. During these 
runs, the number of migrations of each case was recorded. We 
observed following amounts of migration for each configura­
tion: without the heterogeneous scheduling there were 1309 
migrations; and with the heterogeneous scheduling there were 
22215 migrations. The migration rate was increased approxi­
mately 17 times with the use of heterogeneous scheduler. This 
result explains some negative performances achieved using the 
scheduler. 
VI. C ONCLUSION 
In order to improve the performance of processes in a het­
erogenous multiprocessor architecture, this paper presented a 
heuristic to determines in real time the performance projections 
of any process running to all processors of the architecture. 
Through this projection the processes are migrated to the pro­
cessors that have the most suitable resources to each process, 
making the effective execution times for these processes are 
reduced. By using the presented heuristic it was possible to 
obtain a higher performance in the execution of processes on 
architectures containing small architectural differences among 
the processors. We show that the proposed heuristic improved 
the performance up to more than 8%. 
In the future, we should take into account the frequences 
of the cores in the heuristic in order to achieve higher perfor­
mance with reduced power comsumption. 
REFERENCES 
[I]  R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. l. Farkas, 
"Single-isa heterogeneous multi-core architectures for multithreaded 
workload performance," SIGARCH Comput. A rchit. News, vol. 32, no. 2, 
pp. 64-, Mar. 2004. 
[2] M. Becchi and P. Crowley, "Dynamic thread assignment on het­
erogeneous mUltiprocessor architectures," in Proceedings of the 3rd 
conference on Computing frontiers, ser. CF ' 06. New York, NY, USA: 
ACM, 2006, pp. 29-40. 
[3] D. Shelepov, J. C. Saez Alcaide, S. Jeffery, A. Fedorova, N. Perez, 
Z. F. Huang, S. Blagodurov, and V. Kumar, "HASS :  a scheduler for 
heterogeneous multicore systems," SIGOPS Oper. Syst. Rev. , vol. 43, 
no. 2,  pp. 66-75, Apr. 2009. 
[4] J. Singh and H. Singh, "Efficient tasks scheduling for heterogeneous 
mUltiprocessor using genetic algorithm with node duplication," Indian 
Journal of Computer Science and Engineering, vol. 2, no. 3, pp. 402-
410 ,  Jul. 20 1 1 . 
[5] T. Sondag, V. Krishnamurthy, and H. Rajan, "Predictive thread-to-core 
assignment on a heterogeneous multi-core processor," in Proceedings 
of the 4th workshop on Programming languages and operating systems, 
ser. PLOS ' 07. New York, NY, USA: ACM, 2007, pp. 7 : 1-7 :5. 
[6] T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn, "Efficient op­
erating system scheduling for performance-asymmetric multi-core ar­
chitectures," in Proceedings of the 2007 A CMIIEEE conference on 
Supercomputing, ser. SC ' 07. New York, NY, USA: ACM, 2007, 
pp. 53 : 1-53 : 1 1 . 
[7] K. Van Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez, and J. Emer, 
"Scheduling heterogeneous multi-cores through performance impact 
estimation (PIE)," SIGARCH Comput. A rchit. News, vol. 40, no. 3, pp. 
2 1 3-224, Jun. 2012. 
[8] V. Gupta, R. Knauerhase, P. Brett, and K. Schwan, "Kinship: efficient 
resource management for performance and functionally asymmetric 
platforms," in Proceedings of the A CM International Conference on 
Computing Frontiers, ser. CF ' 1 3. New York, NY, USA: ACM, 20 1 3 ,  
pp. 16 : 1-16: 1 0. 
[9] S. Srinivasan, L. Zhao, R. lllikkal, and R. Iyer, "Efficient interaction 
between os and architecture in heterogeneous platforms," SIGOPS Oper. 
Syst. Rev. , vol. 45, no. I ,  pp. 62-72, Feb. 20 1 1 . 
[ 1 0] R. Iyer, S. Srinivasan, L. Zhao, and R. Illikkal, "Application 
scheduling in heterogeneous mUltiprocessor computing platforms," 
http://http://www.google.com/patents/US20 1 20079235. 03 20 12,  patent 
US20 1 20079235. 
[ 1 1 ]  D. Koufaty, D. Reddy, and S. Hahn, "Bias scheduling in heterogeneous 
multi-core architectures," in Proceedings of the 5th European confer­
ence on Computer systems, ser. EuroSys ' 1 0. New York, NY, USA: 
ACM, 20 10 ,  pp. 125-1 38. 
[ 1 2] J. C. Saez, A. Fedorova, D. Koufaty, and M. Prieto, "Leveraging 
core specialization via OS scheduling to improve performance on 
asymmetric multicore systems," A CM Trans. Comput. Syst. , vol. 30, 
no. 2, pp. 6: 1-6:38 ,  Apr. 2012. 
[ 1 3] V. Petrucci, O. Loques,  D. Mosse, R. Melhem, N. A. Gazala, and S. Go­
briel, "Thread assignment optimization with real-time performance and 
memory bandwidth guarantees for energy-efficient heterogeneous multi­
core systems," in Proceedings of the 201 2 IEEE 18th Real Time and 
Embedded Technology and Applications Symposium, ser. RTAS ' 1 2. 
Washington, DC, USA: IEEE Computer Society, 2012,  pp. 263-272. 
[ 1 4] J. Cong and B. Yuan, "Energy-efficient scheduling on heterogeneous 
multi-core architectures," in Proceedings of the 2012 A CMIIEEE inter­
national symposium on Low power electronics and design, ser. ISLPED 
' 1 2. New York, NY, USA: ACM, 20 12,  pp. 345-350. 
[ 1 5] O. Allam, S. Eyerman, and L. Eeckhout, "An efficient CPI stack counter 
architecture for superscalar processors," in Proceedings of the great 
lakes symposium on VLSI, ser. GLSVLSI ' 12. New York, NY, USA: 
ACM, 20 12 ,  pp. 55-58. 
[ 1 6] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, "A top-down 
approach to architecting cpi component performance counters," IEEE 
Micro, vol. 27, no. I, pp. 84-93,  Jan. 2007. 
[ 1 7] D. P. Bertsekas, "Auction algorithms for network flow problems: A 
tutorial introduction," Computational Optimization and Applications, 
vol. 1, pp. 7-66, 1 992. 
[ 1 8] H. W. Kuhn and B. Yaw, "The hungarian method for the assignment 
problem," Naval Res. Logist. Quart, pp. 83-97, 1955. 
[ 1 9] A. Narayanan, B. B. Nagarathnam, M. Meyyappan, and S. Mongkolsri, 
"Experimental comparison of hungarian and auction algorithms to solve 
the assignment problem," http://chalamy.tripod.comlReporLpdf, 2000. 
[20] P. Zhu, C. Zhang, H. Li, R. C. C. Cheung, and B. Hu, "An FPGA-based 
acceleration platform for auction algorithm." in ISCAS. IEEE, 20 12 ,  
pp. 1 002-1 005. 
[2 1 ]  A. Gaisler, "Grlib ip library user's manual," 
http://www.gaisler.com/products/grlib/grlib.pdf. 20 13 .  
[22] Kernel, "Cfs scheduler," https://www.kernel.org/doclDocumentation/­
schedulerlsched-design-CFS.txt, 20 1 3. 
[23] Xilinx, "M1505/m1506/m1507 evaluation platform user guide," 
http://www.xilinx.com/supportldocumentation/boards_and_kits/ug347.pdf, 
20 1 1 . 
