





































Designing new architectures is one of the ways to improve the performance of the                           
computers that we used in our days. This improvements allow us to increase productivity, to                             
simulate physics, quimics, etc. that were not be able to do it before in a realistic amount of                                   
time or even to be able to have a smartphone that could check whether tomorrow will rain or                                   
not. 
 
But, designing new architectures implies testing them, in order to do so, one could                           
implement the design and test it. In the case of processors, this implies a huge cost in terms                                   
of money. So, it is not feasible to build every single version of the microprocessor that it is                                   
design. 
 
Then, what we can do? One answer is to simulate the behavior of the new architecture by                                 
using architectural simulators. The main issue with this solution is that these simulations                         
require a lot of computational power. In order to minimize this computation, several                         
approaches about how these simulations have to be done have been specified. Nowadays,                         
the most used approach is execution­driven simulation due to the fact that the level of detail                               
it allows to achieve is really high. Even though, execution­driven simulators are expensive in                           
terms of computational power. Another kind of simulators which are less detailed are the                           
ones included in the group of trace­driven simulators. 
 
The TaskSim simulator uses both execution and trace­driven approaches in order to                       
minimize the computational power required for the simulation and maximize the level of                         
detail obtained from it. For this, it presents different models which are more or less detailed.                               
One of these models is the memory model, which is focused on representing the best as                               
possible the memory of the architecture simulated. 
 
During this project, we will study how precise is this memory model of TaskSim when                             
simulating a real architecture that is already in production. This will allow to discover the                             
















¿Entonces qué podemos hacer? Una respuesta seria simular el comportamiento de la                       
nueva arquitectura. El principal problema de esta solución es que las simulaciones son muy                           
costosas en términos de cómputo. Para minimizar este coste computacional, diversos                     
métodos sobre cómo y qué debe simularse se han diseñado. A día de hoy, la más utilizada                                 
es la denominada ​execution­driven ​ya que proporciona una fiabilidad muy alta, aún a costa                           
de ser computacionalmente muy exigente. Otro tipo de simuladores, los cuales son menos                         
detallados, son los denominados ​trace­driven​. 
 
El simulador TaskSim utiliza una combinación de los dos tipo de simuladores ya                         
mencionados en orden de minimizar el coste computacional y, a la vez maximizar el nivel de                               
detalle obtenido de la simulación. Para ello, presenta diferentes modelos los cuales difieren                         
en la cantidad de detalle que reportan. Uno de estos modelos es el modelo de memoria, el                                 
cual se centra en simular la memoria de la arquitectura. 
 
Durante este proyecto, estudiaremos como de preciso es este modelo de memoria cuando                         
se simula una arquitectura real de una máquina en producción. Esto nos permitirá descubrir                           
































En el trancurs d’aquest projecte, estudiarem com de precís és aquest model de memoria                           
quan es simula una arquitectura real de una màquina en producció. Això ens pemetrà                           





















































































TaskSim is a leading­edge computer architecture simulator, designed for architecture                   
exploration of future many­core processors and parallel programming models.                 
TaskSim can simulate parallel applications with multiple levels of abstraction,                   
modeling the processor pipeline, memory hierarchy or just synchronization of the                     
parallel application. 
 
A major problem with current simulators is the increasing gap in performance                       
between simulation and real execution on a modern complex many­core architecture.                     
When the simulator is to be used for architecture exploration, an embarrassingly                       
parallel problem, there is little benefit from parallelizing the simulator itself. 
 
TaskSim takes a different approach, by increasing the level of abstraction of both the                           
application, and the architecture. This approach requires trace­driven simulation. An                   
existing simulator, Dimemas, operates at the MPI level. 
 
One of the TaskSim levels of abstraction models the memory accesses of the CPU                           
using the Reorder­buffer Occupancy Analysis model. Previous works have evaluated                   
















































































































































































Some obstacles could surface during the project, including time­related problems,                   
budget­related problems or even finding the complexity of the project to escalate                       
above the lines in which a PFC should be restricted. 
 
During the realization of the project, some tasks could encounter data differing more                         
than the expected from the awaited, slowing the project’s pace and ending                       


























































































Project management  T1  2 days    ­ Research assistant 
­ Laptop 
Research  T2  5 days  T1  ­ Research assistant 
­ Laptop 































Evaluation  T8  3 days  T7  ­ Research assistant 
­ Laptop 






































































































































































































































































































































Despite the fact that this will be a one­man project, woman in this case, if we want to                                   
be exactly accurate, there are several roles that would be taken during the realization                           
of the project. Some other roles needed, like maintenance of the systems, will be                           





Role  Estimated hours  Est. price/hour  Total est. cost 




Test Runner  160 h  25,00 €  4.000,00 € 
Analyst  50 h  30,00 €  1.500,00 € 
Technical Support  50 h  25,00 €  1.250,00 € 
TOTAL      12.200,00 € 
Table: Estimated costs 
Hardware 
In order to realize all the tasks of the project, a set of hardware will be needed. The                                   
most important tools will be a laptop and the nodes rented from the BSC, the Mare                               
Nostrum nodes, the real hardware on which we will be running the tests, to compare                             
with the simulated hardware tests. 
 
This two Mare Nostrum nodes will be required in order to compile and execute and                             
run the tests in a semi concurrent way. One node will be used to compile the                               





        Estimated  Total estimated 
30 
 Product  Price  Units  Service life  residual value  amortization  1
Macbook Pro 13’’  1.329,00 €  1  5 years  500,00 €  165,80 € 













HDMI Adaptor  6,98 €  1  5 years  1,00 €  1,20 € 
TOTAL  7.393,95 €        777,60 € 
Table: Hardware budget 
Software 
The software that will be used is during this project has no extra cost, at the time this                                   
document is written. The software included is Xcode, Nanos++, Latex, Linux                     
distributions, OSX (there is not any extra fee since the hardware machine Macbook                         
Pro is included in the budget), Extrae, Paraver and GNU gprof. 
 




This section includes the estimated indirect cost produced not considered in the                       
previous sections, like electricity and unforeseen expenses. 
 
We can’t power only the nodes needed for our project, since the supercomputer must                           
be powered completely before using it; the cost of running the whole cluster will be                             
applied. 
 












 Unforeseen expenses  300 €  1  300 € 




It will be mandatory to possess a mechanism to constantly surveil the project’s                         
budget in order to avoid skyrocketing the pressupost. 
Our technique would consist on checking and updating the budget after the                       
completion of each main task. 





In this section, we will discuss about the sustainability of this project. The                         
sustainability of a project can be defined as the ability of a project to maintain its                               
finality and benefits during its project life time. In this line of thought, the market                             




TaskSim is a leading­edge architecture simulator developed by the BSC in order to                         
satisfy and actual need in the market. 










 With more investment, more tools can be used to provide more accurate analysis of                           








































































































In OmpSs programming model, each parallel region of code is defined as task. Each of them                               
has its own dependencies, which are declared by using ​in​, ​out and ​inout directives. These                             
directives are set by the user and using them only helps the runtime to construct the                               
dependency graph. This dependency graph will be used later during the execution to decide                           
whether a task can be executed or not by checking if all the data dependencies of the task                                   
are fulfilled. Understanding what each of the directives indicates is pretty straightforward. ​In                         
37 
 directive tells the runtime the task needs to read the data indicated at the clause; ​out ​that                                 
needs to write on the region of memory where the data is contained; and ​inout that it will                                   
read and write at that memory region during the task's execution. Also, some other directives                             
are available to indicate, for example, the priority of the task so the runtime will try to execute                                   
it as soon as possible. 
Figure 1 shows how OmpSs programming model works. Basically, you compile the                       
annotated source code with the Mercurium source­to­source compiler, which generates an                     






















































































































































































































































































































































































































































































































































































































































































































































































































































































140 cyc) / (2.6 cyc/s)( * 10−9 * 109
 
⇒ 140 cyc ) / (2.6 cyc)( * 10−9 * s * 109 ⇒  



























Now that we fixed the issues we had with the values of the memory latency and the CPU                                   
frequency at the configuration files, as well as the issue with the metric reported, we wanted                               
to see if these solutions fixed the weird behavior we found during all the experiments we                               
have made. 
 
In order to do so, more experiments have been run with the new configuration. The                             
experiments were only repeated for the vector operation and the sparse matrix vector                         
multiplication benchmarks, the reason is that they were the only ones that showed a different                             
behavior when comparing the simulation against the native execution. 
 






























































On the chart above[] we can find a comparison, for all the benchmarks from the suite,                               
between the parallels section speedup of the native execution. In the middle we see the total                               




 As we can observe, we cannot find a significant difference for most of the benchmarks. Only                               
the sparse­matrix­vector­multiplication shows significant difference. Also, vector, histogram               
and reduction show a slight difference. 
 
In fact, we can say that there is not a lot of difference between using the parallel section                                   
cycle count or the total simulation cycle count in order to realize the speedups. Still, some                               













A really good approach in order to improve the vector­operation benchmark, which still show                           
under performance than the native executions, is to realize a bandwidth study, since it is a                               
bandwidth dependent benchmark. 
 















The study that can be found in the last chart, gives a really good feedback. One of the                                   
proposed parameter configurations gives a behaviour quite similar to the native execution for                         




Now, that we found a configuration that really works to simulate the vector­operation, we will                             































As for last, what surprised us more, was the 3d­stencil benchmark. A benchmark that shows                             







This project had one main objective, to take the TaskSim simulation memory model for a                             
driving test. No previous study had been done before for this powerful tool. 
A simulator like the TaskSim is a big and complicated application, that is able to simulate a                                 
more complex hardware with just a simple description from the user. Adding a mask layer                             
between the hardware and the user. Being able to verify that will have an expected                             
behaviour is a complicated task but really interesting with a lot of meaning. 
 
In order to accomplish this, the main task was to a benchmark suite with a lot of different                                   
memory dependent benchmarks, like the Mont­Blanc benchmarks. 
This task has been carefully completed. We ended discovering a strange behaviour with the                           
memory latency. With some benchmarks non­memory latency dependents, would end being                     
influenced by the memory latency parameter. This creates the need to study this                         
phenomena in a future project, in order to explain it and corrected it if necessary. 
 
The second objective, was to think and propose new modifications in order to emulate more                             
perfectly the benchmarks simulated. In other words, to give tips that could help tune the                             
behaviour of the simulations with their homolog native execution. This objective was not only                           
accomplish. Not only some propositions were made, but they were implemented, tested and                         
analyzed repeatedly. With that, one of the benchmarks was able to be tuned to emulate                             
quite nearly the same behaviour as the native execution. 
 
For future projects, as commented in this section, a good study of the memory latency                             
dencies found during the realization of this project would generate a more strong validation                           
and help to improve the TaskSim simulator. 
Another interesting continuation for this project, would be to keep on searching for the best                             
parameters for every benchmark of the Mont­Blanc benchmarks suite. 
And last, it would be really interesting to try other memory bounded benchmarks like the                             
PARSEC benchmarks suite. 
 
As personal gain, I would like to add that this project lead me to get a good methodology in                                     
how to approach observed problems in order to understand the behaviour that created them,                           
also how to solve and document them accordingly. 
91 
  
 
 
 
   
92 
 Bibliography 
[1] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg,J. H
 
ogberg, F. 
Larsson, A. Moestedt, and B. Werner, “Simics: A FullSystem Simulation Platform,”IEEE 
Computer, vol. 35, no. 2, pp. 50–58,2002. 
[2] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze,S. Sarangi, P. Sack, K. 
Strauss, and P. Montesinos, “SESC simulator,”http://sesc.sourceforge.net, 2005. 
[3] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An Infrastructurefor Computer System 
Modeling,”IEEE Computer, vol. 35, no. 2, pp.59–67, 2002. 
[4] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, andS. K. Reinhardt, “The 
M5 Simulator: Modeling Networked Systems,”IEEE Micro, vol. 26, no. 4, pp. 52–60, 2006. 
[5] E. Argollo, A. Falćon, P. Faraboschi, M. Monchiero, and D. Ortega,“COTSon: 
infrastructure for full system simulation,”SIGOPS Oper. Syst.Rev., vol. 43, no. 1, pp. 52–61, 
200 
[6] N. Rajovic, P. Carpenter, I. Gelado, N. Puzovic, A. Ramirez,and M. Valero, “ 
Supercomputing with Commodity CPUs:Are Mobile SoCs Ready for HPC? ,” inSC, 2013. 
[7] N. Rajovic, A. Rico, J. Vipond, I. Gelado, N. Puzovic, andA. Ramirez, “ Experiences with 
mobile processors for energyefficient HPC ,” inDATE, 2013, pp. 464–468. 
[8] Rico, A. Cabarcas, F., Villavieja, C., Pavlovic, M., Vega, A., Etsion, Y., Ramirez, A., and 
Valero, M. 2012. On the simulation of large­scale architectures using multiple application 
abstraction levels. ACM Trans. Architec. Code Optim. 8, 4, Article 36 (January 2012). 
93 
