Application Source Codes Profiling for ASIP Memory Subsystem Design  by Li, Xiaoyang et al.
Procedia Engineering 29 (2012) 3160 – 3164
1877-7058 © 2011 Published by Elsevier Ltd.
doi:10.1016/j.proeng.2012.01.458
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
          Procedia Engineering  00 (2011) 000–000 
Procedia
Engineering
www.elsevier.com/locate/procedia
2012 International Workshop on Information and Electronics Engineering (IWIEE) 
Application Source Codes Profiling for ASIP Memory 
Subsystem Design 
Xiaoyang Lia, Wenbiao Zhoua*, Dake Liu a
aInstitute of ASIP, School of Information and Electronics, Beijing Institute of Technology, Beijing, China. 100081 
Abstract 
Application Specific Instruction-set Processors (ASIPs) are a realistic solution for domain-specific applications. To 
reach optimal system-level performance, memory subsystem design is considered in the pre-architecture design stage, 
which narrows down the huge design space to applications of a specific domain. Source code profiling approach aims 
to understand the characteristics of applications to guide ASIP design. The memory profiler proposed in the paper 
uses dynamic profiling technique to generate memory traces, and the live intervals of memory objects are computed 
by load-store information. Then memory requirement diagram is plotted according to instruction counts. The 
minimum memory requirement of the application is acquired from the diagram and guides the design of memory 
subsystem. The profiler is tested using a computing kernel, and the memory subsystem design suggestions are given 
according to the profiling results. 
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Harbin University 
of Science and Technology 
Keywords: ASIP ; profiling ; memory;
1. Introduction 
Comparing to general purpose processors (GPPs) and application specific integrated circuits (ASICs), 
application specific instruction-set processors (ASIPs) are a realistic solution for flexibility and high-
performance [1]. However, ASIP design space is quite huge and even experienced designers could not 
handle it properly [2]. Most ASIP design flow relies on the accurate understanding of application source 
codes. 
* Corresponding author. Tel.: +86-10-6891-8279; fax:+86-10-6891-8279. 
E-mail address: wenbiaozhou2010@gmail.com. 
Open access under CC BY-NC-ND license.
Open access under CC BY-NC-ND license.
3161Xiaoyang Li et al. / Procedia Engineering 29 (2012) 3160 – 31642 Xiaoyang Li et al./ Procedia Engineering 00 (2011) 000–000 
Fig. 1. Proposed ASIP design flow 
Our ASIP design flow is illustrated in Fig. 1. Application source codes are firstly profiled and 
analyzed. The results of profiling are the design documents of instruction-set and memory architecture 
suggestions. Architecture engineers design the architecture details and write architecture description 
language (ADL). The ADL is then synthesized using architecture synthesis tools, such as NoGAP[3]. 
Finally, the architecture is tested and benchmarked. The design flow may have several iterations to meet 
timing, area, power etc. requirements. The “memory wall” is the growing disparity of speed between 
processor and memory. It was expected that memory latency would become an overwhelming bottleneck 
in the system performance. It becomes the trend to estimate the memory cost and give a rough design of 
memory hierarchy in the early stage of architecture design. A common technique for understanding 
source codes is profiling. By instrumenting extra codes in the sources, profilers record execution 
frequencies of functions, basic blocks and lines. 
Some research fills the gap between source codes and memory subsystem, such as Micro-profiler [4]. 
Micro-profiler instruments intermediate representation of application source codes, and collects execution 
3162  Xiaoyang Li et al. / Procedia Engineering 29 (2012) 3160 – 3164 Xiaoyang Li et al./ Procedia Engineeri g 00 (2011) 00–000 3
counters of basic blocks. It memory profiler is able to track the read and write counts of variables and 
presents in GUI. 
This work presents a profiling approach to analyse application source codes to analyse application 
memory requirements. Firstly, memory trace generation method is designed and implemented in Low 
Level Virtual Machine (LLVM). Secondly, the memory requirement along the program execution time 
line is plotted. Architecture designers reply on this information and consider other factors, such as area 
cost and design complexity, to design the actual memory hierarchy. 
This paper is based on LLVM compilation framework, and relies on its interpreter execution 
capability to record memory traces. In addition, this paper aims to provide application memory access 
patterns for memory hierarchy design and try to estimate the minimum memory requirement of the 
application.  
2. Profiler design 
1.1. Memory trace generation 
Memory traces are the access operations to memories, such as load a value from a memory and store a 
value to memory. Memory objects are data values than could be loaded or stored in memory. In this work, 
not considering instruction memory, memory objects are equivalent to variable in the program. The 
memory objects in the program codes are divided into three types: 
• Global and static variables 
• Locally declared variables, which are placed on the run-time stack 
• Dynamically allocated variables, which are placed in heap and are established and destroyed explicitly 
using malloc and free in C language. 
To get the memory range of the memory objects, the size of the memory objects should be computed. 
Here we define sizeof() as the function to return size of the memory objects. Thus, suppose the memory 
object is of type T , we have, 
• sizeof(T)=sizeof(basic type), if T is a scalar, such as int or double.
• sizeof(T)=n×sizeof(Telement), if T is an array, and n is number of elements, and the T element is the type 
of the element. 
• sizeof(T)=sum(sizeof(Ti)), if T is struct, and Ti is sub-type of T.
• sizeof(T)=max(T1,T2,…,Tn), if T is union, and Ti is the element in the union. 
By this basic definition, the sizes of all types in the source codes could be computed recursively. The 
end address of the memory object is the sum of start address and the type size. Every memory access is 
represented by load and store intermediate representation of the compiler. Extra trace codes are hooked in 
the load and store locations. When the compiler intermediate representation instructions are executed, the 
address, access size and instruction count are recorded to a text file. Besides, in the beginning of the 
program execution, global variable allocation is stored in the memory trace file, including variable name, 
start address, end address and instruction count. The instruction counts of global variables are always zero. 
In the beginning of every function execution, the stack variable allocation is stored in the memory trace 
file, with the same format of global variable allocation. After the bit codes finish execution, the memory 
trace file contains all the allocation information and load-store information, which are sufficient for 
memory trace analysis. 
A dynamic symbol look-up table is generated by analysing the trace file. If the address range of a 
memory object is overlapped with a previous one along the execution path. The previous memory object 
is marked as dead. 
3163Xiaoyang Li et al. / Procedia Engineering 29 (2012) 3160 – 31644 Xiaoyang Li et al./ Procedia Engineering 00 (2011) 00 –000 
The live time of a memory object is defined as the first store to the memory object and the last load to 
the memory object. Pointer alias problem makes it not clear and reliable to save variable names in the 
trace file. Instead, only access addresses are saved. The addresses are then compared to the start address 
and end address of the symbol table to determine the variable accessed. 
The variable table in constructed using only allocation information. If the address range of the new 
memory object is overlapped with the old one, the old one is marked invalid in the symbol table at the 
instruction count of the allocation of the new memory object. 
1.2.  Minimum memory requirement 
A memory object is not used before the first load instruction and the last store instruction. Global 
variables are always initialized, that is to say, a store instruction always follows a global variable 
allocation. Thus the access patterns of memory objects are divided into the following categories: 
• If all memory accesses to the memory object are store without load, the live interval is marked as zero. 
• If the first access to the memory object is load, this is marked as illegal, and the source code needs to 
be changed. 
• If accesses to the memory object include both store and load, and the first access is store, the live 
interval is the interval between the first store and last load. 
By this rule, we can compute the live intervals of all the memory objects. The live interval is 
represented by the instruction counts of intermediate representation. Summing all the sizes of the live 
memory objects at a particular instruction count, the memory requirement at that instruction count can be 
computed. Iterating over all the instruction counts, the memory requirement diagram is able to be plotted. 
The algorithm is described using Pseudo code in Fig 2(a). 
3. Implementation 
The memory profiler is implemented in Low Level Virtual Machine (LLVM). LLVM has an 
interpreter execution engine. Extra instructions are hooked at load interpreter and store interpreter, to 
dump the load and store information to a text file in the sequence of program execution. The memory 
traces are processed using a Python script to find out memory requirement. The memory requirement 
diagram is finally plotted in Python. 
4. Simulation and results 
To evaluate the results, a computing kernel 1024-pt fast Fourier transform is tested by the profiling 
tool. The memory requirement diagram is plotted in Fig. 2(b). The horizontal axis represents instruction 
counts, and the vertical axis represents memory requirement at the instruction count in Byte. By analyzing 
the diagrams, the peak point (6223 Bytes) is the diagram is the minimum memory requirement. The 
system should have at least 6223 Bytes memory. 
5. Conclusions and future work 
This paper presents a memory profiling tool for ASIP memory design. The profiler is based on LLVM 
framework to execute the source codes by interpreter. The allocation and access information of memory 
objects are recorded to a text file. The minimum live internal of memory objects are calculated and a 
memory usage diagram is plotted. In future work, more complex case studies will be performed to verify 
the usefulness and correctness of the profiler. This work only estimates requirements of main memory. A 
3164  Xiaoyang Li et al. / Procedia Engineering 29 (2012) 3160 – 3164 Xiaoyang Li et al./ Procedia Engineeri g 00 (2011) 00–000 5
memory partition algorithm should be proposed for memory hierarchy design, with both scratch-pad 
memory and main memory. 
Fig. 2. (a)Algorithm for memory requirement diagram ; (b)Memory requirement diagram of FFT 
References 
[1] D. Liu, Embedded DSP Processor Design, : Application Specific Instruction Set Processors (Systems on Silicon),2005. 
[2] K. Keutzer, S. Malik, A. Newton, From asic to asip: the next design discontinuity, in: Computer Design: VLSI in Computers 
and Processors, 2002. Proceedings. 2002 IEEE International Conference on, 2002, pp. 84 – 90. doi:10.1109/ICCD.2002.1106752.
 [3] P. Karlstro andm, W. Zhou, D. Liu, Implementation of a floating point adder and subtracter in nogap, a comparative case 
study, in: Embedded and Ubiquitous Computing (EUC), 2010 IEEE/IFIP 8th International Conference on, 2010, pp. 68 –72. 
doi:10.1109/EUC.2010.20. 
 [4] K. Karuri, M. Al Faruque, S. Kraemer, R. Leupers, G. Ascheid, H. Meyr, Fine-grained application source code profiling for 
asip design, in: Design Automation Conference, 2005. Proceedings. 42nd, 2005, pp. 329 – 334. doi:10.1109/DAC.2005.193827. 
