ABSTRACT At present, the IoT devices face many kinds of software and hardware attacks, especially buffer overflow attacks. This paper presents an architectural-enhanced security hardware design to detect buffer overflow attacks. One part of the design is instructions monitoring and verification used to trace the execution behavior of programs. Another one is secure tag validation used to monitor the attributes of every memory segment. The automated extraction tools extract the monitoring model and secure tag of each memory segment at the compile time. At run-time, the designed hardware observes its dynamic execution trace and checks whether the trace conforms to the permissible behavior, if not the appropriate response mechanisms will be triggered. The proposed schemes don't change the compiler or the existing instruction set and imposes no restriction to the software developer. The architectural design is implemented on an actual OR1200-FPGA platform. The experimental analysis shows that the proposed techniques can detect a wide range of buffer overflow attacks. And it takes low performance penalties and minimal overheads.
I. INTRODUCTION
With the development of electronic technology and consumer market trends in previous decades, the IoT devices are playing an important role in our lives and becoming increasingly complex, networked, and functionally extensible through software. Since the IoT device security is in urgent needs and especially the software attack has become the greatest threat. A common theme in software security attacks is that the program bugs are exploited to result in unintended behavior, such as buffer overflow attacks which overflow the return address or the data structure in the stack, and change the original integrity information to execute the malicious code, which induce both the leak of confidential code and execution of the modified malicious code [3] . Besides, physical attackers can control the address/data bus to steal, tamper or replay memory blocks when the application code and data are loaded into the inner register [9] . By using advanced electronic equipment, the sophisticated attacker can control the address or data bus to tamper or modify the instruction code, interrupt the communication between the processor and main memory, and control the execution in his desired direction.
Most of the existing approaches tackle the security problems at the software level, such as Stackguard [1] , PointGuard [2] and HSDefender [3] , but they often induce high performance overheads with the additional code. Therefore, using a separate hardware unit to detect buffer overflow attacks will not add additional performance loss to the original software execution, but also can achieve rapid detection (reasonable design can achieve real-time detection), such research is very meaningful.
IoT devices are developed on the traditional embedded devices, and the differences between them also lead to the design of this paper more suitable for IoT devices. The first major difference is that IoT devices are connected to the Internet and are therefore more vulnerable to attackers accessing and trying to attack, since attackers are not in the same physical space as IoT devices, so attacks are limited to software attacks, and the main means of software attacks are buffer overflow attacks, and that is the main detection target of this paper. The traditional embedded devices may also suffer from physical attacks and other more diversified means, this paper temporarily considers less of that part of the attack. Another difference is that IoT devices are more geared towards service-oriented devices, and traditional embedded devices are more often used in industrial equipment, so IoT devices have higher requirements for interactivity than traditional embedded devices, and therefore higher demands on attack detection speed. This paper uses hardware to detect attacks, and the detection speed is much faster than the traditional software means. To sum up, the design of this paper is more suitable for IoT devices.
This paper presents a novel architectural-enhanced scheme to monitor the application execution security which can be summarized as follows. The automated extraction tools extracts the monitoring model based on control flow, and generates the secure tag of each memory segment at the compile time. At run-time, designed hardware observes its dynamic execution trace and ensures that the instruction stream does not deviate from its intended behavior.
The rest of the paper is organized as below. Section 2 discusses the related work. Section III presents the instruction monitoring and verification architecture in detail. The secure tag validation architecture is described in Section IV. Section V shows the experimental result and security analysis. At last, conclusion was drawn in Section VI.
In this paper, we present light-weight, low-overhead, design-time security solutions for IoT devices in the form of architectural techniques and design methodologies.
II. RELATED WORK
Software-based approaches used to be the main methods for avoiding the attacks. Stackguard [1] prevent return address corruption resulting from stack overflow by modifying the compiler to generate code and place a random word or canary on the program stack just follow the return address on each function call, and check whether the canary is unmodified on return from the call. PointGuard [2] is another kind of technology to defend against most kinds of buffer overflows by encrypting pointers when stored in memory, and decrypting them only when loaded into CPU registers. Shao et al. [3] proposed an HSDefender technique to perform protection and checking together by designing secure call instructions. Program shepherding [4] employs an efficient machine code interpreter to prevent execution errors. Secure return address stack (SRAS) is proposed [5] , which uses a hardware stack of return address to counter the stack smashing attack. Abadi et al. [6] rewrite the software to implement the control-flow checks. BinArmor [7] is used to preventing buffer overflows without recompilation, but the overhead of BinArmor is high. BORG [8] is a testing tool that uses static and dynamic program analysis, taint propagation and symbolic execution to detect buffer overread bugs in real-world programs, but its effectiveness or applicability is limited by the target selection method.. Most of these mechanisms are based on additional code to perform control-flow validation which will result in poor performance. IoT device has the nature of limited resources, so the high overhead is unacceptable.
Recently, hardware-based approaches are increasing gradually. The AEGIS [9] presents techniques for control-flow protection, privacy, and prevention of code tampering. Xiang et al. [10] presents a hardware control flow monitor to enhance embedded system security. CODESSEAL [11] uses compiler inserting signature to the object file, and the run-time re-configurable FPGA logic validate the application execution at a cache block level. REM. Reference [12] propose an architectural mechanism to detect program flow anomalies. Divya et al. [13] presents a hardware control flow monitor to enhance embedded system security. Yunsi and Jerry [14] proposed a micro-architectural monitoring module to monitor the code integrity. Grasser et al. [15] use the secure Tag technique to perform the bound checking for avoiding the buffer overflow. Xu et al. [16] makes use of an architecture support for defending against buffer overflow attacks. Bu et al. [17] presents a compiler/hardware assisted design to protect codes. Masked Program Counter [21] is used to against buffer overflow but need to modify the processor architecture, and that could cause the old program not to function properly..
In this paper, the proposed hardware enhanced technique which possess fine physical isolation, high operation performance and low resource overhead, is more efficient than previous works.
III. ARCHITECTURAL ENHANCED INSTRUCTIONS MONITORING AND VERIFICATION
This section provides an architectural-enhanced instructions monitoring for intrusion detection to prevent the implementation of wrong control flow.
Architectural-enhanced instructions monitoring for intrusion detection: The paper proposes a novel, hardwareassisted, hierarchical framework for monitoring the control flow of embedded programs at a fine granularity, and a methodology for designing and configuring the hardware monitors. The framework catches all attacks on programs that operate by altering its control flow.
A. THREAT MODEL
We first present the threat model that motivated the proposed architecture.
One of the most common buffer overflow attack that change the control flow is Stack-smashing. This method is based on the corruption of return address or the frame pointer (FP) in the memory stack. C/C++ have some exploits, for example strcpy() that do not verify the bonding checking, and it can be used by the attacker to overflow the previous fp and return address, etc. By careful design, an attacker can make the return address point to the malicious code. After the program returns, the normal control flow will be interrupted to execute the malicious code. A variation of this kind of attacks is to overflow previous fp only. Since previous fp will point to the stack frame of main() after returning from function(), a similar attack can be activated when the program returns from main(). Besides this, Heap VOLUME 6, 2018 overflow corrupts many sensitive data structures, such as the function pointers, stored on the program heap can also lead to the changes of control flow.
B. ARCHITECTURE OVERVIEW
The instructions models architecture we propose is shown in Fig.2 . There are two parts in this architecture: program off-line behavior analysis (POLBA) and hardware real-time behavior monitoring (HRTBM).
In POLBA, the source code is cross compiled and linked to generate the executable binary code. Then, the source and the object codes are analyzed by the compiler to extract the monitor model. The executable binary code is stored in the CPU main memory while the monitor model information is stored in the on-chip ROM memory.
In HRTBM, the monitor logic compares the CPU execution information with the monitor model and checks whether the execution trace is permitted. Once the run-time execution behavior is not permitted, an interrupt signal will be sent to the processor to trigger the response mechanism (e.g., terminating or recovering the program). In this system, the monitoring hardware is implemented on-chip. So it cannot be compromised by many malicious software and physical attacks. Considering the above requirements, the basic block level is chosen to monitor the program execution. We define flow control instructions, such as branch and jump, indicate the end of a basic block, and the next instruction to be executed is the beginning of another basic block.
Our monitoring model depends on the control flow graph. It contains three sets: F, B&I.
F: Function calling information of program. fj ∈ F, fj = {address}. fj is the jth function absolute entry address. B: Basic block jump information of program. bj
bj is the jth basic block jump information. index_f[j] is the jth basic block corresponding function index in F; addr_bn [j] is the relative address between the entry address of the jth basic block and the entry address of the function corresponding to the jth basic block; type_b[j] is the jumping type of the jth basic block which is shown in TABLE 1. TARGET_b[j] is the possible target jump address sets of the jth basic block. Table 1 shows the next possible path of control flow. After this section, the application basic blocks are extracted. So the control flow is known.
I: The application code integrity information Based on the F, B and I sets, the instructions model is generated.
D. SHADOW CALL STACK
Because the monitoring model concerns a finite, static CFG, it can not be used to calculate the return address of the calling function. We employ a shadow call stack (SCS) to provide correct return addresses to settle this problem.
In ID stage, the hardware SCS decodes the instruction. When founding a push instruction that store the return address onto the stack, the SCS pushes the current return address onto the top of its corresponding LIFO memory. When founding that there is a pop instruction that load the return address the SCS pops the current return address from the top of its corresponding LIFO memory. In MEM stage, the return address from the stack memory is popped out. Then the compare unit determines whether it equals to the return address popped from the SCS and trigger the response mechanism.
As illustrated in Fig.1 , there is a variation of attacks that corrupt previous fp only. To prevent these kinds of attacks, we push and pop the fp as the return address does. So, any attacks that defending against return address or fp corruption in the stack memory can be detected efficiently.
We make use of the reverse engineering technology to fetch the binary code and instructions as Fig.4 shows. The left column is the relative address which store in the memory. The middle and right column shows the binary code and its mapped instructions. The regular expression which is coded by Perl Language is applied to extract the intrusion model. Fig.5 shows the trace pattern of the software execution according to our rules of models.
E. THE MONITORING HARDWARE
In HRTBM, the monitoring hardware compares the run-time execution information from CPU with the monitoring model derived from the off-line analysis. The hardware contains memories that keep the monitoring model and monitoring logic for validation as in Fig.6 shows. Four memories are required for the set F, B, C and shadow stack. We defined that MEM_f is for set F, MEM_b is for set B, MEM_c is for set C and MEM_s is for shadow stack.
At run time, the monitoring logic buffers current basic block's beginning Program Counter (PC). Depending on this, the monitoring logic fetches the corresponding basic block' monitoring model (F, B and I value). Then, the monitoring hardware logic begins to calculate the next possible basic VOLUME 6, 2018 Since the monitoring model concerns a finite and static CFG, it cannot be utilized to calculate the return address of the invoking function. One possible solution is to employ a shadow stack (SS) to keep the return address. The shadow call stack is a Last in First out (LIFO) memory. The input is the current Basic Block Information, and the output is the previous one. Once an inter-function jump is detected, the current BBI will be pushed into the stack. When detecting a function return is d, the top data of the stack will be pushed out to calculate the entry of the next basic block.
The recursive function needs to be handled separately since the potentially unbounded call depth requires a large memory. The entry of a recursive function is fixed, so we can employ the strategy as follow: firstly, pull the current BBI into the stack only once and count the call depth; when recursive call return, one is subtracted from the counter and the BBI for entry calculation is the top data of the stack until the counter equals zero. For example, if the call sequence is A → (B → B → . . . B)n (B is a recursive call, n is the all depth), only two BBIs (A and B) will be stored and the BBI_B will be used n times for calculation. When the counter turns to be zero, the BBI_A will be used.
The comparing unit detects deviation of program execution, but if it isn't accomplished, the processor must be frozen to let the comparing unit catch up. When founding a mismatch, the monitor logic will send the control signals to trigger response mechanism.
IV. SECURE TAG VALIDATION
This section presents secure tag validation scheme to monitor the application instruction and data executable, writable and readable attributes.
A. THREAT MODEL AND OVERVIEW
All application processes are loaded into three major memory areas: the stack segment, the data segment, and the code segment as in Fig.7 shows. The stack segment contains stack and heap. The stack stores the local variables and procedure calls. Heap, similar to stack, is a region of virtual memory used by applications. However, unlike stack, private heap space can be created and freed by programmers. The data segment stores static variables and dynamic variables, and it contains Data, BSS, Rodata. Data is used to the initialed data, Block Started by Symbol (BSS) is used to store the un-initialized data, and Rodata is used to store the read only data. The code segment contains Code and Vector. Code stores the program instructions and Vector stores the exception handlers. Each segment has different attributes of writable readable and executable as shown in Fig.7 .
Many buffer overflow attacks inject malicious codes into the data segment. To avoid this kind of attack, we set out the following segment tagging mechanism.
First, we tag the data segment as non-executable. Similarly, the stack segment is private for each application, and no other application can access the stack segment. Few applications are designed to be executable in the stack. Only under a few conditions, the OS permits executing the code in the stack segment. Therefore, in this paper, we tag the stack segment as non-executable. In conclusion, only the code segment is executable. When the CPU executes instructions in the segments tagged as not executable, the attack will be detected.
On the other hand, the code segment is read-only. Therefore, an attempt to write to this area is a segment violation.
We tag the Data, BSS, stack and heap parts as writable. In addition, we tag the Data, BSS and Rodata parts as readable. If the CPU loads the application data in the segments tagged as unreadable, it will also lead to a violation.
Therefore, we can use a 2-bit wide data to represent the properties of the segment. We define 10 to represent that the segment is executable, non-writable, unreadable, 00 for nonexecutable, non-writable, readable, 01 for non-executable, writable, readable, as shown in Fig.7 .
B. THE ARCHITECTURE DESIGN
At compile time, the compiler generates the executable binary code, and the secure tag is be extracted from the binary code according to our definition. In our design, the binary code is stored in the main memory, and the secure tag is loaded in an on-chip memory which cannot be accessed by the attacker. Therefore, the secure tag cannot be tampered by software or physical attacks.
The data is sent to the CPU at the MEM stage when the Load instruction is being executed. When a D-cache miss occurs, the application data are loaded into the D-cache from the main memory. The checker receives the corresponding secure tag, and check if it is from the readable region.
The data can be written back to the main memory when the Store instruction at the MEM stage is being executed. According to the writing target address, the checker can determine whether the memory region is writable.
When finding the execution does not conform to our designed attributes, the checker will send the interrupt signal to the CPU.
V. EXPERIMENTAL RESULTS
The processor adopted in this paper is OR1200 which is a 32-bit scalar RISC with Harvard micro architecture, 5 stage integer pipeline, virtual memory support (MMU) and basic DSP capabilities. The system on programmable chip (SOPC) platform is built on Xilinx FPGA. The software development tool for OR1200 is the popular and free GNU.
A. SECURITY CHECK
First, we present an example to check the instructions monitoring and verification design. Fig.9 (a) is the source code which is divided into four blocks B0, B1, B2 and B3 by the compiler as in Fig.9 (b) . Analyzing the source and the binary, the control flow graph (CFG) can be obtained and it generates the monitoring model in Fig.9 (c) . The expected control flow is B0 → B2 → B1. B0 → B2 is an inter-function jump and B2 → B1 is a return. Function main() calls the function foo(), then function foo() should return to function main(). However, the function foo() will result in a stack smashing and the return address stores in the buffer will be tampered to point the function bar(). So the actual control flow path is B0 → B2 → B3 (shown in Fig.9d ). When the control flow try to jump from B2 to B3, the discrepancy between the processor and the monitoring model will be detected and the invalid signal will be assert to stop the unexpected execution. Now, we consider the integrity tampering attack. If an attacker tamper the third basic block's ret0 = (int * )& ret0 + 2 to ret0 = (int * )&ret0-2. This kind of attack can be detected by the integrity validation. The integrity algorithms used in this paper are SHA-1, SHA-256 and CRC. SHA-1 and SHA-256 are both considered to be the safe hash algorithms so far. Given a message x and its hash H(x), it is computationally in-feasible to find another message y such that y = x and H(y) = H(x). So the normal basic block' HASH value integrity [2] is different from the tampered one. And this kind of attack can be detected.
But SHA-1 and SHA-256 incur a long delay. CRC32 is a substitution which has a lower security but a faster calculation speed and shorter width. The probability of injecting instructions randomly which would result in the same SHA-1, SHA-256 and CRC32 is 1/2160, 1/2256 and 1/232 respectively. So the CRC algorithm can also be used in this detecting system.
Then, we check the secure tag validation design.
FIGURE 10. Secure tag verification procedure. Fig.10 (a) shows a typical buffer overflow attack. Function PrStr() uses strcpy() to copy the * str pointed character string into local buf [10] . The attacker overflow the buf [16] to overwrite the return address with 2110 which is in the data segment. Fig.10 (b) shows its stack structure. After the program returns, the normal control flow will be interrupted to execute the malicious code in address 2110. At the IF stage, the designed secure tag checker monitors that the instruction is from the data segment, so it will send the interrupt signal to the CPU. Similarly, it can detect the unintended writable and readable attributes.
B. HARDWARE OVERHEAD
The hardware overhead contains two parts: memory overhead for monitoring model and logic overhead for control and comparison. Our monitoring model memory overhead is different for diverse applications. The width of set F, B, and SS are assumed to 32 bits and the secure tag is 2 bits. In our experiment, CRC32 are used as the integrity algorithm. Their corresponding set width is 32 bit respectively.
Our design is synthesized under Xilinx ISE. The modest area overhead is 330 slices.
C. DETECTING SPEED
The monitoring hardware runs in parallel with the CPU. If the detecting speed of the monitoring hardware cannot completely match to the processor, it will result in a performance impact. So, we check the detecting speed of the design in next.
First, the instructions monitoring and validation employs little performance overhead. It takes at most 4 clocks to calculate the entry of next possible basic block. The main overhead is from the integrity validation. A 32-bit CRC is chosen for substitution to meet the timing requirement. It is implemented in combination circuit, and takes only a single clock to accomplish the calculation. Once the run-time state is calculated, the compare unit can accomplish the comparison in one clock. The memory access scheme will accomplish the validation within one clock in instruction fetching and data loading. The main overhead should also consider the ratio of the memory access instructions to all possible instructions.
Secondly, we check the secure tag validation. The secure tag data is stored in the on-chip RAM, and it can be accessed in one clock. The checker can accomplish the validation within two clocks in instruction fetching and data loading. When executing the store instruction, the checker can get the secure tag within one clock depend on the target storing address, and accomplish the validation within two clocks.
In conclusion, the monitoring hardware can detect the buffer overflow attack with two clocks.
VI. CONCLUSIONS
The designing hardware is used to secure application execution in untrustworthy environment in IoT device. The two experimental result shows that it can detect a wide variety of buffer overflow attacks with simple hardware design and reasonable overhead penalties. The hardware overhead is acceptable. The detection speed of this design is 2 clocks, and the software-based design requires at least 8 clocks, so the detection speed of this design is very advantageous.
In conclusion, this design is suitable for IoT device whose resource is limited and security is demanding.
