(SoC) architecture with reconfigurable logic and multiple processing elements sharing a common memory, like that shown in Figure 1 , is likely to become quite common in the near future. For example, programmable-logic companies are producing single chips containing millions of reconfigurable logic gate equivalents and full-custom VLSI processors. Designers can customize such chips by reprogramming the reconfigurable logic or the processors. We envision a scenario in which a few highly reconfigurable, reprogrammable SoCs appear in most embedded applications that demand rapid upgrades, fast time to market, and low cost.
RTOSs for SoCs. This article introduces the δ hardware-software RTOS framework, a prototype hardware-software generation tool that aids in customized RTOS-SoC codesign.
Motivation
Commercial RTOSs available for popular embedded processors provide significant reductions in design time. But they typically don't take advantage of hardware to implement any of their functions, probably because processors and custom hardware accelerators have historically resided on separate chips. Therefore, partitioning functionality between hardware and software has increased system speed only when the dramatic speedup provided by hardware compensates for the chipto-chip communication cost, as is the case for graphics coprocessors. Large communication costs, however, have limited fine-grained hardware-software partitioning. The advent of SoCs has greatly reduced this communication barrier to splitting logic between hardware and software on the same chip. Therefore, we developed the δ framework.
architecture can be considered an example of hardware-software codesign. 1, 2 Many of today's SoC designs are single-chip implementations of prior multichip PCB designs. The SoC designers make only marginal changes of the PCB design; for example, the SoC might use exactly the same bus structure as the PCB. It is certainly not common practice to codesign the SoC and the RTOS that will run the application code. In this respect, much of today's SoC design is analogous to early computer architecture design performed with no consideration of the compiler.
Programming model
We focus our research on multiprocessor SoC designs. We assume that these SoCs use current programming models for real-time embedded systems. Specifically, we assume that applications execute with multiple threads of control, using the shared-address-space or message-passing programming models.
Research goal
An important goal of our research is that our RTOS-SoC designs be easy for programmers to use. The 1990s saw the spectacular failure of several large parallel-processing companies such as Thinking Machines and MasPar. Although we cannot point to a single cause of the failures, the general opinion seems to be that these companies' large machines were very difficult to program. To avoid programming difficulty, we deliberately use a few (four to 10) processors with existing compilers such as GCC. Thus, no innovation is required for compiler and processor design in our SoCs. Instead, we leverage processors and compilers available on the market. Specifically, for our simulation engine, we use the Mentor Graphics Seamless Coverification Environment (CVE), which comes with Seamless Processor Support Packages. More than 30 are available, including packages for various Advanced RISC Machine (ARM) processors, IBM PowerPC (PPC) processors, and DSPs (http://www.mentor.com/seamless/).
Atalanta RTOS
As part of the δ framework, we developed the Atalanta multiprocessor RTOS kernel for SoC architectures. Atalanta provides key RTOS features including multitasking capabilities; event-driven, priority-based preemptive scheduling; and intertask communication and synchronization. Atalanta's small, compact, deterministic, modular, library-based architecture is important for SoC applications. Atalanta also supports special features such as priority inheritance and user configurability. 3 Atalanta provides various system objects for intertask communication and synchronization. Event groups, mailboxes, queues, semaphores, and mutexes (mutual exclusion objects) are available for tasks executing on the same processor or different processors in a homogeneous architecture. All these objects support the shared-address-space programming model. The mutex, a binary semaphore that implements priority inheritance, enables programmers to prevent priority inversion, thus providing a critical capability for real-time applications. The basic semaphore does not implement priority inheritance. Atalanta also provides message-passing system calls for communication between tasks on different processors in a heterogeneous architecture.
The Atalanta kernel is currently under evaluation in several hardware-software codesign research projects using both ARM and PPC processors. 3 
Approach
The δ framework is designed to provide automatic hardware-software configurability to support user-directed hardware-software partitioning. Figure 2 illustrates the framework's graphical user interface, which allows the user to select RTOS features. For some features, both hardware and software versions are available. In Figure 2 , the user has specified four PPC CPUs (specifically, PPC 750s). Furthermore, the user has chosen to implement deadlock detection in software and the SoC lock cache (SoCLC) in hardware. (The SoCLC, a hardware lock cache to which lock variables are memory-mapped, improves performance and real-time predictability of lock-variable access.) Finally, for interprocedure calls (IPCs), the user has chosen only semaphores under IPC methods. Figure 3 shows a novel approach to automating the partitioning of a hardware-software RTOS using predesigned hardware and software RTOS components. The δ framework takes the following input: The δ framework outputs the Makefile, User.h, and Verilog header files shown in Figure  3 . These are configuration files that combine the hardware-software RTOS IP library components specified by the user.
Graphical user interface

Partitioning methodology
The RTOS hardware IP available in the δ framework includes the SoCLC, the SoCDDU, and the SoCDMMU. The SoCLC stores lock variables in a separate lock cache outside the memory system, thereby reducing lock latency, lock delay, and bandwidth consumption in a sharedmemory multiprocessor SoC. 5, 6 Because each lock variable requires only 1 bit, the hardware cost is low. For example, in an experimental SoC, we synthesized a SoCLC with 128 lock variables that cost approximately 7,000 gates. The SoCDDU performs parallel hardware deadlock detection by searching for deadlocks on the resource allocation graph in hardware. 7 It provides a fast, low-area implementation of runtime deadlock detection. Compared with software, the SoCDDU reduces deadlock detection time by 99%.
The SoCDMMU implements dynamic memory allocation and management in hardware to improve both average-and worst-case memory allocation time. 8, 9 For example, in a practical example, the SoCDMMU reduced average-case allocation time by 440% compared with software memory allocation (already optimized for speed). 8 
Target SoC
The δ framework targets a SoC consisting of custom logic, reconfigurable logic, and multiple processing elements, all sharing a common memory, as shown in Figure 1 . Figure 4 shows an example of such a SoC, in which four PPC processors communicate with second-level memory and reconfigurable logic over a single bus. However, nothing in the δ framework requires PPC processors or a single bus; the only requirement for additional processors (currently the framework supports only the PPC 750, PPC 755, and ARM9TDMI) is that processor-specific assembly code for the software RTOS must be written for the new processor. All the hardware RTOS components have well-defined interfaces to which any processing element-including a hardware (non-Von Neumann or non-instruction-set) processing element-can connect and thus use the hardware RTOS component's features. In other words, both the custom and reconfigurable logic can contain specialized processing elements that interface to the hardware RTOS components.
Examples and experimental results
To test the δ framework, we simulated various RTOS configurations using the Mentor Graphics Seamless CVE. same chip architecture: four PPC processors with a single bus and shared memory. The first configuration, RTOS1, is for a system implementing a database transaction example with software semaphores and spin locks. 6 For comparison with RTOS1, we generated RTOS2, which uses a SoCLC along with the same database example. Figure 4 shows the SoC architecture for this case.
The third configuration, RTOS3, is a software RTOS with deadlock detection in software. It is designed for the shared-memory SoC already described. RTOS4 is for a system using the SoCDDU hardware. Finally, RTOS5 is for a system using both a SoCLC and a SoCDDU.
We simulated these systems in the Seamless framework, using the Synopsys VCS Verilog simulator and Mentor Graphics Xray for application code debugging. We compiled the software application code and the software RTOS code for execution on the PPC processors, and we instantiated the hardware part of the configured RTOS in a top-level Verilog file that could execute in the Seamless framework on the Synopsys VCS simulator. We interfaced the processors with the shared memory and other hardware RTOS components through an address decoder, an arbiter, and a memory controller.
To verify that the generated RTOS1 and RTOS2 configurations are correct, we used the same database example mentioned previously in connection with the SoCLC. 6 The example includes accesses to both short and long critical sections. Long critical sections are actual database-copying actions, whereas short critical sections are synchronization actions among server tasks and client tasks before a long data transaction begins. Table 1 presents our experimental simulation results for the database example. It compares lock delay, lock latency, and total execution time for RTOS1 (with software locks) and RTOS2 (with SoCLC). Both configurations ran with 40 tasks. RTOS2 achieves a speedup of 27% over RTOS1 on the same architecture.
To estimate the added area cost of RTOS2, we synthesized the SoCLC to a semicustom library and to reconfigurable logic (using a Xilinx 4000 series FPGA as an example for reconfigurable logic area). Table 2 shows the hardware area we used for the SoCLC with support for up to 64 short and 64 long criticalsection locks. Table 2 also shows the SoCDDU hardware area for an example we discuss later.
On the basis of these results, δ framework users can choose between two tradeoffs: Gain a speedup using the SoCLC with chip space overhead, or use software locking and save the hardware for other uses (or simply reduce overall chip size).
To verify that the generated configurations for RTOS3 and RTOS4 are correct, we considered how the SoCDDU would perform in an example based on a Jini lookup service application 10 in which client applications can request services through intermediate layers (lookup, discovery, and join). This system has four clients (four PPC processors) and four services: peripheral component interconnect (PCI), MPEG, fast Fourier transform (FFT), and wireless interface hardware Because the system has multiple requesters and resources, deadlock is possible. Therefore, the system could benefit from the SoCDDU. Figure 6 illustrates the example. The sequence of requests and grants leads to a deadlock caused by event e 5 at time t 5 (a previous article 7 presents the sequence in detail). We did not obtain the processing times shown from an actual industrial product; rather, we estimated them to exemplify an application running setup code in C to dynamically change (at runtime) hardware resource use among the processors. We assumed that if a processor needs two resources, it cannot proceed in its computation until it acquires both resources. This assumption is valid in typical multimedia applications processing streamed data.
Suppose that we start deadlock detection at time t 5 and that the time needed for deadlock detection is ∆. We compared the ∆ values required with two deadlock detection methods: software versus SoCDDU. We compared the two methods' effects on deadlock detection time and total execution time. Table 3 shows the results. Total execution time speeds up 38% in this example. Table 2 shows the hardware area used by the SoCDDU.
Both the SoCLC and the SoCDDU are scalable according to the number of processors, lock variables, or hardware resources. Thus, δ framework users can consider various predefined hardware-software RTOS partitions and SoC architectures.
Incorporating the δ framework in platform-based design
How might a designer use the δ framework in platform-based design? That depends on whether the user is designing a custom platform ASIC (also known as a SoC) for eventual fabrication or using an existing platform ASIC.
In the former case, the δ framework helps in design space exploration of the SoC architecture. For example, if fast locking is important for the kinds of applications to be run, the designer might want a SoCLC in custom hardware. On the basis of this application analysis, the designer would use the framework to narrow down the set of RTOS choices and corresponding SoC architecture choices. Then, the designer can simulate each customized RTOS and SoC architecture in a cycle-accurate simulator such as the Seamless CVE to obtain sample code traces for the applications to be run on the SoC. Seamless CVE also supports noncycle-accurate simulation modes for faster design space exploration, and the δ framework works with these modes as well.
Designers should also consider additional factors such as power consumption. Then, they can make a final choice as to which SoC to fabricate. After fabrication, the processors, the custom hardware RTOS services, and the amount of reconfigurable logic included on the SoC will limit the RTOS choices available. In the postfabrication scenario, a particular real-time application might have a rare deadlock condition, which, if encountered, should trigger system reset as soon as possible. An already fabricated SoC with on-chip reconfigurable logic could implement the hardware SoCDDU in reconfigurable logic for this application. Potentially, any of the hardware RTOS services could be placed in reconfigurable logic. Therefore, to the extent possible (which is limited by area), the δ framework can generate a hardware-software RTOS customized for a particular set of application requirements to be run on a platform ASIC with reconfigurable logic. Even without reconfigurable logic or custom RTOS logic included on the SoC, the framework is useful, but perhaps no more useful than an existing software RTOS available commercially.
Finally, the δ framework can be extended to include additional hardware and software components to generate a wider range of customized RTOSs. For example, a designer might notice that the application software would run much faster and consume less energy if a particular RTOS function were implemented completely or partly in hardware. Although retooling the RTOS generation framework requires effort (much as integrating new instructions into a compiler does), a user with access to and understanding of the source code could add such functionality, or the company providing the framework could make these extensions. In fact, the existing three hardware RTOS IP components resulted from our trying to run application programs in a multiprocessor SoC and then asking, How can we speed up these programs dramatically with a very small hardware addition (less than 20,000 gates) to the RTOS? In short, we consider the δ framework to be an early conceptual prototype of the RTOS side of the equation in RTOS-SoC or RTOSplatform ASIC codesign.
WE BELIEVE that platform-based design must pay more attention to software in the platform definition process. Bearing this out, our results with the δ framework show that SoC architectures can benefit greatly from early codesign with a hardware-software RTOS. In particular, we have shown several examples in which a small hardware area (less than 10,000 gates) results in speedups of 27% or more in likely application scenarios. In our future work, we plan to more fully automate RTOS generation by integrating a wider variety of RTOS hardware components in the δ framework. 
IEEE Micro serves your interests
SEPTEMBER-OCTOBER 2002
Guest Editor Luciano Lavagno of the Politecnico di Torino presents articles on
Systems on a Chip: The Next Electronic Design Frontier
The integration of entire systems on a chip poses a host of problems to both architects and designers; the solutions to such problems are often very different from their board-level counterparts. The SeptemberOctober special SOC issue of IEEE Micro includes a set of articles that explores several interesting aspects of SOC integration.
http:computer.org/micro
