Abstract: This paper presents a HW/SW co-design methodology for the design of resource limited embedded devices. The design methodology takes advantage of system abstraction and attached vertical and horizontal co-design flows. The vertical co-design flow focuses on a system consisting of a standard processor and a custom hardware as co-processor. The horizontal co-design flow implements the system's functionality entirely in software, but the underlying processor is developed at the same time. The focus of the paper is on the detailed description of the two design flows under the aspect of power awareness.
Ulrich Neffe received his MS Degree in Electrical Engineering from the Graz University of Technology, Austria, in 2002 and his PhD Degree in Electrical Engineering from the same university in 2005. Currently, he is member of the smart card software research and development team of Philips Semiconductors and works in close cooperation with Graz University of Technology. His research interests include low-power software design and HW/SW co-design.
Introduction
HW/SW co-design methodologies were developed for the optimisation of highly complex systems. Such applications, for instance, multi-media systems, are abstract and the abstract model is then mapped on different target architectures to evaluate design parameters like performance, energy dissipation, system size, or system weight. The goal is to find a system design which best fits the design requirements. The optimisation potential using such an approach is often a factor between 10 and 20 for dedicated design parameters. A lot of effort was spent on design methodologies for these systems but less work was done for small embedded systems like smart cards because their complexity can be handled by standard design techniques.
To get the advantages of these methodologies for the design of small, highly optimised embedded systems, a further development of the proposed design methodologies is necessary.
Smart card systems
A smart card (Rankl and Effing, 2002) has no integrated power supply and is therefore a passive device. Passive devices always need an active device to get their power. Hence, a smart card system comprises not only the smart card but also a host terminal and optional a background system. Such a system is presented in Figure 1 . The smart card itself is presented by its hardware, the operating system running on the card and the applications. The link between the card and the smart card reader depends on the interface type. Contact cards are directly connected to a reader via a serial link, which contains also two lines for power supply. A contactless smart card reader generates an RF-field for communication. The card uses this field as the power supply and for communication. The reader is connected to the terminal via a standard interface. This can be a serial RS-232 connection or may be an USB interface. The terminal contains the interface hardware, the smart card reader driver and terminal applications. Dependent on the driver, a terminal can manage one or more smart card reader independent of the smart card I/O interface. The terminal application can communicate with a background system. As discussed before, a smart card is a passive component in terms of power but also a passive component due to its behaviour. During normal mode of operation a smart card only reacts on commands sent to the card by the host. Hence, a serial Universal Asynchrony Receiver/Transmitter (UART) is an essential component of a card. This UART is used to receive and send messages. Added to the serial interface, data have to be stored permanently on the card. Different non-volatile memory technologies such as EEPROM and Flash are used. Smart cards are often used to store confidential information. Thus cryptographic algorithms and secure keys are used to protect confidential data and data transmission. But cryptographic algorithms require a high amount of computation power. This power is mostly provided by dedicated co-processors. Only high performance 32-bit smart cards can compute them in software.
Power and energy dissipation is an important design constraint for smart cards as mentioned before. Embedded systems powered by batteries are designed for minimised energy dissipation to increase stand-by and active times. Smart cards are often used by embedded systems powered by batteries and therefore it is also necessary to design them for minimised energy dissipation.
An example is the Subscriber Identification Module (SIM) cards embedded in GSM mobile phones. On the other hand contactless smart cards are often powered by some sort of Radio Frequency (RF) field with constrained field energy. For instance, the magnetic field of the RF field for ISO 14443 specified proximity cards has to be between 1.5 A/m and 7.5 A/m. The field does not limit energy dissipation, but power consumption. The second hard constraint is the size of the chip. The upper limit for the chip size is defined by manufacturers to avoid breaking of the chip due to torsional forces caused by customer usage. But the smart card market is also a high volume market with high competition and thus chip size has a high impact on costs.
The document is organised as follows. Section 2 surveys related work and discusses proposed HW/SW co-design methodologies. The subsequent Section 3 presents the developed HW/SW co-design methodology. Afterwards Section 4 presents the vertical and horizontal co-design flow in details. Section 5 presents the methodology evaluation and Section 6 concludes.
Related work
Research on HW/SW co-design has been done since the early 1990s. During the first decade, formal models and abstract descriptions for system behaviour were developed. Also, partitioning algorithms based on cost functions to find the best hardware/software solution were developed. These approaches are often for restricted and limited hardware architectures or application areas. For instance, POLIS is a design environment for control-dominated applications and was designed for the automobile industry (Balarin et al., 2003) .
With shrinking silicon process technology the design complexity increased dramatically during the last years. Thus design reuse and more abstract system representations were necessary to handle such complex multi-processor system-on-chip designs and therefore new design methodologies came up to close the resulting design gap. The following paragraphs discuss selected design methodologies driven by large organisations.
Reuse based methodologies were introduced for SoC design at register-transfer-level in the late 1990s. The problems of using this Intellectual Property (IP) cores are the interface difficulties, the process portability, and customisation of IP cores. The Semiconductor Technology Academic Research Centre (STARC) in Yokohama, Japan, has developed a virtual core based design methodology in the Virtual Core based Design System (VCDS) project (Muraoka et al., 2003 (Muraoka et al., , 2004 . The authors have proposed Virtual Cores (VCores) to solve this problem. VCores are reusable, configurable, and high-level abstracted design components. The feature of VCores is their variability with regard to functionality and structure compared to IP components, which have a very limited variability (e.g., bit width). The VCDS defines three different types of VCores:
• functional VCores used at system-level design of a SoC
• software VCores representing software components
• hardware VCores.
During the HW/SW partitioning step functional equivalent software and hardware VCores have to be assigned to functional VCores. For a fast trade-off between hardware and software the assignment is done automatically.
A different design methodology was proposed by Sangiovanni-Vincentelli and Ferrari (1999) and Sangiovanni-Vincentelli and Martin (2001) that is called "Platform-based Design Methodology". Platform-based design is an approach, which separates different aspects of the design space to allow more effective design space exploration. This approach separates function and architecture as well as computation and communication. The mapping of function to architecture is an essential step and allows independent exploration of functionality and architecture. Additional to existing hardware platforms a software platform provides an Application Program Interface (API) to the application software by abstracting hardware details through device drivers and the operating system. The hardware and software platform together are called the system platform.
Platform-based design distinguishes between functional and structural design, but hardware designs contain a certain amount of structural information. Therefore, Doucet et al. (2002) proposes a structured component composition framework based on the fact that a hardware structure can be mapped to an object structure with an one-to-one correspondence.
Only a few HW/SW co-design approaches were published for designing smart cards. A methodology for smart card security design is proposed by Gressus (2001) . This methodology is based on concurrent secure development, which objectives are the decrease of the global development time, to integrate and assure the security requirements and deliverables for evaluation all over the development cycle, test development at each stage of the design process, and control throughout the development cycle by using appropriate tools. Concurrent secure development includes a structural and modular architecture with clear interface communication definition and IP building process with design for reuse. They start very early in the development process with formal and semi-formal methods to be compliant with the assurance level five requirements and more. Therefore they use the most advanced techniques of formal and semi-formal methods like UML, B and dedicated languages like Java. This approach fulfils all security requirements but it does not support system-level design space exploration for power, performance and chip size optimisation.
Given an application, the problem of ASIP design is to derive an appropriate processor architecture that can implement application-specific software. For that purpose, one must adapt compilers, libraries, operating system functions, and the simulation and debugging environment (Ernst, 1998) .
The development of ASIPs requires expertise knowledge in different domains: application software development tools, processor hardware implementation, and system integration and verification. Hoffmann et al. (2001) presents a framework for ASIP design, which is based on machine descriptions in the LISA language. From that, software development tools can be automatically generated including HLL C-compiler, assembler, linker, simulator and debugger front end.
Xtensa is an ASIP processor core designed with ease of integration, customisation, and extension in mind (Gonzalez, 2000) . Xtensa lets the system designer select and size only the features required for a given application. The designer can define new system-specific instructions if pre-existing features do not provide the required functionality.
3 Novel HW/SW co-design methodology
Methodology requirements
Before describing the design methodology in detail the requirements for smart card design should be discussed.
The smart card design process is differentiated from other embedded systems by five particular features:
• high performance
• high security
These features are important due to the following reasons. Performance can be divided into two groups:
• hard-time constraints
• soft-time constraints.
Hard-time constraints have to be fulfilled at lower layers. These are the strict timing constraints of communication protocols at the bit, byte and message layers. Soft-realtime constraints are defined at the application level. These timing constraints define application response times, which differ dependent on the application domain. These times comprise the time from starting a card transaction until the transaction is finished. Short response times are necessary for public transport or payment systems to get a high throughput of transactions.
High security is always a topic in smart card design because they are used to store sensible personal data or other confidential information. The requirement for low power depends on the application area. Power is critical for contactless systems but energy dissipation is also a requirement for smart cards used in mobile, battery powered devices like mobile phones. Costs are always an important design constraint in consumer and high volume markets. This is also true for smart cards. Limited resources are a result of all other design constraints discussed before.
System design abstraction
Intermediate levels of abstraction are necessary between a textual system specification and an implementation to close the design gap and to explore the design space. Between two levels of abstraction a model transformation or synthesis step has to be performed to generate a lower level representation of a higher level model. It is necessary to verify the generated model to reduce the risk of design errors due to synthesis bugs. The synthesis and verification processes lead to a high effort for a model transformation from one level to the next lower level of abstraction. But small steps between subsequent levels can be used to design more efficient systems. Hence, there is a trade-off between design efficiency and design effort.
We have implemented different levels of abstraction and corresponding system models in this approach: Functional level, transactional level, and cycle-accurate models. The reason for system abstraction is the high potential for power and performance optimisation capabilities proposed by previous publications. Raghunathan et al. (1998) denoted the gain at system-level to a factor of 5-10 and at behavioural level a gain of 2-5.
The approach presented in this work has been designed to get high design efficiency by using a low number of levels of abstraction. This goal is reached by using different platform models at different levels. Platforms cover a broad band of abstraction instead of a single level. The bands of abstractions are depicted in Figure 2 . These bands provide a high flexibility regarding the trade-off between design efficiency and design effort. Thus, a model can be refined or explored manually within a platform if necessary nearly down to the next lower level of abstraction, which results in a small design gap. Otherwise, a model can be mapped directly to the next lower level, which leads to a larger gap. This flexibility and the maximum range of abstraction depend on the definition of a platform. A platform is defined by 
Overall design flow
The design methodology presented is implemented by using the global design flow shown in Figure 3 . The design flow is divided into two different phases:
• design, verification and exploration of the functional platform model
• exploration of different system architectures and the selection of the best solution. The first phase starts with a specification analysis and the development of a first virtual prototype at functional platform. The first milestone within phase one is the successful validation of the specification based on the prototype. The validation requires a full functional testbench, which is necessary to verify all subsequent system synthesis and refinement steps. Afterwards functional platform exploration and optimisation can be performed based on the indicators for energy dissipation and performance.
After one or more optimised functional platform models have been selected two different approaches (phase 2) can be considered:
• a vertical co-design process for processor-based target architectures
• a horizontal co-design process for application specific processor design.
Vertical and horizontal HW/SW co-design flows
The proposed design methodology (Neffe, 2005) has been implemented (based on SystemC, Open SystemC™ Initiative, 2005) and evaluated for two different HW/SW co-design strategies. The first strategy comprises hardware architectures with one or more existing instruction-set processors, which should be supported by one or more co-processors. The second strategy comprises the development of application-specific instruction-set processors with optimised software structures. These strategies are supported by horizontal and vertical HW/SW co-design flows. Both flows are supported by automated inter-platform model transformations done by a synthesis system.
Vertical co-design

Vertical HW/SW co-design flow
The vertical co-design flow ( Figure 4 ) is designed for multi-processing element systems. Thus the input for this flow is the functional platform model and an abstract target architecture model. The architecture model represents the system structure consisting of processing elements and communication units.
Figure 4 Vertical co-design flow
The user has to provide a system configuration, which contains the functional object mapping, memory map information, bus configuration and protocol, and synthesis support data. Based on this information the synthesis process can be performed automatically. The resulting transaction level platform model is again a virtual prototype, which can be verified and used for performance and energy dissipation analysis. Due to the automatic generation of the transaction-level platform model, different architectures can be evaluated by moving back in the design flow and the definition of a new system configuration. The design space can be explored fast and efficiently. Processing elements are only represented as abstract components resulting in missing timing information and missing impact of code fetching on the system bus. Therefore, software components have to be cross-compiled on the target processor and instruction-set simulators have to replace the corresponding abstract processing elements. Additionally, all remaining processing elements have to be reconfigured to support the cycle accurate system bus interface.
However, this configuration has to be provided to the system synthesis unit to generate software automatically. The required heterogeneous simulation environment has to be set up by the user. The resulting accuracy for performance depends on the simulators and the description of hardware models. The accuracy of the energy estimation also depends on the used energy models . The design space can be explored by reconfiguring software generation but the considered design space is small compared to the transaction-level platform model exploration. If the selected architecture does not fulfil the design constraints at cycle-accurate platform a step back is necessary and a new system configuration should be found which has a higher potential to keep the design constraints.
System synthesis
The design of the synthesis system is shown in Figure 5 . The system description is read by a C++ front end (Edison Design Group, 2005 ). An internal data structure representing the system is built based on the intermediate language generated by the front end. The data structure contains the system hierarchy and all interconnects between modules. There are only a few design restrictions. Only one module (top module) has to be instantiated in the SystemC main function, all other modules have to be instantiated hierarchically in the top module and all communication links have to use SystemC ports. This data structure is written to an XML file, which is used by the user front end to visualise the system structure. The user front end also visualises the target architecture description and allows a mapping of functional modules on processing elements. This mapping information, synchronisation mechanisms, bus configurations and global memory mapping are written to a system synthesis file (ssf). The system synthesis is performed by three modules:
• system synthesis control unit
• interface generator
• parallelisation unit.
The synthesis control unit transforms the system representation into the architecture defined in the ssf. It instantiates functional modules in processing elements, changes to the top module description and if necessary inserts adapters between functional modules and bus interfaces. Therefore the interface generator is used. It generates a pair of adapters if two connected functional modules are mapped to different processing elements. The master adapter implements the same interface as implemented by the slave-side functional module. It translates a method call into bus transactions and forwards them to the slave adapter. Afterwards the slave adapter performs the initial method call to the functional module. The third module allows a parallelisation of modules as explained above. It can be used by the synthesis control unit but also stand alone for functional system transformation. The SystemC software generator writes the new system into C++ files. These can be compiled and enable a simulation of the transformed system. The C software generator creates C code out of functional modules mapped to a processing unit, which represents a processor. This software can be compiled with standard tools supporting the used processor.
Functional platform
The functional platform (Figure 6) is the highest-level of abstraction for design space exploration. The model implemented on this platform consists of user-defined and Application Program Interface (API) SystemC modules communicating over SystemC interfaces. API modules implement standard smart card resources like non-volatile memory or UART functionality for sending/receiving messages to the host. Due to missing physical information at this level of abstraction, different indicators for energy dissipation have to be used. Direct indicators for memory energy dissipation are the number of memory accesses, the number of non-volatile memory programming cycles, memory access patterns, and data-path switching activity. Similar indicators are useful for communication channels, for instance, the serial link between a smart card and a terminal. Indicators for data transmissions are bandwidths, total number of transmitted bytes, and average size of messages. Additional to the direct indicators discussed before, energy macro models can be used for estimation. Energy macro models are based on a bottom-up energy characterisation process and can be applied to reused objects, which are, in general, API objects. The characterisation process requires an existing implementation of the functional objects on dedicated architectures. Thus the characterisation is done for a target processor implemented with a dedicated process technology. The energy dissipation of an operating system call can be measured and assigned to a functional object interface call. Of course, there is no differentiation between software and hardware energy dissipation but this topic is not relevant at this level of abstraction.
API objects are pre-characterised and support direct energy estimation under the assumption of a selection of a target processor-based solution. Therefore the energy estimation has to be done by the API object. The estimated energy values have to be provided to a global energy estimation unit. This unit is called the Energy Sampling Unit and can read out the energy values of API objects via an energy interface. This interface has to be implemented by all objects which provide energy estimation.
But this model does not affect the internal energy models used by objects. The interface has to provide the following functionality:
• a method to reset the internal energy models
• a method to read the energy value
• a method to clear the accumulated energy model without resetting the internal energy model.
With this interface energy profiling of a system is possible. Therefore the sampling unit has to be triggered to fetch energy data. Due to the missing timing information the system has to trigger the sampling unit by itself.
Transaction-level platform
The transaction-level platform is designed for system architecture exploration and performed by mapping a functional model on an architecture model. Architecture models consist of processing elements and communication units. Processing elements represent processors and co-processors in an abstract way. Communication units comprise all available types of signals and transaction-level bus models. Bus models for supported smart card CPUs were developed, also configurable models from ocp-ip (Haverinen et al., 2002) for communication exploration are supported. Functional models are assigned to processing elements by the user. These are automatically instantiated in the scope of the corresponding processing element. If two modules are mapped to the same element no interface adaptation is necessary. But if they are mapped to different processing elements interface adapters are required. The general structure of this mapping process is shown in Figure 7 . The functional model contains point-to-point interface-based connections. These connections have to be refined to software-software, software-hardware or hardware-hardware interfaces. Two adapters are necessary to forward the method call over the communication system to the slave. The communication sequence is as follows: method parameters are passed to the slave, the right method is invoked, and if applicable, the result is read by the master after the slave has finished the work. Additional control bits, e.g., start, run, or error, are necessary and can be realised by special function registers. Again, a memory map is needed to address all method parameters and control registers. In this case the memory map can be optimised to reduce switching activity. When the channel represents a shared bus, communication delays dependent on different arbitration schemes can be explored. Implementation of these adapters by hand is an error prone and time consuming process; thus, their automated generation is an essential part of the design flow.
The memory generator creates for every functional module a ROM, RAM, EEPROM, or flash module, if necessary. Hence a lot of small memory objects are present in the functional model. These small memory blocks have to be merged to larger blocks. Whether one large memory block is used for each memory technology or some smaller blocks has a high impact on the power consumption. Therefore temporary adapters are inserted in the transactional level platform model to trace all memory accesses and to calculate the best memory structure. This information is also used to determine the most power efficient size of non-volatile page sizes for memory programming.
Cycle-accurate platform
The cycle-accurate platform is used to implement a selected transaction-level platform model. At this level of abstraction, processing elements of the TL platform are replaced by cycle accurate models. Cycle accurate models can be instruction-set simulators representing processors, accurate simulation models of co-processors, or accurate representations of application specific hardware blocks. Transaction-level communication elements are replaced by dedicated bus systems, but they can be again implemented at transaction-level. It is necessary to add memories to hold the cross-compiled object code representing the software.
A cycle-accurate platform model is, in general, a heterogeneous system. The heterogeneous system comprises instruction-set simulators, bus models, co-processor models and possibly analogue parts. Due to the complexity of such a system accompanied by the implementation challenge, the cycle-accurate platform should be represented as long as possible in a homogeneous simulation environment. Figure 8 shows the general concept for this platform. The bus system and all application specific hardware components are modelled within the homogeneous simulation environment. These components can be refined based on the corresponding transaction-level models. A selected processor is represented by its instruction-set simulator, which has to be coupled to the bus model within the homogeneous environment. Simulators normally provide bus interfaces for this reason. Simulation models are normally available for reused components. These are often provided in some type of library, which also requires a bus interface wrapper. 
Horizontal HW/SW co-design approach
Horizontal design flow
The target architecture for the horizontal co-design approach is a single application specific processor. This approach does not focus only on the hardware architecture of the processor but also on the layering of functionality.
Therefore the hardware and software architecture is developed out of the functional platform model and functional platform objects are integrated step by step into the system. During the entire design flow always a co-simulation between the functional platform model and the new processor design is available. The design steps, as depicted in Figure 9 , follow in detail. The horizontal design flow starts with the definition of the supported instruction set. The defined instruction set represents the functionality of objects in the functional model. Functionality, defined by the instruction set, has to be moved to a separate processing element. Thus the functional platform model is partitioned into two processing elements, one containing the functionality defined by the instruction set, further called hardware element, and a second element containing the remaining system. The interface between these two modules represents the HW/SW interface defined by the instruction set. After verifying the correct behaviour of the system, the hardware element can be refined until it represents a basic hardware architecture. Due to abstract interfaces to provide access to all hardware resources the entire system can be simulated. Of course, additional test patterns are necessary to verify the hardware element alone.
Afterwards, the next layer has to be moved from the functional platform model to the hardware element. The next layers are the hardware abstraction layer and the operating system layer. The steps are the same as presented before. The interfaces to the hardware and the hardware abstraction layer have to be refined and extended to support target applications optimal. The hardware abstraction layer and parts of the operating system layer have to be implemented using the instruction set and the remaining parts can be implemented in a high-level language. Only critical parts of the considered layers should be implemented during this step, all other parts can stay in the functional platform model but with modified interfaces and objects reorganised in a layered system. The last step considers the interface between operating system and application. This interface has to be defined and supported at all lower layers. Again, critical parts should be implemented to be sure that the concept works. Afterwards the interfaces between all layers should be clearly defined and tested. The result is a refined model consisting of a hardware model and all software layers inclusive of the target applications. Further, the different layers can be implemented independent of each other with the co-simulation environment for verification.
It is not necessary to perform all these steps in the order defined before. It depends on the application if a bottom-up, top-down, or meet-in-the-middle approach should be used. But the design flow is the same. Of course, an assembler and a high-level compiler are necessary to implement all software layers. Co-simulation capabilities are available during the entire design process but high-level debugging techniques are necessary when more and more software layers get integrated in the final system.
System layers and optimisation
Looking at the design flow, the typical levels of an ASIP processor can be figured out, as depicted in Figure 10 . These levels are the application level, the operation system level, the assembler level, the system architecture level and instruction set level, as well as the logic level. A first partition of the system can be optimised by improving the implementation on each level and by shifting functionality from one level to another. Depending on the requirements of the system, optimisations can be done concerning chip area, performance and power.
In most cases the resources of the hardware can be accessed in a more efficient manner by the operating system as by the application layer. Thus, shifting functionality from the application layer to the operating system layer can improve the performance of the system. Due to security issues of the system, these optimisations are very limited. To avoid context violations, the application layer should, for example, not be allowed to directly access the memory of the system. The operating system can use high level functions, which are also used by the application layer, but in most cases the use of primitive functions is more efficient. Using such primitive functions, the designer always has to be aware of the security requirements of the system. Some functionality can also be solved directly on the assembly layer. These functions are implemented manually and can thus be solved more efficiently compared with a compiler solution.
Functionality can be shifted from the software layer to the hardware layer by implementing new instructions. The designer has to trade off the higher performance by the increased chip area. Thus new instructions should only be implemented for sequences of instructions, which are used frequently. On the other side, for reducing the chip area, instructions that are used seldom can be shifted in the software layer.
The system can not only be optimised by shifting functionalities, but also by optimising the implementation of each layer of the system. Many different publications deal with such optimisations concerning mostly only one layer of the system (Glökler et al., 2003; Benini and de Micheli, 2000; Tiwari et al., 1998; Su et al., 1994; Gschwind, 1999) .
Optimisation influence parameter
Looking at the different layers of the system, three factors describing the effort of an optimisation depending on which layer optimisations are performed can be identified.
• Design effort. The design effort describes the effort which is necessary to implement certain functionality on one of the system layers. Depending on the layer of the system, the implementation of the same functionality yields different complexity. First of all the complexity is determined by the abstraction level one can work on. Further on the complexity depends on the design tools and the development environment provided on the different layers. Finally, the effects, which are caused on the other layers also have to be taken into account. Depending on the layer, on which an optimisation has been implemented, other system layers may have to be changed as well. The effort of making such changes is also heavily depending on the layer, on which the actual optimisation has been implemented.
• Verification effort. The verification effort describes the effort, which is necessary to verify the correctness of the system after an optimisation. This is heavily depending on how functionalities can be visualised for the designer. Thus the verification effort is depending on which design tool and which development environment is provided to the designer. Also, the simulation time on the different layers is an issue for the verification effort.
• Effort of parameter estimation. This factor describes how simple and how accurately the parameters chip area, performance and power can be estimated. This depends on how easy the changed functionality can be marked-off from the remaining system and how easily indicators for the estimation of the parameters can be found and analysed. The accuracy of the estimation is determined by how strong the found indicators influence the real parameters.
Optimisation space on system layers
In this section the defined factors will be analysed for each layer of the system:
• Application layer. Changes on the application layer or the shifting of functionality in the application layer cause a rather low design effort due to the high abstraction level. Functionality is easily comprehended by the designer and thus is also easily changeable. Due to the sophisticated development environment, also the verification effort is small. Usually it is possible to verify the functionality independently from the other system layers. The debugging of the functionality is mostly facilitated by features of the development environment such as breakpoints. Indicators such as execution time, memory accesses, or loop counts can be easily found and analysed. The development environment makes it also easy to mark-off the desired functionality from the rest of the application. Due to the high abstraction level, high estimation accuracy can usually not be achieved.
• Operating system layer. Similar to the application layer, the design effort on this layer is rather low due to the high abstraction level. But due to the access on processor resources the abstraction level is already lower. The verification independent of the other system layers is typically impossible. Nevertheless, the verification effort is quite small, because similar development environments can be used by the designer. Also, the useable indicators are similar to the indicators of the application layer. The possibility to access different resources of the hardware can increase the accuracy of the indicators.
• Assembly layer. On this layer the design effort is already considerable. Large functionalities are hard to comprehend for the designer. That is why only small pieces of software are implemented directly on this layer. Due to the lack of design tools, the verification effort increases considerable. Mostly it is only possible to access the final result of an operation, what complicates the debugging heavily. Mostly it is difficult to mark-off functionality from the rest of the system. This does not only imply higher verification effort, but also more difficult parameter estimation. Due to the lower abstraction level the accuracy can be increased.
• Instruction set layer. Changes in the instruction set cause a high design effort. Not only the underlying hardware has to be adapted, but also all software layers including their tools such as compiler and linker. The verification effort increases dramatically as well, because a change in the instruction set causes the change of the behaviour of the whole system. Due to the necessary changes on all system layers, possible errors are hard to detect. In addition, the simulation time on this layer is considerably higher than on the software layers. Due to the changing behaviour of the system, the parameter estimation is complex.
• Logic logic layer. Despite the deep abstraction level and the high complexity on this layer, the design effort is less than on the instruction set layer. This is mainly caused by the different tools which can be used by the designer. This layer is separated from the rest of the system by the instruction set and thus changes on this layer do not affect other layers. This is also the reason why the verification effort decreases. Logic can be simulated on different levels and many development environments offer debugging features similar to those of software development environments. Different tools also decrease the effort of the parameter estimation and increase the accuracy. In fact, the simulation time increases due to the deep level of abstraction.
This analysis is graphically visualised in Figure 11 . The impact of the defined factors increases in the software layer with the deepness of the abstraction level. Due to the high effort, the number of possible optimisations is decreased. This can also be expressed as a decrease in optimisation space. The instruction set layer offers the smallest optimisation space. In the underlying hardware the space increases again. By defining or eliminating instructions the space is shifted in a vertical manner. 
Evaluation and experimental results
The evaluation of the design methodology has been done in two steps. Firstly, all system components and design steps were evaluated separately. Secondly, the entire design methodology was used for the design of a multi-application smart card operating system -a Java Card™ virtual machine. Both a horizontal and a vertical HW/SW solution were developed to evaluate the usability for both target architectures. Therefore a functional platform model was developed which acted as reference model for the horizontal and vertical HW/SW co-design processes.
Vertical co-design
This section describes the evaluation of the proposed energy estimation and optimisation techniques at different abstraction layers of the vertical HW/SW co-design process.
Bus switching activity
The bus switching activity estimated at functional and at transactional level platform is an example output from the vertical co-design flow. The results presented in Table 1 should demonstrate that optimisations performed at functional level have the same impact at transaction-level platform. Hence, optimisations can be done at functional level where the model is less complex, simulation performance is high, and verification can be done fast and easily. The results in Table 1 show the number of memory accesses for all memory blocks during the execution of several test applets. The results are given for an un-optimised (original Java Card TM model) and an optimised solution. In the optimised solution, methods to access static fields and the firewall function to check the context have been modified. The target architecture contained a single processor element, an EEPROM block, a RAM block, and a single system bus. tt gives the number of bus signal switches with regard to ground and te the switching activity to the adjacent lines. Both parameters are estimated for address and data lines.
The higher number of memory accesses at transaction-level is due to the usage of the transaction-level UART object which performs data transmission byte by byte. Bus switching activity increases due to the merging of all memory and peripheral interfaces. The increase in total switching activity for the un-optimised and optimised solution is 33.5% and 27.3%, respectively, based on the assumption that the capacitances Ce and Ct are equal. The optimisation reduces total switching activity by 6.9% at functional level and by 11.1% at transaction level.
Co-processor interface optimisation
The usage of the proposed design methodology to optimise data interfaces can be demonstrated for a Universal Serial Receiver/Transmitter (UART). The used API object uses three different memory mapped register:
• a control register (CTRL)
• a receiver buffer register (RX)
• a transmit buffer register (TX).
Two different address mappings were used. The first map coded the register TX, RX, and CTRL with binary 0, 1 and 2, respectively. The second map coded the TX, RX, and CTRL with 3, 0, and 1, respectively. The two mappings were applied to a 8-bit and 16-bit interface configuration. The results are shown in Table 2 . Table 2 Exploration of register address assignment for an UART interface (based on the vertical co-design flow) 
Horizontal co-design
This section describes the evaluation of the horizontal HW/SW co-design methodology. Therefore the Java Card TM functional platform model was implemented as an application specific instruction-set processor with several software layers. The horizontal co-design approach uses the same initial functional platform model as the vertical co-design approach. Starting from the functional platform model a first step was to define and implement the hardware architecture and the software interfaces between different languages and levels as well as the implementation of needed development tools, for then implement the remaining hardware and software layers. Several transformations were performed on the resulting system to explore the design space. This section discusses the results at different layers. The first subsection describes the optimisation by moving functionality between layers, the second discusses the extension of the instruction set, the third shows hardware optimisation results, and fourth summarises all results.
Optimisation at different software layers
This optimisation technique was applied to functions which are used to initialise, to clear, or to copy transient arrays. These methods are often used during normal operation of the card. Array accesses always perform a security check, have to check for exceptions, have to calculate the address of the addressed data field in the volatile memory, and have to access the memory. In the original implementation this operation was performed by a simple Java loop, which executed all the operations discussed above.
The first transformation was done by using native functions to access the memory. Therefore all the security checks and the address calculations were done for the first memory access. Afterwards, native functions write the data values by direct memory access. All arrays considered for this optimisation are byte arrays and therefore word accesses were used to reduce the number of memory accesses. But again a switch between assembly code and Java is necessary during a native method call. This drawback eliminates the fourth solution by implementing the entire functionality in Assembly as depicted in Figure 12 . Afterwards, a single native method call is necessary and the entire loop is done in Assembly. The presented techniques were evaluated with two test applets and the results are presented in Table 3 . The performance figures show the total execution times of the optimised solutions compared to the execution time of the original Java solution. Table 3 Performance results for moving functionality between layers for array access methods 
Instruction set extension
The basic hardware architecture supported only word access to the memory due to the usage of words within all data structures. Data structures were changed to also use bytes to reduce the memory footprint of arrays and some parts of the control data structures. Therefore instructions to access data fields byte-wise were introduced. Byte access was not supported by the memory system. Also instructions were fetched by 32-bit accesses to get all operands with the instruction word. The usage of the byte-access instructions increases the performance because masking of bytes within words is not necessary any more. The total performance of the test applications was increased by about 5%. The memory footprint for byte arrays was cut in half and the additional hardware requirements were insignificant. Other instruction-set extensions were necessary to increase hardware functionality but did not have any impact on performance.
Hardware optimisations
The optimisation of the hardware was done to decrease chip size and to reduce power dissipation. Of course, the latter increases chip size. The largest parts of the core were the Memory Management Unit (MMU) with 30% of total chip size, the Arithmetic Logical Unit (ALU) with 29%, the instruction decoder with 19%, and the stack control unit with 8%. The hardware stack is not considered due to its fixed size. The energy is dissipated by the ALU (48%), the MMU (43%), and the instruction decoder (7%).
The MMU contains 16 16-bit registers where 12 registers are used as segment registers for address space extension and as compare registers. The other four registers are general purpose registers used by software. These registers cannot be changed because they are necessary as temporary storage during stack frame manipulations. But the other 12 registers were reduced to their minimal possible size. This reduces chip size but also reduces flexibility. Guarding techniques were used to keep switching activity away from functional units. Therefore latches were used in front of components which were not transparent as long as the component is not used. This technique costs additional chip size and thus can also increase total energy dissipation if the blocked functional unit is not large enough. The results for hardware optimisation are summarised in Table 4 . 
Entire design space
All minima and maxima for performance, energy, and chip size optimisation are combined to get the entire design space for the considered Java Card solution. All results are generated with dedicated test applets. The minimum area solution combines the minimum size MMU, ALU and decoder without guarding techniques, and optimised stack controller. The minimum energy solution combines all results for energy optimisation. The high performance solution is mainly driven by moving functionality between software layers. Figure 13 depicts the design space. Note that the different optimisations can not all be achieved at the same time. The designer has to decide which optimisation will be implemented according to the requirements.
Summary
The usage of the proposed design methodology was compared to an equivalent implementation using only standard design processes. The main results are as follows:
• Design abstraction and the usage of abstract models reduces design complexity and leads to higher optimised systems due to a higher focus on algorithm design.
• Functional test and verification can be done early in the design process before introducing architecture details.
• System simulation times can be neglected.
• Moving down the abstraction pyramid following the platform approach minimises the risk to miss design constraints.
• Virtual prototypes are used as golden model for hardware and software implementation which reduces system integration risks. The integration of gate-level HW design, assembly layer, OS and application was done for the Java Card system first-time-right.
• Abstract design models can be used for derivate development which guarantees functional equivalence over a set of implementations with different design constraints.
Conclusion and future work
This paper has presented a design methodology for the design of resource limited embedded devices and demonstrated it especially for smart cards. The design methodology takes advantage of system abstraction and attached vertical and horizontal co-design flows for the design of
• a system consisting of a standard processor and an ASIC
• a system consisting of software and the underlying processor.
The proposed design methodology was evaluated for a Java Card TM virtual machine implementation. The evaluation of the proposed methodology will be continued with further projects to get experience with pure industrial projects, like the future e-passport (ICAO Standard for ePassports, http://www.icao.int/mrtd/).
The key challenge of the follow-up project "Power Profile Embossed Software Optimisation for Mobile Devices and Smart Card Systems" is the optimisation of the software with the goal to minimise the consumed power and the current peaks for better adaptation of the current consumption to the source of energy. This can only be done by knowing the current profile. To avoid extensive real-time measurements, this has to be done by realising an energy characterisation of the used processor and simulating the execution of the software. The obtained current profile has to be abstracted and analysed to extract the desired properties. Based on these properties, optimisation algorithms based on software transformations of the source code and instruction level can be implemented to achieve the goals.
