3,460 research outputs found

    Synthesis of application specific processor architectures for ultra-low energy consumption

    No full text
    In this paper we suggest that further energy savings can be achieved by a new approach to synthesis of embedded processor cores, where the architecture is tailored to the algorithms that the core executes. In the context of embedded processor synthesis, both single-core and many-core, the types of algorithms and demands on the execution efficiency are usually known at the chip design time. This knowledge can be utilised at the design stage to synthesise architectures optimised for energy consumption. Firstly, we present an overview of both traditional energy saving techniques and new developments in architectural approaches to energy-efficient processing. Secondly, we propose a picoMIPS architecture that serves as an architectural template for energy-efficient synthesis. As a case study, we show how the picoMIPS architecture can be tailored to an energy efficient execution of the DCT algorithm

    Design and implementation of robust embedded processor for cryptographic applications

    Get PDF
    Practical implementations of cryptographic algorithms are vulnerable to side-channel analysis and fault attacks. Thus, some masking and fault detection algorithms must be incorporated into these implementations. These additions further increase the complexity of the cryptographic devices which already need to perform computationally-intensive operations. Therefore, the general-purpose processors are usually supported by coprocessors/hardware accelerators to protect as well as to accelerate cryptographic applications. Using a configurable processor is just another solution. This work designs and implements robust execution units as an extension to a configurable processor, which detect the data faults (adversarial or otherwise) while performing the arithmetic operations. Assuming a capable adversary who can injects faults to the cryptographic computation with high precision, a nonlinear error detection code with high error detection capability is used. The designed units are tightly integrated to the datapath of the configurable processor using its tool chain. For different configurations, we report the increase in the space and time complexities of the configurable processor. Also, we present performance evaluations of the software implementations using the robust execution units. Implementation results show that it is feasible to implement robust arithmetic units with relatively low overhead in an embedded processor

    Viterbi Accelerator for Embedded Processor Datapaths

    Get PDF
    We present a novel architecture for a lightweight Viterbi accelerator that can be tightly integrated inside an embedded processor. We investigate the accelerator’s impact on processor performance by using the EEMBC Viterbi benchmark and the in-house Viterbi Branch Metric kernel. Our evaluation based on the EEMBC benchmark shows that an accelerated 65-nm 2.7-ns processor datapath is 20% larger but 90% more cycle efficient than a datapath lacking the Viterbi accelerator, leading to an 87% overall energy reduction and a data throughput of 3.52 Mbit/s

    Thermoelectric energy harvester with a cold start of 0.6 °C

    Get PDF
    This paper presents the electrical and thermal design of a thermoelectric energy harvester power system and its characterisation. The energy harvester is powered by a single Thermoelectric Generator (TEG) of 449 couples connected via a power conditioning circuit to an embedded processor. The aim of the work presented in this paper is to experimentally confirm the lowest ΔT measured across the TEG (ΔTTEG) at which the embedded processor operates to allow for wireless communication. The results show that when a temperature difference of 0.6 °CΔTTEG is applied across the thermoelectric module, an input voltage of 23 mV is generated which is sufficient to activate the energy harvester in approximately 3 minutes. An experimental setup able to accurately maintain and measure very low temperatures is described and the electrical power generated by the TEG at these temperatures is also described. It was found that the energy harvester power system can deliver up to 30 mA of current at 2.2 V in 3ms pulses for over a second. This is sufficient for wireless broadcast, communication and powering of other sensor devices. The successful operation of the wireless harvester at such low temperature gradients offers many new application areas for the system, including those powered by environmental sources and body heat

    Storage constraint satisfaction for embedded processor compilers

    Get PDF
    Increasing interest in the high-volume high-performance embedded processor market motivates the stand-alone processor world to consider issues like design flexibility (synthesizable processor core), energy consumption, and silicon efficiency. Implications for embedded processor architectures and compilers are the exploitation of hardware acceleration, instruction-level parallelism (ILP), and distributed storage files. In that scope, VLIW architectures have been acclaimed for their parallelism in the architecture while orthogonality of the associated instruction sets is maintained. Code generation methods for such processors will be pressured towards an efficient use of scarce resources while satisfying tight real-time constraints imposed by DSP and multimedia applications. Limited storage (e.g. registers) availability poses a problem for traditional methods that perform code generation in separate stages, e.g. operation scheduling followed by register allocation. This is because the objectives of scheduling and register allocation cause conflicts in code generation in several ways. Firstly, register reuse can create dependencies that did not exist in the original code, but can also save spilling values to memory. Secondly, while a particular ordering of instructions may increase the potential for ILP, the reordering due to instruction scheduling may also extend the lifetime of certain values, which can increase the register requirement. Furthermore, the instruction scheduler requires an adequate number of local registers to avoid register reuse (since reuse limits the opportunity for ILP), while the register allocator would prefer sufficient global registers in order to avoid spills. Finally, an effective scheduler can lose its achieved degree of instruction-level parallelism when spill code is inserted afterwards. Without any communication of information and cooperation between scheduling and storage allocation phases, the compiler writer faces the problem of determining which of these phases should run first to generate the most efficient final code. The lack of communication and cooperation between the instruction scheduling and storage allocation can result in code that contains excess of register spills and/or lower degree of ILP than actually achievable. This problem called phase coupling cannot be ignored when constraints are tight and efficient solutions are desired. Traditional methods that perform code generation in separate stages are often not able to find an efficient or even a feasible solution. Therefore, those methods need an increasing amount of help from the programmer (or designer) to arrive at a feasible solution. Because this requires an excessive amount of design time and extensive knowledge of the processor architecture, there is a need for automated techniques that can cope with the different kinds of constraints during scheduling. This thesis proposes an approach for instruction scheduling and storage allocation that makes an extensive use of timing, resource and storage constraints to prune the search space for scheduling. The method in this approach supports VLIW architectures with (distributed) storage files containing random-access registers, rotating registers to exploit the available ILP in loops, stacks or fifos to exploit larger storage capacities with lower addressing costs. Potential access conflicts between values are analyzed before and during scheduling, according to the type of storage they are assigned to. Using constraint analysis techniques and properties of colored conflict graphs essential information is obtained to identify the bottlenecks for satisfying the storage file constraints. To reduce the identified bottlenecks, this method performs partial scheduling by ordering value accesses such that to allow a better reuse of storage. Without enforcing any specific storage assignment of values, the method continues until it can guarantee that any completion of the partial schedule will also result in a feasible storage allocation. Therefore, the scheduling freedom is exploited for satisfaction of storage, resource, and timing constraints in one phase

    A Virtual Testbed for Embedded Systems

    Get PDF
    Hardware-In-the-Loop (HIL) Simulation is a simulation approach in which a hardware embedded processor is connected to the simulation computer that simulates the electrical/mechanical devices controlled by the embedded processor. By using a real-time simulation computer and special-purpose hardware for connecting to the embedded processor, this method of simulation can be very precise but is costly. We are proposing an alternative method, HIL simulation with a network link, in which the device under test (the embedded processor) communicates with the simulation computer over a network connection (in our case a serial line) instead of through special-purpose hardware. We present an abstraction layer that facilitates the simulation of external devices. An earlier prototype had been developed for a 16-bit TMS320LF2407A DSP from Texas Instruments. We generalized the approach to the more advanced 32-bit TMS320F28335 DSP. We have made the changes in the DSP abstraction layer to enable more features and provide more flexibility to the programmer. For example, we introduced a shadow interrupt vector to make the simulation layer more general. We developed various scenarios to measure the performance of the system. In particular, we measure round-trip time and through-put for the communication between the simulator and the DSP. Also we rewrote the serial line drivers on the DSP to incorporate different working scenarios and to invoke the timers on the DSP for measuring the execution time. Our work helps to judge the performance of the system and to identify the application domains for this approach

    An architecture for adaptive real time communication with embedded devices

    Get PDF
    The virtual testbed is designed to be a cost-effective rapid development environment as well as a teaching tool for embedded systems. Teaching and development of embedded systems otherwise requires dedicated real time operating systems and costly infrastructure for hardware simulation. Writing control software for embedded systems with such a setup takes prolonged development cycles. Moreover, actual hardware may get damaged while writing the control software. On the contrary, in a virtual testbed environment, a simulator running on the host machine is used instead of the actual hardware, which then interacts with an embedded processor through serial communication. This hardware-in-the-loop setup reduces development time drastically but is reliable only if it behaves as close to real time as possible. Use of non-real time architecture like Windows NT on the host machine and the Win32 API causes an overhead in the serial communication that slows down the simulator. The problem is that the simulator is unable to cope with the communication speeds offered by the embedded processor. We propose the development of a kernel mode device driver that overcomes inefficiencies in the Win32 API. The result is faster communication between the simulator and the embedded processor. Another problem that arises with an increase in the simulator’s communication capabilities is whether the operating system can support such a dynamic and high speed interaction. To solve this problem we propose the use of efficient process and thread management and utilization of Windows NT’s support for real time execution and utilization of intelligent buffer and interrupt handling to process the high frequency requests coming from the embedded processor to the host machine. Another hurdle is the diverse nature of hardware that is being simulated: from simple features with low data volume to fairly complex features with high data volume, and with the data rate ranging from very small to very high. Hence, we propose to make the simulator and the kernel mode device driver adaptive. All these strategies culminate into an architecture for adaptive real time communication with the embedded processor, giving the virtual testbed an edge over other design methodologies for embedded systems
    corecore