38 research outputs found

    Face Detection on Embedded Systems

    Get PDF
    Over recent years automated face detection and recognition (FDR) have gained significant attention from the commercial and research sectors. This paper presents an embedded face detection solution aimed at addressing the real-time image processing requirements within a wide range of applications. As face detection is a computationally intensive task, an embedded solution would give rise to opportunities for discrete economical devices that could be applied and integrated into a vast majority of applications. This work focuses on the use of FPGAs as the embedded prototyping technology where the thread of execution is carried out on an embedded soft-core processor. Custom instructions have been utilized as a means of applying software/hardware partitioning through which the computational bottlenecks are moved to hardware. A speedup by a factor of 110 was achieved from employing custom instructions and software optimizations

    Towards extending the SWITCH platform for time-critical, cloud-based CUDA applications: Job scheduling parameters influencing performance

    Get PDF
    SWITCH (Software Workbench for Interactive, Time Critical and Highly self-adaptive cloud applications) allows for the development and deployment of real-time applications in the cloud, but it does not yet support instances backed by Graphics Processing Units (GPUs). Wanting to explore how SWITCH might support CUDA (a GPU architecture) in the future, we have undertaken a review of time-critical CUDA applications, discovering that run-time requirements (which we call ‘wall time’) are in many cases regarded as the most important. We have performed experiments to investigate which parameters have the greatest impact on wall time when running multiple Amazon Web Services GPU-backed instances. Although a maximum of 8 single-GPU instances can be launched in a single Amazon Region, launching just 2 instances rather than 1 gives a 42% decrease in wall time. Also, instances are often wasted doing nothing, and there is a moderately-strong relationship between how problems are distributed across instances and wall time. These findings can be used to enhance the SWITCH provision for specifying Non-Functional Requirements (NFRs); in the future, GPU-backed instances could be supported. These findings can also be used more generally, to optimise the balance between the computational resources needed and the resulting wall time to obtain results

    Compiler-Driven Power Optimizations in the Register File of Processor-Based Systems

    Get PDF
    The complexity of the register file is currently one of the main factors on determining the cycle time of high performance wide-issue microprocessors due to its access time and size. Both parameters are directly related to the number of read and write ports of the register file and can be managed from a code compilation-level. Therefore, it is a priority goal to reduce this complexity in order to allow the efficient implementation of complex superscalar machines. This work presents a modified register assignment and a banked architecture which efficiently reduce the number of required ports. Also, the effect of the loop unrollling optimization performed by the compiler is analyzed and several power-efficient modifications to this mechanism are proposed. Both register assignment and loop unrolling mechanisms are modified to improve the energy savings while avoiding a hard performance impact
    corecore