38 research outputs found
Face Detection on Embedded Systems
Over recent years automated face detection and recognition (FDR) have gained significant attention from the commercial and research sectors. This paper presents an embedded face detection solution aimed at addressing the real-time image processing requirements within a wide range of applications. As face detection is a computationally intensive task, an embedded solution would give rise to opportunities for discrete economical devices that could be applied and integrated into a vast majority of applications. This work focuses on the use of FPGAs as the embedded prototyping technology where the thread of execution is carried out on an embedded soft-core processor. Custom instructions have been utilized as a means of applying software/hardware partitioning through which the computational bottlenecks are moved to hardware. A speedup by a factor of 110 was achieved from employing custom instructions and software optimizations
Towards extending the SWITCH platform for time-critical, cloud-based CUDA applications: Job scheduling parameters influencing performance
SWITCH (Software Workbench for Interactive, Time Critical and Highly self-adaptive cloud applications) allows for the development and deployment of real-time applications in the cloud, but it does not yet support instances backed by Graphics Processing Units (GPUs). Wanting to explore how SWITCH might support CUDA (a GPU architecture) in the future, we have undertaken a review of time-critical CUDA applications, discovering that run-time requirements (which we call ‘wall time’) are in many cases regarded as the most important. We have performed experiments to investigate which parameters have the greatest impact on wall time when running multiple Amazon Web Services GPU-backed instances. Although a maximum of 8 single-GPU instances can be launched in a single Amazon Region, launching just 2 instances rather than 1 gives a 42% decrease in wall time. Also, instances are often wasted doing nothing, and there is a moderately-strong relationship between how problems are distributed across instances and wall time. These findings can be used to enhance the SWITCH provision for specifying Non-Functional Requirements (NFRs); in the future, GPU-backed instances could be supported. These findings can also be used more generally, to optimise the balance between the computational resources needed and the resulting wall time to obtain results
Compiler-Driven Power Optimizations in the Register File of Processor-Based Systems
The complexity of the register file is currently one of the
main factors on determining the cycle time of high
performance wide-issue microprocessors due to its
access time and size. Both parameters are directly
related to the number of read and write ports of the
register file and can be managed from a code
compilation-level. Therefore, it is a priority goal to
reduce this complexity in order to allow the efficient
implementation of complex superscalar machines. This
work presents a modified register assignment and a
banked architecture which efficiently reduce the number
of required ports. Also, the effect of the loop unrollling
optimization performed by the compiler is analyzed and
several power-efficient modifications to this mechanism
are proposed. Both register assignment and loop
unrolling mechanisms are modified to improve the
energy savings while avoiding a hard performance
impact