28 research outputs found

    Extending OpenVX for Model-based Design of Embedded Vision Applications

    Get PDF
    Developing computer vision applications for lowpower heterogeneous systems is increasingly gaining interest in the embedded systems community. Even more interesting is the tuning of such embedded software for the target architecture when this is driven by multiple constraints (e.g., performance, peak power, energy consumption). Indeed, developers frequently run into system-level inefficiencies and bottlenecks that can not be quickly addressed by traditional methods. In this context OpenVX has been proposed as the standard platform to develop portable, optimized and powerefficient applications for vision algorithms targeting embedded systems. Nevertheless, adopting OpenVX for rapid prototyping, early algorithm parametrization and validation of complex embedded applications is a very challenging task. This paper presents a methodology to integrate a model-based design environment to OpenVX. The methodology allows applying Matlab/Simulink for the model-based design, parametrization, and validation of computer vision applications. Then, it allows for the automatic synthesis of the application model into an OpenVX description for the hardware and constraints-aware application tuning. Experimental results have been conducted with an application for digital image stabilization developed through Simulink and, then, automatically synthesized into OpenVX-VisionWorks code for an NVIDIA Jetson TX1 boar

    A model-based design flow for embedded vision applications on heterogeneous architectures

    Get PDF
    The ability to gather information from images is straightforward to human, and one of the principal input to understand external world. Computer vision (CV) is the process to extract such knowledge from the visual domain in an algorithmic fashion. The requested computational power to process these information is very high. Until recently, the only feasible way to meet non-functional requirements like performance was to develop custom hardware, which is costly, time-consuming and can not be reused in a general purpose. The recent introduction of low-power and low-cost heterogeneous embedded boards, in which CPUs are combine with heterogeneous accelerators like GPUs, DSPs and FPGAs, can combine the hardware efficiency needed for non-functional requirements with the flexibility of software development. Embedded vision is the term used to identify the application of the aforementioned CV algorithms applied in the embedded field, which usually requires to satisfy, other than functional requirements, also non-functional requirements such as real-time performance, power, and energy efficiency. Rapid prototyping, early algorithm parametrization, testing, and validation of complex embedded video applications for such heterogeneous architectures is a very challenging task. This thesis presents a comprehensive framework that: 1) Is based on a model-based paradigm. Differently from the standard approaches at the state of the art that require designers to manually model the algorithm in any programming language, the proposed approach allows for a rapid prototyping, algorithm validation and parametrization in a model-based design environment (i.e., Matlab/Simulink). The framework relies on a multi-level design and verification flow by which the high-level model is then semi-automatically refined towards the final automatic synthesis into the target hardware device. 2) Relies on a polyglot parallel programming model. The proposed model combines different programming languages and environments such as C/C++, OpenMP, PThreads, OpenVX, OpenCV, and CUDA to best exploit different levels of parallelism while guaranteeing a semi-automatic customization. 3) Optimizes the application performance and energy efficiency through a novel algorithm for the mapping and scheduling of the application 3 tasks on the heterogeneous computing elements of the device. Such an algorithm, called exclusive earliest finish time (XEFT), takes into consideration the possible multiple implementation of tasks for different computing elements (e.g., a task primitive for CPU and an equivalent parallel implementation for GPU). It introduces and takes advantage of the notion of exclusive overlap between primitives to improve the load balancing. This thesis is the result of three years of research activity, during which all the incremental steps made to compose the framework have been tested on real case studie

    Enhancing Performance of Computer Vision Applications on Low-Power Embedded Systems Through Heterogeneous Parallel Programming

    Get PDF
    Enabling computer vision applications on low-power embedded systems gives rise to new challenges for embedded SW developers. Such applications implement different functionalities, like image recognition based on deep learning, simultaneous localization and mapping tasks. They are characterized by stringent performance constraints to guarantee real-time behaviors and, at the same time, energy constraints to save battery on the mobile platform. Even though heterogeneous embedded boards are getting pervasive for their high computational power at low power costs, they need a time consuming customization of the whole application (i.e., mapping of application blocks to CPUGPU processing elements and their synchronization) to efficiently exploit their potentiality. Different languages and environments have been proposed for such an embedded SW customization. Nevertheless, they often find limitations on complex real cases, as their application is mutual exclusive. This paper presents a comprehensive framework that relies on a heterogeneous parallel programming model, which combines OpenMP, PThreads, OpenVX, OpenCV, and CUDA to best exploit different levels of parallelism while guaranteeing a semi-automatic customization. The paper shows how such languages and API platforms have been interfaced, synchronized, and applied to customize an ORBSLAM application for an NVIDIA Jetson TX2 board

    Integrating Simulink, OpenVX, and ROS for Model-Based Design of Embedded Vision Applications

    Get PDF
    OpenVX is increasingly gaining consensus as standard platform to develop portable, optimized and power-efficient embedded vision applications. Nevertheless, adopting OpenVX for rapid prototyping, early algorithm parametrization and validation of complex embedded applications is a very challenging task. This paper presents a comprehensive framework that integrates Simulink, OpenVX, and ROS for model-based design of embedded vision applications. The framework allows applying Matlab-Simulink for the model-based design, parametrization, and validation of computer vision applications. Then, it allows for the automatic synthesis of the application model into an OpenVX description for the hardware and constraints-aware application tuning. Finally, the methodology allows integrating the OpenVX application with Robot Operating System (ROS), which is the de-facto reference standard for developing robotic software applications. The OpenVX-ROS interface allows co-simulating and parametrizing the application by considering the actual robotic environment and the application reuse in any ROS-compliant system. Experimental results have been conducted with two real case studies: An application for digital image stabilization and the ORB descriptor for simultaneous localization and mapping (SLAM), which have been developed through Simulink and, then, automatically synthesized into OpenVX-VisionWorks code for an NVIDIA Jetson TX2 boar

    Rapid Prototyping of Embedded Vision Systems: Embedding Computer Vision Applications into Low-Power Heterogeneous Architectures

    Get PDF
    Embedded vision is a disruptive new technology in the vision industry. It is a revolutionary concept with far reaching implications, and it is opening up new applications and shaping the future of entire industries. It is applied in self-driving cars, autonomous vehicles in agriculture, digital dermascopes that help specialists make more accurate diagnoses, among many other unique and cutting-edge applications. The design of such systems gives rise to new challenges for embedded Software developers. Embedded vision applications are characterized by stringent performance constraints to guarantee real-time behaviours and, at the same time, energy constraints to save battery on the mobile platforms. In this paper, we address such challenges by proposing an overall view of the problem and by analysing current solutions. We present our last results on embedded vision design automation over two main aspects: the adoption of the model-based paradigm for the embedded vision rapid prototyping, and the application of heterogeneous programming languages to improve the system performance. The paper presents our recent results on the design of a localization and mapping application combined with image recognition based on deep learning optimized for an NVIDIA Jetson TX2

    On the Task Mapping and Scheduling for DAG-based Embedded Vision Applications on Heterogeneous Multi/Many-core Architectures

    Get PDF
    Embedded vision applications have stringent performance constraints that must be satisfied when they are run on low-power embedded systems. OpenVX has emerged as the de-facto reference standard to develop such applications. Starting with a DAG representation of the application and by relying on a primitive-based programming model, it allows for automatic system-level optimizations and synthesis of an implementation onto the target heterogeneous multi-core architecture. However, the state-of-the-art algorithm for task mapping and scheduling in OpenVX does not provide the performance necessary for such applications when deployed on embedded multi-/many-core architectures. %does not implement an efficient algorithm task mapping and scheduling onto embedded multi/many-core architectures. Our work addresses this challenge by making the following three contributions. First, we implemented a static task scheduling and mapping approach for OpenVX using the heterogeneous earliest finish time (HEFT) heuristic. We show that HEFT allows us to improve the system performance up to 70% on one of the most widespread embedded vision systems (i.e., NVIDIA VisionWorks on NVIDIA Jetson TX2). Second, we show that HEFT, in the context of an embedded vision application where some primitives may have multiple implementations (e.g., for CPU and for GPU), can lead to an imbalance in load amongst heterogeneous computing elements (CEs); thereby, suffering from degraded performance. Third, we propose an algorithm called exclusive earliest finish time (XEFT) that introduces the notion of exclusive overlap between single implementation primitives to improve the load balancing. We show that XEFT can further improve the system performance up to 33% over HEFT, and 82% over OpenVX. We present the results on different benchmarks, including a real-world localization and mapping application (ORB-SLAM) combined with the NVIDIA image recognition application based on deep-learning

    Sub-pJ per operation scalable computing: The PULP experience

    Get PDF
    none1noUltra-low power operation and extreme energy efficiency are strong requirements for a number of high-growth Internet of-Things (IoT) applications requiring near-sensor processing. A promising approach to achieve major energy efficiency improvements is near-threshold computing. However, frequency degradation due to aggressive voltage scaling may not be acceptable for performance-constrained applications. The PULP platform leverages multi-core parallelism with explicitly-managed shared L1 memory to overcome performance degradation at low voltage, while maintaining the flexibility and programmability typical of instruction processors. PULP supports OpenMP, OpenCL, and OpenVX parallel programming with hardware support for energy efficient synchronization. Multiple silicon implementations of PULP have been taped out and achieve hundreds of GOPS/W on video, audio, inertial sensor data processing and classification, within power envelopes of a few milliwatts. PULP hardware and software are open-source, with the goal of supporting and boosting an innovation ecosystem focusing on ULP computing for the IoT.openRossi, DavideRossi, David

    Optimization Techniques for Parallel Programming of Embedded Many-Core Computing Platforms

    Get PDF
    Nowadays many-core computing platforms are widely adopted as a viable solution to accelerate compute-intensive workloads at different scales, from low-cost devices to HPC nodes. It is well established that heterogeneous platforms including a general-purpose host processor and a parallel programmable accelerator have the potential to dramatically increase the peak performance/Watt of computing architectures. However the adoption of these platforms further complicates application development, whereas it is widely acknowledged that software development is a critical activity for the platform design. The introduction of parallel architectures raises the need for programming paradigms capable of effectively leveraging an increasing number of processors, from two to thousands. In this scenario the study of optimization techniques to program parallel accelerators is paramount for two main objectives: first, improving performance and energy efficiency of the platform, which are key metrics for both embedded and HPC systems; second, enforcing software engineering practices with the aim to guarantee code quality and reduce software costs. This thesis presents a set of techniques that have been studied and designed to achieve these objectives overcoming the current state-of-the-art. As a first contribution, we discuss the use of OpenMP tasking as a general-purpose programming model to support the execution of diverse workloads, and we introduce a set of runtime-level techniques to support fine-grain tasks on high-end many-core accelerators (devices with a power consumption greater than 10W). Then we focus our attention on embedded computer vision (CV), with the aim to show how to achieve best performance by exploiting the characteristics of a specific application domain. To further reduce the power consumption of parallel accelerators beyond the current technological limits, we describe an approach based on the principles of approximate computing, which implies modification to the program semantics and proper hardware support at the architectural level

    ENABLING REAL-TIME CERTIFICATION OF AUTONOMOUS DRIVING APPLICATIONS

    Get PDF
    The push towards fielding advanced driver-assist systems (ADASs) is happening at breakneck speed. Semi-autonomous features are becoming increasingly common, including adaptive cruise control and automatic lane keeping. Today, graphics processing units (GPUs) are seen as a key technology in this push towards greater autonomy. However, realizing full autonomy in mass-production vehicles will necessitate the use of stringent certification processes. Unfortunately, currently available GPUs tend to be closed-source “black boxes” that have features that are not publicly disclosed; these features must be documented for certification to be tenable. Furthermore, existing real-time task models have not evolved to handle historical-result requirements common in computer-vision (CV) applications, which introduce cycles in processing graphs; existing models must be extended to account for such dependencies. Additionally, due to size, weight, power, and cost constraints, multiple CV applications may need to share a single hardware platform; if the platform contains accelerators such as non-preemptive GPUs, such sharing must be managed in a way that ensures applications are isolated from one another. For ADAS certification to be possible, these challenges must be addressed. This dissertation addresses each of these three challenges. First, scheduling details of NVIDIA GPU are presented, as derived through extensive micro-benchmarking experiments. These details provide the foundation for identifying and automatically detecting key issues when using NVIDIA GPUs in real-time safety-critical applications. Second, a generalization of a real-time task model is introduced, enabling the computation of response-time bounds for processing graphs that contain cycles. This model exposes a trade-off between the age of historical data, the resulting response-time bounds, and the accuracy of the CV application; this trade-off is explored in detail. Finally, a time-partitioning framework for multicore+accelerator platforms is introduced. When applied alongside existing methods for alleviating spatial interference, this framework can help enable component-wise ADAS certification on multicore+accelerator platforms.Doctor of Philosoph
    corecore