4,402 research outputs found

    Enabling preemptive multiprogramming on GPUs

    Get PDF
    GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.We would like to thank the anonymous reviewers, Alexan- der Veidenbaum, Carlos Villavieja, Lluis Vilanova, Lluc Al- varez, and Marc Jorda on their comments and help improving our work and this paper. This work is supported by Euro- pean Commission through TERAFLUX (FP7-249013), Mont- Blanc (FP7-288777), and RoMoL (GA-321253) projects, NVIDIA through the CUDA Center of Excellence program, Spanish Government through Programa Severo Ochoa (SEV-2011-0067) and Spanish Ministry of Science and Technology through TIN2007-60625 and TIN2012-34557 projects.Peer ReviewedPostprint (author’s final draft

    Qduino: a cyber-physical programming platform for multicore Systems-on-Chip

    Full text link
    Emerging multicore Systems-on-Chip are enabling new cyber-physical applications such as autonomous drones, driverless cars and smart manufacturing using web-connected 3D printers. Common to those applications is a communicating task pipeline, to acquire and process sensor data and produce outputs that control actuators. As a result, these applications usually have timing requirements for both individual tasks and task pipelines formed for sensor data processing and actuation. Current cyber-physical programming platforms, such as Arduino and embedded Linux with the POSIX interface do not allow application developers to specify those timing requirements. Moreover, none of them provide the programming interface to schedule tasks and map them to processor cores, while managing I/O in a predictable manner, on multicore hardware platforms. Hence, this thesis presents the Qduino programming platform. Qduino adopts the simplicity of the Arduino API, with additional support for real-time multithreaded sketches on multicore architectures. Qduino allows application developers to specify timing properties of individual tasks as well as task pipelines at the design stage. To this end, we propose a mathematical framework to derive each task’s budget and period from the specified end-to-end timing requirements. The second part of the thesis is motivated by the observation that at the center of these pipelines are tasks that typically require complex software support, such as sensor data fusion or image processing algorithms. These features are usually developed by many man-year engineering efforts and thus commonly seen on General-Purpose Operating Systems (GPOS). Therefore, in order to support modern, intelligent cyber-physical applications, we enhance the Qduino platform’s extensibility by taking advantage of the Quest-V virtualized partitioning kernel. The platform’s usability is demonstrated by building a novel web-connected 3D printer and a prototypical autonomous drone framework in Qduino

    An Optimization Based Design for Integrated Dependable Real-Time Embedded Systems

    Get PDF
    Moving from the traditional federated design paradigm, integration of mixedcriticality software components onto common computing platforms is increasingly being adopted by automotive, avionics and the control industry. This method faces new challenges such as the integration of varied functionalities (dependability, responsiveness, power consumption, etc.) under platform resource constraints and the prevention of error propagation. Based on model driven architecture and platform based design’s principles, we present a systematic mapping process for such integration adhering a transformation based design methodology. Our aim is to convert/transform initial platform independent application specifications into post integration platform specific models. In this paper, a heuristic based resource allocation approach is depicted for the consolidated mapping of safety critical and non-safety critical applications onto a common computing platform meeting particularly dependability/fault-tolerance and real-time requirements. We develop a supporting tool suite for the proposed framework, where VIATRA (VIsual Automated model TRAnsformations) is used as a transformation tool at different design steps. We validate the process and provide experimental results to show the effectiveness, performance and robustness of the approach

    From Dataflow Specification to Multiprocessor Partitioned Time-triggered Real-time Implementation *

    Get PDF
    International audienceOur objective is to facilitate the development of complex time-triggered systems by automating the allocation and scheduling steps. We show that full automation is possible while taking into account the elements of complexity needed by a complex embedded control system. More precisely, we consider deterministic functional specifications provided (as often in an industrial setting) by means of synchronous data-flow models with multiple modes and multiple relative periods. We first extend this functional model with an original real-time characterization that takes advantage of our time-triggered framework to provide a simpler representation of complex end-to-end flow requirements. We also extend our specifications with additional non-functional properties specifying partitioning, allocation , and preemptability constraints. Then, weprovide novel algorithms for the off-line scheduling of these extended specifications onto partitioned time-triggered architectures Ă  la ARINC 653. The main originality of our work is that it takes into account at the same time multiple complexity elements: various types of non-functional properties (real-time, partitioning, allocation, preemptability) and functional specifications with conditional execution and multiple modes. Allocation of time slots/windows to partitions can be fullyor partially provided, or synthesized by our tool. Our algorithms allow the automatic allocation and scheduling onto multi-processor (distributed) sys-tems with a global time base, taking into account communication costs. We demonstrate our technique on a model of space flight software systemwith strong real-time determinism requirements

    A Dynamically Reconfigurable Parallel Processing Framework with Application to High-Performance Video Processing

    Get PDF
    Digital video processing demands have and will continue to grow at unprecedented rates. Growth comes from ever increasing volume of data, demand for higher resolution, higher frame rates, and the need for high capacity communications. Moreover, economic realities force continued reductions in size, weight and power requirements. The ever-changing needs and complexities associated with effective video processing systems leads to the consideration of dynamically reconfigurable systems. The goal of this dissertation research was to develop and demonstrate the viability of integrated parallel processing system that effectively and efficiently apply pre-optimized hardware cores for processing video streamed data. Digital video is decomposed into packets which are then distributed over a group of parallel video processing cores. Real time processing requires an effective task scheduler that distributes video packets efficiently to any of the reconfigurable distributed processing nodes across the framework, with the nodes running on FPGA reconfigurable logic in an inherently Virtual\u27 mode. The developed framework, coupled with the use of hardware techniques for dynamic processing optimization achieves an optimal cost/power/performance realization for video processing applications. The system is evaluated by testing processor utilization relative to I/O bandwidth and algorithm latency using a separable 2-D FIR filtering system, and a dynamic pixel processor. For these applications, the system can achieve performance of hundreds of 640x480 video frames per second across an eight lane Gen I PCIe bus. Overall, optimal performance is achieved in the sense that video data is processed at the maximum possible rate that can be streamed through the processing cores. This performance, coupled with inherent ability to dynamically add new algorithms to the described dynamically reconfigurable distributed processing framework, creates new opportunities for realizable and economic hardware virtualization.\u2

    Solving Challenging Real-World Scheduling Problems

    Get PDF
    This work contains a series of studies on the optimization of three real-world scheduling problems, school timetabling, sports scheduling and staff scheduling. These challenging problems are solved to customer satisfaction using the proposed PEAST algorithm. The customer satisfaction refers to the fact that implementations of the algorithm are in industry use. The PEAST algorithm is a product of long-term research and development. The first version of it was introduced in 1998. This thesis is a result of a five-year development of the algorithm. One of the most valuable characteristics of the algorithm has proven to be the ability to solve a wide range of scheduling problems. It is likely that it can be tuned to tackle also a range of other combinatorial problems. The algorithm uses features from numerous different metaheuristics which is the main reason for its success. In addition, the implementation of the algorithm is fast enough for real-world use.Siirretty Doriast

    C-MOS array design techniques: SUMC multiprocessor system study

    Get PDF
    The current capabilities of LSI techniques for speed and reliability, plus the possibilities of assembling large configurations of LSI logic and storage elements, have demanded the study of multiprocessors and multiprocessing techniques, problems, and potentialities. Evaluated are three previous systems studies for a space ultrareliable modular computer multiprocessing system, and a new multiprocessing system is proposed that is flexibly configured with up to four central processors, four 1/0 processors, and 16 main memory units, plus auxiliary memory and peripheral devices. This multiprocessor system features a multilevel interrupt, qualified S/360 compatibility for ground-based generation of programs, virtual memory management of a storage hierarchy through 1/0 processors, and multiport access to multiple and shared memory units
    • …
    corecore