82 research outputs found

    A time-predictable many-core processor design for critical real-time embedded systems

    Get PDF
    Critical Real-Time Embedded Systems (CRTES) are in charge of controlling fundamental parts of embedded system, e.g. energy harvesting solar panels in satellites, steering and breaking in cars, or flight management systems in airplanes. To do so, CRTES require strong evidence of correct functional and timing behavior. The former guarantees that the system operates correctly in response of its inputs; the latter ensures that its operations are performed within a predefined time budget. CRTES aim at increasing the number and complexity of functions. Examples include the incorporation of \smarter" Advanced Driver Assistance System (ADAS) functionality in modern cars or advanced collision avoidance systems in Unmanned Aerial Vehicles (UAVs). All these new features, implemented in software, lead to an exponential growth in both performance requirements and software development complexity. Furthermore, there is a strong need to integrate multiple functions into the same computing platform to reduce the number of processing units, mass and space requirements, etc. Overall, there is a clear need to increase the computing power of current CRTES in order to support new sophisticated and complex functionality, and integrate multiple systems into a single platform. The use of multi- and many-core processor architectures is increasingly seen in the CRTES industry as the solution to cope with the performance demand and cost constraints of future CRTES. Many-cores supply higher performance by exploiting the parallelism of applications while providing a better performance per watt as cores are maintained simpler with respect to complex single-core processors. Moreover, the parallelization capabilities allow scheduling multiple functions into the same processor, maximizing the hardware utilization. However, the use of multi- and many-cores in CRTES also brings a number of challenges related to provide evidence about the correct operation of the system, especially in the timing domain. Hence, despite the advantages of many-cores and the fact that they are nowadays a reality in the embedded domain (e.g. Kalray MPPA, Freescale NXP P4080, TI Keystone II), their use in CRTES still requires finding efficient ways of providing reliable evidence about the correct operation of the system. This thesis investigates the use of many-core processors in CRTES as a means to satisfy performance demands of future complex applications while providing the necessary timing guarantees. To do so, this thesis contributes to advance the state-of-the-art towards the exploitation of parallel capabilities of many-cores in CRTES contributing in two different computing domains. From the hardware domain, this thesis proposes new many-core designs that enable deriving reliable and tight timing guarantees. From the software domain, we present efficient scheduling and timing analysis techniques to exploit the parallelization capabilities of many-core architectures and to derive tight and trustworthy Worst-Case Execution Time (WCET) estimates of CRTES.Los sistemas críticos empotrados de tiempo real (en ingles Critical Real-Time Embedded Systems, CRTES) se encargan de controlar partes fundamentales de los sistemas integrados, e.g. obtención de la energía de los paneles solares en satélites, la dirección y frenado en automóviles, o el control de vuelo en aviones. Para hacerlo, CRTES requieren fuerte evidencias del correcto comportamiento funcional y temporal. El primero garantiza que el sistema funciona correctamente en respuesta de sus entradas; el último asegura que sus operaciones se realizan dentro de unos limites temporales establecidos previamente. El objetivo de los CRTES es aumentar el número y la complejidad de las funciones. Algunos ejemplos incluyen los sistemas inteligentes de asistencia a la conducción en automóviles modernos o los sistemas avanzados de prevención de colisiones en vehiculos aereos no tripulados. Todas estas nuevas características, implementadas en software,conducen a un crecimiento exponencial tanto en los requerimientos de rendimiento como en la complejidad de desarrollo de software. Además, existe una gran necesidad de integrar múltiples funciones en una sóla plataforma para así reducir el número de unidades de procesamiento, cumplir con requisitos de peso y espacio, etc. En general, hay una clara necesidad de aumentar la potencia de cómputo de los actuales CRTES para soportar nueva funcionalidades sofisticadas y complejas e integrar múltiples sistemas en una sola plataforma. El uso de arquitecturas multi- y many-core se ve cada vez más en la industria CRTES como la solución para hacer frente a la demanda de mayor rendimiento y las limitaciones de costes de los futuros CRTES. Las arquitecturas many-core proporcionan un mayor rendimiento explotando el paralelismo de aplicaciones al tiempo que proporciona un mejor rendimiento por vatio ya que los cores se mantienen más simples con respecto a complejos procesadores de un solo core. Además, las capacidades de paralelización permiten programar múltiples funciones en el mismo procesador, maximizando la utilización del hardware. Sin embargo, el uso de multi- y many-core en CRTES también acarrea ciertos desafíos relacionados con la aportación de evidencias sobre el correcto funcionamiento del sistema, especialmente en el ámbito temporal. Por eso, a pesar de las ventajas de los procesadores many-core y del hecho de que éstos son una realidad en los sitemas integrados (por ejemplo Kalray MPPA, Freescale NXP P4080, TI Keystone II), su uso en CRTES aún precisa de la búsqueda de métodos eficientes para proveer evidencias fiables sobre el correcto funcionamiento del sistema. Esta tesis ahonda en el uso de procesadores many-core en CRTES como un medio para satisfacer los requisitos de rendimiento de aplicaciones complejas mientras proveen las garantías de tiempo necesarias. Para ello, esta tesis contribuye en el avance del estado del arte hacia la explotación de many-cores en CRTES en dos ámbitos de la computación. En el ámbito del hardware, esta tesis propone nuevos diseños many-core que posibilitan garantías de tiempo fiables y precisas. En el ámbito del software, la tesis presenta técnicas eficientes para la planificación de tareas y el análisis de tiempo para aprovechar las capacidades de paralelización en arquitecturas many-core, y también para derivar estimaciones de peor tiempo de ejecución (Worst-Case Execution Time, WCET) fiables y precisas

    parMERASA Multi-Core Execution of Parallelised Hard Real-Time Applications Supporting Analysability

    Get PDF
    International audienceEngineers who design hard real-time embedded systems express a need for several times the performance available today while keeping safety as major criterion. A breakthrough in performance is expected by parallelizing hard real-time applications and running them on an embedded multi-core processor, which enables combining the requirements for high-performance with timing-predictable execution. parMERASA will provide a timing analyzable system of parallel hard real-time applications running on a scalable multicore processor. parMERASA goes one step beyond mixed criticality demands: It targets future complex control algorithms by parallelizing hard real-time programs to run on predictable multi-/many-core processors. We aim to achieve a breakthrough in techniques for parallelization of industrial hard real-time programs, provide hard real-time support in system software, WCET analysis and verification tools for multi-cores, and techniques for predictable multi-core designs with up to 64 cores

    Graphics Processing Unit-Based Computer-Aided Design Algorithms for Electronic Design Automation

    Get PDF
    The electronic design automation (EDA) tools are a specific set of software that play important roles in modern integrated circuit (IC) design. These software automate the design processes of IC with various stages. Among these stages, two important EDA design tools are the focus of this research: floorplanning and global routing. Specifically, the goal of this study is to parallelize these two tools such that their execution time can be significantly shortened on modern multi-core and graphics processing unit (GPU) architectures. The GPU hardware is a massively parallel architecture, enabling thousands of independent threads to execute concurrently. Although a small set of EDA tools can benefit from using GPU to accelerate their speed, most algorithms in this field are designed with the single-core paradigm in mind. The floorplanning and global routing algorithms are among the latter, and difficult to render any speedup on the GPU due to their inherent sequential nature. This work parallelizes the floorplanning and global routing algorithm through a novel approach and results in significant speedups for both tools implemented on the GPU hardware. Specifically, with a complete overhaul of solution space and design space exploration, a GPU-based floorplanning algorithm is able to render 4-166X speedup, while achieving similar or improved solutions compared with the sequential algorithm. The GPU-based global routing algorithm is shown to achieve significant speedup against existing state-of-the-art routers, while delivering competitive solution quality. Importantly, this parallel model for global routing renders a stable solution that is independent from the level of parallelism. In summary, this research has shown that through a design paradigm overhaul, sequential algorithms can also benefit from the massively parallel architecture. The findings of this study have a positive impact on the efficiency and design quality of modern EDA design flow

    Towards Efficient Computation in Real-Time Systems

    Get PDF
    Graph algorithms have gained popularity and are utilized in high performance and mobile computing paradigms. Input dependence due to input graph changes leads to performance variations in such algorithms. The impact of input dependence for graph algorithms is not well studied in the context of approximate computing. This thesis conducts such analysis by applying loop perforation, which is a general approximation mechanism that transforms the program loops to drop a subset of their total iterations. The analysis identifies the need to adapt the inner and outer loop perforation as a function of input graph characteristics, such as the density or size of the graph. A predictive model is proposed to learn the near-optimal loop perforation rates using synthetic input graphs. When the input-aware loop perforation model is applied to real world graphs, the evaluated graph algorithms systematically degrade accuracy to achieve performance and power benefits. Results show ~30% performance and ~19% power utilization improvements on average at a program accuracy loss threshold of 10% for an NVidia GPU. The analysis is also conducted for two concurrent Intel CPU architectures, an 8-core Xeon and a 61-core Xeon Phi machine

    parMERASA – multicore execution of parallelised hard real-time applications supporting analysability

    Get PDF
    Abstract-Engineers who design hard real-time embedded systems express a need for several times the performance available today while keeping safety as major criterion. A breakthrough in performance is expected by parallelizing hard real-time applications and running them on an embedded multi-core processor, which enables combining the requirements for high-performance with timing-predictable execution. parMERASA will provide a timing analyzable system of parallel hard real-time applications running on a scalable multicore processor. parMERASA goes one step beyond mixed criticality demands: It targets future complex control algorithms by parallelizing hard real-time programs to run on predictable multi-/many-core processors. We aim to achieve a breakthrough in techniques for parallelization of industrial hard real-time programs, provide hard real-time support in system software, WCET analysis and verification tools for multi-cores, and techniques for predictable multi-core designs with up to 64 cores

    Multicore and FPGA implementations of emotional-based agent architectures

    Get PDF
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-014-1307-6.Control architectures based on Emotions are becoming promising solutions for the implementation of future robotic agents. The basic controllers of the architecture are the emotional processes that decide which behaviors of the robot must activate to fulfill the objectives. The number of emotional processes increases (hundreds of millions/s) with the complexity level of the application, reducing the processing capacity of the main processor to solve complex problems (millions of decisions in a given instant). However, the potential parallelism of the emotional processes permits their execution in parallel on FPGAs or Multicores, thus enabling slack computing in the main processor to tackle more complex dynamic problems. In this paper, an emotional architecture for mobile robotic agents is presented. The workload of the emotional processes is evaluated. Then, the main processor is extended with FPGA co-processors through Ethernet link. The FPGAs will be in charge of the execution of the emotional processes in parallel. Different Stratix FPGAs are compared to analyze their suitability to cope with the proposed mobile robotic agent applications. The applications are set up taking into account different environmental conditions, robot dynamics and emotional states. Moreover, the applications are run also on Multicore processors to compare their performance in relation to the FPGAs. Experimental results show that Stratix IV FPGA increases the performance in about one order of magnitude over the main processor and solves all the considered problems. Quad-Core increases the performance in 3.64 times, allowing to tackle about 89 % of the considered problems. Quad-Core has a lower cost than a Stratix IV, so more adequate solution but not for the most complex application. Stratix III could be applied to solve problems with around the double of the requirements that the main processor could support. Finally, a Dual-Core provides slightly better performance than stratix III and it is relatively cheaper.This work was supported in part under Spanish Grant PAID/2012/325 of "Programa de Apoyo a la Investigacion y Desarrollo. Proyectos multidisciplinares", Universitat Politecnica de Valencia, Spain.Domínguez Montagud, CP.; Hassan Mohamed, H.; Crespo, A.; Albaladejo Meroño, J. (2015). Multicore and FPGA implementations of emotional-based agent architectures. Journal of Supercomputing. 71(2):479-507. https://doi.org/10.1007/s11227-014-1307-6S479507712Malfaz M, Salichs MA (2010) Using MUDs as an experimental platform for testing a decision making system for self-motivated autonomous agents. Artif Intell Simul Behav J 2(1):21–44Damiano L, Cañamero L (2010) Constructing emotions. Epistemological groundings and applications in robotics for a synthetic approach to emotions. In: Proceedings of international symposium on aI-inspired biology, The Society for the Study of Artificial Intelligence, pp 20–28Hawes N, Wyatt J, Sloman A (2009) Exploring design space for an integrated intelligent system. Knowl Based Syst 22(7):509–515Sloman A (2009) Some requirements for human-like robots: why the recent over-emphasis on embodiment has held up progress. Creat Brain Like Intell 2009:248–277Arkin RC, Ulam P, Wagner AR (2012) Moral decision-making in autonomous systems: enforcement, moral emotions, dignity, trust and deception. In: Proceedings of the IEEE, Mar 2012, vol 100, no 3, pp 571–589iRobot industrial robots website. http://www.irobot.com/gi/ground/ . Accessed 22 Sept 2014Moravec H (2009) Rise of the robots: the future of artificial intelligence. Scientific American, March 2009. http://www.scientificamerican.com/article/rise-of-the-robots/ . Accessed 14 Oct 2014.Thu Bui L, Abbass HA, Barlow M, Bender A (2012) Robustness against the decision-maker’s attitude to risk in problems with conflicting objectives. IEEE Trans Evolut Comput 16(1):1–19Pedrycz W, Song M (2011) Analytic hierarchy process (AHP) in group decision making and its optimization with an allocation of information granularity. IEEE Trans Fuzzy Syst 19(3):527–539Lee-Johnson CP, Carnegie DA (2010) Mobile robot navigation modulated by artificial emotions. IEEE Trans Syst Man Cybern Part B 40(2):469–480Daglarli E, Temeltas H, Yesiloglu M (2009) Behavioral task processing for cognitive robots using artificial emotions. Neurocomputing 72(13):2835–2844Ventura R, Pinto-Ferreira C (2009) Responding efficiently to relevant stimuli using an emotion-based agent architecture. Neurocomputing 72(13):2923–2930Arkin RC, Ulam P, Wagner AR (2012) Moral decision-making in autonomous systems: enforcement, moral emotions, dignity, trust and deception. Proc IEEE 100(3):571–589Salichs MA, Malfaz M (2012) A new approach to modeling emotions and their use on a decision-making system for artificial agents. Affect Comput IEEE Trans 3(1):56–68Altera Corporation (2011) Stratix III device handbook, vol 1–2, version 2.2. http://www.altera.com/literature/lit-stx3.jsp . Accessed 14 Oct 2014.Altera Corporation (2014) Stratix IV device handbook, vol 1–4, version 5.9. http://www.altera.com/literature/lit-stratix-iv.jsp . Accessed 14 Oct 2014.Naouar MW, Monmasson E, Naassani AA, Slama-Belkhodja I, Patin N (2007) FPGA-based current controllers for AC machine drives: a review. IEEE Trans Ind Electr 54(4):1907–1925Intel Corporation (2014) Desktop 4th generation Intel Core Processor Family, Desktop Intel Pentium Processor Family, and Desktop Intel Celeron Processor Family, Datasheet, vol 1, 2March JL, Sahuquillo J, Hassan H, Petit S, Duato J (2011) A new energy-aware dynamic task set partitioning algorithm for soft and hard embedded real-time systems. Comput J 54(8):1282–1294Del Campo I, Basterretxea K, Echanobe J, Bosque G, Doctor F (2012) A system-on-chip development of a neuro-fuzzy embedded agent for ambient-intelligence environments. IEEE Trans Syst Man Cybern Part B 42(2):501–512Pedraza C, Castillo J, Martínez JI, Huerta P, Bosque JL, Cano J (2011) Genetic algorithm for Boolean minimization in an FPGA cluster. J Supercomput 58(2):244–252Orlowska-Kowalska T, Kaminski M (2011) FPGA implementation of the multilayer neural network for the speed estimation of the two-mass drive system. IEEE Trans Ind Inf 7(3):436–445Cassidy AS, Merolla P, Arthur JV, Esser SK, Jackson B, Alvarez-icaza R, Datta P, Sawada J, Wong TM, Feldman V, Amir A, Ben-dayan D, Mcquinn E, Risk WP, Modha DS (2013) Cognitive computing building block: a versatile and efficient digital neuron model for neurosynaptic cores. In: Proceedings of international joint conference on neural networks, IEEE (IJCNN’2013)IBM Cognitive Computing and Neurosynaptic chips website. http://www.research.ibm.com/cognitive-computing/neurosynaptic-chips.shtml . Accessed 22 Sept 2014Seo E, Jeong J, Park S, Lee J (2008) Energy efficient scheduling of real-time tasks on multicore processors. IEEE Trans Parallel Distrib Syst 19(11):1540–1552Lehoczky J, Sha L, Ding Y (1989) The rate monotonic scheduling algorithm: exact characterization and average case behavior. In: Proceedings of real time systems symposium, IEEE 1989, pp 166–171Ng-Thow-Hing V, Lim J, Wormer J, Sarvadevabhatla RK, Rocha C, Fujimura K, Sakagami Y (2008) The memory game: creating a human-robot interactive scenario for ASIMO. In: Proceedings of intelligent robots and systems, 2008, IROS 2008, IEEE/RSJ international conference, pp 779–78

    Online Modeling and Tuning of Parallel Stream Processing Systems

    Get PDF
    Writing performant computer programs is hard. Code for high performance applications is profiled, tweaked, and re-factored for months specifically for the hardware for which it is to run. Consumer application code doesn\u27t get the benefit of endless massaging that benefits high performance code, even though heterogeneous processor environments are beginning to resemble those in more performance oriented arenas. This thesis offers a path to performant, parallel code (through stream processing) which is tuned online and automatically adapts to the environment it is given. This approach has the potential to reduce the tuning costs associated with high performance code and brings the benefit of performance tuning to consumer applications where otherwise it would be cost prohibitive. This thesis introduces a stream processing library and multiple techniques to enable its online modeling and tuning. Stream processing (also termed data-flow programming) is a compute paradigm that views an application as a set of logical kernels connected via communications links or streams. Stream processing is increasingly used by computational-x and x-informatics fields (e.g., biology, astrophysics) where the focus is on safe and fast parallelization of specific big-data applications. A major advantage of stream processing is that it enables parallelization without necessitating manual end-user management of non-deterministic behavior often characteristic of more traditional parallel processing methods. Many big-data and high performance applications involve high throughput processing, necessitating usage of many parallel compute kernels on several compute cores. Optimizing the orchestration of kernels has been the focus of much theoretical and empirical modeling work. Purely theoretical parallel programming models can fail when the assumptions implicit within the model are mis-matched with reality (i.e., the model is incorrectly applied). Often it is unclear if the assumptions are actually being met, even when verified under controlled conditions. Full empirical optimization solves this problem by extensively searching the range of likely configurations under native operating conditions. This, however, is expensive in both time and energy. For large, massively parallel systems, even deciding which modeling paradigm to use is often prohibitively expensive and unfortunately transient (with workload and hardware). In an ideal world, a parallel run-time will re-optimize an application continuously to match its environment, with little additional overhead. This work presents methods aimed at doing just that through low overhead instrumentation, modeling, and optimization. Online optimization provides a good trade-off between static optimization and online heuristics. To enable online optimization, modeling decisions must be fast and relatively accurate. Online modeling and optimization of a stream processing system first requires the existence of a stream processing framework that is amenable to the intended type of dynamic manipulation. To fill this void, we developed the RaftLib C++ template library, which enables usage of the stream processing paradigm for C++ applications (it is the run-time which is the basis of almost all the work within this dissertation). An application topology is specified by the user, however almost everything else is optimizable by the run-time. RaftLib takes advantage of the knowledge gained during the design of several prior streaming languages (notably Auto-Pipe). The resultant framework enables online migration of tasks, auto-parallelization, online buffer-reallocation, and other useful dynamic behaviors that were not available in many previous stream processing systems. Several benchmark applications have been designed to assess the performance gains through our approaches and compare performance to other leading stream processing frameworks. Information is essential to any modeling task, to that end a low-overhead instrumentation framework has been developed which is both dynamic and adaptive. Discovering a fast and relatively optimal configuration for a stream processing application often necessitates solving for buffer sizes within a finite capacity queueing network. We show that a generalized gain/loss network flow model can bootstrap the process under certain conditions. Any modeling effort, requires that a model be selected; often a highly manual task, involving many expensive operations. This dissertation demonstrates that machine learning methods (such as a support vector machine) can successfully select models at run-time for a streaming application. The full set of approaches are incorporated into the open source RaftLib framework
    • …
    corecore