4,867 research outputs found

    Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems

    Full text link
    In modern Commercial Off-The-Shelf (COTS) multicore systems, each core can generate many parallel memory requests at a time. The processing of these parallel requests in the DRAM controller greatly affects the memory interference delay experienced by running tasks on the platform. In this paper, we model a modern COTS multicore system which has a nonblocking last-level cache (LLC) and a DRAM controller that prioritizes reads over writes. To minimize interference, we focus on LLC and DRAM bank partitioned systems. Based on the model, we propose an analysis that computes a safe upper bound for the worst-case memory interference delay. We validated our analysis on a real COTS multicore platform with a set of carefully designed synthetic benchmarks as well as SPEC2006 benchmarks. Evaluation results show that our analysis is more accurately capture the worst-case memory interference delay and provides safer upper bounds compared to a recently proposed analysis which significantly under-estimate the delay.Comment: Technical Repor

    HyperLoom: A platform for defining and executing scientific pipelines in distributed environments

    Get PDF
    Real-world scientific applications often encompass end-to-end data processing pipelines composed of a large number of interconnected computational tasks of various granularity. We introduce HyperLoom, an open source platform for defining and executing such pipelines in distributed environments and providing a Python interface for defining tasks. HyperLoom is a self-contained system that does not use an external scheduler for the actual execution of the task. We have successfully employed HyperLoom for executing chemogenomics pipelines used in pharmaceutic industry for novel drug discovery.6

    DeepPicar: A Low-cost Deep Neural Network-based Autonomous Car

    Full text link
    We present DeepPicar, a low-cost deep neural network based autonomous car platform. DeepPicar is a small scale replication of a real self-driving car called DAVE-2 by NVIDIA. DAVE-2 uses a deep convolutional neural network (CNN), which takes images from a front-facing camera as input and produces car steering angles as output. DeepPicar uses the same network architecture---9 layers, 27 million connections and 250K parameters---and can drive itself in real-time using a web camera and a Raspberry Pi 3 quad-core platform. Using DeepPicar, we analyze the Pi 3's computing capabilities to support end-to-end deep learning based real-time control of autonomous vehicles. We also systematically compare other contemporary embedded computing platforms using the DeepPicar's CNN-based real-time control workload. We find that all tested platforms, including the Pi 3, are capable of supporting the CNN-based real-time control, from 20 Hz up to 100 Hz, depending on hardware platform. However, we find that shared resource contention remains an important issue that must be considered in applying CNN models on shared memory based embedded computing platforms; we observe up to 11.6X execution time increase in the CNN based control loop due to shared resource contention. To protect the CNN workload, we also evaluate state-of-the-art cache partitioning and memory bandwidth throttling techniques on the Pi 3. We find that cache partitioning is ineffective, while memory bandwidth throttling is an effective solution.Comment: To be published as a conference paper at RTCSA 201

    MARACAS: a real-time multicore VCPU scheduling framework

    Full text link
    This paper describes a multicore scheduling and load-balancing framework called MARACAS, to address shared cache and memory bus contention. It builds upon prior work centered around the concept of virtual CPU (VCPU) scheduling. Threads are associated with VCPUs that have periodically replenished time budgets. VCPUs are guaranteed to receive their periodic budgets even if they are migrated between cores. A load balancing algorithm ensures VCPUs are mapped to cores to fairly distribute surplus CPU cycles, after ensuring VCPU timing guarantees. MARACAS uses surplus cycles to throttle the execution of threads running on specific cores when memory contention exceeds a certain threshold. This enables threads on other cores to make better progress without interference from co-runners. Our scheduling framework features a novel memory-aware scheduling approach that uses performance counters to derive an average memory request latency. We show that latency-based memory throttling is more effective than rate-based memory access control in reducing bus contention. MARACAS also supports cache-aware scheduling and migration using page recoloring to improve performance isolation amongst VCPUs. Experiments show how MARACAS reduces multicore resource contention, leading to improved task progress.http://www.cs.bu.edu/fac/richwest/papers/rtss_2016.pdfAccepted manuscrip

    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

    Get PDF
    The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

    Power-aware scheduling with effective task migration for real-time multicore embedded systems

    Full text link
    A major design issue in embedded systems is reducing the power consumption because batteries have a limited energy budget. For this purpose, several techniques such as dynamic voltage and frequency scaling (DVFS) or task migration are being used. DVFS allows reducing power by selecting the optimal voltage supply, whereas task migration achieves this effect by balancing the workload among cores. This paper focuses on power-aware scheduling allowing task migration to reduce energy consumption in multicore embedded systems implementing DVFS capabilities. To address energy savings, the devised schedulers follow two main rules: migrations are allowed at specific points of time and only one task is allowed to migrate each time. Two algorithms have been proposed working under real-time constraints. The simpler algorithm, namely, single option migration (SOM) only checks just one target core before performing a migration. In contrast, the multiple option migration (MOM) searches the optimal target core. In general, the MOM algorithm achieves better energy savings than the SOM algorithm, although differences are wider for a reduced number of cores and frequency/voltage levels. Moreover, the MOM algorithm reduces energy consumption as much as 40% over the worst fit algorithm.This work was supported by the Spanish MICINN, Consolider Programme and Plan E funds, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01.March Cabrelles, JL.; Sahuquillo Borrás, J.; Petit Martí, SV.; Hassan Mohamed, H.; Duato Marín, JF. (2013). Power-aware scheduling with effective task migration for real-time multicore embedded systems. Concurrency and Computation: Practice and Experience. 25(14):1987-2001. doi:10.1002/cpe.2899S198720012514Euiseong Seo, Jinkyu Jeong, Seonyeong Park, & Joonwon Lee. (2008). Energy Efficient Scheduling of Real-Time Tasks on Multicore Processors. IEEE Transactions on Parallel and Distributed Systems, 19(11), 1540-1552. doi:10.1109/tpds.2008.104March, J. L., Sahuquillo, J., Hassan, H., Petit, S., & Duato, J. (2011). A New Energy-Aware Dynamic Task Set Partitioning Algorithm for Soft and Hard Embedded Real-Time Systems. The Computer Journal, 54(8), 1282-1294. doi:10.1093/comjnl/bxr008AlEnawy, T. A., & Aydin, H. (s. f.). Energy-Aware Task Allocation for Rate Monotonic Scheduling. 11th IEEE Real Time and Embedded Technology and Applications Symposium. doi:10.1109/rtas.2005.20Intel atom processor microarchitecture www.intel.com/Marvell ARMADA TM 628 Marvell Semiconductor, Inc. Santa Clara, CA, USA http://www.marvell.com/company/press_kit/assets/Marvell_ARMADA_628_Release_FINAL3.pdfMcNairy, C., & Bhatia, R. (2005). Montecito: A Dual-Core, Dual-Thread Itanium Processor. IEEE Micro, 25(2), 10-20. doi:10.1109/mm.2005.34Kalla, R., Sinharoy, B., & Tendler, J. M. (2004). IBM power5 chip: a dual-core multithreaded processor. IEEE Micro, 24(2), 40-47. doi:10.1109/mm.2004.1289290Shah A Arm plans to add multithreading to chip design 2010 http://www.itworld.com/hardware/122383/arm-plans-add-multithreading-chip-designSchranzhofer, A., Chen, J.-J., & Thiele, L. (2010). Dynamic Power-Aware Mapping of Applications onto Heterogeneous MPSoC Platforms. IEEE Transactions on Industrial Informatics, 6(4), 692-707. doi:10.1109/tii.2010.2062192Cazorla, F. J., Knijnenburg, P. M. W., Sakellariou, R., Fernandez, E., Ramirez, A., & Valero, M. (2006). Predictable performance in SMT processors: synergy between the OS and SMTs. IEEE Transactions on Computers, 55(7), 785-799. doi:10.1109/tc.2006.108Fisher, N., & Baruah, S. (2008). The feasibility of general task systems with precedence constraints on multiprocessor platforms. Real-Time Systems, 41(1), 1-26. doi:10.1007/s11241-008-9054-5Buttazzo, G., Bini, E., & Yifan Wu. (2011). Partitioning Real-Time Applications Over Multicore Reservations. IEEE Transactions on Industrial Informatics, 7(2), 302-315. doi:10.1109/tii.2011.2123902Intel Pentium M processor datasheet INTEL Corp. Santa Clara, CA, USA 2004 http://download.intel.com/support/processors/mobile/pm/sb/25261203.pdfChaparro, P., Gonzáles, J., Magklis, G., Cai, Q., & González, A. (2007). Understanding the Thermal Implications of Multi-Core Architectures. IEEE Transactions on Parallel and Distributed Systems, 18(8), 1055-1065. doi:10.1109/tpds.2007.1092WCET analysis project. WCET benchmark programs 2006 http://www.mrtc.mdh.se/projects/wcet

    Timing Analysis for DAG-based and GFP Scheduled Tasks

    Full text link
    Modern embedded systems have made the transition from single-core to multi-core architectures, providing performance improvement via parallelism rather than higher clock frequencies. DAGs are considered among the most generic task models in the real-time domain and are well suited to exploit this parallelism. In this paper we provide a schedulability test using response-time analysis exploiting exploring and bounding the self interference of a DAG task. Additionally we bound the interference a high priority task has on lower priority ones
    corecore