1,940 research outputs found
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
We present the use of a hybrid static/dynamic scheduling strategy of the task
dependency graph for direct methods used in dense numerical linear algebra.
This strategy provides a balance of data locality, load balance, and low
dequeue overhead. We show that the usage of this scheduling in communication
avoiding dense factorization leads to significant performance gains. On a 48
core AMD Opteron NUMA machine, our experiments show that we can achieve up to
64% improvement over a version of CALU that uses fully dynamic scheduling, and
up to 30% improvement over the version of CALU that uses fully static
scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic
scheduling approach is up to 8% faster than the version of CALU that uses a
fully static scheduling or fully dynamic scheduling. Our algorithm leads to
speedups over the corresponding routines for computing LU factorization in well
known libraries. On the 48 core AMD NUMA machine, our best implementation is up
to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to
82% faster than MKL. Our approach also shows significant speedups compared with
PLASMA on both of these systems
Modulo scheduling for a fully-distributed clustered VLIW architecture
Clustering is an approach that many microprocessors are adopting in recent times in order to mitigate the increasing penalties of wire delays. We propose a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory. A modulo scheduling scheme for this architecture is also proposed. This algorithm takes into account both register and memory inter-cluster communications so that the final schedule results in a cluster assignment that favors cluster locality in cache references and register accesses. It has been evaluated for both 2- and 4-cluster configurations and for differing numbers and latencies of inter-cluster buses. The proposed algorithm produces schedules with very low communication requirements and outperforms previous cluster-oriented schedulers.Peer ReviewedPostprint (published version
DeepPicar: A Low-cost Deep Neural Network-based Autonomous Car
We present DeepPicar, a low-cost deep neural network based autonomous car
platform. DeepPicar is a small scale replication of a real self-driving car
called DAVE-2 by NVIDIA. DAVE-2 uses a deep convolutional neural network (CNN),
which takes images from a front-facing camera as input and produces car
steering angles as output. DeepPicar uses the same network architecture---9
layers, 27 million connections and 250K parameters---and can drive itself in
real-time using a web camera and a Raspberry Pi 3 quad-core platform. Using
DeepPicar, we analyze the Pi 3's computing capabilities to support end-to-end
deep learning based real-time control of autonomous vehicles. We also
systematically compare other contemporary embedded computing platforms using
the DeepPicar's CNN-based real-time control workload. We find that all tested
platforms, including the Pi 3, are capable of supporting the CNN-based
real-time control, from 20 Hz up to 100 Hz, depending on hardware platform.
However, we find that shared resource contention remains an important issue
that must be considered in applying CNN models on shared memory based embedded
computing platforms; we observe up to 11.6X execution time increase in the CNN
based control loop due to shared resource contention. To protect the CNN
workload, we also evaluate state-of-the-art cache partitioning and memory
bandwidth throttling techniques on the Pi 3. We find that cache partitioning is
ineffective, while memory bandwidth throttling is an effective solution.Comment: To be published as a conference paper at RTCSA 201
Computing Execution Times with eXecution Decision Diagrams in the Presence of Out-Of-Order Resources
Worst-Case Execution Time (WCET) is a key component for the verification of
critical real-time applications. Yet, even the simplest microprocessors
implement pipelines with concurrently-accessed resources, such as the memory
bus shared by fetch and memory stages. Although their in-order pipelines are,
by nature, very deterministic, the bus can cause out-of-order accesses to the
memory and, therefore, timing anomalies: local timing effects that can have
global effects but that cannot be easily composed to estimate the global WCET.
To cope with this situation, WCET analyses have to generate important
over-estimations in order to preserve safety of the computed times or have to
explicitly track all possible executions. In the latter case, the presence of
out-of-order behavior leads to a combinatorial blowup of the number of pipeline
states for which efficient state abstractions are difficult to design. This
paper proposes instead a compact and exact representation of the timings in the
pipeline, using eXecution Decision Diagram (XDD) [1]. We show how XDD can be
used to model pipeline states all along the execution paths by leveraging the
algebraic properties of XDD. This computational model allows to compute the
exact temporal behavior at control flow graph level and is amenable to
efficiently and precisely support WCET calculation in presence of out-of-order
bus accesses. This model is finally experimented on the TACLe benchmark suite
and we observe good performance making this approach appropriate for industrial
applications
Modeling and visualizing networked multi-core embedded software energy consumption
In this report we present a network-level multi-core energy model and a
software development process workflow that allows software developers to
estimate the energy consumption of multi-core embedded programs. This work
focuses on a high performance, cache-less and timing predictable embedded
processor architecture, XS1. Prior modelling work is improved to increase
accuracy, then extended to be parametric with respect to voltage and frequency
scaling (VFS) and then integrated into a larger scale model of a network of
interconnected cores. The modelling is supported by enhancements to an open
source instruction set simulator to provide the first network timing aware
simulations of the target architecture. Simulation based modelling techniques
are combined with methods of results presentation to demonstrate how such work
can be integrated into a software developer's workflow, enabling the developer
to make informed, energy aware coding decisions. A set of single-,
multi-threaded and multi-core benchmarks are used to exercise and evaluate the
models and provide use case examples for how results can be presented and
interpreted. The models all yield accuracy within an average +/-5 % error
margin
WCET-aware Software Based Cache Partitioning for Multi-Task Real-Time Systems
Caches are a source of unpredictability since it is very difficult to predict if a memory access results in a cache hit or miss. In systems running multiple tasks steered by a preempting scheduler, it is even impossible to determine the cache behavior since interrupt-driven schedulers lead to unknown points of time for context switches. Partitioned caches are already used in multi-task environments to increase the cache hit ratio by avoiding mutual eviction of tasks from the cache.
For real-time systems, the upper bound of the execution time is one of the most important metrics, called the Worst-Case Execution Time (WCET). In this paper, we use partitioning of instruction caches as a technique to achieve tighter WCET estimations since tasks can not be evicted from their partition by other tasks. We propose a novel WCET-aware cache partitioning algorithm, which determines the optimal partition size for each task with focus on decreasing the system\u27s WCET for a given set of possible partition sizes. Employing this algorithm, we are able to decrease the WCET depending on the number of tasks in a set by up to 34%. On average, reductions between 12% and 19% can be achieved
Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling
High-performance, multi-core processors are the key to accelerating workloads
in several application domains. To continue to scale performance at the limit
of Moore's Law and Dennard scaling, software and hardware designers have turned
to dynamic solutions that adapt to the needs of applications in a transparent,
automatic way. For example, modern hardware improves its performance and power
efficiency by changing the hardware configuration, like the frequency and
voltage of cores, according to a number of parameters such as the technology
used, the workload running, etc. With this level of dynamism, it is essential
to simulate next-generation multi-core processors in a way that can both
respond to system changes and accurately determine system performance metrics.
Currently, no sampled simulation platform can achieve these goals of dynamic,
fast, and accurate simulation of multi-threaded workloads.
In this work, we propose a solution that allows for fast, accurate simulation
in the presence of both hardware and software dynamism. To accomplish this
goal, we present Pac-Sim, a novel sampled simulation methodology for fast,
accurate sampled simulation that requires no upfront analysis of the workload.
With our proposed methodology, it is now possible to simulate long-running
dynamically scheduled multi-threaded programs with significant simulation
speedups even in the presence of dynamic hardware events. We evaluate Pac-Sim
using the multi-threaded SPEC CPU2017, NPB, and PARSEC benchmarks with both
static and dynamic thread scheduling. The experimental results show that
Pac-Sim achieves a very low sampling error of 1.63% and 3.81% on average for
statically and dynamically scheduled benchmarks, respectively. Pac-Sim also
demonstrates significant simulation speedups as high as 523.5
(210.3 on average) for the train input set of SPEC CPU2017.Comment: 14 pages, 14 figure
- …