137 research outputs found
D5.1: Accelerator Deployment Models
In this deliverable, we explore this question by studying accelerator deployment models. Under accelerator, we understand for example application-specific GPUs or specially programmed FPGAs. A deployment specifies types, amount, and connectivity of accelerators in a datacenter. With these definitions in mind, we created a theoretical model of the datacenter, its components, expected workloads, and finally, it is possible deployments.
We have developed VineSim, a software simulator of a datacenter, based on the aforementioned theoretical modeling. VineSim takes as inputs a workload and a deployment description and outputs performance metrics of interest, such as job latency and resource utilization. In VineSim, one can configure several parameters, including how tasks are allocated to nodes, and estimations of how fast they execute on different accelerators. VineSim can be used to explore how different deployments respond to different kinds of workloads, thus allowing one to determine how to best compose a datacenter based on particular workload, performance, or budgeting requirements
Commodity single board computer clusters and their applications
© 2018 Current commodity Single Board Computers (SBCs) are sufficiently powerful to run mainstream operating systems and workloads. Many of these boards may be linked together, to create small, low-cost clusters that replicate some features of large data center clusters. The Raspberry Pi Foundation produces a series of SBCs with a price/performance ratio that makes SBC clusters viable, perhaps even expendable. These clusters are an enabler for Edge/Fog Compute, where processing is pushed out towards data sources, reducing bandwidth requirements and decentralizing the architecture. In this paper we investigate use cases driving the growth of SBC clusters, we examine the trends in future hardware developments, and discuss the potential of SBC clusters as a disruptive technology. Compared to traditional clusters, SBC clusters have a reduced footprint, are low-cost, and have low power requirements. This enables different models of deploymentâparticularly outside traditional data center environments. We discuss the applicability of existing software and management infrastructure to support exotic deployment scenarios and anticipate the next generation of SBC. We conclude that the SBC cluster is a new and distinct computational deployment paradigm, which is applicable to a wider range of scenarios than current clusters. It facilitates Internet of Things and Smart City systems and is potentially a game changer in pushing application logic out towards the network edge
HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges
High Performance Computing (HPC) clouds are becoming an alternative to
on-premise clusters for executing scientific applications and business
analytics services. Most research efforts in HPC cloud aim to understand the
cost-benefit of moving resource-intensive applications from on-premise
environments to public cloud platforms. Industry trends show hybrid
environments are the natural path to get the best of the on-premise and cloud
resources---steady (and sensitive) workloads can run on on-premise resources
and peak demand can leverage remote resources in a pay-as-you-go manner.
Nevertheless, there are plenty of questions to be answered in HPC cloud, which
range from how to extract the best performance of an unknown underlying
platform to what services are essential to make its usage easier. Moreover, the
discussion on the right pricing and contractual models to fit small and large
users is relevant for the sustainability of HPC clouds. This paper brings a
survey and taxonomy of efforts in HPC cloud and a vision on what we believe is
ahead of us, including a set of research challenges that, once tackled, can
help advance businesses and scientific discoveries. This becomes particularly
relevant due to the fast increasing wave of new HPC applications coming from
big data and artificial intelligence.Comment: 29 pages, 5 figures, Published in ACM Computing Surveys (CSUR
Improving the Performance of Cloud-based Scientific Services
Cloud computing provides access to a large scale set of readily available computing resources at the click of a button. The cloud paradigm has commoditised computing capacity and is often touted as a low-cost model for executing and scaling applications. However, there are significant technical challenges associated with selecting, acquiring, configuring, and managing cloud resources which can restrict the efficient utilisation of cloud capabilities.
Scientific computing is increasingly hosted on cloud infrastructureâin which scientific capabilities are delivered to the broad scientific community via Internet-accessible services. This migration from on-premise to on-demand cloud infrastructure is motivated by the sporadic usage patterns of scientific workloads and the associated potential cost savings without the need to purchase, operate, and manage compute infrastructureâa task that few scientific users are trained to perform. However, cloud platforms are not an automatic solution. Their flexibility is derived from an enormous number of services and configuration options, which in turn result in significant complexity for the user. In fact, naĂŻve cloud usage can result in poor performance and excessive costs, which are then directly passed on to researchers.
This thesis presents methods for developing efficient cloud-based scientific services. Three real-world scientific services are analysed and a set of common requirements are derived. To address these requirements, this thesis explores automated and scalable methods for inferring network performance, considers various trade-offs (e.g., cost and performance) when provisioning instances, and profiles application performance, all in heterogeneous and dynamic cloud environments. Specifically, network tomography provides the mechanisms to infer network performance in dynamic and opaque cloud networks; cost-aware automated provisioning approaches enable services to consider, in real-time, various trade-offs such as cost, performance, and reliability; and automated application profiling allows a huge search space of applications, instance types, and configurations to be analysed to determine resource requirements and application performance. Finally, these contributions are integrated into an extensible and modular cloud provisioning and resource management service called SCRIMP. Cloud-based scientific applications and services can subscribe to SCRIMP to outsource their provisioning, usage, and management of cloud infrastructures. Collectively, the approaches presented in this thesis are shown to provide order of magnitude cost savings and significant performance improvement when employed by production scientific services
Optimization of deep learning algorithms for an autonomous RC vehicle
Dissertação de mestrado em Engenharia InformåticaThis dissertation aims to evaluate and improve the performance of deep learning (DL)
algorithms to autonomously drive a vehicle, using a Remo Car (an RC vehicle) as testbed.
The RC vehicle was built with a 1:10 scaled remote controlled car and fitted with an
embedded system and a video camera to capture and process real-time image data. Two
different embedded systems were comparatively evaluated: an homogeneous system, a
Raspberry Pi 4, and an heterogeneous system, a NVidia Jetson Nano. The Raspberry Pi 4 with
an advanced 4-core ARM device supports multiprocessing, while the Jetson Nano, also with
a 4-core ARM device, has an integrated accelerator, a 128 CUDA-core NVidia GPU.
The captured video is processed with convolutional neural networks (CNNs), which
interpret image data of the vehicleâs surroundings and predict critical data, such as lane view
and steering angle, to provide mechanisms to drive on its own, following a predefined path.
To improve the driving performance of the RC vehicle, this work analysed the programmed
DL algorithms, namely different computer vision approaches for object detection and image
classification, aiming to explore DL techniques and improve their performance at the inference
phase.
The work also analysed the computational efficiency of the control software, while running
intense and complex deep learning tasks in the embedded devices, and fully explored the
advanced characteristics and instructions provided by the two embedded systems in the
vehicle.
Different machine learning (ML) libraries and frameworks were analysed and evaluated:
TensorFlow, TensorFlow Lite, Arm NN, PyArmNN and TensorRT. They play a key role to
deploy the relevant algorithms and to fully engage the hardware capabilities.
The original algorithm was successfully optimized and both embedded systems could
perfectly handle this workload. To understand the computational limits of both devices, an
additional and heavy DL algorithm was developed that aimed to detect traffic signs.
The homogeneous system, the Raspberry Pi 4, could not deliver feasible low-latency values,
hence the detection of traffic signs was not possible in real-time. However, a great performance
improvement was achieved using the heterogeneous system, Jetson Nano, enabling their
CUDA-cores to process the additional workload.Esta dissertação tem como objetivo avaliar e melhorar o desempenho de algoritmos de deep learning (DL) orientados Ă condução autĂłnoma de veĂculos, usando um carro controlado remotamente como ambiente de teste. O carro foi construĂdo usando um modelo de um veĂculo de controlo remoto de escala 1:10, onde foi colocado um sistema embebido e uma cĂąmera de vĂdeo para capturar e processar imagem em tempo real. Dois sistemas embebidos foram comparativamente avaliados: um sistema homogĂ©neo, um Raspberry Pi 4, e um sistema heterogĂ©neo, uma NVidia Jetson Nano. O Raspberry Pi 4 possui um processador ARM com 4 nĂșcleos, suportando multiprocessamento. A Jetson Nano, tambĂ©m com um processador ARM de 4 nĂșcleos, possui uma unidade adicional de processamento com 128 nĂșcleos do tipo CUDA-core. O vĂdeo capturado e processado usando redes neuronais convolucionais (CNN), interpretando o meio envolvente do veĂculo e prevendo dados cruciais, como a visibilidade da linha da estrada e o angulo de direção, de forma a que o veĂculo consiga conduzir de forma autĂłnoma num determinado ambiente. De forma a melhorar o desempenho da condução autĂłnoma do veĂculo, diferentes algoritmos de deep learning foram analisados, nomeadamente diferentes abordagens de visĂŁo por computador para detecção e classificação de imagens, com o objetivo de explorar tĂ©cnicas de CNN e melhorar o seu desempenho na fase de inferĂȘncia. A dissertação tambĂ©m analisou a eficiĂȘncia computacional do software usado para a execução de tarefas de aprendizagem profunda intensas e complexas nos dispositivos embebidos, e explorou completamente as caracterĂsticas avançadas e as instruçÔes fornecidas pelos dois sistemas embebidos no veĂculo. Diferentes bibliotecas e frameworks de machine learning foram analisadas e avaliadas: TensorFlow, TensorFlow Lite, Arm NN, PyArmNN e TensorRT. Estes desempenham um papel fulcral no provisionamento dos algoritmos de deep learning para tirar mĂĄximo partido das capacidades do hardware usado. O algoritmo original foi otimizado com sucesso e ambos os sistemas embebidos conseguiram executar os algoritmos com pouco esforço. Assim, para entender os limites computacionais de ambos os dispositivos, um algoritmo adicional mais complexo de deep learning foi desenvolvido com o objetivo de detectar sinais de transito. O sistema homogĂ©neo, o Raspberry Pi 4, nĂŁo conseguiu entregar valores viĂĄveis de baixa latĂȘncia, portanto, a detecção de sinais de trĂąnsito nĂŁo foi possĂvel em tempo real, usando este sistema. No entanto, foi alcançada uma grande melhoria de desempenho usando o sistema heterogeneo, Jetson Nano, que usaram os seus nĂșcleos CUDA adicionais para processar a carga computacional mais intensa
EDEN: A high-performance, general-purpose, NeuroML-based neural simulator
Modern neuroscience employs in silico experimentation on ever-increasing and
more detailed neural networks. The high modelling detail goes hand in hand with
the need for high model reproducibility, reusability and transparency. Besides,
the size of the models and the long timescales under study mandate the use of a
simulation system with high computational performance, so as to provide an
acceptable time to result. In this work, we present EDEN (Extensible Dynamics
Engine for Networks), a new general-purpose, NeuroML-based neural simulator
that achieves both high model flexibility and high computational performance,
through an innovative model-analysis and code-generation technique. The
simulator runs NeuroML v2 models directly, eliminating the need for users to
learn yet another simulator-specific, model-specification language. EDEN's
functional correctness and computational performance were assessed through
NeuroML models available on the NeuroML-DB and Open Source Brain model
repositories. In qualitative experiments, the results produced by EDEN were
verified against the established NEURON simulator, for a wide range of models.
At the same time, computational-performance benchmarks reveal that EDEN runs up
to 2 orders-of-magnitude faster than NEURON on a typical desktop computer, and
does so without additional effort from the user. Finally, and without added
user effort, EDEN has been built from scratch to scale seamlessly over multiple
CPUs and across computer clusters, when available.Comment: 29 pages, 9 figure
Architecting Data Centers for High Efficiency and Low Latency
Modern data centers, housing remarkably powerful computational capacity, are built in massive scales and consume a huge amount of energy. The energy consumption of data centers has mushroomed from virtually nothing to about three percent of the global electricity supply in the last decade, and will continuously grow. Unfortunately, a significant fraction of this energy consumption is wasted due to the inefficiency of current data center architectures, and one of the key reasons behind this inefficiency is the stringent response latency requirements of the user-facing services hosted in these data centers such as web search and social networks. To deliver such low response latency, data center operators often have to overprovision resources to handle high peaks in user load and unexpected load spikes, resulting in low efficiency.
This dissertation investigates data center architecture designs that reconcile high system efficiency and low response latency. To increase the efficiency, we propose techniques that understand both microarchitectural-level resource sharing and system-level resource usage dynamics to enable highly efficient co-locations of latency-critical services and low-priority batch workloads. We investigate the resource sharing on real-system simultaneous multithreading (SMT) processors to enable SMT co-locations by precisely predicting the performance interference. We then leverage historical resource usage patterns to further optimize the task scheduling algorithm and data placement policy to improve the efficiency of workload co-locations. Moreover, we introduce methodologies to better manage the response latency by automatically attributing the source of tail latency to low-level architectural and system configurations in both offline load testing environment and online production environment. We design and develop a response latency evaluation framework at microsecond-level precision for data center applications, with which we construct statistical inference procedures to attribute the source of tail latency. Finally, we present an approach that proactively enacts carefully designed causal inference micro-experiments to diagnose the root causes of response latency anomalies, and automatically correct them to reduce the response latency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144144/1/yunqi_1.pd
Recent Advances in Embedded Computing, Intelligence and Applications
The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems
- âŠ