3,818 research outputs found
A hybrid queueing model for fast broadband networking simulation
PhDThis research focuses on the investigation of a fast simulation method for broadband
telecommunication networks, such as ATM networks and IP networks. As a result of
this research, a hybrid simulation model is proposed, which combines the analytical
modelling and event-driven simulation modelling to speeding up the overall
simulation.
The division between foreground and background traffic and the way of dealing with
these different types of traffic to achieve improvement in simulation time is the major
contribution reported in this thesis. Background traffic is present to ensure that proper
buffering behaviour is included during the course of the simulation experiments, but
only the foreground traffic of interest is simulated, unlike traditional simulation
techniques. Foreground and background traffic are dealt with in a different way.
To avoid the need for extra events on the event list, and the processing overhead,
associated with the background traffic, the novel technique investigated in this
research is to remove the background traffic completely, adjusting the service time of
the queues for the background traffic to compensate (in most cases, the service time
for the foreground traffic will increase). By removing the background traffic from the
event-driven simulator the number of cell processing events dealt with is reduced
drastically.
Validation of this approach shows that, overall, the method works well, but the
simulation using this method does have some differences compared with experimental
results on a testbed. The reason for this is mainly because of the assumptions behind
the analytical model that make the modelling tractable.
Hence, the analytical model needs to be adjusted. This is done by having a neural
network trained to learn the relationship between the input traffic parameters and the
output difference between the proposed model and the testbed. Following this
training, simulations can be run using the output of the neural network to adjust the
analytical model for those particular traffic conditions.
The approach is applied to cell scale and burst scale queueing to simulate an ATM
switch, and it is also used to simulate an IP router. In all the applications, the method
ensures a fast simulation as well as an accurate result
Leveraging RISC-V to build an open-source (hardware) OS framework for reconfigurable IoT devices
With the growing interest in RISC-V systems and the endless possi bilities of creating customized hardware architectures, we introduce
the first proof of concept (PoC) implementation of ChamelIoT, the
first open-source agnostic hardware operating system (OS) frame work for reconfigurable Internet of Things (IoT) low-end devices. At
this stage, ChamelIoT, leveraging the Rocket Custom Co-Processor
Interface (RoCC), provides hardware acceleration support for thread
management and scheduling of three different OSes: RIOT, Zephyr,
and FreeRTOS. This paper overviews the overall ChamelIoT archi tecture and describes the implementation details of the current PoC
deployment. Our first experiments were carried out on a Xilinx
Arty-35T FPGA Evaluation kit and the preliminary results are very
promising, showing that the desired agnosticism and flexibility can
be achieved with determinism and performance advantages at a
reasonable cost of hardware resources
Supporting Preemptive Task Executions and Memory Copies in GPGPUs
GPGPUs (General Purpose Graphic Processing Units) provide massive computational power. However, applying GPGPU technology to real-time computing is challenging due to the non-preemptive nature of GPGPUs. Especially, a job running in a GPGPU or a data copy between a GPGPU and CPU is non-preemptive. As a result, a high priority job arriving in the middle of a low priority job execution or memory copy suffers from priority inversion. To address the problem, we present a new lightweight approach to supporting preemptive memory copies and job executions in GPGPUs. Moreover, in our approach, a GPGPU job and memory copy between a GPGPU and the hosting CPU are run concurrently to enhance the responsiveness. To show the feasibility of our approach, we have implemented a prototype system for preemptive job executions and data copies in a GPGPU. The experimental results show that our approach can bound the response times in a reliable manner. In addition, the response time of our approach is significantly shorter than those of the unmodified GPGPU runtime system that supports no preemption and an advanced GPGPU model designed to support prioritization and performance isolation via preemptive data copies
Parallel architectures and runtime systems co-design for task-based programming models
The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design.
This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cĂłmputo modernos ha provocado la necesidad de una visiĂłn holĂstica en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programaciĂłn y las aplicaciones. Hoy en dĂa el diseño de los computadores consiste en diferentes capas de abstracciĂłn con una interfaz bien definida entre ellas. Las limitaciones de esta aproximaciĂłn junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayorĂa de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducciĂłn del tamaño del canal del transistor, lo cual permite chips mĂĄs rĂĄpidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnologĂa actual estĂĄ alcanzando limitaciones fĂsicas donde no serĂĄ posible reducir el tamaño de los transistores motivando asĂ un cambio de paradigma en la construcciĂłn de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecuciĂłn del modelo de programaciĂłn sean capaces de intercambiar informaciĂłn para alcanzar una meta comĂșn: La mejora del rendimiento y la reducciĂłn del consumo energĂ©tico. Haciendo que la arquitectura sea consciente de la informaciĂłn disponible en el modelo de programaciĂłn, como puede ser el grafo de dependencias entre tareas en los modelos de programaciĂłn dataflow, es posible reducir el consumo energĂ©tico explotando el camino critico del grafo. AdemĂĄs, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecuciĂłn para realizar una mejor planificaciĂłn de las comunicaciones y creando nuevas oportunidades de solapamiento entre cĂłmputo y comunicaciĂłn que no eran posibles anteriormente. Esta tesis aporta una evaluaciĂłn de todas estas propuestas, asĂ como una metodologĂa para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version
Gunrock: A High-Performance Graph Processing Library on the GPU
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs have been two
significant challenges for developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We evaluate Gunrock on five key graph
primitives and show that Gunrock has on average at least an order of magnitude
speedup over Boost and PowerGraph, comparable performance to the fastest GPU
hardwired primitives, and better performance than any other GPU high-level
graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the
previous version v5
CATA: Criticality aware task acceleration for multicore processors
Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities in future manycore systems. Criticality aware task schedulers can benefit from these opportunities by scheduling tasks to the most appropriate cores. However, these schedulers may suffer from priority inversion and static binding problems that limit their expected improvements. Based on the observation that task criticality information can be exploited to drive hardware reconfigurations, we propose a Criticality Aware Task Acceleration (CATA) mechanism that dynamically adapts the computational power of a task depending on its criticality. As a result, CATA achieves significant improvements over a baseline static scheduler, reaching average improvements up to 18.4% in execution time and 30.1% in Energy-Delay Product (EDP) on a simulated 32-core system. The cost of reconfiguring hardware by means of a software-only solution rises with the number of cores due to lock contention and reconfiguration overhead. Therefore, novel architectural support is proposed to eliminate these overheads on future manycore systems. This architectural support minimally extends hardware structures already present in current processors, which allows further improvements in performance with negligible overhead. As a consequence, average improvements of up to 20.4% in execution time and 34.0% in EDP are obtained, outperforming state-of-the-art acceleration proposals not aware of task criticality.This work has been supported by the Spanish Government (grant SEV2015-0493, SEV-2011-00067 of the Severo Ochoa
Program), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316, TIN2012-34557, TIN2013-46957-C2-2-P), by Generalitat de Catalunya (contracts 2014-SGR-
1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the
EUâs Seventh Framework Programme (FP7/2007-2013) under grant agreement no 610402 and from the EUâs H2020 Framework
Programme (H2020/2014-2020) under grant agreement no 671697. M. MoretÂŽo has been partially supported by the Ministry of Economy and Competitiveness under Juan de
la Cierva postdoctoral fellowship number JCI-2012-15047.
M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the
Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). E. Castillo has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2012/2254.Peer ReviewedPostprint (author's final draft
GeNN: a code generation framework for accelerated brain simulations
Large-scale numerical simulations of detailed brain circuit models are important for identifying hypotheses on brain functions and testing their consistency and plausibility. An ongoing challenge for simulating realistic models is, however, computational speed. In this paper, we present the GeNN (GPU-enhanced Neuronal Networks) framework, which aims to facilitate the use of graphics accelerators for computational models of large-scale neuronal networks to address this challenge. GeNN is an open source library that generates code to accelerate the execution of network simulations on NVIDIA GPUs, through a flexible and extensible interface, which does not require in-depth technical knowledge from the users. We present performance benchmarks showing that 200-fold speedup compared to a single core of a CPU can be achieved for a network of one million conductance based Hodgkin-Huxley neurons but that for other models the speedup can differ.
GeNN is available for Linux, Mac OS X and Windows platforms. The source code, user manual, tutorials,
Wiki, in-depth example projects and all other related information can be found on the project website http://genn-team.github.io/genn/
Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models
A Retrieval-Augmented Language Model (RALM) augments a generative language
model by retrieving context-specific knowledge from an external database. This
strategy facilitates impressive text generation quality even with smaller
models, thus reducing orders of magnitude of computational demands. However,
RALMs introduce unique system design challenges due to (a) the diverse workload
characteristics between LM inference and retrieval and (b) the various system
requirements and bottlenecks for different RALM configurations such as model
sizes, database sizes, and retrieval frequencies. We propose Chameleon, a
heterogeneous accelerator system that integrates both LM and retrieval
accelerators in a disaggregated architecture. The heterogeneity ensures
efficient acceleration of both LM inference and retrieval, while the
accelerator disaggregation enables the system to independently scale both types
of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype
implements retrieval accelerators on FPGAs and assigns LM inference to GPUs,
with a CPU server orchestrating these accelerators over the network. Compared
to CPU-based and CPU-GPU vector search systems, Chameleon achieves up to 23.72x
speedup and 26.2x energy efficiency. Evaluated on various RALMs, Chameleon
exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput
compared to the hybrid CPU-GPU architecture. These promising results pave the
way for bringing accelerator heterogeneity and disaggregation into future RALM
systems
- âŠ