3,818 research outputs found

    A hybrid queueing model for fast broadband networking simulation

    Get PDF
    PhDThis research focuses on the investigation of a fast simulation method for broadband telecommunication networks, such as ATM networks and IP networks. As a result of this research, a hybrid simulation model is proposed, which combines the analytical modelling and event-driven simulation modelling to speeding up the overall simulation. The division between foreground and background traffic and the way of dealing with these different types of traffic to achieve improvement in simulation time is the major contribution reported in this thesis. Background traffic is present to ensure that proper buffering behaviour is included during the course of the simulation experiments, but only the foreground traffic of interest is simulated, unlike traditional simulation techniques. Foreground and background traffic are dealt with in a different way. To avoid the need for extra events on the event list, and the processing overhead, associated with the background traffic, the novel technique investigated in this research is to remove the background traffic completely, adjusting the service time of the queues for the background traffic to compensate (in most cases, the service time for the foreground traffic will increase). By removing the background traffic from the event-driven simulator the number of cell processing events dealt with is reduced drastically. Validation of this approach shows that, overall, the method works well, but the simulation using this method does have some differences compared with experimental results on a testbed. The reason for this is mainly because of the assumptions behind the analytical model that make the modelling tractable. Hence, the analytical model needs to be adjusted. This is done by having a neural network trained to learn the relationship between the input traffic parameters and the output difference between the proposed model and the testbed. Following this training, simulations can be run using the output of the neural network to adjust the analytical model for those particular traffic conditions. The approach is applied to cell scale and burst scale queueing to simulate an ATM switch, and it is also used to simulate an IP router. In all the applications, the method ensures a fast simulation as well as an accurate result

    Leveraging RISC-V to build an open-source (hardware) OS framework for reconfigurable IoT devices

    Get PDF
    With the growing interest in RISC-V systems and the endless possi bilities of creating customized hardware architectures, we introduce the first proof of concept (PoC) implementation of ChamelIoT, the first open-source agnostic hardware operating system (OS) frame work for reconfigurable Internet of Things (IoT) low-end devices. At this stage, ChamelIoT, leveraging the Rocket Custom Co-Processor Interface (RoCC), provides hardware acceleration support for thread management and scheduling of three different OSes: RIOT, Zephyr, and FreeRTOS. This paper overviews the overall ChamelIoT archi tecture and describes the implementation details of the current PoC deployment. Our first experiments were carried out on a Xilinx Arty-35T FPGA Evaluation kit and the preliminary results are very promising, showing that the desired agnosticism and flexibility can be achieved with determinism and performance advantages at a reasonable cost of hardware resources

    Supporting Preemptive Task Executions and Memory Copies in GPGPUs

    Get PDF
    GPGPUs (General Purpose Graphic Processing Units) provide massive computational power. However, applying GPGPU technology to real-time computing is challenging due to the non-preemptive nature of GPGPUs. Especially, a job running in a GPGPU or a data copy between a GPGPU and CPU is non-preemptive. As a result, a high priority job arriving in the middle of a low priority job execution or memory copy suffers from priority inversion. To address the problem, we present a new lightweight approach to supporting preemptive memory copies and job executions in GPGPUs. Moreover, in our approach, a GPGPU job and memory copy between a GPGPU and the hosting CPU are run concurrently to enhance the responsiveness. To show the feasibility of our approach, we have implemented a prototype system for preemptive job executions and data copies in a GPGPU. The experimental results show that our approach can bound the response times in a reliable manner. In addition, the response time of our approach is significantly shorter than those of the unmodified GPGPU runtime system that supports no preemption and an advanced GPGPU model designed to support prioritization and performance isolation via preemptive data copies

    Parallel architectures and runtime systems co-design for task-based programming models

    Get PDF
    The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cĂłmputo modernos ha provocado la necesidad de una visiĂłn holĂ­stica en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programaciĂłn y las aplicaciones. Hoy en dĂ­a el diseño de los computadores consiste en diferentes capas de abstracciĂłn con una interfaz bien definida entre ellas. Las limitaciones de esta aproximaciĂłn junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayorĂ­a de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducciĂłn del tamaño del canal del transistor, lo cual permite chips mĂĄs rĂĄpidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnologĂ­a actual estĂĄ alcanzando limitaciones fĂ­sicas donde no serĂĄ posible reducir el tamaño de los transistores motivando asĂ­ un cambio de paradigma en la construcciĂłn de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecuciĂłn del modelo de programaciĂłn sean capaces de intercambiar informaciĂłn para alcanzar una meta comĂșn: La mejora del rendimiento y la reducciĂłn del consumo energĂ©tico. Haciendo que la arquitectura sea consciente de la informaciĂłn disponible en el modelo de programaciĂłn, como puede ser el grafo de dependencias entre tareas en los modelos de programaciĂłn dataflow, es posible reducir el consumo energĂ©tico explotando el camino critico del grafo. AdemĂĄs, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecuciĂłn para realizar una mejor planificaciĂłn de las comunicaciones y creando nuevas oportunidades de solapamiento entre cĂłmputo y comunicaciĂłn que no eran posibles anteriormente. Esta tesis aporta una evaluaciĂłn de todas estas propuestas, asĂ­ como una metodologĂ­a para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version

    Gunrock: A High-Performance Graph Processing Library on the GPU

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs have been two significant challenges for developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We evaluate Gunrock on five key graph primitives and show that Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives, and better performance than any other GPU high-level graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the previous version v5

    CATA: Criticality aware task acceleration for multicore processors

    Get PDF
    Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities in future manycore systems. Criticality aware task schedulers can benefit from these opportunities by scheduling tasks to the most appropriate cores. However, these schedulers may suffer from priority inversion and static binding problems that limit their expected improvements. Based on the observation that task criticality information can be exploited to drive hardware reconfigurations, we propose a Criticality Aware Task Acceleration (CATA) mechanism that dynamically adapts the computational power of a task depending on its criticality. As a result, CATA achieves significant improvements over a baseline static scheduler, reaching average improvements up to 18.4% in execution time and 30.1% in Energy-Delay Product (EDP) on a simulated 32-core system. The cost of reconfiguring hardware by means of a software-only solution rises with the number of cores due to lock contention and reconfiguration overhead. Therefore, novel architectural support is proposed to eliminate these overheads on future manycore systems. This architectural support minimally extends hardware structures already present in current processors, which allows further improvements in performance with negligible overhead. As a consequence, average improvements of up to 20.4% in execution time and 34.0% in EDP are obtained, outperforming state-of-the-art acceleration proposals not aware of task criticality.This work has been supported by the Spanish Government (grant SEV2015-0493, SEV-2011-00067 of the Severo Ochoa Program), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316, TIN2012-34557, TIN2013-46957-C2-2-P), by Generalitat de Catalunya (contracts 2014-SGR- 1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 610402 and from the EU’s H2020 Framework Programme (H2020/2014-2020) under grant agreement no 671697. M. Moret®o has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). E. Castillo has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2012/2254.Peer ReviewedPostprint (author's final draft

    GeNN: a code generation framework for accelerated brain simulations

    Get PDF
    Large-scale numerical simulations of detailed brain circuit models are important for identifying hypotheses on brain functions and testing their consistency and plausibility. An ongoing challenge for simulating realistic models is, however, computational speed. In this paper, we present the GeNN (GPU-enhanced Neuronal Networks) framework, which aims to facilitate the use of graphics accelerators for computational models of large-scale neuronal networks to address this challenge. GeNN is an open source library that generates code to accelerate the execution of network simulations on NVIDIA GPUs, through a flexible and extensible interface, which does not require in-depth technical knowledge from the users. We present performance benchmarks showing that 200-fold speedup compared to a single core of a CPU can be achieved for a network of one million conductance based Hodgkin-Huxley neurons but that for other models the speedup can differ. GeNN is available for Linux, Mac OS X and Windows platforms. The source code, user manual, tutorials, Wiki, in-depth example projects and all other related information can be found on the project website http://genn-team.github.io/genn/

    Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models

    Full text link
    A Retrieval-Augmented Language Model (RALM) augments a generative language model by retrieving context-specific knowledge from an external database. This strategy facilitates impressive text generation quality even with smaller models, thus reducing orders of magnitude of computational demands. However, RALMs introduce unique system design challenges due to (a) the diverse workload characteristics between LM inference and retrieval and (b) the various system requirements and bottlenecks for different RALM configurations such as model sizes, database sizes, and retrieval frequencies. We propose Chameleon, a heterogeneous accelerator system that integrates both LM and retrieval accelerators in a disaggregated architecture. The heterogeneity ensures efficient acceleration of both LM inference and retrieval, while the accelerator disaggregation enables the system to independently scale both types of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements retrieval accelerators on FPGAs and assigns LM inference to GPUs, with a CPU server orchestrating these accelerators over the network. Compared to CPU-based and CPU-GPU vector search systems, Chameleon achieves up to 23.72x speedup and 26.2x energy efficiency. Evaluated on various RALMs, Chameleon exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. These promising results pave the way for bringing accelerator heterogeneity and disaggregation into future RALM systems
    • 

    corecore