48 research outputs found

    Accelerating Network Functions using Reconfigurable Hardware. Design and Validation of High Throughput and Low Latency Network Functions at the Access Edge

    Get PDF
    Providing Internet access to billions of people worldwide is one of the main technical challenges in the current decade. The Internet access edge connects each residential and mobile subscriber to this network and ensures a certain Quality of Service (QoS). However, the implementation of access edge functionality challenges Internet service providers: First, a good QoS must be provided to the subscribers, for example, high throughput and low latency. Second, the quick rollout of new technologies and functionality demands flexible configuration and programming possibilities of the network components; for example, the support of novel, use-case-specific network protocols. The functionality scope of an Internet access edge requires the use of programming concepts, such as Network Functions Virtualization (NFV). The drawback of NFV-based network functions is a significantly lowered resource efficiency due to the execution as software, commonly resulting in a lowered QoS compared to rigid hardware solutions. The usage of programmable hardware accelerators, named NFV offloading, helps to improve the QoS and flexibility of network function implementations. In this thesis, we design network functions on programmable hardware to improve the QoS and flexibility. First, we introduce the host bypassing concept for improved integration of hardware accelerators in computer systems, for example, in 5G radio access networks. This novel concept bypasses the system’s main memory and enables direct connectivity between the accelerator and network interface card. Our evaluations show an improved throughput and significantly lowered latency jitter for the presented approach. Second, we analyze different programmable hardware technologies for hardware-accelerated Internet subscriber handling, including three P4-programmable platforms and FPGAs. Our results demonstrate that all approaches have excellent performance and are suitable for Internet access creation. We present a fully-fledged User Plane Function (UPF) designed upon these concepts and test it in an end-to-end 5G standalone network as part of this contribution. Third, we analyze and demonstrate the usability of Active Queue Management (AQM) algorithms on programmable hardware as an expansion to the access edge. We show the feasibility of the CoDel AQM algorithm and discuss the challenges and constraints to be considered when limited hardware is used. The results show significant improvements in the QoS when the AQM algorithm is deployed on hardware. Last, we focus on network function benchmarking, which is crucial for understanding the behavior of implementations and their optimization, e.g., Internet access creation. For this, we introduce the load generation and measurement framework P4STA, benefiting from flexible software-based load generation and hardware-assisted measuring. Utilizing programmable network switches, we achieve a nanosecond time accuracy while generating test loads up to the available Ethernet link speed

    NFComms: A synchronous communication framework for the CPU-NFP heterogeneous system

    Get PDF
    This work explores the viability of using a Network Flow Processor (NFP), developed by Netronome, as a coprocessor for the construction of a CPU-NFP heterogeneous platform in the domain of general processing. When considering heterogeneous platforms involving architectures like the NFP, the communication framework provided is typically represented as virtual network interfaces and is thus not suitable for generic communication. To enable a CPU-NFP heterogeneous platform for use in the domain of general computing, a suitable generic communication framework is required. A feasibility study for a suitable communication medium between the two candidate architectures showed that a generic framework that conforms to the mechanisms dictated by Communicating Sequential Processes is achievable. The resulting NFComms framework, which facilitates inter- and intra-architecture communication through the use of synchronous message passing, supports up to 16 unidirectional channels and includes queuing mechanisms for transparently supporting concurrent streams exceeding the channel count. The framework has a minimum latency of between 15.5 μs and 18 μs per synchronous transaction and can sustain a peak throughput of up to 30 Gbit/s. The framework also supports a runtime for interacting with the Go programming language, allowing user-space processes to subscribe channels to the framework for interacting with processes executing on the NFP. The viability of utilising a heterogeneous CPU-NFP system for use in the domain of general and network computing was explored by introducing a set of problems or applications spanning general computing, and network processing. These were implemented on the heterogeneous architecture and benchmarked against equivalent CPU-only and CPU/GPU solutions. The results recorded were used to form an opinion on the viability of using an NFP for general processing. It is the author’s opinion that, beyond very specific use cases, it appears that the NFP-400 is not currently a viable solution as a coprocessor in the field of general computing. This does not mean that the proposed framework or the concept of a heterogeneous CPU-NFP system should be discarded as such a system does have acceptable use in the fields of network and stream processing. Additionally, when comparing the recorded limitations to those seen during the early stages of general purpose GPU development, it is clear that general processing on the NFP is currently in a similar state

    Novel Architectures for Offloading and Accelerating Computations in Artificial Intelligence and Big Data

    Get PDF
    Due to the end of Moore's Law and Dennard Scaling, performance gains in general-purpose architectures have significantly slowed in recent years. While raising the number of cores has been a viable approach for further performance increases, Amdahl's Law and its implications on parallelization also limit further performance gains. Consequently, research has shifted towards different approaches, including domain-specific custom architectures tailored to specific workloads. This has led to a new golden age for computer architecture, as noted in the Turing Award Lecture by Hennessy and Patterson, which has spawned several new architectures and architectural advances specifically targeted at highly current workloads, including Machine Learning. This thesis introduces a hierarchy of architectural improvements ranging from minor incremental changes, such as High-Bandwidth Memory, to more complex architectural extensions that offload workloads from the general-purpose CPU towards more specialized accelerators. Finally, we introduce novel architectural paradigms, namely Near-Data or In-Network Processing, as the most complex architectural improvements. This cumulative dissertation then investigates several architectural improvements to accelerate Sum-Product Networks, a novel Machine Learning approach from the class of Probabilistic Graphical Models. Furthermore, we use these improvements as case studies to discuss the impact of novel architectures, showing that minor and major architectural changes can significantly increase performance in Machine Learning applications. In addition, this thesis presents recent works on Near-Data Processing, which introduces Smart Storage Devices as a novel architectural paradigm that is especially interesting in the context of Big Data. We discuss how Near-Data Processing can be applied to improve performance in different database settings by offloading database operations to smart storage devices. Offloading data-reductive operations, such as selections, reduces the amount of data transferred, thus improving performance and alleviating bandwidth-related bottlenecks. Using Near-Data Processing as a use-case, we also discuss how Machine Learning approaches, like Sum-Product Networks, can improve novel architectures. Specifically, we introduce an approach for offloading Cardinality Estimation using Sum-Product Networks that could enable more intelligent decision-making in smart storage devices. Overall, we show that Machine Learning can benefit from developing novel architectures while also showing that Machine Learning can be applied to improve the applications of novel architectures

    Development of a parallel trigger framework for rare decay searches

    Get PDF
    L'esperimento NA62 rappresenta il programma attuale di test del Modello Standard mediante lo studio del mesone K al CERN di Ginevra e offre un approccio complementare rispetto agli esperimenti alla frontiera delle alte energie al Large Hadron Collider. L'obiettivo dell'esperimento NA62 è misurare il rapporto di decadimento per il processo K + → π+ ν ν con una precisione del ∼10%. Essendo il valore previsto dal Modello Standard determinato con elevata precisione, la misura di questa quantità risulta essere un ottimo modo per investigare l'esistenza di nuova Fisica. In modo complementare a questo programma principale, la semplicità dei decadimenti del mesone K + (pochi canali di decadimento e bassa molteplicità nello stato finale), offre la possibilità di raggiungere ottime sensibilità nelle ricerche di decadimenti che violano la conservazione del sapore leptonico. Le caratteristiche sperimentali di decadimenti come K + → π- μ+μ+ sono molto chiare e permettono una efficace reiezione del fondo. Tuttavia, per misurare eventi di questo tipo è necessario produrre un numero considerevole di decadimenti del mesone K + . La banda in scrittura su disco rigido o nastro magnetico disponibile attualmente non consente la memorizzazione di tutti gli eventi prodotti e risulta necessaria una selezione a più stadi degli eventi potenzialmente interessanti (trigger). In NA62, una prima selezione viene effettuata in tempo reale (tempi di risposta inferiori ad 1ms) dal cosiddetto trigger di livello 0, basato su logica programmabile (FPGA), che non permette la stessa flessibilità dei processori utilizzati per i calcolatori programmabili utilizzando software. Le prestazioni delle architetture parallele come le CPU multi-core e le GPU (Graphics Processing Unit) presenti sulle schede grafiche dei calcolatori, sono promettenti per un eventuale utilizzo di queste piattaforme per il riconoscimento di patterns più elaborati come ad esempio la ricostruzione di circonferenze, dovute a luce Čerenkov, all'interno del rivelatore RICH di NA62. Nella prima parte della mia tesi ho effettuato uno studio di fattibilità sulla possibilità di utilizzare le GPU in un contesto di alta banda di eventi e bassa latenza, quale quello del trigger in tempo reale. A NA62 questo studio ha richiesto lo sviluppo di algoritmi paralleli diversi e sempre più complessi, per determinare le prestazioni e trovare i possibili colli di bottiglia di un sistema di questo tipo. Descrivo poi lo sviluppo di un framework software ad alte prestazioni, che utilizza tecniche di programmazione multithreaded e drivers di rete veloci per il trasporto delle primitive di trigger dall'elettronica di front end alla memoria della GPU per l'elaborazione e la selezione degli eventi. Infine, è descritto l'utilizzo del sistema sviluppato per la selezione di decadimenti K + → π- μ+μ+ tramite l'impiego di un algoritmo per il riconoscimento di più anelli nel rivelatore RICH. Al fine di determinare l'efficienza di selezione del decadimento ho studiato l'efficienza di reiezione del fondo e l'accettanza per gli eventi di segnale al variare di alcuni parametri di selezione, determinando i vantaggi di questo approccio innovativo

    Study and Benchmarking of modern computing architectures

    Get PDF
    Modern computing architectures change rapidly and exhibit high levels of complexity and heterogeneity. For example, the Barcelona Supercomputing Center (BSC) will have one general purpose cluster and three smaller clusters with Emerging Technologies like Many Core architecture from Intel, Power from IBM and ARMv8 from Fujitsu. These are technologies currently being developed to accelerate the arrival of the new generation of pre-exascale supercomputers. We have analysed these architectures in order to gain knowledge and experience to get the best performance possible from them running HPC applications that exploit the strengths of each architecture. Additionally, we have run a benchmark suite on each available computer architecture with the purpose of characterize them. In order to complement the benchmark results, we have decided to analyse the execution of each benchmark with a performance analysis tool. There already exist several performance analysis tools which cover different performance areas like hardware events counting, performance simulation, traces, etc. However, these tools are dependent of being ported to modern computing architectures and they normally have different usages for different architectures. We propose a solution based on the perf_event interface, which is included on the Linux kernel, that allows us to analyse the benchmark execution on the different modern computing architectures and it allows users easily to get performance information of their applications. The information that we provide with our solution are a sampling trace of hardware events, a profiling report and monitoring information of CPU and memory. Our solution enhances the performance results of other tools with extra features that help a better understanding of the bottlenecks of applications and/or architectures, and so, benchmarking. In addition, our proposal can be used in any system with Linux

    Harnessing low-level tuning in modern architectures for high-performance network monitoring in physical and virtual platforms

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura: 02-07-201

    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009)(revised 08/2009)

    Get PDF
    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009) which was held Feb. 12th 2009 in Mannheim, Germany. The 1st International Workshop for Research on HyperTransport is an international high quality forum for scientists, researches and developers working in the area of HyperTransport. This includes not only developments and research in HyperTransport itself, but also work which is based on or enabled by HyperTransport. HyperTransport (HT) is an interconnection technology which is typically used as system interconnect in modern computer systems, connecting the CPUs among each other and with the I/O bridges. Primarily designed as interconnect between high performance CPUs it provides an extremely low latency, high bandwidth and excellent scalability. The definition of the HTX connector allows the use of HT even for add-in cards. In opposition to other peripheral interconnect technologies like PCI-Express no protocol conversion or intermediate bridging is necessary. HT is a direct connection between device and CPU with minimal latency. Another advantage is the possibility of cache coherent devices. Because of these properties HT is of high interest for high performance I/O like networking and storage, but also for co-processing and acceleration based on ASIC or FPGA technologies. In particular acceleration sees a resurgence of interest today. One reason is the possibility to reduce power consumption by the use of accelerators. In the area of parallel computing the low latency communication allows for fine grain communication schemes and is perfectly suited for scalable systems. Summing up, HT technology offers key advantages and great performance to any research aspect related to or based on interconnects. For more information please consult the workshop website (http://whtra.uni-hd.de)

    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009)

    Get PDF
    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009) which was held Feb. 12th 2009 in Mannheim, Germany. The 1st International Workshop for Research on HyperTransport is an international high quality forum for scientists, researches and developers working in the area of HyperTransport. This includes not only developments and research in HyperTransport itself, but also work which is based on or enabled by HyperTransport. HyperTransport (HT) is an interconnection technology which is typically used as system interconnect in modern computer systems, connecting the CPUs among each other and with the I/O bridges. Primarily designed as interconnect between high performance CPUs it provides an extremely low latency, high bandwidth and excellent scalability. The definition of the HTX connector allows the use of HT even for add-in cards. In opposition to other peripheral interconnect technologies like PCI-Express no protocol conversion or intermediate bridging is necessary. HT is a direct connection between device and CPU with minimal latency. Another advantage is the possibility of cache coherent devices. Because of these properties HT is of high interest for high performance I/O like networking and storage, but also for co-processing and acceleration based on ASIC or FPGA technologies. In particular acceleration sees a resurgence of interest today. One reason is the possibility to reduce power consumption by the use of accelerators. In the area of parallel computing the low latency communication allows for fine grain communication schemes and is perfectly suited for scalable systems. Summing up, HT technology offers key advantages and great performance to any research aspect related to or based on interconnects. For more information please consult the workshop website (http://whtra.uni-hd.de)

    Artificial Intelligence Technology

    Get PDF
    This open access book aims to give our readers a basic outline of today’s research and technology developments on artificial intelligence (AI), help them to have a general understanding of this trend, and familiarize them with the current research hotspots, as well as part of the fundamental and common theories and methodologies that are widely accepted in AI research and application. This book is written in comprehensible and plain language, featuring clearly explained theories and concepts and extensive analysis and examples. Some of the traditional findings are skipped in narration on the premise of a relatively comprehensive introduction to the evolution of artificial intelligence technology. The book provides a detailed elaboration of the basic concepts of AI, machine learning, as well as other relevant topics, including deep learning, deep learning framework, Huawei MindSpore AI development framework, Huawei Atlas computing platform, Huawei AI open platform for smart terminals, and Huawei CLOUD Enterprise Intelligence application platform. As the world’s leading provider of ICT (information and communication technology) infrastructure and smart terminals, Huawei’s products range from digital data communication, cyber security, wireless technology, data storage, cloud computing, and smart computing to artificial intelligence
    corecore