48 research outputs found

    Performance comparison between the Click Modular Router and the NetFPGA

    Get PDF
    It is possible to forward minimum-sized packets at rates of hundreds of Mbps using commodity hardware and Linux. We had a preference for the Click Modular Router platform due its flexibility and the fact that it claimed to have equal or higher performance than native forwarding if used with its polling drivers. Moreover, the NetFPGA is an open networking platform accelerator that enables researchers and instructors to build working prototypes of high-speed, hardware-accelerated networking systems. NetFPGA reference designs comprised in the system include an IPv4 router, an Ethernet switch, a four-port NIC, and SCONE (Software Component of NetFPGA). Researchers have used the platform to build advanced network flow processing systems. We have followed the RFC1242 - Benchmarking Terminology for Network Interconnection Devices - and the RFC2544 - Benchmarking Methodology for Network Interconnection Devices - in order to define the specific set of tests to use to describe the performance characteristics of the two routers. We have also shown a test comparison between the NetFPGA and the Click router about a file transfer using the FTP and the HTTP protocol.Overall, the NetFPGA router performance outperforms the Click router performance

    Application Specific Customization and Scalability of Soft Multiprocessors

    Full text link

    Heracles: Fully Synthesizable Parameterized MIPS-Based Multicore System

    Get PDF
    Heracles is an open-source complete multicore system written in Verilog. It is fully parameterized and can be reconfigured and synthesized into different topologies and sizes. Each processing node has a 7-stage pipeline, fully bypassed, microprocessor running the MIPS-III ISA, a 4-stage input-buffer, virtual-channel router, and a local variable-size shared memory. Our design is highly modular with clear interfaces between the core, the memory hierarchy, and the on-chip network. In the baseline design, the microprocessor is attached to two caches, one instruction cache and one data cache, which are oblivious to the global memory organization. The memory system in Heracles can be configured as one single global shared memory (SM), or distributed shared memory (DSM), or any combination thereof. Each core is connected to the rest of the network of processors by a parameterized, realistic, wormhole router. We show different topology configurations of the system, and their synthesis results on the Xilinx Virtex-5 LX330T FPGA board. We also provide a small MIPS cross-compiler toolchain to assist in developing software for Heracles

    Performance Analysis of a Reconfigurable Shared Memory Multiprocessor System for Embedded Applications

    Get PDF
    This paper presents a method to predict performance of multiple processor cores in a reconfigurable system for embedded applications. A multiprocessor framework is developed with the capability of reconfigurable processors in a shared memory system optimized for stream-oriented data and signal processing applications. The framework features a discrete time Markov based stochastic tool, which is used to analyze memory contention in the shared memory architecture, and to predict the performance increase (speed of execution) when the number of processors is varied. Performance predictions for variations of other system parameters, such as different task allocations and the number of pipeline stages are possible as well. The results of the prediction tool were verified by experimental results of a green screen application developed and run on a Xilinx Virtex-II Pro FPGA with MicroBlaze soft processors

    Класифікація та архітектурні особливості програмованих мультипроцесорних систем-на-кристалі

    Get PDF
    Provided general information on embedded multiprocessor systems-on-chip based on FPGA (FPGA-MPSoC). Completed a comprehensive analysis of the architectural features and provided Shih rock classification FPGA-MPSoC. Powered overview of recent research in the development of FPGA-MPSoC. A wide circle of such systems in order to study trends in architecture and all problems solvedПредоставлено общую информацию о встроенных мультипроцессорных систем-на-кристалле на базе ПЛИС (FPGA-MPSoC). Выполнено всесторонний анализ архитектурных особенностей и предоставлена ​​широкая классификация FPGA-MPSoC. Приведены обзор последних исследований в области разработки FPGA-MPSoC. Представлен широкий круг таких систем с целью исследования всех тенденциях архитектуры и решаемых задачПредоставлено общую информацию о встроенных мультипроцессорных систем-на-кристалле на базе ПЛИС (FPGA-MPSoC). Выполнено всесторонний анализ архитектурных особенностей и предоставлена ​​широкая классификация FPGA-MPSoC. Приведены обзор последних исследований в области разработки FPGA-MPSoC. Представлен широкий круг таких систем с целью исследования всех тенденциях архитектуры и решаемых зада

    Conception et évaluation des systèmes logiciels de classifications de paquets haute-performance

    Get PDF
    Packet classification consists of matching packet headers against a set of pre-defined rules, and performing the action(s) associated with the matched rule(s). As a key technology in the data-plane of network devices, packet classification has been widely deployed in many network applications and services, such as firewalling, load balancing, VPNs etc. Packet classification has been extensively studied in the past two decades. Traditional packet classification methods are usually based on specific hardware. With the development of data center networking, software-defined networking, and application-aware networking technology, packet classification methods based on multi/many processor platform are becoming a new research interest. In this dissertation, packet classification has been studied mainly in three aspects: algorithm design framework, rule-set features analysis and algorithm implementation and optimization. In the dissertation, we review multiple proposed algorithms and present a decision tree based algorithm design framework. The framework decomposes various existing packet classification algorithms into a combination of different types of “meta-methods”, revealing the connection between different algorithms. Based on this framework, we combine different “meta-methods” from different algorithms, and propose two new algorithms, HyperSplit-op and HiCuts-op. The experiment results show that HiCuts-op achieves 2~20x less memory size, and 10% less memory accesses than HiCuts, while HyperSplit-op achieves 2~200x less memory size, and 10%~30% less memory accesses than HyperSplit. We also explore the connections between the rule-set features and the performance of various algorithms. We find that the “coverage uniformity” of the rule-set has a significant impact on the classification speed, and the size of “orthogonal structure” rules usually determines the memory size of algorithms. Based on these two observations, we propose a memory consumption model and a quantified method for coverage uniformity. Using the two tools, we propose a new multi-decision tree algorithm, SmartSplit and an algorithm policy framework, AutoPC. Compared to EffiCuts algorithm, SmartSplit achieves around 2.9x speedup and up to 10x memory size reduction. For a given rule-set, AutoPC can automatically recommend a “right” algorithm for the rule-set. Compared to using a single algorithm on all the rulesets, AutoPC achieves in average 3.8 times faster. We also analyze the connection between prefix length and the update overhead for IP lookup algorithms. We observe that long prefixes will always result in more memory accesses using Tree Bitmap algorithm while short prefixes will always result in large update overhead in DIR-24-8. Through combining two algorithms, a hybrid algorithm, SplitLookup, is proposed to reduce the update overhead. Experimental results show that, the hybrid algorithm achieves 2 orders of magnitudes less in memory accesses when performing short prefixes updating, but its lookup speed with DIR-24-8 is close. In the dissertation, we implement and optimize multiple algorithms on the multi/many core platform. For IP lookup, we implement two typical algorithms: DIR-24-8 and Tree Bitmap, and present several optimization tricks for these two algorithms. For multi-dimensional packet classification, we have implemented HyperCuts/HiCuts and the variants of these two algorithms, such as Adaptive Binary Cuttings, EffiCuts, HiCuts-op and HyperSplit-op. The SplitLookup algorithm has achieved up to 40Gbps throughput on TILEPro64 many-core processor. The HiCuts-op and HyperSplit-op have achieved up to 10 to 20Gbps throughput on a single core of Intel processors. In general, our study reveals the connections between the algorithmic tricks and rule-set features. Results in this dissertation provide insight for new algorithm design and the guidelines for efficient algorithm implementation.La classification de paquets consiste à vérifier par rapport à un ensemble de règles prédéfinies le contenu des entêtes de paquets. Cette vérification permet d'appliquer à chaque paquet l'action adaptée en fonction de règles qu'il valide. La classification de paquets étant un élément clé du plan de données des équipements de traitements de paquets, elle est largement utilisée dans de nombreuses applications et services réseaux, comme les pare-feu, l'équilibrage de charge, les réseaux privés virtuels, etc. Au vu de son importance, la classification de paquet a été intensivement étudiée durant les vingt dernières années. La solution classique à ce problème a été l'utilisation de matériel dédiés et conçus pour cet usage. Néanmoins, l'émergence des centres de données, des réseaux définis en logiciel nécessite une flexibilité et un passage à l'échelle que les applications classiques ne nécessitaient pas. Afin de relever ces défis des plateformes de traitement multi-cœurs sont de plus en plus utilisés. Cette thèse étudie la classification de paquets suivant trois dimensions : la conception des algorithmes, les propriétés des règles de classification et la mise en place logicielle, matérielle et son optimisation. La thèse commence, par faire une rétrospective sur les diverses algorithmes fondés sur des arbres de décision développés pour résoudre le problème de classification de paquets. Nous proposons un cadre générique permettant de classifier ces différentes approches et de les décomposer en une séquence de « méta-méthodes ». Ce cadre nous a permis de monter la relation profonde qui existe ces différentes méthodes et en combinant de façon différentes celle-ci de construire deux nouveaux algorithmes de classification : HyperSplit-op et HiCuts-op. Nous montrons que ces deux algorithmes atteignent des gains de 2~200x en terme de taille de mémoire et 10%~30% moins d'accès mémoire que les meilleurs algorithmes existant. Ce cadre générique est obtenu grâce à l'analyse de la structure des ensembles de règles utilisés pour la classification des paquets. Cette analyse a permis de constater qu'une « couverture uniforme » dans l'ensemble de règle avait un impact significatif sur la vitesse de classification ainsi que l'existence de « structures orthogonales » avait un impact important sur la taille de la mémoire. Cette analyse nous a ainsi permis de développer un modèle de consommation mémoire qui permet de découper les ensembles de règles afin d'en construire les arbres de décision. Ce découpage permet jusqu'à un facteur de 2.9 d'augmentation de la vitesse de classification avec une réduction jusqu'à 10x de la mémoire occupé. La classification par ensemble de règle simple n'est pas le seul cas de classification de paquets. La recherche d'adresse IP par préfixe le plus long fourni un autre traitement de paquet stratégique à mettre en œuvre. Une troisième partie de cette thèse c'est donc intéressé à ce problème et plus particulièrement sur l'interaction entre la charge de mise à jour et la vitesse de classification. Nous avons observé que la mise à jour des préfixes longs demande plus d'accès mémoire que celle des préfixes court dans les structures de données d'arbre de champs de bits alors que l'inverse est vrai dans la structure de données DIR-24-8. En combinant ces deux approches, nous avons propose un algorithme hybride SplitLookup, qui nécessite deux ordres de grandeurs moins d'accès mémoire quand il met à jour les préfixes courts tout en gardant des performances de recherche de préfixe proche du DIR-24-8. Tous les algorithmes étudiés, conçus et implémentés dans cette thèse ont été optimisés à partir de nouvelles structures de données pour s'exécuter sur des plateformes multi-cœurs. Ainsi nous obtenons des débits de recherche de préfixe atteignant 40 Gbps sur une plateforme TILEPro64

    Branch Prediction For Network Processors

    Get PDF
    Originally designed to favour flexibility over packet processing performance, the future of the programmable network processor is challenged by the need to meet both increasing line rate as well as providing additional processing capabilities. To meet these requirements, trends within networking research has tended to focus on techniques such as offloading computation intensive tasks to dedicated hardware logic or through increased parallelism. While parallelism retains flexibility, challenges such as load-balancing limit its scope. On the other hand, hardware offloading allows complex algorithms to be implemented at high speed but sacrifice flexibility. To this end, the work in this thesis is focused on a more fundamental aspect of a network processor, the data-plane processing engine. Performing both system modelling and analysis of packet processing functions; the goal of this thesis is to identify and extract salient information regarding the performance of multi-processor workloads. Following on from a traditional software based analysis of programme workloads, we develop a method of modelling and analysing hardware accelerators when applied to network processors. Using this quantitative information, this thesis proposes an architecture which allows deeply pipelined micro-architectures to be implemented on the data-plane while reducing the branch penalty associated with these architectures

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    DPI over commodity hardware: implementation of a scalable framework using FastFlow

    Get PDF
    In the last years we assisted to a large increase of the number of applications running on top of IP networks. Consequently the need to implement very efficient monitoring solutions that can manage these high data rates and that can classify the type of traffic which is traveling over the network has increased. For example, as far as network security is concerned, in the recent years we have seen a shift from so-called "network-level" attacks, which target the network they are transported on (e.g. Denial of Service), to content-based threats which exploit applications vulnerabilities and require sophisticated levels of intelligence to be detected. For some of these threats, it is no more sufficient to have only a software solution on the client side but we also need to run some controls on the network itself. To manage these kinds of scenarios, payload inspection is often required in order to correctly identify the application protocol and to process the data carried over it. This is the reason why, in recent years, Deep Packet Inspection (DPI) technology has emerged. This kind of processing is in many cases implemented, at least in part, through dedicated hardware. However, full software solutions may often be more appealing because they are typically more economical and have, in general, the capability to react faster to protocols evolution and changes. Moreover, software solutions which run over general purpose hardware do not exploit the underlying multiprocessor architecture, providing only the capability to process the incoming packets sequentially. Furthermore, many DPI research works that can be found in literature and which exploits multicore architectures are often characterized by a poor scalability, due to the overhead required for synchronization and to load unbalance among the used cores. In this thesis, we will describe the design and implementation of a DPI framework capable of managing current networks rates using commodity multicore hardware. Our framework provides the possibility to identify the protocol, to specify the kind of data to extract when it has been identified and how these data has to be processed. Differently from existing works, the developed framework has been designed according to the structured parallel programming theory, allowing thus to completely hide to the user the complexity of the management of the problems related to an efficient exploitation of the underlying architecture. These concepts have then been applied using FastFlow, a library for structured parallel programming targeting both shared memory and distributed memory architectures

    Run-time management for future MPSoC platforms

    Get PDF
    In recent years, we are witnessing the dawning of the Multi-Processor Systemon- Chip (MPSoC) era. In essence, this era is triggered by the need to handle more complex applications, while reducing overall cost of embedded (handheld) devices. This cost will mainly be determined by the cost of the hardware platform and the cost of designing applications for that platform. The cost of a hardware platform will partly depend on its production volume. In turn, this means that ??exible, (easily) programmable multi-purpose platforms will exhibit a lower cost. A multi-purpose platform not only requires ??exibility, but should also combine a high performance with a low power consumption. To this end, MPSoC devices integrate computer architectural properties of various computing domains. Just like large-scale parallel and distributed systems, they contain multiple heterogeneous processing elements interconnected by a scalable, network-like structure. This helps in achieving scalable high performance. As in most mobile or portable embedded systems, there is a need for low-power operation and real-time behavior. The cost of designing applications is equally important. Indeed, the actual value of future MPSoC devices is not contained within the embedded multiprocessor IC, but in their capability to provide the user of the device with an amount of services or experiences. So from an application viewpoint, MPSoCs are designed to ef??ciently process multimedia content in applications like video players, video conferencing, 3D gaming, augmented reality, etc. Such applications typically require a lot of processing power and a signi??cant amount of memory. To keep up with ever evolving user needs and with new application standards appearing at a fast pace, MPSoC platforms need to be be easily programmable. Application scalability, i.e. the ability to use just enough platform resources according to the user requirements and with respect to the device capabilities is also an important factor. Hence scalability, ??exibility, real-time behavior, a high performance, a low power consumption and, ??nally, programmability are key components in realizing the success of MPSoC platforms. The run-time manager is logically located between the application layer en the platform layer. It has a crucial role in realizing these MPSoC requirements. As it abstracts the platform hardware, it improves platform programmability. By deciding on resource assignment at run-time and based on the performance requirements of the user, the needs of the application and the capabilities of the platform, it contributes to ??exibility, scalability and to low power operation. As it has an arbiter function between different applications, it enables real-time behavior. This thesis details the key components of such an MPSoC run-time manager and provides a proof-of-concept implementation. These key components include application quality management algorithms linked to MPSoC resource management mechanisms and policies, adapted to the provided MPSoC platform services. First, we describe the role, the responsibilities and the boundary conditions of an MPSoC run-time manager in a generic way. This includes a de??nition of the multiprocessor run-time management design space, a description of the run-time manager design trade-offs and a brief discussion on how these trade-offs affect the key MPSoC requirements. This design space de??nition and the trade-offs are illustrated based on ongoing research and on existing commercial and academic multiprocessor run-time management solutions. Consequently, we introduce a fast and ef??cient resource allocation heuristic that considers FPGA fabric properties such as fragmentation. In addition, this thesis introduces a novel task assignment algorithm for handling soft IP cores denoted as hierarchical con??guration. Hierarchical con??guration managed by the run-time manager enables easier application design and increases the run-time spatial mapping freedom. In turn, this improves the performance of the resource assignment algorithm. Furthermore, we introduce run-time task migration components. We detail a new run-time task migration policy closely coupled to the run-time resource assignment algorithm. In addition to detailing a design-environment supported mechanism that enables moving tasks between an ISP and ??ne-grained recon??gurable hardware, we also propose two novel task migration mechanisms tailored to the Network-on-Chip environment. Finally, we propose a novel mechanism for task migration initiation, based on reusing debug registers in modern embedded microprocessors. We propose a reactive on-chip communication management mechanism. We show that by exploiting an injection rate control mechanism it is possible to provide a communication management system capable of providing a soft (reactive) QoS in a NoC. We introduce a novel, platform independent run-time algorithm to perform quality management, i.e. to select an application quality operating point at run-time based on the user requirements and the available platform resources, as reported by the resource manager. This contribution also proposes a novel way to manage the interaction between the quality manager and the resource manager. In order to have a the realistic, reproducible and ??exible run-time manager testbench with respect to applications with multiple quality levels and implementation tradev offs, we have created an input data generation tool denoted Pareto Surfaces For Free (PSFF). The the PSFF tool is, to the best of our knowledge, the ??rst tool that generates multiple realistic application operating points either based on pro??ling information of a real-life application or based on a designer-controlled random generator. Finally, we provide a proof-of-concept demonstrator that combines these concepts and shows how these mechanisms and policies can operate for real-life situations. In addition, we show that the proposed solutions can be integrated into existing platform operating systems
    corecore