134 research outputs found

    Cost effective RISC core supporting the large sending offload

    Get PDF
    The Ethernet speed has increased sending and receiving frames from 40 to 100 Gbps after the IEEE P802.3ba released. The industry and academia have focused scaling up the TCP/IP protocol processing for 40-100 Gbps. LSO is a de facto standard, which is offloaded to network interface for sending packets up to 10 Gbps. It not clears whether a network interface can support such function for new 40-100 Gbps. The widely use of the hardware-based NIC such as the use of a fully customized logic based network interface can be due to the following reasons; Still it is not clear whether the General Purpose Processor (GPP) can provide the processing required for high-speed line beyond the 10 Gbps. Also, the limit of the GPP's clock in supporting the processing of network interfaces. However, using a RISC core engine for offloading the LSO function can deliver some important features to network interfaces design, such as simplicity, scalability, shorter developing cycle time. In this paper, we have investigated using a specialized RISC core to process the LSO functions for TCP/IP and UDP/IP for high-speed communications rate up to 100 Gbps. To achieve this, we have enhanced the LSO algorithm to scale it to 100 Gbps. A fast DMA is used to support transferring data in the network interface. The LSO processing methodology on the network has presented. In addition, the RISC's performance and data movements for high communication rate up to 100 Gbps have been measured. A 148 MHz RISC core can support the sending-side processing for up to 100 Gbps transmission speed for the TCP/IP and UDP/IP protocol when the MTU is applied (1500 bytes). A DMA with 3759 MHz is required to eliminate the idle cycles while transferring data over the 64-bit local bus

    Design of a scalable network interface to support enhanced TCP and UDP processing for high speed networks

    Get PDF
    Communication networks have advanced rapidly in providing additional services, with improvements made to their bandwidth and the integration of advanced technology. As the speed of networks exceeds 10 Gbps, the time frame for completing the processing of TCP and UDP packets has become extremely short. The design and implementation of high performance Network Interfaces (NIs) that can support offload protocol functions for current and next-generation networks is challenging. In this thesis two software approaches are presented to enhance protocol processing of TCP and UDP in the network interface. A novel software Large Receive Offload (LRO) approach for enhancing the receiving side has been proposed. The LRO works by aggregating the incoming TCP and UDP packets into larger packets inside the NI’s buffer. The receiving side software has been improved to support out-of-order packets. The second proposed software solution is applied on the Large Send Offload (LSO). The proposed LSO function processing is implemented by segmenting TCP and UDP messages that are larger than the Maximum Transmission Unit to the Maximum Segment Size. New packet headers are generated for each new outgoing packet. A scalable programmable NI based 32-bit RISC core is presented that can support 100 Gbps network speeds. Acceleration of the processing time frame required at the NI has been implemented to prevent hazards (such as Data Hazard and Control Hazard) during the execution of the LRO and the LSO functions. An R2000/3000 RISC has been used in order to test the LRO and LSO functions and to discover the instruction set that is most suitable. Following this the VHDL NI was implemented with three pipeline RISC cores, a simple DMA controller and Content Addressable Memory. An evaluation of the desired RISC clock rate that is required to process TCP and UDP streams at 100 Gbps was conducted. It was determined that a RISC core running at 752 MHz with a DMA clock of 3753 MHz was able to process packets 512 bytes or larger fast enough to support 100 Gbps network speeds

    IXIAM: ISA EXtension for Integrated Accelerator Management

    Get PDF
    During the last few years, hardware accelerators have been gaining popularity thanks to their ability to achieve higher performance and efficiency than classic general-purpose solutions. They are fundamentally shaping the current generations of Systems-on-Chip (SoCs), which are becoming increasingly heterogeneous. However, despite their widespread use, a standard, general solution to manage them while providing speed and consistency has not yet been found. Common methodologies rely on OS mediation and a mix of user-space and kernel-space drivers, which can be inefficient, especially for fine-grained tasks. This paper addresses these sources of inefficiencies by proposing an ISA eXtension for Integrated Accelerator Management (IXIAM), a cost-effective HW-SW framework to control a wide variety of accelerators in a standard way, and directly from the cores. The proposed instructions include reservation, work offloading, data transfer, and synchronization. They can be wrapped in a high-level software API or even integrated into a compiler. IXIAM features also a user-space interrupt mechanism to signal events directly to the user process. We implement it as a RISC-V extension in the gem5 simulator and demonstrate detailed support for complex accelerators, as well as the ability to specify sequences of memory transfers and computations directly from the ISA and with significantly lower overhead than driver-based schemes. IXIAM provides a performance advantage that is more evident for small and medium workloads, reaching around 90x in the best case. This way, we enlarge the set of workloads that would benefit from hardware acceleration

    Rapid SoC Design: On Architectures, Methodologies and Frameworks

    Full text link
    Modern applications like machine learning, autonomous vehicles, and 5G networking require an order of magnitude boost in processing capability. For several decades, chip designers have relied on Moore’s Law - the doubling of transistor count every two years to deliver improved performance, higher energy efficiency, and an increase in transistor density. With the end of Dennard’s scaling and a slowdown in Moore’s Law, system architects have developed several techniques to deliver on the traditional performance and power improvements we have come to expect. More recently, chip designers have turned towards heterogeneous systems comprised of more specialized processing units to buttress the traditional processing units. These specialized units improve the overall performance, power, and area (PPA) metrics across a wide variety of workloads and applications. While the GPU serves as a classical example, accelerators for machine learning, approximate computing, graph processing, and database applications have become commonplace. This has led to an exponential growth in the variety (and count) of these compute units found in modern embedded and high-performance computing platforms. The various techniques adopted to combat the slowing of Moore’s Law directly translates to an increase in complexity for modern system-on-chips (SoCs). This increase in complexity in turn leads to an increase in design effort and validation time for hardware and the accompanying software stacks. This is further aggravated by fabrication challenges (photo-lithography, tooling, and yield) faced at advanced technology nodes (below 28nm). The inherent complexity in modern SoCs translates into increased costs and time-to-market delays. This holds true across the spectrum, from mobile/handheld processors to high-performance data-center appliances. This dissertation presents several techniques to address the challenges of rapidly birthing complex SoCs. The first part of this dissertation focuses on foundations and architectures that aid in rapid SoC design. It presents a variety of architectural techniques that were developed and leveraged to rapidly construct complex SoCs at advanced process nodes. The next part of the dissertation focuses on the gap between a completed design model (in RTL form) and its physical manifestation (a GDS file that will be sent to the foundry for fabrication). It presents methodologies and a workflow for rapidly walking a design through to completion at arbitrary technology nodes. It also presents progress on creating tools and a flow that is entirely dependent on open-source tools. The last part presents a framework that not only speeds up the integration of a hardware accelerator into an SoC ecosystem, but emphasizes software adoption and usability.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168119/1/ajayi_1.pd

    On the co-design of scientific applications and long vector architectures

    Get PDF
    The landscape of High Performance Computing (HPC) system architectures keeps expanding with new technologies and increased complexity. To improve the efficiency of next-generation compute devices, architects are looking for solutions beyond the commodity CPU approach. In 2021, the five most powerful supercomputers in the world use either GP-GPU (General-purpose computing on graphics processing units) accelerators or a customized CPU specially designed to target HPC applications. This trend is only expected to grow in the next years motivated by the compute demands of science and industry. As architectures evolve, the ecosystem of tools and applications must follow. The choices in the number of cores in a socket, the floating point-units per core and the bandwidth through the memory hierarchy among others, have a large impact in the power consumption and compute capabilities of the devices. To balance CPU and accelerators, designers require accurate tools for analyzing and predicting the impact of new architectural features on the performance of complex scientific applications at scale. In such a large design space, capturing and modeling with simulators the complex interactions between the system software and hardware components is a defying challenge. Moreover, applications must be able to exploit those designs with aggressive compute capabilities and memory bandwidth configurations. Algorithms and data structures will need to be redesigned accordingly to expose a high degree of data-level parallelism allowing them to scale in large systems. Therefore, next-generation computing devices will be the result of a co-design effort in hardware and applications supported by advanced simulation tools. In this thesis, we focus our work on the co-design of scientific applications and long vector architectures. We significantly extend a multi-scale simulation toolchain enabling accurate performance and power estimations of large-scale HPC systems. Through simulation, we explore the large design space in current HPC trends over a wide range of applications. We extract speedup and energy consumption figures analyzing the trade-offs and optimal configurations for each of the applications. We describe in detail the optimization process of two challenging applications on real vector accelerators, achieving outstanding operation performance and full memory bandwidth utilization. Overall, we provide evidence-based architectural and programming recommendations that will serve as hardware and software co-design guidelines for the next generation of specialized compute devices.El panorama de las arquitecturas de los sistemas para la Computación de Alto Rendimiento (HPC, de sus siglas en inglés) sigue expandiéndose con nuevas tecnologías y complejidad adicional. Para mejorar la eficiencia de la próxima generación de dispositivos de computación, los arquitectos están buscando soluciones más allá de las CPUs. En 2021, los cinco supercomputadores más potentes del mundo utilizan aceleradores gráficos aplicados a propósito general (GP-GPU, de sus siglas en inglés) o CPUs diseñadas especialmente para aplicaciones HPC. En los próximos años, se espera que esta tendencia siga creciendo motivada por las demandas de más potencia de computación de la ciencia y la industria. A medida que las arquitecturas evolucionan, el ecosistema de herramientas y aplicaciones les debe seguir. Las decisiones eligiendo el número de núcleos por zócalo, las unidades de coma flotante por núcleo y el ancho de banda a través de la jerarquía de memoría entre otros, tienen un gran impacto en el consumo de energía y las capacidades de cómputo de los dispositivos. Para equilibrar las CPUs y los aceleradores, los diseñadores deben utilizar herramientas precisas para analizar y predecir el impacto de nuevas características de la arquitectura en el rendimiento de complejas aplicaciones científicas a gran escala. Dado semejante espacio de diseño, capturar y modelar con simuladores las complejas interacciones entre el software de sistema y los componentes de hardware es un reto desafiante. Además, las aplicaciones deben ser capaces de explotar tales diseños con agresivas capacidades de cómputo y ancho de banda de memoria. Los algoritmos y estructuras de datos deberán ser rediseñadas para exponer un alto grado de paralelismo de datos permitiendo así escalarlos en grandes sistemas. Por lo tanto, la siguiente generación de dispósitivos de cálculo será el resultado de un esfuerzo de codiseño tanto en hardware como en aplicaciones y soportado por avanzadas herramientas de simulación. En esta tesis, centramos nuestro trabajo en el codiseño de aplicaciones científicas y arquitecturas vectoriales largas. Extendemos significativamente una serie de herramientas para la simulación multiescala permitiendo así obtener estimaciones de rendimiento y potencia de sistemas HPC de gran escala. A través de simulaciones, exploramos el gran espacio de diseño de las tendencias actuales en HPC sobre un amplio rango de aplicaciones. Extraemos datos sobre la mejora y el consumo energético analizando las contrapartidas y las configuraciones óptimas para cada una de las aplicaciones. Describimos en detalle el proceso de optimización de dos aplicaciones en aceleradores vectoriales, obteniendo un rendimiento extraordinario a nivel de operaciones y completa utilización del ancho de memoria disponible. Con todo, ofrecemos recomendaciones empíricas a nivel de arquitectura y programación que servirán como instrucciones para diseñar mejor hardware y software para la siguiente generación de dispositivos de cálculo especializados.Postprint (published version

    Offloading cryptographic services to the SIM card in smartphones

    Get PDF
    Smartphones have achieved ubiquitous presence in people’s everyday life as communication, entertainment and work tools. Touch screens and a variety of sensors offer a rich experience and make applications increasingly diverse, complex and resource demanding. Despite their continuous evolution and enhancements, mobile devices are still limited in terms of battery life, processing power, storage capacity and network bandwidth. Computation offloading stands out among the efforts to extend device capabilities and face the growing gap between demand and availability of resources. As most popular technologies, mobile devices are attractive targets for malicious at- tackers. They usually store sensitive private data of their owners and are increasingly used for security sensitive activities such as online banking or mobile payments. While computation offloading introduces new challenges to the protection of those assets, it is very uncommon to take security and privacy into account as the main optimization objectives of this technique. Mobile OS security relies heavily on cryptography. Available hardware and software cryptographic providers are usually designed to resist software attacks. This kind of protection is not enough when physical control over the device is lost. Secure elements, on the other hand, include a set of protections that make them physically tamper-resistant devices. This work proposes a computation offloading technique that prioritizes enhancing security capabilities in mobile phones by offloading cryptographic operations to the SIM card, the only universally present secure element in those devices. Our contributions include an architecture for this technique, a proof-of-concept prototype developed under Android OS and the results of a performance evaluation that was conducted to study its execution times and battery consumption. Despite some limitations, our approach proves to be a valid alternative to enhance security on any smartphone.Los smartphones están omnipresentes en la vida cotidiana de las personas como herramientas de comunicación, entretenimiento y trabajo. Las pantallas táctiles y una variedad de sensores ofrecen una experiencia superior y hacen que las aplicaciones sean cada vez más diversas, complejas y demanden más recursos. A pesar de su continua evolución y mejoras, los dispositivos móviles aún están limitados en duración de batería, poder de procesamiento, capacidad de almacenamiento y ancho de banda de red. Computation offloading se destaca entre los esfuerzos para ampliar las capacidades del dispositivo y combatir la creciente brecha entre demanda y disponibilidad de recursos. Como toda tecnología popular, los smartphones son blancos atractivos para atacantes maliciosos. Generalmente almacenan datos privados y se utilizan cada vez más para actividades sensibles como banca en línea o pagos móviles. Si bien computation offloading presenta nuevos desafíos al proteger esos activos, es muy poco común tomar seguridad y privacidad como los principales objetivos de optimización de dicha técnica. La seguridad del SO móvil depende fuertemente de la criptografía. Los servicios criptográficos por hardware y software disponibles suelen estar diseñados para resistir ataques de software, protección insuficiente cuando se pierde el control físico sobre el dispositivo. Los elementos seguros, en cambio, incluyen un conjunto de protecciones que los hacen físicamente resistentes a la manipulación. Este trabajo propone una técnica de computation offloading que prioriza mejorar las capacidades de seguridad de los teléfonos móviles descargando operaciones criptográficas a la SIM, único elemento seguro universalmente presente en los mismos. Nuestras contribuciones incluyen una arquitectura para esta técnica, un prototipo de prueba de concepto desarrollado bajo Android y los resultados de una evaluación de desempeño que estudia tiempos de ejecución y consumo de batería. A pesar de algunas limitaciones, nuestro enfoque demuestra ser una alternativa válida para mejorar la seguridad en cualquier smartphone

    System Abstractions for Scalable Application Development at the Edge

    Get PDF
    Recent years have witnessed an explosive growth of Internet of Things (IoT) devices, which collect or generate huge amounts of data. Given diverse device capabilities and application requirements, data processing takes place across a range of settings, from on-device to a nearby edge server/cloud and remote cloud. Consequently, edge-cloud coordination has been studied extensively from the perspectives of job placement, scheduling and joint optimization. Typical approaches focus on performance optimization for individual applications. This often requires domain knowledge of the applications, but also leads to application-specific solutions. Application development and deployment over diverse scenarios thus incur repetitive manual efforts. There are two overarching challenges to provide system-level support for application development at the edge. First, there is inherent heterogeneity at the device hardware level. The execution settings may range from a small cluster as an edge cloud to on-device inference on embedded devices, differing in hardware capability and programming environments. Further, application performance requirements vary significantly, making it even more difficult to map different applications to already heterogeneous hardware. Second, there are trends towards incorporating edge and cloud and multi-modal data. Together, these add further dimensions to the design space and increase the complexity significantly. In this thesis, we propose a novel framework to simplify application development and deployment over a continuum of edge to cloud. Our framework provides key connections between different dimensions of design considerations, corresponding to the application abstraction, data abstraction and resource management abstraction respectively. First, our framework masks hardware heterogeneity with abstract resource types through containerization, and abstracts away the application processing pipelines into generic flow graphs. Further, our framework further supports a notion of degradable computing for application scenarios at the edge that are driven by multimodal sensory input. Next, as video analytics is the killer app of edge computing, we include a generic data management service between video query systems and a video store to organize video data at the edge. We propose a video data unit abstraction based on a notion of distance between objects in the video, quantifying the semantic similarity among video data. Last, considering concurrent application execution, our framework supports multi-application offloading with device-centric control, with a userspace scheduler service that wraps over the operating system scheduler

    Techniques for Processing TCP/IP Flow Content in Network Switches at Gigabit Line Rates

    Get PDF
    The growth of the Internet has enabled it to become a critical component used by businesses, governments and individuals. While most of the traffic on the Internet is legitimate, a proportion of the traffic includes worms, computer viruses, network intrusions, computer espionage, security breaches and illegal behavior. This rogue traffic causes computer and network outages, reduces network throughput, and costs governments and companies billions of dollars each year. This dissertation investigates the problems associated with TCP stream processing in high-speed networks. It describes an architecture that simplifies the processing of TCP data streams in these environments and presents a hardware circuit capable of TCP stream processing on multi-gigabit networks for millions of simultaneous network connections. Live Internet traffic is analyzed using this new TCP processing circuit

    High Availability and Scalability of Mainframe Environments using System z and z/OS as example

    Get PDF
    Mainframe computers are the backbone of industrial and commercial computing, hosting the most relevant and critical data of businesses. One of the most important mainframe environments is IBM System z with the operating system z/OS. This book introduces mainframe technology of System z and z/OS with respect to high availability and scalability. It highlights their presence on different levels within the hardware and software stack to satisfy the needs for large IT organizations
    corecore