106 research outputs found

    Management, Optimization and Evolution of the LHCb Online Network

    Get PDF
    The LHCb experiment is one of the four large particle detectors running at the Large Hadron Collider (LHC) at CERN. It is a forward single-arm spectrometer dedicated to test the Standard Model through precision measurements of Charge-Parity (CP) violation and rare decays in the b quark sector. The LHCb experiment will operate at a luminosity of 2x10^32cm-2s-1, the proton-proton bunch crossings rate will be approximately 10 MHz. To select the interesting events, a two-level trigger scheme is applied: the rst level trigger (L0) and the high level trigger (HLT). The L0 trigger is implemented in custom hardware, while HLT is implemented in software runs on the CPUs of the Event Filter Farm (EFF). The L0 trigger rate is dened at about 1 MHz, and the event size for each event is about 35 kByte. It is a serious challenge to handle the resulting data rate (35 GByte/s). The Online system is a key part of the LHCb experiment, providing all the IT services. It consists of three major components: the Data Acquisition (DAQ) system, the Timing and Fast Control (TFC) system and the Experiment Control System (ECS). To provide the services, two large dedicated networks based on Gigabit Ethernet are deployed: one for DAQ and another one for ECS, which are referred to Online network in general. A large network needs sophisticated monitoring for its successful operation. Commercial network management systems are quite expensive and dicult to integrate into the LHCb ECS. A custom network monitoring system has been implemented based on a Supervisory Control And Data Acquisition (SCADA) system called PVSS which is used by LHCb ECS. It is a homogeneous part of the LHCb ECS. In this thesis, it is demonstrated how a large scale network can be monitored and managed using tools originally made for industrial supervisory control. The thesis is organized as the follows: Chapter 1 gives a brief introduction to LHC and the B physics on LHC, then describes all sub-detectors and the trigger and DAQ system of LHCb from structure to performance. Chapter 2 first introduces the LHCb Online system and the dataflow, then focuses on the Online network design and its optimization. In Chapter 3, the SCADA system PVSS is introduced briefly, then the architecture and implementation of the network monitoring system are described in detail, including the front-end processes, the data communication and the supervisory layer. Chapter 4 first discusses the packet sampling theory and one of the packet sampling mechanisms: sFlow, then demonstrates the applications of sFlow for the network trouble-shooting, the traffic monitoring and the anomaly detection. In Chapter 5, the upgrade of LHC and LHCb is introduced, the possible architecture of DAQ is discussed, and two candidate internetworking technologies (high speed Ethernet and InfniBand) are compared in different aspects for DAQ. Three schemes based on 10 Gigabit Ethernet are presented and studied. Chapter 6 is a general summary of the thesis

    Venice: Exploring Server Architectures for Effective Resource Sharing

    Get PDF
    Consolidated server racks are quickly becoming the backbone of IT infrastructure for science, engineering, and business, alike. These servers are still largely built and organized as when they were distributed, individual entities. Given that many fields increasingly rely on analytics of huge datasets, it makes sense to support flexible resource utilization across servers to improve cost-effectiveness and performance. We introduce Venice, a family of data-center server architectures that builds a strong communication substrate as a first-class resource for server chips. Venice provides a diverse set of resource-joining mechanisms that enables user programs to efficiently leverage non-local resources. To better understand the implications of design decisions about system support for resource sharing we have constructed a hardware prototype that allows us to more accurately measure end-to-end performance of at-scale applications and to explore tradeoffs among performance, power, and resource-sharing transparency. We present results from our initial studies analyzing these tradeoffs when sharing memory, accelerators, or NICs. We find that it is particularly important to reduce or hide latency, that data-sharing access patterns should match the features of the communication channels employed, and that inter-channel collaboration can be exploited for better performance

    Effects of Communication Protocol Stack Offload on Parallel Performance in Clusters

    Get PDF
    The primary research objective of this dissertation is to demonstrate that the effects of communication protocol stack offload (CPSO) on application execution time can be attributed to the following two complementary sources. First, the application-specific computation may be executed concurrently with the asynchronous communication performed by the communication protocol stack offload engine. Second, the protocol stack processing can be accelerated or decelerated by the offload engine. These two types of performance effects can be quantified with the use of the degree of overlapping Do and degree of acceleration Daccs. The composite communication speedup metrics S_comm(Do, Daccs) can be used in order to quantify the combined effects of the protocol stack offload. This dissertation thesis is validated empirically. The degree of overlapping Do, the degree of acceleration Daccs, and the communication speedup Scomm characteristic of the system configurations under test are derived in the course of experiments performed for the system configurations of interest. It is shown that the proposed metrics adequately describe the effects of the protocol stack offload on the application execution time. Additionally, a set of analytical models of the networking subsystem of a PC-based cluster node is developed. As a result of the modeling, the metrics Do, Daccs, and Scomm are obtained. The models are evaluated as to their complexity and precision by comparing the modeling results with the measured values of Do, Daccs, and Scomm. The primary contributions of this dissertation research are as follows. First, the metric Daccs and Scomm are introduced in order to complement the Do metric in its use for evaluation of the effects of optimizations in the networking subsystem on parallel performance in clusters. The metrics are shown to adequately describe CPSO performance effects. Second, a method for assessing performance effects of CPSO scenarios on application performance is developed and presented. Third, a set of analytical models of cluster node networking subsystems with CPSO capability is developed and characterised as to their complexity and precision of the prediction of the Do and Daccs metrics

    Composable architecture for rack scale big data computing

    No full text
    The rapid growth of cloud computing, both in terms of the spectrum and volume of cloud workloads, necessitate re-visiting the traditional rack-mountable servers based datacenter design. Next generation datacenters need to offer enhanced support for: (i) fast changing system configuration requirements due to workload constraints, (ii) timely adoption of emerging hardware technologies, and (iii) maximal sharing of systems and subsystems in order to lower costs. Disaggregated datacenters, constructed as a collection of individual resources such as CPU, memory, disks etc., and composed into workload execution units on demand, are an interesting new trend that can address the above challenges. In this paper, we demonstrated the feasibility of composable systems through building a rack scale composable system prototype using PCIe switch. Through empirical approaches, we develop assessment of the opportunities and challenges for leveraging the composable architecture for rack scale cloud datacenters with a focus on big data and NoSQL workloads. In particular, we compare and contrast the programming models that can be used to access the composable resources, and developed the implications for the network and resource provisioning and management for rack scale architecture

    Infrastructure Plan for ASC Petascale Environments

    Full text link

    Improving the Scalability of High Performance Computer Systems

    Full text link
    Improving the performance of future computing systems will be based upon the ability of increasing the scalability of current technology. New paths need to be explored, as operating principles that were applied up to now are becoming irrelevant for upcoming computer architectures. It appears that scaling the number of cores, processors and nodes within an system represents the only feasible alternative to achieve Exascale performance. To accomplish this goal, we propose three novel techniques addressing different layers of computer systems. The Tightly Coupled Cluster technique significantly improves the communication for inter node communication within compute clusters. By improving the latency by an order of magnitude over existing solutions the cost of communication is considerably reduced. This enables to exploit fine grain parallelism within applications, thereby, extending the scalability considerably. The mechanism virtually moves the network interconnect into the processor, bypassing the latency of the I/O interface and rendering protocol conversions unnecessary. The technique is implemented entirely through firmware and kernel layer software utilizing off-the-shelf AMD processors. We present a proof-of-concept implementation and real world benchmarks to demonstrate the superior performance of our technique. In particular, our approach achieves a software-to-software communication latency of 240 ns between two remote compute nodes. The second part of the dissertation introduces a new framework for scalable Networks-on-Chip. A novel rapid prototyping methodology is proposed, that accelerates the design and implementation substantially. Due to its flexibility and modularity a large application space is covered ranging from Systems-on-chip, to high performance many-core processors. The Network-on-Chip compiler enables to generate complex networks in the form of synthesizable register transfer level code from an abstract design description. Our engine supports different target technologies including Field Programmable Gate Arrays and Application Specific Integrated Circuits. The framework enables to build large designs while minimizing development and verification efforts. Many topologies and routing algorithms are supported by partitioning the tasks into several layers and by the introduction of a protocol agnostic architecture. We provide a thorough evaluation of the design that shows excellent results regarding performance and scalability. The third part of the dissertation addresses the Processor-Memory Interface within computer architectures. The increasing compute power of many-core processors, leads to an equally growing demand for more memory bandwidth and capacity. Current processor designs exhibit physical limitations that restrict the scalability of main memory. To address this issue we propose a memory extension technique that attaches large amounts of DRAM memory to the processor via a low pin count interface using high speed serial transceivers. Our technique transparently integrates the extension memory into the system architecture by providing full cache coherency. Therefore, applications can utilize the memory extension by applying regular shared memory programming techniques. By supporting daisy chained memory extension devices and by introducing the asymmetric probing approach, the proposed mechanism ensures high scalability. We furthermore propose a DMA offloading technique to improve the performance of the processor memory interface. The design has been implemented in a Field Programmable Gate Array based prototype. Driver software and firmware modifications have been developed to bring up the prototype in a Linux based system. We show microbenchmarks that prove the feasibility of our design
    corecore