91 research outputs found

    Scalable system software for high performance large-scale applications

    Get PDF
    In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches ¿ dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability

    Decoupled Thermal Simulation

    Get PDF
    Small transistors and high clock frequency have resulted in high power density, which makes temperature a strong constraint in today's microprocessor design. For maximizing performance, the thermal design power must be set according to average, instead of worst case, conditions. Consequently, current processors feature temperature sensors and throt-tling mechanisms to keep the chip temperature at a safe level. To study future thermally-constrained processors and systems, researchers and engineers use cycle-accurate performance simulators modeling power consumption and temperature. Cycle-accurate simulators are relatively slow and make it difficult to study long-term thermal behaviors that may require to simulate several minutes or even hours of processor execution. Sampling or phase analysis cannot be applied directly in this case because temperature depends on all past energy events. We propose a partial solution to this problem, which consists in decoupling cycle-accurate simulations and thermal ones. Temperature-unaware cycle-accurate simulation is used to generate an energy trace representing the complete execution of an application. Phase analysis can be used to decrease the trace generation time and make compact traces. Temperature and thermal-throttling are simulated in a separate thermal simulator that reads energy traces. The thermal simulator is faster than the cycle-accurate one and can be used to explore, with the same energy trace, parameters that are not modeled in cycle-accurate simulation

    Performance Projections of HPC Applications on Chip Multiprocessor (CMP) Based Systems

    Get PDF
    Performance projections of High Performance Computing (HPC) applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems and help HPC users with system procurement and application refinements. In this dissertation, we present an efficient method to project the performance of HPC applications onto Chip Multiprocessor (CMP) based systems using widely available standard benchmark data. The main advantage of this method is the use of published data about the target machine; the target machine need not be available. With the current trend in HPC platforms shifting towards cluster systems with chip multiprocessors (CMPs), efficient and accurate performance projection becomes a challenging task. Typically, CMP-based systems are configured hierarchically, which significantly impacts the performance of HPC applications. The goal of this research is to develop an efficient method to project the performance of HPC applications onto systems that utilize CMPs. To provide for efficiency, our projection methodology is automated (projections are done using a tool) and fast (with small overhead). Our method, called the surrogate-based workload application projection method, utilizes surrogate benchmarks to project an HPC application performance on target systems where computation component of an HPC application is projected separately from the communication component. Our methodology was validated on a variety of systems utilizing different processor and interconnect architectures with high accuracy and efficiency. The average projection error on three target systems was 11.22 percent with standard deviation of 1.18 percent for twelve HPC workloads

    Structure of the Human Cytomegalovirus Protease Catalytic Domain Reveals a Novel Serine Protease Fold and Catalytic Triad

    Get PDF
    AbstractProteolytic processing of capsid assembly protein precursors by herpesvirus proteases is essential for virion maturation. A 2.5 Ã… crystal structure of the human cytomegalovirus protease catalytic domain has been determined by X-ray diffraction. The structure defines a new class of serine protease with respect to global-fold topology and has a catalytic triad consisting of Ser-132, His-63, and His-157 in contrast with the Ser-His-Asp triads found in other serine proteases. However, catalytic machinery for activating the serine nucleophile and stabilizing a tetrahedral transition state is oriented similarly to that for members of the trypsin-like and subtilisin-like serine protease families. Formation of the active dimer is mediated primarily by burying a helix of one protomer into a deep cleft in the protein surface of the other

    E-QED: Electrical Bug Localization During Post-Silicon Validation Enabled by Quick Error Detection and Formal Methods

    Full text link
    During post-silicon validation, manufactured integrated circuits are extensively tested in actual system environments to detect design bugs. Bug localization involves identification of a bug trace (a sequence of inputs that activates and detects the bug) and a hardware design block where the bug is located. Existing bug localization practices during post-silicon validation are mostly manual and ad hoc, and, hence, extremely expensive and time consuming. This is particularly true for subtle electrical bugs caused by unexpected interactions between a design and its electrical state. We present E-QED, a new approach that automatically localizes electrical bugs during post-silicon validation. Our results on the OpenSPARC T2, an open-source 500-million-transistor multicore chip design, demonstrate the effectiveness and practicality of E-QED: starting with a failed post-silicon test, in a few hours (9 hours on average) we can automatically narrow the location of the bug to (the fan-in logic cone of) a handful of candidate flip-flops (18 flip-flops on average for a design with ~ 1 Million flip-flops) and also obtain the corresponding bug trace. The area impact of E-QED is ~2.5%. In contrast, deter-mining this same information might take weeks (or even months) of mostly manual work using traditional approaches

    New Trends in Amplifiers and Sources via Chalcogenide Photonic Crystal Fibers

    Get PDF
    Rare-earth-doped chalcogenide glass fiber lasers and amplifiers have great applicative potential in many fields since they are key elements in the near and medium-infrared (mid-IR) wavelength range. In this paper, a review, even if not exhaustive, on amplification and lasing obtained by employing rare-earth-doped chalcogenide photonic crystal fibers is reported. Materials, devices, and feasible applications in the mid-IR are briefly mentioned
    • …
    corecore