95 research outputs found

    Improving the efficiency of multicore systems through software and hardware cooperation

    Get PDF
    Increasing processors' clock frequency has traditionally been one of the largest drivers of performance improvements for computing systems. In the first half of the 2000s, however, it became clear that continuing to increase frequency was not a viable solution anymore. Power consumption and power density became prohibitively costly, and processor manufacturers moved to multicore designs. This new paradigm introduced multiple challenges not present in single-threaded processors. Applications running on multicore systems share different resources such as the cache hierarchy and the memory bus. Resource sharing occurs at much finer degree when cores support multithreading as well. In this case, applications share the processor¿s pipeline too. Running multiple applications on the same processor allows for better utilization of its resources¿which otherwise may just lie idle if an application does not use them. But sharing resources may create interferences between applications running on the system. While the degree of these interferences depends on the nature of the applications, it is typically desirable to reduce them in order to improve efficiency. Most currently available processors expose a set of sensors and actuators that software can use to monitor and control resource sharing among the applications running on a system. But it is typically up to end users to analyze their workloads of interest and to manually use the actuators provided by the processor. Because of this, in many cases the different mechanisms for controlling resource sharing are simply left unused. In this thesis we present different techniques that rely on software/hardware interaction to monitor and improve application interference¿and thus improve system efficiency. First we conduct a quantitative study showing the benefits of hardware/software cooperation on system efficiency. Then we narrow our focus on a given hardware knob: data prefetching. Specifically we develop and evaluate several adaptive solutions for improving the efficiency of hardware data prefetching on multicore systems. The impact of the solutions presented in this thesis, however, goes beyond the particular case of data prefetching. They serve as illustrative examples for developing software/hardware cooperation schemes that enable the efficient sharing of resources in multicore systems.L'increment de la freqüència dels processadors ha estat tradicionalment un dels majors responsables de la millora de rendiment dels sistemes de computació. Tanmateix, a la primera meitat del segle XXI es va fer evident que continuar incrementant la freqüència ja no era una solució viable. El consum de potència i la densitat de potència van esdevenir massa costosos, i els dissenyadors de processadors van adoptar dissenys "multicore". Aquest nou paradigma va introduir molts reptes que no eren presents als processadors "single-threaded". Les aplicacions que s'executen a processadors multicore comparteixen diferent recursos tal i com la jerarquia de "cache" i el bus de memòria. En processadors que suporten "multi-threading" encara comparteixen més recursos: en aquest cas les aplicacions també comparteixen els recursos del "pipeline". Executar diverses aplicacions en un processador permet una millor utilització dels seus recursos, que d'altra forma podrien no tenir cap utilitat si l'aplicació en execució no els utilitzés. Compartir recursos, però, pot crear interferències entre les aplicacions executant-se al sistema. Encara que el nivell d'aquestes interferències depèn de les aplicacions que s'executen conjuntament, normalment és desitjable reduir-les per tal de millorar la eficiència. Molts dels processadors actuals exposen un conjunt sensors i actuadors que el software pot utilitzar per supervisar i controlar la compartició de recursos entre les diferents aplicacions executant-se al sistema. En general és responsabilitat dels usuaris analitzar les aplicacions del seu interès i després configurar els actuadors de forma manual. Això suposa una dificultat afegida i per aquest motiu, en molts casos els diferents mecanismes per controlar com es comparteixen els recursos senzillament no es fan servir. En aquesta tesi, presentem diferents tècniques basades en la interacció del software i el hardware per supervisar i reduir la interferència entre aplicacions, i d'aquesta forma millorar la eficiència del sistema. Primer es presenta un estudi quantitatiu que mostra els beneficis de la cooperació entre software i hardware en la eficiència del sistema. Després el focus es centra en un actuador en concret: "data prefetching". En concret, desenvolupem i avaluem diferents solucions adaptatives per millorar la eficiència de hardware data prefetching a sistemes multicore. L'impacte de les solucions presentades a aquesta tesi, però, no es limiten a aquest cas concret. Al contrari, serveixen com exemples il·lustratius per desenvolupar tècniques de cooperació software i hardware que permetin compartir els recursos en sistemes multicore de forma eficient. La compartició de recursos en un processador és un factor crucial que afecta significativament a la seva eficiència. Però, altres nivells d'un sistema de computació també comparteixen recursos. En grans instal·lacions de computació com els "datacenters", les aplicacions també poden compartir altres recursos com la xarxa o l'emmagatzemament. Com a cas d'estudi considerem el disseny d'un sistema d'un sistema de comptabilitat d'energia basat en la cooperació entre el software i el hardware per a grans instal·lacions de computació. En aquest context, explorem diverses alternatives per als sensors i actuadors que es requereixen, així com també analitzem els diferents aspectes claus en el disseny d'un sistema d'aquestes característiques

    Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors

    Get PDF
    Silicon reliability is a key challenge facing the microprocessor industry. Processors need to be designed such that they are resilient against both soft errors and lifetime reliability phenomena. However, techniques developed to address one class of reliability problems may impact other aspects of silicon reliability. In this paper, we show that Redundant Multi-Threading (RMT), which provides soft error protection, exacerbates lifetime reliability. We then explore two different architectural approaches to tackle this problem, namely, Dynamic Voltage Scaling (DVS) and partial RMT. We show that each approach has certain strengths and weaknesses with respect to performance, soft error coverage, and lifetime reliability. We then propose and evaluate a hybrid approach that combines DVS and partial RMT. We show that this approach provides better improvement in lifetime reliability than DVS or partial RMT alone, buys back a significant amount of performance that is lost due to DVS, and provides nearly complete soft error coverage. I

    Scalable system software for high performance large-scale applications

    Get PDF
    In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches ¿ dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability

    Critical Casimir forces and adsorption profiles in the presence of a chemically structured substrate

    Full text link
    Motivated by recent experiments with confined binary liquid mixtures near demixing, we study the universal critical properties of a system, which belongs to the Ising universality class, in the film geometry. We employ periodic boundary conditions in the two lateral directions and fixed boundary conditions on the two confining surfaces, such that one of them has a spatially homogeneous adsorption preference while the other one exhibits a laterally alternating adsorption preference, resembling locally a single chemical step. By means of Monte Carlo simulations of an improved Hamiltonian, so that the leading scaling corrections are suppressed, numerical integration, and finite-size scaling analysis we determine the critical Casimir force and its universal scaling function for various values of the aspect ratio of the film. In the limit of a vanishing aspect ratio the critical Casimir force of this system reduces to the mean value of the critical Casimir force for laterally homogeneous ++ and +- boundary conditions, corresponding to the surface spins on the two surfaces being fixed to equal and opposite values, respectively. We show that the universal scaling function of the critical Casimir force for small but finite aspect ratios displays a linear dependence on the aspect ratio which is solely due to the presence of the lateral inhomogeneity. We also analyze the order-parameter profiles at criticality and their universal scaling function which allows us to probe theoretical predictions and to compare with experimental data.Comment: revised version, section 5.2 expanded; 53 pages, 12 figures, iopart clas

    Worst-case energy consumption: A new challenge for battery-powered critical devices

    Get PDF
    The number of devices connected to the IoT is on the rise, reaching hundreds of billions in the next years. Many devices will implement some type of critical functionality, for instance in the medical market. Energy awareness is mandatory in the design of IoT devices because of their huge impact on worldwide energy consumption and the fact that many of them are battery powered. Critical IoT devices further require addressing new energy-related challenges. On the one hand, factoring in the impact of energy-solutions on device's performance, providing evidence of adherence to domain-specific safety standards. On the other hand, deriving safe worst-case energy consumption (WCEC) estimates is a fundamental building block to ensure the system can continuously operate under a pre-established set of power/energy caps, safely delivering its critical functionality. We analyze for the first time the impact that different hardware physical parameters have on both model-based and measurement-based WCEC modeling, for which we also show the main challenges they face compared to chip manufacturers' current practice for energy modeling and validation. Under the set of constraints that emanate from how certain physical parameters can be actually modeled, we show that measurement-based WCEC is a promising way forward for WCEC estimation.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015- 65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717. Carles Hernndez is jointly funded by the MINECO and FEDER funds through grant TIN2014-60404-JIN.Peer ReviewedPostprint (author's final draft

    On the particle spectrum and the conformal window

    Get PDF
    We study the SU(3) gauge theory with twelve flavours of fermions in the fundamental representation as a prototype of non-Abelian gauge theories inside the conformal window. Guided by the pattern of underlying symmetries, chiral and conformal, we analyze the two-point functions theoretically and on the lattice, and determine the finite size scaling and the infinite volume fermion mass dependence of the would-be hadron masses. We show that the spectrum in the Coulomb phase of the system can be described in the context of a universal scaling analysis and we provide the nonperturbative determination of the fermion mass anomalous dimension gamma*=0.235(46) at the infrared fixed point. We comment on the agreement with the four-loop perturbative prediction for this quantity and we provide a unified description of all existing lattice results for the spectrum of this system, them being in the Coulomb phase or the asymptotically free phase. Our results corroborate the view that the fixed point we are studying is not associated to a physical singularity along the bare coupling line and estimates of physical observables can be attempted on either side of the fixed point. Finally, we observe the restoration of the U(1) axial symmetry in the two-point functions.Comment: 40 pages, 22 figure
    • …
    corecore