173 research outputs found

    Efficient processing of ATLAS events analysis in homogeneous and heterogeneous platforms with accelerator devices

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaMost event data analysis tasks in the ATLAS project require both intensive data access and processing, where some tasks are typically I/O bound while others are compute bound. This dissertation work mainly focus improving the code efficiency of the compute bound stages of the ATLAS detector data analysis, complementing a parallel dissertation work that addresses the I/O bound issues. The main goal of the work was to design, implement, validate and evaluate an improved and more robust data analysis task, originally developed by the LIP research group at the University of Minho. This involved tuning the performance of both Top Quark and Higgs Boson reconstruction of events, within the ATLAS framework, to run on homogeneous systems with multiple CPUs and on heterogeneous computing platforms. The latter are based on multicore CPU devices coupled to PCI-E boards with many-core devices, such as the Intel Xeon Phi or the NVidia Fermi GPU devices. Once the critical areas of the event analysis were identified and restructured, two parallelization approaches for homogeneous systems and two for heterogeneous systems were developed and evaluated to identify their limitations and the restrictions imposed by the LipMiniAnalysis library, an integral part of every application developed at LIP. To efficiently use multiple CPU resources, an application scheduler was also developed to extract parallelism from simultaneously execution of both sequential and parallel applications when processing large sets of input data files. A key achieved outcome of this work is a set of guidelines for LIP researchers to efficiently use the available computing resources in current and future complex parallel environments, taking advantage of the acquired expertise during this dissertation work. Further improvements on LIP libraries can be achieved by developing a tool to automatically extract parallelism of LIP applications, complemented by the application scheduler and additional suggested approaches.A maior parte das tarefas de análise de dados de eventos no projeto ATLAS requerem grandes capacidades de acesso a dados e processamento, em que a performance de algumas das tarefas são limitadas pela capacidade de I/O e outras pela capacidade de computação. Esta dissertação irá focar-se principalmente em melhorar a eficiência do código nos problemas limitados computacionalmente nas últimas fases de análise de dados do detector do ATLAS, complementando uma dissertação paralela que irá lidar com as tarefas limitadas pelo I/O. O principal objectivo deste trabalho será desenhar, implementar, validar e avaliar uma tarefa de análise mais robusta e melhorada, desenvolvida pelo grupo de investigação do LIP na Universidade do Minho. Isto envolve aperfeiçoar a performance das reconstruções do Top Quark e bosão de Higgs de eventos dentro da framework do ATLAS, a ser executada em plataformas homogéneas com vários CPUs e em plataformas de computação heterogénea. A última é baseada em CPUs multicore acoplados a placas PCI-E com dispositivos many-core, tais como o Intel Xeon Phi ou os dispositivos GPU NVidia Fermi. Depois de identificar e restructurar as regiões críticas da análise de eventos, duas abordagens de paralelização para plataformas homogéneas e duas para plataformas heterogéneas foram desenvolvidas e avaliadas, com o objectivo de identificar as suas limitações e as restrições impostas pela biblioteca LipMiniAnalysis, uma parte integrante de todas as aplicações desenvolvidas no LIP. Um escalonador de aplicações foi desenvolvido para usar eficientemente os recursos de múltiplos CPUs, através da extracção de paralelismo de execução em simultâneo de tanto aplicações sequenciais como paralelas para processamento de grandes conjuntos de ficheiros de dados. Um resultado obtido neste trabalho foi um conjunto de directivas para os investigadores do LIP para o uso eficiente de recursos em ambientes paralelos complexos. É possível melhorar as bibliotecas do LIP através do desenvolvimento de uma ferramenta para extrair automaticamente paralelismo das aplicações do LIP, complementado pelo escalonador de aplicações e outras alternativas sugeridas

    DReAM: An approach to estimate per-Task DRAM energy in multicore systems

    Get PDF
    Accurate per-task energy estimation in multicore systems would allow performing per-task energy-aware task scheduling and energy-aware billing in data centers, among other applications. Per-task energy estimation is challenged by the interaction between tasks in shared resources, which impacts tasks’ energy consumption in uncontrolled ways. Some accurate mechanisms have been devised recently to estimate per-task energy consumed on-chip in multicores, but there is a lack of such mechanisms for DRAM memories. This article makes the case for accurate per-task DRAM energy metering in multicores, which opens new paths to energy/performance optimizations. In particular, the contributions of this article are (i) an ideal per-task energy metering model for DRAM memories; (ii) DReAM, an accurate yet low cost implementation of the ideal model (less than 5% accuracy error when 16 tasks share memory); and (iii) a comparison with standard methods (even distribution and access-count based) proving that DReAM is much more accurate than these other methods.Peer ReviewedPostprint (author's final draft

    Architecting Data Centers for High Efficiency and Low Latency

    Full text link
    Modern data centers, housing remarkably powerful computational capacity, are built in massive scales and consume a huge amount of energy. The energy consumption of data centers has mushroomed from virtually nothing to about three percent of the global electricity supply in the last decade, and will continuously grow. Unfortunately, a significant fraction of this energy consumption is wasted due to the inefficiency of current data center architectures, and one of the key reasons behind this inefficiency is the stringent response latency requirements of the user-facing services hosted in these data centers such as web search and social networks. To deliver such low response latency, data center operators often have to overprovision resources to handle high peaks in user load and unexpected load spikes, resulting in low efficiency. This dissertation investigates data center architecture designs that reconcile high system efficiency and low response latency. To increase the efficiency, we propose techniques that understand both microarchitectural-level resource sharing and system-level resource usage dynamics to enable highly efficient co-locations of latency-critical services and low-priority batch workloads. We investigate the resource sharing on real-system simultaneous multithreading (SMT) processors to enable SMT co-locations by precisely predicting the performance interference. We then leverage historical resource usage patterns to further optimize the task scheduling algorithm and data placement policy to improve the efficiency of workload co-locations. Moreover, we introduce methodologies to better manage the response latency by automatically attributing the source of tail latency to low-level architectural and system configurations in both offline load testing environment and online production environment. We design and develop a response latency evaluation framework at microsecond-level precision for data center applications, with which we construct statistical inference procedures to attribute the source of tail latency. Finally, we present an approach that proactively enacts carefully designed causal inference micro-experiments to diagnose the root causes of response latency anomalies, and automatically correct them to reduce the response latency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144144/1/yunqi_1.pd

    Doctor of Philosophy

    Get PDF
    dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications

    A Comparative study and evaluation of parallel programming models for shared-memory parallel architectures

    Get PDF
    Nowadays, shared-memory parallel architectures have evolved and new programming frameworks have appeared that exploit these architectures: OpenMP, TBB, Cilk Plus, ArBB and OpenCL. This article focuses on the most extended of these frameworks in commercial and scientific areas. This paper shows a comparative study of these frameworks and an evaluation. The study covers several capacities, such as task deployment, scheduling techniques, or programming language abstractions. The evaluation measures three dimensions: code development complexity, performance and efficiency, measure as speedup per watt. For this evaluation, several parallel benchmarks have been implemented with each framework. These benchmarks are created to cover certain scenarios, like regular memory access or irregular computation. The conclusions show some highlights, like the fact that some frameworks (OpenMP, Cilk Plus) are better for transforming quickly a sequential code, others (TBB) have a small footprint which is ideal for small problems, and others (OpenCL) are suited for heterogeneous architectures but they require a very complex development process. The conclusions also show that the vectorization support is more critical than multitasking to achieve efficiency for those problems where this approach fits.This work has been partially funded by the project “Input/Output Scalable Techniques for distributed and high-performance computing environments” of MINISTERIO DE CIENCIA E INNOVACIÓN, TIN2010-16497. The work of J. Daniel García has been funded by "FUNDACIÓN CAJAMADRID" through a grant for Mobility of Madrid Public Universities Professors

    A Design Approach for Soft Errors Protection in Real-Time Systems

    Get PDF
    This paper proposes the use of metrics to refine system design for soft errors protection in system on chip architectures. Specifically this research shows the use of metrics in design space exploration that highlight where in the structure of the model and at what point in the behaviour, protection is needed against soft errors. As these metrics improve the ability of the system to provide functionality, they are referred to here as reliability metrics. Previous approaches to prevent soft errors focused on recovery after detection. Almost no research has been directed towards preventive measures. But in real-time systems, deadlines are performance requirements that absolutely must be met and a missed deadline constitutes an erroneous action and a possible system failure. This paper focuses on a preventive approach as a solution rather than recovery after detection. The intention of this research is to prevent serious loss of system functionality or system failure though it may not be able to eliminate the impact of soft errors completely

    Multi-core devices for safety-critical systems: a survey

    Get PDF
    Multi-core devices are envisioned to support the development of next-generation safety-critical systems, enabling the on-chip integration of functions of different criticality. This integration provides multiple system-level potential benefits such as cost, size, power, and weight reduction. However, safety certification becomes a challenge and several fundamental safety technical requirements must be addressed, such as temporal and spatial independence, reliability, and diagnostic coverage. This survey provides a categorization and overview at different device abstraction levels (nanoscale, component, and device) of selected key research contributions that support the compliance with these fundamental safety requirements.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015-65316-P, Basque Government under grant KK-2019-00035 and the HiPEAC Network of Excellence. The Spanish Ministry of Economy and Competitiveness has also partially supported Jaume Abella under Ramon y Cajal postdoctoral fellowship (RYC-2013-14717).Peer ReviewedPostprint (author's final draft

    Microarchitectural Floorplanning for Thermal Management: A Technical Report

    Get PDF
    • …
    corecore