311 research outputs found

    Hardware-Aware Algorithm Designs for Efficient Parallel and Distributed Processing

    Get PDF
    The introduction and widespread adoption of the Internet of Things, together with emerging new industrial applications, bring new requirements in data processing. Specifically, the need for timely processing of data that arrives at high rates creates a challenge for the traditional cloud computing paradigm, where data collected at various sources is sent to the cloud for processing. As an approach to this challenge, processing algorithms and infrastructure are distributed from the cloud to multiple tiers of computing, closer to the sources of data. This creates a wide range of devices for algorithms to be deployed on and software designs to adapt to.In this thesis, we investigate how hardware-aware algorithm designs on a variety of platforms lead to algorithm implementations that efficiently utilize the underlying resources. We design, implement and evaluate new techniques for representative applications that involve the whole spectrum of devices, from resource-constrained sensors in the field, to highly parallel servers. At each tier of processing capability, we identify key architectural features that are relevant for applications and propose designs that make use of these features to achieve high-rate, timely and energy-efficient processing.In the first part of the thesis, we focus on high-end servers and utilize two main approaches to achieve high throughput processing: vectorization and thread parallelism. We employ vectorization for the case of pattern matching algorithms used in security applications. We show that re-thinking the design of algorithms to better utilize the resources available in the platforms they are deployed on, such as vector processing units, can bring significant speedups in processing throughout. We then show how thread-aware data distribution and proper inter-thread synchronization allow scalability, especially for the problem of high-rate network traffic monitoring. We design a parallelization scheme for sketch-based algorithms that summarize traffic information, which allows them to handle incoming data at high rates and be able to answer queries on that data efficiently, without overheads.In the second part of the thesis, we target the intermediate tier of computing devices and focus on the typical examples of hardware that is found there. We show how single-board computers with embedded accelerators can be used to handle the computationally heavy part of applications and showcase it specifically for pattern matching for security-related processing. We further identify key hardware features that affect the performance of pattern matching algorithms on such devices, present a co-evaluation framework to compare algorithms, and design a new algorithm that efficiently utilizes the hardware features.In the last part of the thesis, we shift the focus to the low-power, resource-constrained tier of processing devices. We target wireless sensor networks and study distributed data processing algorithms where the processing happens on the same devices that generate the data. Specifically, we focus on a continuous monitoring algorithm (geometric monitoring) that aims to minimize communication between nodes. By deploying that algorithm in action, under realistic environments, we demonstrate that the interplay between the network protocol and the application plays an important role in this layer of devices. Based on that observation, we co-design a continuous monitoring application with a modern network stack and augment it further with an in-network aggregation technique. In this way, we show that awareness of the underlying network stack is important to realize the full potential of the continuous monitoring algorithm.The techniques and solutions presented in this thesis contribute to better utilization of hardware characteristics, across a wide spectrum of platforms. We employ these techniques on problems that are representative examples of current and upcoming applications and contribute with an outlook of emerging possibilities that can build on the results of the thesis

    Using static program analysis to compile fast cache simulators

    Get PDF
    This thesis presents a generic approach towards compiling fast execution-driven simulators, and applies this to cache simulation of programs. The resulting cache simulation method reduces the time needed for cache performance evaluations without losing the accuracy of the results. Fast cache simulators are needed in the performance analysis of software systems. To properly understand the cache behavior caused by a program, simulations must be performed with a sufficient number of inputs. Traditional simulation of memory operations of a program can be orders of magnitude slower than the execution of the program. This leads to simulation times that are often infeasible in software development. The approach of this thesis is based on using static cache analysis to guide partial evaluation and slicing of simulators. Because of redundancy in memory access patterns of typical programs, an execution-driven cache simulator program can be partially evaluated during its compilation. Program slicing can be used to remove the computations that have no effect on the simulation result. The static cache analysis presented in this thesis is generic. The analysis is designed especially for programs that use dynamic addressing. The thesis assumes an address analysis that gives the cache analysis static information about cache aliases and cache conflicts between accessed memory lines. To determine the memory references that always cause cache hits or cache misses, the thesis describes both must and may analyses of cache states. The cache state analysis is built by using abstract interpretation. Based on the use of abstract interpretation, the soundness of the analysis is proved. The potential performance of the method was experimentally evaluated. The thesis describes both a tool set implementing the cache analysis method and experiments done with the tool set. The experiments indicate that a simple implementation is capable of significantly speeding up the simulations.reviewe

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    Wrong-Path-Aware Entangling Instruction Prefetcher

    Get PDF
    © 2023.IEEE. This document is made available under the CC-BY 4.0 license http://creativecommons.org/licenses/by /4.0/ This document is the Accepted version of a Published Work that appeared in final form in IEEE Transactions on Computers. To access the final edited and published work see DOI 10.1109/TC.2023.3337308Instruction prefetching is instrumental for guaranteeing a high flow of instructions through the processor front end for applications whose working set does not fit in the lowerlevel caches. Examples of such applications are server workloads, whose instruction footprints are constantly growing. There are two main techniques to mitigate this problem: fetch directed prefetching (or decoupled front end) and instruction cache (L1I) prefetching. This work extends the state-of-the-art Entangling prefetcher to avoid training during wrong-path execution. Our Entangling wrong-path-aware prefetcher is equipped with microarchitectural techniques that eliminate more than 99% of wrong-path pollution, thus reaching 98.9% of the performance of an ideal wrongpath-aware solution. Next, we propose two microarchitectural optimizations able to further increase performance benefits by 1.8%, on average. All this is achieved with just 304 bytes. Finally, we study the interplay between the L1I prefetcher and a decoupled front end. Our analysis shows that due to pollution caused by wrong-path instructions, the degree of decoupling cannot be increased unlimitedly without negative effects on the energy-delay product (EDP). Furthermore, the closer to ideal is the L1I prefetcher, the less decoupling is required. For example, our Entangling prefetcher reaches an optimal EDP with a decoupling degree of 64 instructions

    Assessing the security of hardware-assisted isolation techniques

    Get PDF

    Improving the efficiency of the Energy-Split tool to compute the energy of very large molecular systems

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaThe Energy-Split tool receives as input pieces of a very large molecular system and computes all intra and inter-molecular energies, separately calculating the energies of each fragment and then the total energy of the molecule. It takes into account the connectivity information among atoms in a molecule to compute (i) the energy of all terms involving atoms covalently bonded, namely bonds, angles, dihedral angles, and improper angles, and (ii) Coulomb and the Van der Waals energies, that are independent of the atom’s connections, which have to be computed for every atom in the system. The required operations to obtain the total energy of a large molecule are computationally intensive, which require an efficient high-performance computing approach to obtain results in an acceptable time slot. The original Energy-Split Tcl code was thoroughly analyzed to be ported to a parallel and more efficient C++ version. New data structures were defined with data locality features, to take advantage of the advanced features present in current laptop or server systems. These include the vector extensions to the scalar processors, an efficient on-chip memory hierarchy, and the inherent parallelism in multicore devices. To improve the Energy-Split’s sequential variant a parallel version was developed using auxiliary libraries. Both implementations were tested on different multicore devices and optimized to take the most advantage of the features in high performance computing. Significant results by applying professional performance engineering approaches, namely (i) by identifying the data values that can be represented as Boolean variables (such as variables used in auxiliar data structures on the traversal algorithm that computes the Euclidean distance between atoms), leading to significant performance improvements due to the reduced memory bottleneck (over 10 times faster), and (ii) using an adequate compress format (CSR) to represent and operate on sparse matrices (namely matrices with Euclidean distances between atoms pairs, since all distances further the cut-off distance (user defined) are considered as zero, and these are the majority of values). After the first code optimizations, the performance of the sequential version was improved by around 100 times when compared to the original version on a dual-socket server. The parallel version improved up to 24 times, depending on the molecules tested, on the same server. The overall picture shows that the Energy-Split code is highly scalable, obtaining better results with larger molecule files, even when the atom’s arrangement influences the algorithm’s performance.A ferramenta Energy-Split recebe como ficheiro de input a descrição de fragmentos de um sistema molecular de grandes dimensões, de maneira a calcular os valores de energia intramolecular. Separadamente, também efetua o cálculo da energia de cada fragmento e a energia total de uma molécula. Ao mesmo tempo, tem em conta a informação das ligações entre átomos de uma molécula para calcular (i) a energia que envolve todos os átomos ligados covalentemente, nomeadamente bonds, angles, dihedral angles and improper angles, e (ii) energias de Coulomb e Vand der Waals, que são independentes das conexões dos átomos e têm de ser calculadas para cada átomo do sistema. Para cada átomo, o Energy-Split calcula a energia de interação com todos os outros átomos do sistema, considerando a partição da molécula em fragmentos, feita num programa open source, Visual Molecular Dynamics. As operações para o cálculo destas energias podem levar a tarefas muito intensivas, computacionalmente, fazendo com que seja necessário utilizar uma abordagem que tire proveito de computação de alto desempenho de modo a desenvolver código mais eficiente. O código fornecido, em Tcl, foi profundamente analisado e convertido para uma versão paralela e, mais eficiente, em C++. Ao mesmo tempo, foram definidas novas estruturas de dados, que aproveitam a boa localidade dos mesmos para tirar vantagem das extensões vetoriais presentes em qualquer computador e, também, para explorar o paralelismo inerente a máquinas multicore. Assim, foi implementada uma versão paralela do código convertido numa fase anterior com recurso ao uso de bibliotecas auxiliares. Ambas as versões foram testadas em diferentes ambientes multicore e otimizadas de maneira a ser possível tirar o máximo partido da computação de alto desempenho para obter os melhores resultados. Após a aplicação de técnicas de engenharia de performance como (i) a identificação de dados que poderiam ser representados em formatos mais leves como variáveis booleanas (por exemplo, variáveis usadas em estruturas de dados auxiliares ao cálculo da distância Euclideana entre átomos, utilizadas no algoritmo de travessia da molécula), o que levou a melhorias significativas na performance (cerca de 10 vezes) devido à redução de sobrecarga da memória. (ii) a utilização de um formato adequado para a representação de matirzes esparsas (nomeadamente a de representação das mesmas distâncias Euclidianas do primeiro ponto, uma vez que todas as distâncias que ultrapassem a distância de cutoff (definida pelo utilizador) são consideradas como 0, representado a maioria dos valores). 3 4 Depois das otimizações à versão sequencial, esta apresentou uma melhoria de cerca de 100 vezes em relação à versão original. A versão paralela foi melhorada até 24 vezes, dependendo das moléculas em questão. No geral, o código é escalável, uma vez que apresenta melhores resultados consoante o aumento do tamanho das moléculas testadas, apesar de se concluir que a disposição dos átomos também influencia a perfomance do algoritmo.This work was supported by FCT (Fundação para a Ciência e Tecnologia) within project RDB-TS: Uma base de dados de reações químicas baseadas em informação de estados de transição derivados de cálculos quânticos (Refª BI2-2019_NORTE-01-0145-FEDER-031689_UMINHO), co-funded by the North Portugal Regional Operational Programme, through the European Regional Development Fun

    Regular Expression Synthesis for BLAST Two-Hit Filtering

    Get PDF
    Genomic databases are exhibiting a growth rate that is outpacing Moore\u27s Law, which has made database search algorithms a popular application for use on emerging processor technologies. NCBI BLAST is the standard tool for performing searches against these databases, which operates by transforming each database query into a filter that is subsequently applied to the database. This requires a database scan for every query, fundamentally limiting its performance by I/O bandwidth. In this dissertation we present a functionally-equivalent variation on the NCBI BLAST algorithm that maps more suitably to an FPGA implementation. This variation of the algorithm attempts to reduce the I/O requirement by leveraging FPGA-specific capabilities, such as high pattern matching throughput and explicit on-chip memory structure and allocation. Our algorithm transforms the database—not the query—into a filter that is stored as a hierarchical arrangement of three tables, the first two of which are stored on-chip and the third off-chip. Our results show that it is possible to achieve speedups of up to 8x based on the relative reduction in I/O of our approach versus that of NCBI BLAST, with a minimal impact on sensitivity. More importantly, the performance relative to NCBI BLAST improves with larger databases and query workload sizes
    corecore