7 research outputs found

    HIPE: HMC Instruction Predication Extension Applied on Database Processing

    Get PDF
    The recent Hybrid Memory Cube (HMC) is a smart memory which includes functional units inside one logic layer of the 3D stacked memory design. In order to execute instructions inside the Hybrid Memory Cube (HMC), the processor needs to send instructions to be executed near data, keeping most of the pipeline complexity inside the processor. Thus, control-flow and data-flow dependencies are all managed inside the processor, in such way that only update instructions are supported by the HMC. In order to solve data-flow dependencies inside the memory, previous work proposed HMC Instruction Vector Extensions (HIVE), which embeds a high number of functional units with a interlock register bank. In this work we propose HMC Instruction Prediction Extensions (HIPE), that supports predicated execution inside the memory, in order to transform control-flow dependencies into data-flow dependencies. Our mechanism focus on removing the high latency iteration between the processor and the smart memory during the execution of branches that depends on data processed inside the memory. In this paper we evaluate a balanced design of HIVE comparing to x86 and HMC executions. After we show the HIPE mechanism results when executing a database workload, which is a strong candidate to use smart memories. We show interesting trade-offs of performance when comparing our mechanism to previous work

    Low level conditional move optimization

    Get PDF
    The high level optimizations are becoming more and more sophisticated, the importance of low level optimizations should not be underestimated. Due to the changes in the inner architecture of modern processors, some optimization techniques may become more or less effective. Existing techniques need, from time to time, to be reconsidered, and new techniques, targeting these modern architectures, may emerge. Due to the growing instruction pipeline of modern processors, recovering after branch mis-predictions is becoming more expensive, and so avoiding that is becoming more critical. In this paper we introduce a novel approach to branch elimination using conditional move operations, namely the CMOVcc instruction group. The inappropriate use of these instructions may result in sensible performance regression, but in many cases they outperform the sequence of a conditional jump and an unconditional move instruction. Our goal is to analyze the usage of CMOVcc in different contexts on modern processors, and based on these results, propose a technique to automatically decide whether the conditional move or the sequence of a conditional jump and an unconditional move should be performed in a given situation

    A near-data select scan operator for database systems

    Get PDF
    Orientador : Eduardo Cunha de AlmeidaCoorientador : Marco Antonio Zanata AlvesDissertação (mestrado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 21/12/2017Inclui referências : p. 61-64Resumo: Um dos grandes gargalos em sistemas de bancos de dados focados em leitura consiste em mover dados em torno da hierarquia de memória para serem processados na CPU. O movimento de dados é penalizado pela diferença de desempenho entre o processador e a memória, que é um problema bem conhecido chamado memory wall. O surgimento de memórias inteligentes, como o novo Hybrid Memory Cube (HMC), permitem mitigar o problema do memory wall executando instruções em chips de lógica integrados a uma pilha de DRAMs. Essas memórias possuem potencial para computação de operações de banco de dados direto em memória além do armazenamento de bancos de dados. O objetivo desta dissertação é justamente a execução do operador algébrico de seleção direto em memória para reduzir o movimento de dados através da memória e da hierarquia de cache. O foco na operação de seleção leva em conta o fato que a leitura de colunas a serem filtradas movem grandes quantidades de dados antes de outras operações como junções (ou seja, otimização push-down). Inicialmente, foi avaliada a execução da operação de seleção usando o HMC como uma DRAM comum. Posteriormente, são apresentadas extensões à arquitetura e ao conjunto de instruções do HMC, chamado HMC-Scan, para executar a operação de seleção próximo aos dados no chip lógico do HMC. Em particular, a extensão HMC-Scan tem o objetivo de resolver internamente as dependências de instruções. Contudo, nós observamos que o HMC-Scan requer muita interação entre a CPU e a memória para avaliar a execução de filtros de consultas. Portanto, numa segunda contribuição, apresentamos a extensão arquitetural HIPE-Scan para diminuir esta interação através da técnica de predicação. A predicação suporta a avaliação de predicados direto em memória sem necessidade de decisões da CPU e transforma dependências de controle em dependências de dados (isto é, execução predicada). Nós implementamos a operação de seleção próximo aos dados nas estratégias de execução de consulta orientada a linha/coluna/vetor para a arquitetura x86 e para nas duas extensões HMC-Scan e HIPE-Scan. Nossas simulações mostram uma melhora de desempenho de até 3.7× para HMC-Scan e 5.6× para HIPE-Scan quando executada a consulta 06 do benchmark TPC-H de 1 GB na estratégia de execução orientada a coluna. Palavras-chave: SGBD em Memória, Cubo de Memória Híbrido, Processamento em Memória.Abstract: A large burden of processing read-mostly databases consists of moving data around the memory hierarchy rather than processing data in the processor. The data movement is penalized by the performance gap between the processor and the memory, which is the well-known problem called memory wall. The emergence of smart memories, as the new Hybrid Memory Cube (HMC), allows mitigating the memory wall problem by executing instructions in logic chips integrated to a stack of DRAMs. These memories can enable not only in-memory databases but also have potential for in-memory computation of database operations. In this dissertation, we focus on the discussion of near-data query processing to reduce data movement through the memory and cache hierarchy. We focus on the select scan database operator, because the scanning of columns moves large amounts of data prior to other operations like joins (i.e., push-down optimization). Initially, we evaluate the execution of the select scan using the HMC as an ordinary DRAM. Then, we introduce extensions to the HMC Instruction Set Architecture (ISA) to execute our near-data select scan operator inside the HMC, called HMC-Scan. In particular, we extend the HMC ISA with HMC-Scan to internally solve instruction dependencies. To support branch-less evaluation of the select scan and transform control-flow dependencies into data-flow dependencies (i.e., predicated execution) we propose another HMC ISA extension called HIPE-Scan. The HIPE-Scan leads to less iteration between processor and HMC during the execution of query filters that depends on in-memory data. We implemented the near-data select scan in the row/column/vector-wise query engines for x86 and two HMC extensions, HMC-Scan and HIPE-Scan achieving performance improvements of up to 3.7× for HMC-Scan and 5.6× for HIPE-Scan when executing the Query-6 from 1 GB TPC-H database on column-wise. Keywords: In-Memory DBMS, Hybrid Memory Cube, Processing-in-Memory

    Mitigating the Effect of Misspeculations in Superscalar Processors

    Get PDF
    Modern superscalar processors highly rely on the speculative execution which speculatively executes instructions and then verifies. If the prediction is different from the execution result, a misspeculation recovery is performed. Misspeculation recovery penalties still account for a substantial amount of performance reduction. This work focuses on the techniques to mitigate the effect of recovery penalties and proposes practical mechanisms which are thoroughly implemented and analyzed. In general, we can divide the misspeculation penalty into four parts: misspeculation detection delay; stale instruction elimination delay; state restoration delay and pipeline fill delay. This dissertation does not consider the detection delay, instead, we design four innovative mechanisms. Some of these mechanisms target a specific recovery delay whereas others target multiple types of delay in a unified algorithm. Mower was designed to address the stale instruction elimination delay and the state restoration delay by using a special walker. When a misprediction is detected, the walker will scan and repair the instructions which are younger than the mispredicted instruction. During the walking procedure, the correct state is restored and the stale instructions are eliminated. Based on Mower, we further simplify the design and develop a Two-Phase recovery mechanism. This mechanism uses only a basic recovery mechanism except for the case in which the retire stage was stalled by a long latency instruction. When the retire stage is stalled, the second phase is launched and the instructions in the pipeline are re-fetched. Two-Phase mechanism recovers from an earlier point in the program and overlaps the recovery penalty with the long latency penalty. In reality, some of the instructions on the wrong path can be reused during the recovery. However, such reuse of misprediction results is not easy and most of the time involves significant complexity. We design Passing Loop to reduce the pipeline fill delay. We applied our mechanism only for short forward branches which eliminates a substantial amount of complexity. In terms of memory dependence speculation and associated delays due to memory ordering violations, we develop a mechanism that optimizes store-queue-free architectures. A store-queue-free architecture experiences more memory dependence mispredictions due to its aggressive approach to speculations. A common solution is to delay the execution of an instruction which is more likely to be mispredicted. We propose a mechanism to dynamically insert predicates for comparing the address of memory instructions, which is called “Dynamic Memory Dependence Predication” (DMDP). This mechanism boosts the instruction execution to its earliest point and reduces the number of mispredictions

    Acta Cybernetica : Volume 21. Number 1.

    Get PDF

    Dynamic Hammock Predication for Non-predicated Instruction Set Architectures

    No full text
    Conventional speculative architectures use branch prediction to evaluate the most likely execution path during program execution. However, certain branches are difficult to predict. One solution to this problem is to evaluate both paths following such a conditional branch. Predicated execution can be used to implement this form of multi-path execution. Predicated architectures fetch and issue instructions that have associated predicates. These predicates indicate if the instruction should commit its result. Predicating a branch reduces the number of branches executed, eliminating the chance of branch misprediction at the cost of executing additional instructions. In this paper, we propose a restricted form of multi-path execution called Dynamic Predication for architectures with little or no support for predicated instructions in their instruction set. Dynamic predication dynamically predicates instruction sequences in the form of a branch hammock, concurrently executing both paths of..