426 research outputs found
x86 instruction reordering for code compression
Runtime executable code compression is a method which uses standard data compression methods and binary machine code transformations to achieve smaller file size, yet maintaining the ability to execute the compressed file as a regular executable. With a disassembler, an almost perfect instructional and functional level disassembly can be generated. Using the structural information of the compiled machine code each function can be split into so called basic blocks. In this work we show that reordering instructions within basic blocks using data flow constraints can improve code compression without changing the behavior of the code. We use two kinds of data affection (read, write) and 20 data types including registers: 8 basic x86 registers, 11 eflags, and memory data. Due to the complexity of the reordering, some simplification is required. Our solution is to search local optimum of the compression on the function level and then combine the results to get a suboptimal global result. Using the reordering method better results can be achieved, namely the compression size gain for gzip can be as high as 1.24%, for lzma 0.68% on the tested executables
Code optimizations for narrow bitwidth architectures
This thesis takes a HW/SW collaborative approach to tackle the problem of computational inefficiency in a holistic manner.
The hardware is redesigned by restraining the datapath to merely 16-bit datawidth (integer datapath only) to provide an
extremely simple, low-cost, low-complexity execution core which is best at executing the most common case efficiently. This
redesign, referred to as the Narrow Bitwidth Architecture, is unique in that although the datapath is squeezed to 16-bits, it
continues to offer the advantage of higher memory addressability like the contemporary wider datapath architectures. Its
interface to the outside (software) world is termed as the Narrow ISA. The software is responsible for efficiently mapping the
current stack of 64-bit applications onto the 16-bit hardware. However, this HW/SW approach introduces a non-negligible
penalty both in dynamic code-size and performance-impact even with a reasonably smart code-translator that maps the 64-
bit applications on to the 16-bit processor.
The goal of this thesis is to design a software layer that harnesses the power of compiler optimizations to assuage this
negative performance penalty of the Narrow ISA. More specifically, this thesis focuses on compiler optimizations targeting the
problem of how to compile a 64-bit program to a 16-bit datapath machine from the perspective of Minimum Required
Computations (MRC). Given a program, the notion of MRC aims to infer how much computation is really required to generate
the same (correct) output as the original program.
Approaching perfect MRC is an intrinsically ambitious goal and it requires oracle predictions of program behavior. Towards
this end, the thesis proposes three heuristic-based optimizations to closely infer the MRC. The perspective of MRC unfolds
into a definition of productiveness - if a computation does not alter the storage location, it is non-productive and hence, not
necessary to be performed. In this research, the definition of productiveness has been applied to different granularities of the
data-flow as well as control-flow of the programs.
Three profile-based, code optimization techniques have been proposed :
1. Global Productiveness Propagation (GPP) which applies the concept of productiveness at the granularity of a function.
2. Local Productiveness Pruning (LPP) applies the same concept but at a much finer granularity of a single instruction.
3. Minimal Branch Computation (MBC) is an profile-based, code-reordering optimization technique which applies the
principles of MRC for conditional branches.
The primary aim of all these techniques is to reduce the dynamic code footprint of the Narrow ISA. The first two optimizations
(GPP and LPP) perform the task of speculatively pruning the non-productive (useless) computations using profiles. Further,
these two optimization techniques perform backward traversal of the optimization regions to embed checks into the nonspeculative
slices, hence, making them self-sufficient to detect mis-speculation dynamically.
The MBC optimization is a use case of a broader concept of a lazy computation model. The idea behind MBC is to reorder the
backslices containing narrow computations such that the minimal necessary computations to generate the same (correct)
output are performed in the most-frequent case; the rest of the computations are performed only when necessary.
With the proposed optimizations, it can be concluded that there do exist ways to smartly compile a 64-bit application to a 16-
bit ISA such that the overheads are considerably reduced.Esta tesis deriva su motivación en la inherente ineficiencia computacional de los procesadores actuales: a pesar de que
muchas aplicaciones contemporáneas tienen unos requisitos de ancho de bits estrechos (aplicaciones de enteros, de red y
multimedia), el hardware acaba utilizando el camino de datos completo, utilizando más recursos de los necesarios y
consumiendo más energÃa.
Esta tesis utiliza una aproximación HW/SW para atacar, de forma Ãntegra, el problema de la ineficiencia computacional. El
hardware se ha rediseñado para restringir el ancho de bits del camino de datos a sólo 16 bits (únicamente el de enteros) y
ofrecer asà un núcleo de ejecución simple, de bajo consumo y baja complejidad, el cual está diseñado para ejecutar de
forma eficiente el caso común. El rediseño, llamado en esta tesis Arquitectura de Ancho de Bits Estrecho (narrow bitwidth
en inglés), es único en el sentido que aunque el camino de datos se ha estrechado a 16 bits, el sistema continúa
ofreciendo las ventajas de direccionar grandes cantidades de memoria tal como procesadores con caminos de datos más
anchos (64 bits actualmente). Su interface con el mundo exterior se denomina ISA estrecho. En nuestra propuesta el
software es responsable de mapear eficientemente la actual pila software de las aplicaciones de 64 bits en el hardware de
16 bits. Sin embargo, esta aproximación HW/SW introduce penalizaciones no despreciables tanto en el tamaño del código
dinámico como en el rendimiento, incluso con un traductor de código inteligente que mapea las aplicaciones de 64 bits en
el procesador de 16 bits.
El objetivo de esta tesis es el de diseñar una capa software que aproveche la capacidad de las optimizaciones para reducir
el efecto negativo en el rendimiento del ISA estrecho. Concretamente, esta tesis se centra en optimizaciones que tratan el
problema de como compilar programas de 64 bits para una máquina de 16 bits desde la perspectiva de las MÃnimas
Computaciones Requeridas (MRC en inglés). Dado un programa, la noción de MRC intenta deducir la cantidad de cómputo
que realmente se necesita para generar la misma (correcta) salida que el programa original.
Aproximarse al MRC perfecto es una meta intrÃnsecamente ambiciosa y que requiere predicciones perfectas de
comportamiento del programa. Con este fin, la tesis propone tres heurÃsticas basadas en optimizaciones que tratan de
inferir el MRC. La utilización de MRC se desarrolla en la definición de productividad: si un cálculo no altera el dato que ya
habÃa almacenado, entonces no es productivo y por lo tanto, no es necesario llevarlo a cabo.
Se han propuesto tres optimizaciones del código basadas en profile:
1. Propagación Global de la Productividad (GPP en inglés) aplica el concepto de productividad a la granularidad de función.
2. Poda Local de Productividad (LPP en inglés) aplica el mismo concepto pero a una granularidad mucho más fina, la de
una única instrucción.
3. Computación MÃnima del Salto (MBC en inglés) es una técnica de reordenación de código que aplica los principios de
MRC a los saltos condicionales.
El objetivo principal de todas esta técnicas es el de reducir el tamaño dinámico del código estrecho. Las primeras dos
optimizaciones (GPP y LPP) realizan la tarea de podar especulativamente las computaciones no productivas (innecesarias)
utilizando profiles. Además, estas dos optimizaciones realizan un recorrido hacia atrás de las regiones a optimizar para
añadir chequeos en el código no especulativo, haciendo de esta forma la técnica autosuficiente para detectar,
dinámicamente, los casos de fallo en la especulación.
La idea de la optimización MBC es reordenar las instrucciones que generan el salto condicional tal que las mÃnimas
computaciones que general la misma (correcta) salida se ejecuten en la mayorÃa de los casos; el resto de las
computaciones se ejecutarán sólo cuando sea necesario
Second-generation PLINK: rising to the challenge of larger and richer datasets
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide
association studies (GWAS) and research in population genetics. However, the
steady accumulation of data from imputation and whole-genome sequencing studies
has exposed a strong need for even faster and more scalable implementations of
key functions. In addition, GWAS and population-genetic data now frequently
contain probabilistic calls, phase information, and/or multiallelic variants,
none of which can be represented by PLINK 1's primary data format.
To address these issues, we are developing a second-generation codebase for
PLINK. The first major release from this codebase, PLINK 1.9, introduces
extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space
Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic
improvements. In combination, these changes accelerate most operations by 1-4
orders of magnitude, and allow the program to handle datasets too large to fit
in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data
format capable of efficiently representing probabilities, phase, and
multiallelic variants, and (b) extensions of many functions to account for the
new types of information.
The second-generation versions of PLINK will offer dramatic improvements in
performance and compatibility. For the first time, users without access to
high-end computing resources can perform several essential analyses of the
feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil
Boosting Multi-Core Reachability Performance with Shared Hash Tables
This paper focuses on data structures for multi-core reachability, which is a
key component in model checking algorithms and other verification methods. A
cornerstone of an efficient solution is the storage of visited states. In
related work, static partitioning of the state space was combined with
thread-local storage and resulted in reasonable speedups, but left open whether
improvements are possible. In this paper, we present a scaling solution for
shared state storage which is based on a lockless hash table implementation.
The solution is specifically designed for the cache architecture of modern
CPUs. Because model checking algorithms impose loose requirements on the hash
table operations, their design can be streamlined substantially compared to
related work on lockless hash tables. Still, an implementation of the hash
table presented here has dozens of sensitive performance parameters (bucket
size, cache line size, data layout, probing sequence, etc.). We analyzed their
impact and compared the resulting speedups with related tools. Our
implementation outperforms two state-of-the-art multi-core model checkers (SPIN
and DiVinE) by a substantial margin, while placing fewer constraints on the
load balancing and search algorithms.Comment: preliminary repor
- …