76 research outputs found
Application-Specific Memory Subsystems
The disparity in performance between processors and main memories has
led computer architects to incorporate large cache hierarchies in
modern computers. These cache hierarchies are designed to be
general-purpose in that they strive to provide the best possible
performance across a wide range of applications. However, such a memory
subsystem does not necessarily provide the best possible performance for
a particular application.
Although general-purpose memory subsystems are desirable when the
work-load is unknown and the memory subsystem must remain fixed,
when this is not the case a custom memory subsystem may be beneficial.
For example, in an application-specific integrated circuit (ASIC) or
a field-programmable gate array (FPGA) designed to run a particular
application, a custom memory subsystem optimized for that application
would be desirable. In addition, when there are tunable
parameters in the memory subsystem, it may make sense to change these
parameters depending on the application being run. Such a situation
arises today with FPGAs and, to a lesser extent, GPUs, and it is
plausible that general-purpose computers will begin to support
greater flexibility in the memory subsystem in the future.
In this dissertation, we first show that it is possible to create
application-specific memory subsystems that provide much better
performance than a general-purpose memory subsystem. In addition,
we show a way to discover such memory subsystems automatically using
a superoptimization technique on memory address traces gathered
from applications. This allows one to generate a custom memory subsystem
with little effort.
We next show that our memory subsystem superoptimization technique can
be used to optimize for objectives other than performance. As an example,
we show that it is possible to reduce the number of writes to the main
memory, which can be useful for main memories with limited write
durability, such as flash or Phase-Change Memory (PCM).
Finally, we show how to superoptimize memory subsystems for streaming
applications, which are a class of parallel applications. In particular, we
show that, through the use of ScalaPipe, we can author and deploy streaming
applications targeting FPGAs with superoptimized memory subsystems.
ScalaPipe is a domain-specific language (DSL) embedded in the Scala
programming language for generating streaming applications that can be
implemented on CPUs and FPGAs. Using the ScalaPipe implementation, we
are able to demonstrate actual performance improvements using the
superoptimized memory subsystem with applications implemented in hardware
Unbounded Superoptimization
Our aim is to enable software to take full advantage of the capabilities of emerging microprocessor designs without modifying the compiler. Towards this end, we propose a new approach to code generation and optimization. Our approach uses an SMT solver in a novel way to generate efficient code for modern architectures and guarantee that the generated code correctly implements the source code. The distinguishing characteristic of our approach is that the size of the constraints does not depend on the candidate sequence of instructions. To study the feasibility of our approach, we implemented a preliminary prototype, which takes as input LLVM IR code and uses Z3 SMT solver to generate ARMv7-A assembly. The prototype handles arbitrary loop-free code (not only basic blocks) as input and output. We applied it to small but tricky examples used as standard benchmarks for other superoptimization and synthesis tools. We are encouraged to see that Z3 successfully solved complex constraints that arise from our approach. This work paves the way to employing recent advances in SMT solvers and has a potential to advance SMT solvers further by providing a new category of challenging benchmarks that come from an industrial application domain
Maximally robust controllers for multivariable systems
Published versio
Minotaur: A SIMD-Oriented Synthesizing Superoptimizer
Minotaur is a superoptimizer for LLVM's intermediate representation that
focuses on integer SIMD instructions, both portable and specific to x86-64. We
created it to attack problems in finding missing peephole optimizations for
SIMD instructions-this is challenging because there are many such instructions
and they can be semantically complex. Minotaur runs a hybrid synthesis
algorithm where instructions are enumerated concretely, but literal constants
are generated by the solver. We use Alive2 as a verification engine; to do this
we modified it to support synthesis and also to support a large subset of
Intel's vector instruction sets (SSE, AVX, AVX2, and AVX-512). Minotaur finds
many profitable optimizations that are missing from LLVM. It achieves limited
speedups on the integer parts of SPEC CPU2017, around 1.3%, and it speeds up
the test suite for the libYUV library by 2.2%, on average, and by 1.64x
maximum, when targeting an Intel Cascade Lake processor
- …