18,479 research outputs found
Type-driven automated program transformations and cost modelling for optimising streaming programs on FPGAs
In this paper we present a novel approach to program optimisation based on compiler-based type-driven program transformations and a fast and accurate cost/performance model for the target architecture. We target streaming programs for the problem domain of scientific computing, such as numerical weather prediction. We present our theoretical framework for type-driven program transformation, our target high-level language and intermediate representation languages and the cost model and demonstrate the effectiveness of our approach by comparison with a commercial toolchain
On the efficiency of reductions in µ-SIMD media extensions
Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for µ-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in µ-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current µ-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of µ-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.Peer ReviewedPostprint (published version
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
Low-Complexity Sub-band Digital Predistortion for Spurious Emission Suppression in Noncontiguous Spectrum Access
Noncontiguous transmission schemes combined with high power-efficiency
requirements pose big challenges for radio transmitter and power amplifier (PA)
design and implementation. Due to the nonlinear nature of the PA, severe
unwanted emissions can occur, which can potentially interfere with neighboring
channel signals or even desensitize the own receiver in frequency division
duplexing (FDD) transceivers. In this article, to suppress such unwanted
emissions, a low-complexity sub-band DPD solution, specifically tailored for
spectrally noncontiguous transmission schemes in low-cost devices, is proposed.
The proposed technique aims at mitigating only the selected spurious
intermodulation distortion components at the PA output, hence allowing for
substantially reduced processing complexity compared to classical linearization
solutions. Furthermore, novel decorrelation based parameter learning solutions
are also proposed and formulated, which offer reduced computing complexity in
parameter estimation as well as the ability to track time-varying features
adaptively. Comprehensive simulation and RF measurement results are provided,
using a commercial LTE-Advanced mobile PA, to evaluate and validate the
effectiveness of the proposed solution in real world scenarios. The obtained
results demonstrate that highly efficient spurious component suppression can be
obtained using the proposed solutions
- …