4,243 research outputs found
Instruction fetch architectures and code layout optimizations
The design of higher performance processors has been following two major trends: increasing the pipeline depth to allow faster clock rates, and widening the pipeline to allow parallel execution of more instructions. Designing a higher performance processor implies balancing all the pipeline stages to ensure that overall performance is not dominated by any of them. This means that a faster execution engine also requires a faster fetch engine, to ensure that it is possible to read and decode enough instructions to keep the pipeline full and the functional units busy. This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. We describe the different fetch engines proposed in the literature, the performance issues involved, and some of the proposed improvements. We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described Overall, we show how instruction fetch has evolved from fetching one instruction every few cycles, to fetching one instruction per cycle, to fetching a full basic block per cycle, to several basic blocks per cycle: the evolution of the mechanism surrounding the instruction cache, and the different compiler optimizations used to better employ these mechanisms.Peer ReviewedPostprint (published version
Asymptotic Analysis of Plausible Tree Hash Modes for SHA-3
Discussions about the choice of a tree hash mode of operation for a
standardization have recently been undertaken. It appears that a single tree
mode cannot address adequately all possible uses and specifications of a
system. In this paper, we review the tree modes which have been proposed, we
discuss their problems and propose remedies. We make the reasonable assumption
that communicating systems have different specifications and that software
applications are of different types (securing stored content or live-streamed
content). Finally, we propose new modes of operation that address the resource
usage problem for the three most representative categories of devices and we
analyse their asymptotic behavior
Software trace cache
We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version
Achieving Marton's Region for Broadcast Channels Using Polar Codes
This paper presents polar coding schemes for the 2-user discrete memoryless
broadcast channel (DM-BC) which achieve Marton's region with both common and
private messages. This is the best achievable rate region known to date, and it
is tight for all classes of 2-user DM-BCs whose capacity regions are known. To
accomplish this task, we first construct polar codes for both the superposition
as well as the binning strategy. By combining these two schemes, we obtain
Marton's region with private messages only. Finally, we show how to handle the
case of common information. The proposed coding schemes possess the usual
advantages of polar codes, i.e., they have low encoding and decoding complexity
and a super-polynomial decay rate of the error probability.
We follow the lead of Goela, Abbe, and Gastpar, who recently introduced polar
codes emulating the superposition and binning schemes. In order to align the
polar indices, for both schemes, their solution involves some degradedness
constraints that are assumed to hold between the auxiliary random variables and
the channel outputs. To remove these constraints, we consider the transmission
of blocks and employ a chaining construction that guarantees the proper
alignment of the polarized indices. The techniques described in this work are
quite general, and they can be adopted to many other multi-terminal scenarios
whenever there polar indices need to be aligned.Comment: 26 pages, 11 figures, accepted to IEEE Trans. Inform. Theory and
presented in part at ISIT'1
Exact genome alignment
The increase in the volume of genomic data due to the decrease in the cost of whole genome sequencing techniques has opened up new avenues of research in the field of Bioinformatics, like comparative genomics and evolutionary dynamics. The fundamental task in these studies is to align the genome sequences accurately. Sequence alignment helps to identify regions of similarity between the sequences to establish their functional, evolutionary and structural relationship. The thesis investigates the performance of two sequence alignment programs LASTZ, a hash table based faster method and SSEARCH, a slower but more rigorous Smith-Waterman based approach, on whole genome sequences from primates and mammals. An exact genome alignment technique is used by breaking the entire genome into fragments and aligning these fragments with the reference genome using the Smith-Waterman based method. A comparison of the two methods reveals that the second approach performs better for genomes from closely related species
Significant speedup of database searches with HMMs by search space reduction with PSSM family models
Motivation: Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive
Optimization of Tree Modes for Parallel Hash Functions: A Case Study
This paper focuses on parallel hash functions based on tree modes of
operation for an inner Variable-Input-Length function. This inner function can
be either a single-block-length (SBL) and prefix-free MD hash function, or a
sponge-based hash function. We discuss the various forms of optimality that can
be obtained when designing parallel hash functions based on trees where all
leaves have the same depth. The first result is a scheme which optimizes the
tree topology in order to decrease the running time. Then, without affecting
the optimal running time we show that we can slightly change the corresponding
tree topology so as to minimize the number of required processors as well.
Consequently, the resulting scheme decreases in the first place the running
time and in the second place the number of required processors.Comment: Preprint version. Added citations, IEEE Transactions on Computers,
201
- …