1,026 research outputs found
Cost-effective compiler directed memory prefetching and bypassing
Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefetching techniques aim is to bridge these two gaps by fetching data in advance to both the L1 cache and the register file. Our main contribution in this paper is a hybrid approach to the prefetching problem that combines both software and hardware prefetching in a cost-effective way by needing very little hardware support and impacting minimally the design of the processor pipeline. The prefetcher is built on-top of a static memory instruction bypassing, which is in charge of bringing prefetched values in the register file. In this paper we also present a thorough analysis of the limits of both prefetching and memory instruction bypassing. We also compare our prefetching technique with a prior speculative proposal that attacked the same problem, and we show that at much lower cost, our hybrid solution is better than a realistic implementation of speculative prefetching and bypassing. On average, our hybrid implementation achieves a 13% speed-up improvement over a version with software prefetching in a subset of numerical applications and an average of 43% over a version with no software prefetching (achieving up to a 102% for specific benchmarks).Peer ReviewedPostprint (published version
Low-complexity distributed issue queue
As technology evolves, power density significantly increases and cooling systems become more complex and expensive. The issue logic is one of the processor hotspots and, at the same time, its latency is crucial for the processor performance. We present a low-complexity FP issue logic (MB/spl I.bar/distr) that achieves high performance with small energy requirements. The MB/spl I.bar/distr scheme is based on classifying instructions and dispatching them into a set of queues depending on their data dependences. These instructions are selected for issuing based on an estimation of when their operands will be available, so the conventional wakeup activity is not required. Additionally, the functional units are distributed across the different queues. The energy required by the proposed scheme is substantially lower than that required by a conventional issue design, even if the latter has the ability of waking-up only unready operands. MB/spl I.bar/distr scheme reduces the energy-delay product by 35% and the energy-delay product by 18% with respect to a state-of-the-art approach.Peer ReviewedPostprint (published version
Instruction history management for high-performance microprocessors
textHistory-driven dynamic optimization is an important factor in improving
instruction throughput in future high-performance microprocessors. Historybased
techniques have the ability to improve instruction-level parallelism by
breaking program dependencies, eliminating long-latency microarchitecture
operations, and improving prioritization within the microarchitecture. However,
a combination of factors, such as wider issue widths, smaller transistors,
larger die area, and increasing clock frequency, has led to microprocessors that
are sensitive to both wire delays and energy consumption. In this environment,
the global structures and long-distance communications that characterize current
history data management are limiting instruction throughput.
This dissertation proposes the ScatterFlow Framework for Instruction
History Management. Execution history management tasks, such as history
data storage, access, distribution, collection, and modification, are partitioned
and dispersed throughout the instruction execution pipeline. History data
packets are then associated with active instructions and flow with the instructions
as they execute, encountering the history management tasks along the
way. Between dynamic instances of the instructions, the history data packets
reside in trace-based history storage that is synchronized with the instruction
trace cache. Compared to traditional history data management, this ScatterFlow
method improves instruction coverage, increases history data access
bandwidth, shortens communication distances, improves history data accuracy
in many cases, and decreases the effective history data access time.
A comparison of general history management effectiveness between the
ScatterFlow Framework and traditional hardware tables shows that the ScatterFlow
Framework provides superior history maturity and instruction coverage.
The unique properties that arise due to trace-based history storage and
partitioned history management are analyzed, and novel design enhancements
are presented to increase the usefulness of instruction history data within the
ScatterFlow Framework.
To demonstrate the potential of the proposed framework, specific dynamic
optimization techniques are implemented using the ScatterFlow Framework.
These illustrative examples combine the history capture advantages
with the access latency improvements while exhibiting desirable dynamic energy
consumption properties. Compared to a traditional table-based predictor,
performing ScatterFlow value prediction improves execution time and reduces
dynamic energy consumption. In other detailed examples, ScatterFlowenabled
cluster assignment demonstrates improved execution time over previous
cluster assignment schemes, and ScatterFlow instruction-level profiling
detects more useful execution traits than traditional fixed-size and infinite-size
hardware tables.Electrical and Computer Engineerin
Future value based single assignment program representations and optimizations
An optimizing compiler internal representation fundamentally affects the clarity, efficiency and feasibility of optimization algorithms employed by the compiler. Static Single Assignment (SSA) as a state-of-the-art program representation has great advantages though still can be improved. This dissertation explores the domain of single assignment beyond SSA, and presents two novel program representations: Future Gated Single Assignment (FGSA) and Recursive Future Predicated Form (RFPF). Both FGSA and RFPF embed control flow and data flow information, enabling efficient traversal program information and thus leading to better and simpler optimizations. We introduce future value concept, the designing base of both FGSA and RFPF, which permits a consumer instruction to be encountered before the producer of its source operand(s) in a control flow setting. We show that FGSA is efficiently computable by using a series T1/T2/TR transformation, yielding an expected linear time algorithm for combining together the construction of the pruned single assignment form and live analysis for both reducible and irreducible graphs. As a result, the approach results in an average reduction of 7.7%, with a maximum of 67% in the number of gating functions compared to the pruned SSA form on the SPEC2000 benchmark suite. We present a solid and near optimal framework to perform inverse transformation from single assignment programs. We demonstrate the importance of unrestricted code motion and present RFPF. We develop algorithms which enable instruction movement in acyclic, as well as cyclic regions, and show the ease to perform optimizations such as Partial Redundancy Elimination on RFPF
Assessing Code Obfuscation of Metamorphic JavaScript
Metamorphic malware is one of the biggest and most ubiquitous threats in the digital world. It can be used to morph the structure of the target code without changing the underlying functionality of the code, thus making it very difficult to detect using signature-based detection and heuristic analysis. The focus of this project is to analyze Metamorphic JavaScript malware and techniques that can be used to mutate the code in JavaScript. To assess the capabilities of the metamorphic engine, we performed experiments to visualize the degree of code morphing. Further, this project discusses potential methods that have been used to detect metamorphic malware and their potential limitations. Based on the experiments performed, SVM has shown promise when it comes to detecting and classifying metamorphic code with a high accuracy. An accuracy of 86% is observed when classifying benign, malware and metamorphic files
Robust Watermarking using Hidden Markov Models
Software piracy is the unauthorized copying or distribution of software. It is a growing problem that results in annual losses in the billions of dollars. Prevention is a difficult problem since digital documents are easy to copy and distribute. Watermarking is a possible defense against software piracy. A software watermark consists of information embedded in the software, which allows it to be identified. A watermark can act as a deterrent to unauthorized copying, since it can be used to provide evidence for legal action against those responsible for piracy.In this project, we present a novel software watermarking scheme that is inspired by the success of previous research focused on detecting metamorphic viruses. We use a trained hidden Markov model (HMM) to detect a specific copy of software. We give experimental results that show our scheme is robust. That is, we can identify the original software even after it has been extensively modified, as might occur as part of an attack on the watermarking scheme
An energy efficient dependence driven scalable dispatch scheme
This thesis proposes two schemes that target superscalar microarchitectures. The first scheme aims at alleviating the complexity of the dispatch logic by reducing the number of ports to the renamer. The second scheme targets L-1 data cache energy reduction by proposing a cache architecture incorporating IPC-aware dynamic associativity management. The dispatch stage constitutes a key critical path in the scalability of current generation superscalar processors. The rename map table (RMT) access and the dependence check logic (DCL) delays scale unfavorably with the dispatch width (DW) of a superscalar processor. It is a well-known program property that the results of most instructions are consumed within the following 4-6 instruction window. This program behavior can be exploited to reduce the rename delay by reducing the number of read/write ports in the RMT to significantly below the current 3xDW. We propose an algorithm to dynamically allocate reduced number of RMT ports to instructions in the current dispatch window, matching dispatch resources to average needs rather than peak needs. This results in shorter RMT access delays as well as in lower energy in the dispatch stage. The IPC reduction due to rename map table read/write port contention in the proposed scheme stays within 2-4%. The cycle time saved can also be leveraged to support wider dispatch in the same cycle time in order to offset this degradation. Data caches are designed with higher associativities to support data sets corresponding to peak load/store bandwidths. The second part of this thesis explores a more general design space for dynamic associativity (for a 4-way associative cache, consider 1-way, 2-way, and 4-way associative accesses). We use the actual instruction level parallelism exhibited by the instructions surrounding a given load to classify it as belonging to a particular IPC packet. The lookup schedule is fixed in advance for each IPC classifier. The energy savings over SPEC2000 CPU benchmarks average 28.6% for a 32KB, 4-way, L-1 data cache. The resulting performance (IPC) degradation from the dynamic way schedule is restricted to less than 2.25%, mainly because IPC based placement ends up being an excellent classifier
Applying Deep Learning Techniques to the Analysis of Android APKs
Malware targeting mobile devices is a pervasive problem in modern life and as such tools to detect and classify malware are of great value. This paper seeks to demonstrate the effectiveness of Deep Learning Techniques, specifically Convolutional Neural Networks, in detecting and classifying malware targeting the Android operating system. Unlike many current detection techniques, which require the use of relatively rigid features to aid in detection, deep neural networks are capable of automatically learning flexible features which may be more resilient to obfuscation. We present a parsing for extracting sequences of API calls which can be used to describe a hypothetical execution of a given application. We then show how to use this sequence of API calls to successfully classify Android malware using a Convolutional Neural Network
- …