5,647 research outputs found

    Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache

    Get PDF
    Cache memories play a critical role in bridging the latency, bandwidth, and energy gaps between cores and off-chip memory. However, caches frequently consume a significant fraction of a multicore chip’s area, and thus account for a significant fraction of its cost. Compression has the potential to improve the effective capacity of a cache, providing the performance and energy benefits of a larger cache while using less area. The design of a compressed cache must address two important issues: i) a low-latency, low-overhead compression algorithm that can represent a fixed-size cache block using fewer bits and ii) a cache organization that can efficiently store the resulting variable-size compressed blocks. This paper focuses on the latter issue. In this paper, we propose YACC (Yet Another Compressed Cache), a new compressed cache design that uses super-blocks to reduce tag overheads and variable-size blocks to reduce internal fragmentation, but eliminates two major sources of complexity in previous work—decoupled tag-data mapping and address skewing. YACC’s cache layout is similar to conventional caches, eliminating the back-pointers used to maintain a decoupled tag-data mapping and the extra decoders used to implement skewed associativity. An additional advantage of YACC is that it enables modern replacement mechanisms, such as RRIP. For our benchmark set, YACC performs comparably to the recently-proposed Skewed Compressed Cache (SCC) ‎[Sardashti et al. 2014], but with a simpler, more area efficient design without the complexity and overheads of skewing. Compared to a conventional uncompressed 8MB LLC, YACC improves performance by on average 8% and up to 26%, and reduces total energy by on average 6% and up to 20%. An 8MB YACC achieves approximately the same performance and energy improvements as a 16MB conventional cache at a much smaller silicon footprint, with 1.6% higher area than an 8MB conventional cach

    Leveraging Coding Techniques for Speeding up Distributed Computing

    Get PDF
    Large scale clusters leveraging distributed computing frameworks such as MapReduce routinely process data that are on the orders of petabytes or more. The sheer size of the data precludes the processing of the data on a single computer. The philosophy in these methods is to partition the overall job into smaller tasks that are executed on different servers; this is called the map phase. This is followed by a data shuffling phase where appropriate data is exchanged between the servers. The final so-called reduce phase, completes the computation. One potential approach, explored in prior work for reducing the overall execution time is to operate on a natural tradeoff between computation and communication. Specifically, the idea is to run redundant copies of map tasks that are placed on judiciously chosen servers. The shuffle phase exploits the location of the nodes and utilizes coded transmission. The main drawback of this approach is that it requires the original job to be split into a number of map tasks that grows exponentially in the system parameters. This is problematic, as we demonstrate that splitting jobs too finely can in fact adversely affect the overall execution time. In this work we show that one can simultaneously obtain low communication loads while ensuring that jobs do not need to be split too finely. Our approach uncovers a deep relationship between this problem and a class of combinatorial structures called resolvable designs. Appropriate interpretation of resolvable designs can allow for the development of coded distributed computing schemes where the splitting levels are exponentially lower than prior work. We present experimental results obtained on Amazon EC2 clusters for a widely known distributed algorithm, namely TeraSort. We obtain over 4.69×\times improvement in speedup over the baseline approach and more than 2.6×\times over current state of the art

    Multi-engine packet classification hardware accelerator

    Get PDF
    As line rates increase, the task of designing high performance architectures with reduced power consumption for the processing of router traffic remains important. In this paper, we present a multi-engine packet classification hardware accelerator, which gives increased performance and reduced power consumption. It follows the basic idea of decision-tree based packet classification algorithms, such as HiCuts and HyperCuts, in which the hyperspace represented by the ruleset is recursively divided into smaller subspaces according to some heuristics. Each classification engine consists of a Trie Traverser which is responsible for finding the leaf node corresponding to the incoming packet, and a Leaf Node Searcher that reports the matching rule in the leaf node. The packet classification engine utilizes the possibility of ultra-wide memory word provided by FPGA block RAM to store the decision tree data structure, in an attempt to reduce the number of memory accesses needed for the classification. Since the clock rate of an individual engine cannot catch up to that of the internal memory, multiple classification engines are used to increase the throughput. The implementations in two different FPGAs show that this architecture can reach a searching speed of 169 million packets per second (mpps) with synthesized ACL, FW and IPC rulesets. Further analysis reveals that compared to state of the art TCAM solutions, a power savings of up to 72% and an increase in throughput of up to 27% can be achieved

    Processing Analytical Queries over Encrypted Data

    Get PDF
    MONOMI is a system for securely executing analytical workloads over sensitive data on an untrusted database server. MONOMI works by encrypting the entire database and running queries over the encrypted data. MONOMI introduces split client/server query execution, which can execute arbitrarily complex queries over encrypted data, as well as several techniques that improve performance for such workloads, including per-row precomputation, space-efficient encryption, grouped homomorphic addition, and pre-filtering. Since these optimizations are good for some queries but not others, MONOMI introduces a designer for choosing an efficient physical design at the server for a given workload, and a planner to choose an efficient execution plan for a given query at runtime. A prototype of MONOMI running on top of Postgres can execute most of the queries from the TPC-H benchmark with a median overhead of only 1.24× (ranging from 1.03×to 2.33×) compared to an un-encrypted Postgres database where a compromised server would reveal all data.National Science Foundation (U.S.) (Award IIS-1065219)Google (Firm

    Exploiting CAFS-ISP

    Get PDF
    In the summer of 1982, the ICLCUA CAFS Special Interest Group defined three subject areas for working party activity. These were: 1) interfaces with compilers and databases, 2) end-user language facilities and display methods, and 3) text-handling and office automation. The CAFS SIG convened one working party to address the first subject with the following terms of reference: 1) review facilities and map requirements onto them, 2) "Database or CAFS" or "Database on CAFS", 3) training needs for users to bridge to new techniques, and 4) repair specifications to cover gaps in software. The working party interpreted the topic broadly as the data processing professional's, rather than the end-user's, view of and relationship with CAFS. This report is the result of the working party's activities. The report content for good reasons exceeds the terms of reference in their strictest sense. For example, we examine QUERYMASTER, which is deemed to be an end-user tool by ICL, from both the DP and end-user perspectives. First, this is the only interface to CAFS in the current SV201. Secondly, it is necessary for the DP department to understand the end-user's interface to CAFS. Thirdly, the other subjects have not yet been addressed by other active working parties

    A Matrix PRNG with S-Box Output Filtering

    Get PDF
    We describe a modification to a previously published pseudorandom number generator improving security while maintaining high performance. The proposed generator is based on the powers of a word-packed block upper triangular matrix and it is designed to be fast and easy to implement in software since it mainly involves bitwise operations between machine registers and, in our tests, it presents excellent security and statistical characteristics. The modifications include a new, key-derived s-box based nonlinear output filter and improved seeding and extraction mechanisms. This output filter can also be applied to other generators.Research partially supported by the Spanish MINECO under Project TIN2011-25452
    • 

    corecore