71 research outputs found
Transactions Everywhere
Arguably, one of the biggest deterrants for software developers who might otherwise choose to write parallel code is that parallelism makes their lives more complicated. Perhaps the most basic problem inherent in the coordination of concurrent tasks is the enforcing of atomicity so that the partial results of one task do not inadvertently corrupt another task. Atomicity is typically enforced through locking protocols, but these protocols can introduce other complications, such as deadlock, unless restrictive methodologies in their use are adopted. We have recently begun a research project focusing on transactional memory [18] as an alternative mechanism for enforcing atomicity, since it allows the user to avoid many of the complications inherent in locking protocols. Rather than viewing transactions as infrequent occurrences in a program, as has generally been done in the past, we have adopted the point of view that all user code should execute in the context of some transaction. To make this viewpoint viable requires the development of two key technologies: effective hardware support for scalable transactional memory, and linguistic and compiler support. This paper describes our preliminary research results on making “transactions everywhere” a practical reality.Singapore-MIT Alliance (SMA
Final Report: Efficient Databases for MPC Microdata
The purpose of this grant was to develop the theory and practice of high-performance databases for massive streamed datasets. Over the last three years, we have developed fast indexing technology, that is, technology for rapidly ingesting data and storing that data so that it can be efficiently queried and analyzed. During this project we developed the technology so that high-bandwidth data streams can be indexed and queried efficiently. Our technology has been proven to work data sets composed of tens of billions of rows when the data streams arrives at over 40,000 rows per second. We achieved these numbers even on a single disk driven by two cores. Our work comprised (1) new write-optimized data structures with better asymptotic complexity than traditional structures, (2) implementation, and (3) benchmarking. We furthermore developed a prototype of TokuFS, a middleware layer that can handle microdata I/O packaged up in an MPI-IO abstraction
Hardware Transactional Memory
This work shows how hardware transactional memory (HTM) can be implemented to support transactions of arbitrarily large size, while ensuring that small transactions run efficiently. Our implementation handles small transactions similar to Herlihy and Moss's scheme in that it holds tentative updates in a cache. Unlike their scheme, which uses a special fully associative cache, ours augments the ordinary processor cache and provides a mechanism to handle cache spills of uncommitted transactional data. Consequently, our scheme runs faster for small transactions while correctly handling transactions of arbitrarily large size.
Although transactions are small in the common case, we argue that HTM should not restrict the size of transactions, because it complicates the programmer/compiler model and precludes some important programs from exploiting transactional memory. We show that the Linux 2.4.19 kernel can be automatically and efficiently “transactified” if boundless transactions can be supported. Our experimental results show that the largest transaction touches over 7000 64-byte cache lines, whereas 99.94\% of the transactions touch fewer than 64 cache lines. We further show that synchronized methods in Java can be easily compiled to our HTM scheme, thereby providing the advantages of nonblocking atomicity (including absence of deadlock) in a straightforward fashion.
Our HTM scheme for boundless transactions uses an efficiently implementable hardware snapshot and the ordinary set-associative L2 cache extended with less than two bits per cache line. One of the bits tells whether the cached item is part of a transaction (as in the Herlihy-Moss scheme), and all the lines in an associative set share another bit telling whether a line has overflowed from the cache and is now stored in a special overflow area of main memory. We provide empirical results to show that our scheme does not adversely affect the processor pipeline or hinder speculative execution.Singapore-MIT Alliance (SMA
Adversarial Analyses of Window Backoff Strategies for Simple Multiple-Access Channels
Backoff strategies have typically been analyzed by making statistical assumptions on the distribution of problem inputs. Although these analyses have provided valuable insights into the efficacy of various backoff strategies, they leave open the question as to which backoff algorithms perform best in the worst case or on inputs, such as bursty inputs, that are not covered by the statistical models. This paper analyzes randomized backoff strategies using worst-case assumptions on the inputs.
Specifically, we analyze algorithms for simple multiple-access channels, where the only feedback from each attempt to send a packet is a single bit indicating whether the transmission succeeded or the packet collided with another packet. We analyze a class of strategies, called window strategies, where each packet partitions time into a sequence (W₁, W₂,...) of windows. Within each window, the packet makes an access attempt during a single randomly selected slot. If its transmission is unsuccessful, it waits for its slot in the next window before retrying.
We use delay-sequence arguments to show that for the batch problem, in which n packets all arrive at time 0, if every window has size W = Θ(n), then with high probability, all packets successfully transmit with makespan n lg lg n ± O(n). We use this result to analyze window backoff strategies with varying window sizes. Specifically, we show that the familiar binary exponential backoff algorithm, where Wk = Θ(2k), has makespan Θ(n lg n), and that more generally, for any constant r > 1, the r-exponential backoff algorithm in which Wk = Θ(rk) has makespan Θ(n lglg rn). We also show that for any constant r > 1, the r-polynomial backoff algorithm, in which Wk = Θ(kr), has makespan Θ((n/lg n)¹⁺¹/r).
All of these batch strategies are monotonic, in the sense that the window size monotonically increases over time. We exhibit a monotonic backoff algorithm that achieves makespan Θ(n lg lg n/lg lg lg n). We prove that this algorithm, whose backoff is superpolynomial and subexponential, is optimal over all monotonic backoff schemes. In addition, we exhibit a simple backoff/backon algorithm, having window sizes that vary nonmonotonically according to a "sawtooth" pattern, that achieves the optimal makespan of Θ(n).
We study the online setting using an adversarial queueing model. We define a (λ,T)-stream to be an input stream of packets in which at most n = λT packets arrive during any time interval of size T. In this model, to evaluate a given backoff algorithm (which does not know λ or T), we analyze the worst-case behavior of the algorithm over the class of (λ,T)-streams.
Our results for the online setting focus on exponential backoff. We show that for any arrival rate λ, there exists a sufficiently large interval size T such that the throughput goes to 0 for some (λ,T)-stream. Moreover, there exists a sufficiently large constant c such that for any interval size T, if λ ⥠c lg lg n/lg n, the system is unstable in the sense that the arrival rate exceeds the throughput in the worst case. If, on the other hand, we have λ ⤠c/lg n for a sufficiently small constant c, then the system is stable. Surprisingly, the algorithms that guarantee smaller makespans in the batch setting require lower arrival rates to achieve stability than does exponential backoff, but when they are stable, they have better response times.Singapore-MIT Alliance (SMA
Recommended from our members
Final Report: Efficient Databases for MPC Microdata
The purpose of this grant was to develop the theory and practice of high-performance databases for massive streamed datasets. Over the last three years, we have developed fast indexing technology, that is, technology for rapidly ingesting data and storing that data so that it can be efficiently queried and analyzed. During this project we developed the technology so that high-bandwidth data streams can be indexed and queried efficiently. Our technology has been proven to work data sets composed of tens of billions of rows when the data streams arrives at over 40,000 rows per second. We achieved these numbers even on a single disk driven by two cores. Our work comprised (1) new write-optimized data structures with better asymptotic complexity than traditional structures, (2) implementation, and (3) benchmarking. We furthermore developed a prototype of TokuFS, a middleware layer that can handle microdata I/O packaged up in an MPI-IO abstraction
Don't Thrash: How to Cache Your Hash on Flash
This paper presents new alternatives to the well-known Bloom filter data
structure. The Bloom filter, a compact data structure supporting set insertion
and membership queries, has found wide application in databases, storage
systems, and networks. Because the Bloom filter performs frequent random reads
and writes, it is used almost exclusively in RAM, limiting the size of the sets
it can represent. This paper first describes the quotient filter, which
supports the basic operations of the Bloom filter, achieving roughly comparable
performance in terms of space and time, but with better data locality.
Operations on the quotient filter require only a small number of contiguous
accesses. The quotient filter has other advantages over the Bloom filter: it
supports deletions, it can be dynamically resized, and two quotient filters can
be efficiently merged. The paper then gives two data structures, the buffered
quotient filter and the cascade filter, which exploit the quotient filter
advantages and thus serve as SSD-optimized alternatives to the Bloom filter.
The cascade filter has better asymptotic I/O performance than the buffered
quotient filter, but the buffered quotient filter outperforms the cascade
filter on small to medium data sets. Both data structures significantly
outperform recently-proposed SSD-optimized Bloom filter variants, such as the
elevator Bloom filter, buffered Bloom filter, and forest-structured Bloom
filter. In experiments, the cascade filter and buffered quotient filter
performed insertions 8.6-11 times faster than the fastest Bloom filter variant
and performed lookups 0.94-2.56 times faster.Comment: VLDB201
Recommended from our members
Cache-Oblivious Streaming B-Trees
A streaming B-tree is a dictionary that efficiently implements insertions and range queries. We present two cache-oblivious streaming B-trees, the shuttle tree, and the cache-oblivious lookahead array (COLA). For block-transfer size B and on N elements, the shuttle tree implements searches in optimal transfers, range queries of L successive elements in optimal transfers, and insertions in transfers, which is an asymptotic speedup over traditional B-trees if for any constant c >1. A COLA implements searches in O(log N) transfers, range queries in O(log N + L/B) transfers, and insertions in amortized O((log N)/B) transfers, matching the bounds for a (cache-aware) buffered repository tree. A partially deamortized COLA matches these bounds but reduces the worst-case insertion cost to O(log N) if memory size . We also present a cache-aware version of the COLA, the lookahead array, which achieves the same bounds as Brodal and Fagerberg's (cache-aware) -tree. We compare our COLA implementation to a traditional B-tree. Our COLA implementation runs 790 times faster for random insertions, 3.1 times slower for insertions of sorted data, and 3.5 times slower for searches.Engineering and Applied Science
- …