848 research outputs found
The End of Slow Networks: It's Time for a Redesign
Next generation high-performance RDMA-capable networks will require a
fundamental rethinking of the design and architecture of modern distributed
DBMSs. These systems are commonly designed and optimized under the assumption
that the network is the bottleneck: the network is slow and "thin", and thus
needs to be avoided as much as possible. Yet this assumption no longer holds
true. With InfiniBand FDR 4x, the bandwidth available to transfer data across
network is in the same ballpark as the bandwidth of one memory channel, and it
increases even further with the most recent EDR standard. Moreover, with the
increasing advances of RDMA, the latency improves similarly fast. In this
paper, we first argue that the "old" distributed database design is not capable
of taking full advantage of the network. Second, we propose architectural
redesigns for OLTP, OLAP and advanced analytical frameworks to take better
advantage of the improved bandwidth, latency and RDMA capabilities. Finally,
for each of the workload categories, we show that remarkable performance
improvements can be achieved
The End of a Myth: Distributed Transactions Can Scale
The common wisdom is that distributed transactions do not scale. But what if
distributed transactions could be made scalable using the next generation of
networks and a redesign of distributed databases? There would be no need for
developers anymore to worry about co-partitioning schemes to achieve decent
performance. Application development would become easier as data placement
would no longer determine how scalable an application is. Hardware provisioning
would be simplified as the system administrator can expect a linear scale-out
when adding more machines rather than some complex sub-linear function, which
is highly application specific.
In this paper, we present the design of our novel scalable database system
NAM-DB and show that distributed transactions with the very common Snapshot
Isolation guarantee can indeed scale using the next generation of RDMA-enabled
network technology without any inherent bottlenecks. Our experiments with the
TPC-C benchmark show that our system scales linearly to over 6.5 million
new-order (14.5 million total) distributed transactions per second on 56
machines.Comment: 12 page
OLTP on Hardware Islands
Modern hardware is abundantly parallel and increasingly heterogeneous. The
numerous processing cores have non-uniform access latencies to the main memory
and to the processor caches, which causes variability in the communication
costs. Unfortunately, database systems mostly assume that all processing cores
are the same and that microarchitecture differences are not significant enough
to appear in critical database execution paths. As we demonstrate in this
paper, however, hardware heterogeneity does appear in the critical path and
conventional database architectures achieve suboptimal and even worse,
unpredictable performance. We perform a detailed performance analysis of OLTP
deployments in servers with multiple cores per CPU (multicore) and multiple
CPUs per server (multisocket). We compare different database deployment
strategies where we vary the number and size of independent database instances
running on a single server, from a single shared-everything instance to
fine-grained shared-nothing configurations. We quantify the impact of
non-uniform hardware on various deployments by (a) examining how efficiently
each deployment uses the available hardware resources and (b) measuring the
impact of distributed transactions and skewed requests on different workloads.
Finally, we argue in favor of shared-nothing deployments that are topology- and
workload-aware and take advantage of fast on-chip communication between islands
of cores on the same socket.Comment: VLDB201
Micro-architectural analysis of in-memory OLTP: Revisited
Micro-architectural behavior of traditional disk-based online transaction processing (OLTP) systems has been investigated extensively over the past couple of decades. Results show that traditional OLTP systems mostly under-utilize the available micro-architectural resources. In-memory OLTP systems, on the other hand, process all the data in main-memory and, therefore, can omit the buffer pool. Furthermore, they usually adopt more lightweight concurrency control mechanisms, cache-conscious data structures, and cleaner codebases since they are usually designed from scratch. Hence, we expect significant differences in micro-architectural behavior when running OLTP on platforms optimized for in-memory processing as opposed to disk-based database systems. In particular, we expect that in-memory systems exploit micro-architectural features such as instruction and data caches significantly better than disk-based systems. This paper sheds light on the micro-architectural behavior of in-memory database systems by analyzing and contrasting it to the behavior of disk-based systems when running OLTP workloads. The results show that, despite all the design changes, in-memory OLTP exhibits very similar micro-architectural behavior to disk-based OLTP: more than half of the execution time goes to memory stalls where instruction cache misses or the long-latency data misses from the last-level cache (LLC) are the dominant factors in the overall execution time. Even though ground-up designed in-memory systems can eliminate the instruction cache misses, the reduction in instruction stalls amplifies the impact of LLC data misses. As a result, only 30% of the CPU cycles are used to retire instructions, and 70% of the CPU cycles are wasted to stalls for both traditional disk-based and new generation in-memory OLTP
Instant restore after a media failure
Media failures usually leave database systems unavailable for several hours
until recovery is complete, especially in applications with large devices and
high transaction volume. Previous work introduced a technique called
single-pass restore, which increases restore bandwidth and thus substantially
decreases time to repair. Instant restore goes further as it permits read/write
access to any data on a device undergoing restore--even data not yet
restored--by restoring individual data segments on demand. Thus, the restore
process is guided primarily by the needs of applications, and the observed mean
time to repair is effectively reduced from several hours to a few seconds.
This paper presents an implementation and evaluation of instant restore. The
technique is incrementally implemented on a system starting with the
traditional ARIES design for logging and recovery. Experiments show that the
transaction latency perceived after a media failure can be cut down to less
than a second and that the overhead imposed by the technique on normal
processing is minimal. The net effect is that a few "nines" of availability are
added to the system using simple and low-overhead software techniques
- âŠ