Designing scalable database management systems on modern hardware has been a challenge for almost a decade. Hardware trends oblige software to overcome three major challenges against systems scalability: (1) Exploiting the abundant thread-level parallelism provided by multicores, (2) Achieving predictively efficient execution despite the variability in communication latencies among cores on multisocket multicores, and (3) Taking advantage of the aggressive micro-architectural features.
INTRODUCTION
Length: 3 hours Target Audience: Researchers and developers in the field of data management systems who are non-experts on modern hardware and the challenges the emerging hardware poses on high-performance transaction and query processing, and PhD students who are interested in learning more about the underlying hardware and seeking a challenging and high-impact research topic on data management systems.
Related Previous Tutorials: The first part of this tutorial, scaling-up on multicores, is presented as part of the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SIGMOD '14, June 22-27, 2014 VLDB 2013 tutorial titled Toward Scalable Transaction Processing -Evolution of Shore-MT [1] . This tutorial, however, has broader scope and includes a range of data management systems and hardware platforms. More specifically, it surveys the concept of scalability for data management systems not just on multicores with uniform access latencies but also on multisockets with non-uniform memory accesses (NUMA) and at the micro-architectural level. In addition, it includes examples from a broader range of storage managers, not just from Shore-MT.
THREAD-LEVEL PARALLELISM
In step with Moore's Law, hardware gives us more and more opportunities for parallelism rather than faster processors since 2005. Exploiting parallelism is crucial for utilizing the available architectural resources and enabling faster software. However, designing scalable systems that can take advantage of the underlying parallelism remains a challenge. In traditional high performance transaction processing, the inherent communication leads to scalability bottlenecks on today's multicore and multisocket hardware. Even systems that scale very well on one generation of multicores might fail to scale-up on the next generation. On the other hand, in traditional online analytical processing, the database operators that were designed for unicore processors fail to exploit the abundant parallelism offered by modern hardware.
In this first part of the tutorial, we initially teach a methodology for scaling-up transaction processing systems on multicore hardware. More specifically, we identify three types of communication in a typical transaction processing system: unbounded, fixed, and cooperative [17] . We demonstrate that the key to achieve scalability on modern hardware, especially for transaction processing systems but also for any system that has similar communication patterns, depends on avoiding the unbounded communication points or downgrading them into fixed or cooperative ones. We show how effective our methodology is in practice by surveying related proposals from recent work (e.g., [10, 18, 21, 27, 28, 30] ).
Traditional online analytical processing, however, is formed of read-only queries. Therefore, it does not suffer from the unbounded communication as in transaction processing. On the other hand, the database operators such as joins, scans, etc. are mainly optimized for single threaded execution. Therefore, they fail to exploit intra-query parallelism and cannot utilize several cores naïvely. In this tutorial, we also survey the recent techniques that aim at parallelising traditional database operations and exploring work and data sharing opportunities among the concurrent queries (e.g. [6, 13, 15, 24] ).
NON-UNIFORM MEMORY ACCESSES
Data management applications traditionally run on the highest performing servers of the day. Up until recently, such servers had uniform core-to-core communication latencies -multisocket uniprocessors communicate slowly with each other and cores on a multicore communicate fast. Now with multisocket multicores, for the first time we have Islands, i.e., groups of cores that communicate fast among themselves and slower with other groups. Currently, an Island is represented by a processor socket but soon, with dozens of cores on the same socket, we expect that Islands will form within a chip. In this setting, memory access times vary greatly depending on several factors including latency to access remote memory and contention for the memory hierarchy such as the shared last level caches, the memory controllers, and the interconnect bandwidth.
In the context of transaction processing, it can be appealing to regard multisocket as a distributed system and deploy multiple nodes in a shared-nothing configuration [18, 27] . While this approach works great for perfectly partitionable workloads, it is very sensitive to distributed transactions and the workload skew. At the same time, hardware-oblivious shared-everything systems suffer from non-uniform latencies that amplify bottlenecks in the critical path [23] . First, we present a set of best practices for choosing a good configuration based on different properties of workload and hardware topology. Then, we present a system that achieves scalability on multisockets by utilizing hardware topology-aware data structures and dynamically adapting to workload and hardware [22] .
On the other hand, analytical workloads consist of ad-hoc, long running, and scan-heavy queries over relatively static data. In order to optimize performance, the execution engine needs to become NUMA-aware by tackling two main challenges: (a) employing a scheduling strategy for assigning multiple concurrent threads to cores in order to minimize remote memory accesses while avoiding contention on the memory hierarchy, and (b) dynamically deciding on the data placement in order to minimize the total memory access time of the workload. The two problems are not orthogonal, as data placement can affect scheduling decisions, while scheduling strategies need to take into account data placement. We review the requirements and recent techniques for highly concurrent NUMA-aware analytics that take into consideration data locality, parallelism, and resource allocation (e.g., [2, 5, 9, 20, 25] ).
MICRO-ARCHITECTURAL BEHAVIOR
Recent studies analyzing the micro-architectural behavior of OLTP workloads on modern hardware emphasize that OLTP exploits modern micro-architectural resources very poorly. More than half of the execution time goes to memory stalls [11] ; as a result, on processors that have the ability to execute four instructions in a cycle, which is the most common on modern commodity hardware, OLTP achieves around one instruction per cycle (IPC) [29] . Such underutilization of micro-architectural features is a great waste of hardware resources.
Several proposals have been made to reduce memory stalls through improving instruction and data locality to increase cache hit rates. These range from cache-conscious data structures and algorithms [8] to sophisticated data partitioning and thread scheduling for data [22] , and from compilation optimizations [26] , advanced prefetching [12] , to computation spreading [3, 7] and transaction batching for instructions [4, 14] . We illustrate the strengths and weaknesses of each technique with examples from recent work as well as present the key insights behind each of them.
In addition, several recent proposals opt for hardware specialization for some of the database operations ( [16, 19, 31] ). We briefly go over these techniques and emphasize their impact for emerging hardware technologies.
TUTORIAL OUTLINE • INTRODUCTION AND OVERVIEW (15 minutes)
• Tutorial overview: goal, audience, and schedule
• Hardware trends
• Problem statement:
• three dimensions of scalability • challenges traditional data management systems face on modern hardware
• EXPLOITING THREAD-LEVEL PARALLELISM (45 minutes)
• Scaling up OLTP
• Communication types in transaction processing
• Recent work on scaling-up OLTP on modern hardware
• Mapping state-of-the-art design principles to the communication types they eliminate
• Intra-& Inter-Query Parallelism
• Revisiting database operators on multicores • Exploiting sharing opportunities among concurrent queries
• NUMA-AWARE OLTP (30 minutes)
• Assumptions modern server hardware with NUMA changes for data management systems
• Quantifying the impact of non-uniform communication on OLTP performance using various design options and workloads
• Dynamically adjusting to the hardware topology and workload characteristics while designing transaction processing systems that can scale across sockets
• NUMA-AWARE OLAP (30 minutes)
• Memory access bottlenecks in multisocket multicore architectures
• NUMA-aware analytical algorithms
• Outline of the requirements of a NUMA-aware execution engine for highly concurrent analytical workloads
• MICRO-ARCHITECTURAL UTILIZATION (50 minutes)
• Results from recent workload characterization studies
• Techniques to improve data cache locality
• Techniques to improve instruction cache locality
• Toward specialized hardware
• CONCLUSIONS AND FUTURE DIRECTIONS (10 minutes) Iraklis Psaroudakis is a third year PhD student atÉcole polytechnique fédérale de Lausanne (EPFL) working under the supervision of Prof. Anastasia Ailamaki in DataIntensive Applications and Systems (DIAS) Laboratory. His research focuses on scheduling highly concurrent analytical workloads and he also co-operates with the SAP HANA database team. He has received his diploma from the School of Electrical and Computer Engineering of the National Technical University of Athens.
BIOGRAPHY

