20 research outputs found
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Growing main memory capacities and an increasing number of hardware threads in modern server systems led to fundamental changes in database architectures. Most importantly, query processing is nowadays performed on data that is often completely stored in main memory. Despite of a high main memory scan performance, index structures are still important components, but they have to be designed from scratch to cope with the specific characteristics of main memory and to exploit the high degree of parallelism. Current research mainly focused on adapting block-optimized B+-Trees, but these data structures were designed for secondary memory and involve comprehensive structural maintenance for updates.
In this paper, we present the KISS-Tree, a latch-free inmemory index that is optimized for a minimum number of memory accesses and a high number of concurrent updates. More specifically, we aim for the same performance as modern hash-based algorithms but keeping the order-preserving nature of trees. We achieve this by using a prefix tree that incorporates virtual memory management functionality and compression schemes. In our experiments, we evaluate the KISS-Tree on different workloads and hardware platforms and compare the results to existing in-memory indexes. The KISS-Tree offers the highest reported read performance on current architectures, a balanced read/write performance, and has a low memory footprint
Recommended from our members
Architectural support for message queue task parallelism
The scaling of threads is an attractive way to exploit task-level parallelism and boost performance. From the perspective of software programming, many applications (e.g., network package processing, SQL queries) could be composite of a set of small tasks. Those tasks are arranged in a data flow graph and each task is undertaken by some threads. Message queues are often used to coordinate the tasks among the threads. On the other side, thread scaling is in favor of the hardware advancing trend that there are more Processing Elements (PE) in modern Chip Multiprocessors (CMP) than ever before. This is because single PE cannot simply run faster due to power and thermal limitations; instead architects have to use more transistors for increasing number of PEs, in order to improve the overall computing power of a processor. Unfortunately, this paradigm using message queues to drive parallel tasks sometime leads to diminishing performance returns due to issues lying in the architecture and system design. Particularly, the conventional coherent shared-memory architectures let task-parallel workloads suffer from unnecessary synchronization overhead and load-to-use latency. For instance, when passing messages through queues, multiple threads could contend for the exclusivity of the cacheline where the shared queue data structure stays. The more threads, the more severe the contention is, because every transition upgrading a cacheline from shared to exclusive state needs to invalidate more copies in the private caches of other cores, and waits for the acknowledgements from more cores. Such a overhead hurts the scalability of threads synchronizing via message queues. Adding to the coherence overhead, the load-to-use latency (from a consumer requesting data until the data being moved to the consumer to use) is often on the critical path, slowing down the computation. This is because the cache hierarchy in modern processors creates some layers of local storage to buffer data separately for different cores. Therefore, serving message queue data in an ondemand manner incurs longer load-to-use latency. It is also challenging to schedule message-driven tasks to use cores efficiently when arrival rate and service rate mismatch. It wastes CPU cycles if a runtime system leaves tasks blocked on full/empty message queues, while switching tasks has additional scheduling overheads. Diverse system topologies further complicate the problem, as the scheduling also needs to take data locality into consideration. This dissertation explores architectural supports for enhancing the scalability of message queue task parallelism, reducing the load-to-use latency, as well as avoiding blocking. Specifically, this dissertation designs and evaluates a message queue architecture that lowers the overhead of synchronization on shared queue states, a speculation technique to hide the load-to-use latency, as well as a locality-aware message queue runtime system with low overhead on scheduling and buffer resizing. The first contribution of the dissertation is Virtual-Link scalable message queue architecture (VL). Instead of having threads access the shared queue state variables (i.e., head, tail, or lock) atomically, VL provides configurable hardware support, providing both data transfer and synchronization. Unlike other hardware queue architectures with dedicated network, VL reuses the existing cache coherence network and delivers a virtualized channel as if there were a direct link (or route) between two arbitrary PEs. VL facilitates efficient synchronized data movement between M:N producers and consumers with several benefits: (i) the number of sharers on synchronization primitives is reduced to zero, eliminating a primary bottleneck of traditional lock-free queues, (ii) memory spills, snoops, and invalidations are reduced, (iii) data stays on the fast path (inside the interconnect) a majority of the time. Another contribution of the dissertation is SPAMeR speculation mechanism. SPAMeR has the capability to speculatively push messages in anticipation of consumer message requests. With the speculation, the latency of moving data from the source to the consumer that needs the data could be partially or fully overlapped with the message processing time. Unlike pre-fetch approaches which predict what addresses to fetch next, with a queue we know exactly what data is needed next but not when it is needed; SPAMeR proposes algorithms to learn from queue operation history in order to predict this. Finally the dissertation contributes ARMQ locality-aware runtime. ARMQ collects a set of approaches that avoids message queue blocking, ranging from the most general yielding, to dynamically resizing the buffer, and to spawning helper tasks. On one hand, ARMQ minimizes the overheads (e.g., wasteful polling, context switch, memory allocation and copying etc.) with a few techniques (e.g., userspace threading, chunk-based ringbuffer etc.) On the other hand, ARMQ schedules the message-driven tasks precisely and opportunely, in order to maximize the data locality preserved (in favor of cache) and balance the resource allocation.Electrical and Computer Engineerin
Cache-Aware Lock-Free Concurrent Hash Tries
This report describes an implementation of a non-blocking concurrent shared-memory hash trie based on single-word compare-and-swap instructions. Insert, lookup and remove operations modifying different parts of the hash trie can be run independent of each other and do not contend. Remove operations ensure that the unneeded memory is freed and that the trie is kept compact. A pseudocode for these operations is presented and a proof of correctness is given -- we show that the implementation is linearizable and lock-free. Finally, benchmarks are presented which compare concurrent hash trie operations against the corresponding operations on other concurrent data structures, showing their performance and scalability
Achieving Efficient Work-Stealing for Data-Parallel Collections
In modern programming high-level data-structures are an important foundation for most applications. With the rise of the multi-core era, there is a growing trend of supporting data-parallel collection operations in general purpose programming languages and platforms. To facilitate object-oriented reuse these operations are highly parametric, incurring abstraction performance penalties. Furthermore, data-parallel operations must scale when used in problems with irregular workloads. Work-stealing is a proven load-balancing technique when it comes to irregular workloads, but general purpose work-stealing also suffers from abstraction penalties. In this paper we present a generic design of a data-parallel collections framework based on work-stealing for shared-memory architectures. We show how abstraction penalties can be overcome through callsite specialization of data-parallel operations instances. Moreover, we show how to make work-stealing fine-grained and efficient when specialized for particular data-structures. We experimentally validate the performance of different data-structures and data-parallel operations, achieving up to 60X better performance with abstraction penalties eliminated and 3X higher speedups by specializing work-stealing compared to existing approaches
Scaling In-Memory databases on multicores
Current computer systems have evolved from featuring only a single processing unit and limited RAM, in the order of kilobytes or few megabytes, to include several multicore processors, o↵ering in the order of several tens of concurrent execution contexts, and have main memory in the order of several tens to hundreds of gigabytes. This allows to keep all data of many applications in the main memory, leading to the development of inmemory databases. Compared to disk-backed databases, in-memory databases (IMDBs) are expected to provide better performance by incurring in less I/O overhead.
In this dissertation, we present a scalability study of two general purpose IMDBs on
multicore systems. The results show that current general purpose IMDBs do not scale
on multicores, due to contention among threads running concurrent transactions. In
this work, we explore di↵erent direction to overcome the scalability issues of IMDBs in
multicores, while enforcing strong isolation semantics.
First, we present a solution that requires no modification to either database systems
or to the applications, called MacroDB. MacroDB replicates the database among several engines, using a master-slave replication scheme, where update transactions execute on the master, while read-only transactions execute on slaves. This reduces contention, allowing MacroDB to o↵er scalable performance under read-only workloads, while updateintensive workloads su↵er from performance loss, when compared to the standalone engine.
Second, we delve into the database engine and identify the concurrency control mechanism used by the storage sub-component as a scalability bottleneck. We then propose a new locking scheme that allows the removal of such mechanisms from the storage sub-component. This modification o↵ers performance improvement under all workloads, when compared to the standalone engine, while scalability is limited to read-only workloads.
Next we addressed the scalability limitations for update-intensive workloads, and
propose the reduction of locking granularity from the table level to the attribute level.
This further improved performance for intensive and moderate update workloads, at a
slight cost for read-only workloads. Scalability is limited to intensive-read and read-only
workloads.
Finally, we investigate the impact applications have on the performance of database
systems, by studying how operation order inside transactions influences the database performance. We then propose a Read before Write (RbW) interaction pattern, under
which transaction perform all read operations before executing write operations. The
RbW pattern allowed TPC-C to achieve scalable performance on our modified engine for all workloads. Additionally, the RbW pattern allowed our modified engine to achieve scalable performance on multicores, almost up to the total number of cores, while enforcing strong isolation
A Framework for Verifying Scalability and Performance of Cloud Based Web Applications
Antud magistritöö uurib võimalusi, kuidas kasutada veebirakendust MediaWiki, mida kasutatakse Wikipedia rakendamiseks, ja kuidas kasutada antud teenust mitme serveri peal nii, et see oleks kõige optimaalsem ja samas kõik veebikülastajad saaks teenusele ligi mõistliku ajaga. Amazon küsib raha pilves toimivate masinate ajalise kasutamise eest, ümardades pooleldi kasutatud tunnid täistundideks. Antud töö sisaldab vahendeid kuidas mõõta pilves olevate serverite jõudlust ning võimekust ja skaleerida antud veebirakendust.
Amazon EC2 pilvesüsteemis on võimalik kasutajatel koostada virtuaalseid tõmmiseid operatsiooni süsteemidest, mida saab pilves rakendada XEN virtualiseerimise keskkonnas, kui eraldiseisvat serverit. Antud virtuaalse tõmmise peale sai paigaldatud tööks vaja minev keskkond, et koguda andmeid serverite kasutuse kohta ja võimaldada platvormi, mis lubab dünaamiliselt ajas lisada servereid ja eemaldada neid.
Magistritöö uurib Amazon EC2 pilvesüsteemi kasutusvõimalusi, mille hulka kuulub Auto Scale, mis aitab skaleerida pilves kasutatavaid rakendusi horisontaalselt. Amazon pilve kasutatakse antud töös MediaWiki seadistamiseks ja suuremahuliste eksperimentide rakendamiseks. Vajalik on teha palju optimiseerimisi ja seadistamisi, et suurendada teenuse läbilaske võimsust. Antud töö raames loodud raamistik aitab mõõta serverite kasutust, kogudes andmeid protsessori, mälu ja võrgu kasutamise kohta. See aitab leida süsteemis olevaid kitsaskohti, mis võivad põhjustada süsteemi olulist aeglustumist.
Antud töö raames sai tehtud erinevaid teste, et selgitada välja parim võimalik paigutus ja seadistus. Saavutatud seadistust kontrolliti hiljem 2 suuremahulise eksperimentiga, mis kestis üks päev ja mille käigus tekitati 22 miljonit päringut, leidmaks kuidas raamistik võimaldab teenust pilves skaleerida ülesse päringute arvu tõusmisel ja vähendada servereid, kui päringute arv väheneb. Ühes eksperimendis kasutati optimaalset heuristikat, et selgitada välja optimaalne serverite arv, mida on vaja pilves rakendada. Teine eksperimentidest kasutas Amazon Auto Scale teenust, mis kasutas serverite keskmist protsessori kasutamist, et selgitada välja, kas pilves on vaja servereid lisada või eemaldada. Antud eksperimendid näitavad selgelt, et kasutades dünaamilist arvu servereid, olenevalt päringute arvust, on võimalik teenuse üleval hoidmiseks säästa raha.Network usage and bandwidth speeds have increased massively and vast majority of people are using Internet on daily bases. This has increased CPU utilization on servers meaning that sites with large visits are using hundreds of computers to accommodate increasing traffic rates to the services. Making plans for hardware ordering to upgrade old servers or to add new servers is not a straightforward process and has to be carefully considered. There is a need to predict traffic rate for future usage. Buying too many servers can mean revenue loss and buying too few servers can result in losing clients. To overcome this problem, it is wise to consider moving services into virtual cloud and make server provisioning as an automatic step. One of the popular cloud service providers, Amazon is giving possibility to use large amounts of computing power for running servers in virtual environment with single click. They are providing services to provision as many servers as needed to run, depending how loaded the servers are and whatever we need to do, to add new servers or to remove existing ones. This will eliminate problems associated with ordering new hardware. Adding new servers is an automatic process and will follow the demand, like adding more servers for peak hours and removing unnecessary servers at night or when the traffic is low. Customer pays only for the used resources on the cloud. This thesis focuses on setting up a testbed for the cloud that will run web application, which will be scaled horizontally (by replicating already running servers) and will use the benchmark tool for stressing out the web application, by simulating huge number of concurrent requests and proper load-balancing mechanisms. This study gives us a proper picture how servers in the cloud are scaled and whole process remains transparent for the end user, as it sees the web application as one server. In conclusion, the framework is helpful in analyzing the performance of cloud based applications, in several of our research activities