20 research outputs found

    Fast and Lean Immutable Multi-Maps on the JVM based on Heterogeneous Hash-Array Mapped Tries

    Get PDF
    An immutable multi-map is a many-to-many thread-friendly map data structure with expected fast insert and lookup operations. This data structure is used for applications processing graphs or many-to-many relations as applied in static analysis of object-oriented systems. When processing such big data sets the memory overhead of the data structure encoding itself is a memory usage bottleneck. Motivated by reuse and type-safety, libraries for Java, Scala and Clojure typically implement immutable multi-maps by nesting sets as the values with the keys of a trie map. Like this, based on our measurements the expected byte overhead for a sparse multi-map per stored entry adds up to around 65B, which renders it unfeasible to compute with effectively on the JVM. In this paper we propose a general framework for Hash-Array Mapped Tries on the JVM which can store type-heterogeneous keys and values: a Heterogeneous Hash-Array Mapped Trie (HHAMT). Among other applications, this allows for a highly efficient multi-map encoding by (a) not reserving space for empty value sets and (b) inlining the values of singleton sets while maintaining a (c) type-safe API. We detail the necessary encoding and optimizations to mitigate the overhead of storing and retrieving heterogeneous data in a hash-trie. Furthermore, we evaluate HHAMT specifically for the application to multi-maps, comparing them to state-of-the-art encodings of multi-maps in Java, Scala and Clojure. We isolate key differences using microbenchmarks and validate the resulting conclusions on a real world case in static analysis. The new encoding brings the per key-value storage overhead down to 30B: a 2x improvement. With additional inlining of primitive values it reaches a 4x improvement

    Towards a software product line of trie-based collections

    Get PDF
    Collection data structures in standard libraries of programming languages are designed to excel for the average case by carefully balancing memory footprint and runtime performance. These implicit design decisions and hard-coded trade-offs do constrain users from using an optimal variant for a given problem. Although a wide range of specialized collections is available for the Java Virtual Machine (JVM), they introduce yet another dependency and complicate user adoption by requiring specific Application Program Interfaces (APIs) incompatible with the standard library. A product line for collection data structures would relieve library designers from optimizing for the general case. Furthermore, a product line allows evolving the potentially large code base of a collection family efficiently. The challenge is to find a small core framework for collection data structures which covers all variations without exhaustively listing them, while supporting good performance at the same time. We claim that the concept of Array Mapped Tries (AMTs) embodies a high degree of commonality in the sub-domain of immutable collection data structures. AMTs are flexible enough to cover most of the variability, while minimizing code bloat in the generator and the generated code. We implemented a Data Structure Code Generator (DSCG) that emits immutable collections based on an AMT skeleton foundation. The generated data structures outperform competitive handoptimized implementations, and the generator still allows for customization towards specific workloads

    To-many or to-one? All-in-one! Efficient purely functional multi-maps with type-heterogeneous hash-tries

    Get PDF
    An immutable multi-map is a many-to-many map data structure with expected fast insert and lookup operations. This data structure is used for applications processing graphs or many-to-many relations as applied in compilers, runtimes of programming languages, or in static analysis of object-oriented systems. Collection data structures are assumed to carefully balance execution time of operations with memory consumption characteristics and need to scale gracefully from a few elements to multiple gigabytes at least. When processing larger in-memory data sets the overhead of the data structure encoding itself becomes a memory usage bottleneck, dominating the overall performance. In this paper we propose AXIOM, a novel hash-trie data structure that allows for a highly efficient and type-safe multi-map encoding by distinguishing inlined values of singleton sets from nested sets of multi-mappings

    Persistent Data Structures for Incremental Join Indices

    Get PDF
    Join indices are used in relational databases to make join operations faster. Join indices essentially materialise the results of join operations and so accrue maintenance cost, which makes them more suitable for use cases where modifications are rare and joins are performed frequently. To make the maintenance cost lower incrementally updating existing indices is to be preferred. The usage of persistent data structures for the join indices were explored. Motivation for this research was the ability of persistent data structures to construct multiple partially different versions of the same data structure memory efficiently. This is useful, because there can exist different versions of join indices simultaneously due to usage of multi-version concurrency control (MVCC) in a database. The techniques used in Relaxed Radix Balanced Trees (RRB-Trees) persistent data structure were found promising, but none of the popular implementations were found directly suitable for the use case. This exploration was done from the context of a particular proprietary embedded in-memory columnar multidimensional database called FastormDB developed by RELEX Solutions. This focused the research into Java Virtual Machine (JVM) based data structures as the implementation of FastormDB is in Java. Multiple persistent data-structures made for the thesis and ones from Scala, Clojure and Paguro were evaluated with Java Microbenchmark Harness (JMH) and Java Object Layout (JOL) based benchmarks and their results analysed via visualisations

    Benchmark-driven Software Performance Optimization

    Get PDF
    Software systems are an integral part of modern society. As we continue to harness software automation in all aspects of our daily lives, the runtime performance of these systems become increasingly important. When everything seems just a click away, performance issues that compromise the responsiveness of a system can lead to severe financial and reputation losses. Designing efficient code is critical for ensuring good and consistent performance of software systems. It requires performance expertize, and encompasses a set of difficult design decisions that need to be continuously revisited throughout the evolution of the software. Developers must test the performance of their core implementations, select efficient data structures and algorithms, explore parallel processing when it provides performance benefits, among many other aspects. Furthermore, the constant pressure for high-productivity laid on developers, aligned with the increasing complexity of modern software, makes designing efficient code an even more challenging endeavor. This thesis presents a series of novel approaches based on empirical insights that attempt to support developers at the task of designing efficient code. We present contributions in three aspects. First, we investigate the prevalence and impact of bad practices on performance benchmarks of Java-based open-source software. We show that not only these bad practices occur frequently, they often distort the benchmark results substantially. Moreover, we devise a tool that can be used by developers to identify bad practices during benchmark creation automatically. Second, we design an application-level framework that identifies suboptimal implementations and selects optimized variants at runtime, effectively optimizing the execution time and memory usage of the target application. Furthermore, we investigate the performance of data structures from several popular collection libraries. Our findings show that alternative variants can be selected for substantial performance improvement under specific usage scenarios. Third, we investigate the parallelization of object processing via Java streams. We propose a decision-support framework that leverages machine-learning models trained through a series of benchmarks, to identify and report stream pipelines that should be processed in parallel for better performance

    Обробка персистентних структур даних

    Get PDF
    У дипломній роботі розглянуто теоретичні та практичні методи перетворення структур даних на персистентні та способи їх обробки. Мета роботи. Створення програмного забезпечення для обробки повністю персистентних структур даних, а саме стеку, черги та дерева відрізків, з використанням ефективних алгоритмів їх побудови. Задля досягнення мети було розроблено програмне забезпечення, яке являє собою бібліотеку динамічного компонування і дозволяє використовувати персистентні структури даних з оптимальним часом та пам’яттю виконання для задач у розробці програмного забезпечення.The master's graduate work examines theoretical and practical methods of transforming data structures into persistent ones and methods of their processing. The goal of the work. Develop software for processing fully persistent data structures, namely stack, queue and segment tree, using efficient algorithms for their construction. To achieve the goal, software was developed, which is a dynamic link library and allows the use of persistent data structures with optimal execution time and memory for software development tasks

    Graph databases and their application to the Italian Business Register for efficient search of relationships among companies

    Get PDF
    We studied and tested three of the major graph databases, and we compared them with a relational database. We worked on a dataset representing equity participations among companies, and we found out that the strong points of graph databases are: the purposely designed storage techniques; and their query languages. The main performance increments have been obtained when heavy graph situations are queried; for simpler situations and queries, a relational database performs equally wellope
    corecore