36 research outputs found
Fast and Lean Immutable Multi-Maps on the JVM based on Heterogeneous Hash-Array Mapped Tries
An immutable multi-map is a many-to-many thread-friendly map data structure
with expected fast insert and lookup operations. This data structure is used
for applications processing graphs or many-to-many relations as applied in
static analysis of object-oriented systems. When processing such big data sets
the memory overhead of the data structure encoding itself is a memory usage
bottleneck. Motivated by reuse and type-safety, libraries for Java, Scala and
Clojure typically implement immutable multi-maps by nesting sets as the values
with the keys of a trie map. Like this, based on our measurements the expected
byte overhead for a sparse multi-map per stored entry adds up to around 65B,
which renders it unfeasible to compute with effectively on the JVM.
In this paper we propose a general framework for Hash-Array Mapped Tries on
the JVM which can store type-heterogeneous keys and values: a Heterogeneous
Hash-Array Mapped Trie (HHAMT). Among other applications, this allows for a
highly efficient multi-map encoding by (a) not reserving space for empty value
sets and (b) inlining the values of singleton sets while maintaining a (c)
type-safe API.
We detail the necessary encoding and optimizations to mitigate the overhead
of storing and retrieving heterogeneous data in a hash-trie. Furthermore, we
evaluate HHAMT specifically for the application to multi-maps, comparing them
to state-of-the-art encodings of multi-maps in Java, Scala and Clojure. We
isolate key differences using microbenchmarks and validate the resulting
conclusions on a real world case in static analysis. The new encoding brings
the per key-value storage overhead down to 30B: a 2x improvement. With
additional inlining of primitive values it reaches a 4x improvement
Code Specialization for Memory Efficient Hash Tries
The hash trie data structure is a common part in standard collection libraries of JVM programming languages such as Clojure and Scala. It enables fast immutable implementations of maps, sets, and vectors, but it requires considerably more memory than an equivalent array-based data structure. This hinders the scalability of functional programs and the further adoption of this otherwise attractive style of programming.
In this paper we present a product family of hash tries. We generate Java source code to specialize them using knowledge of JVM object memory layout. The number of possible specializations is exponential. The optimization challenge is thus to find a minimal set of variants which lead to a maximal loss in memory footprint on any given data. Using a set of experiments we measured the distribution of internal tree node sizes in hash tries. We used the results as a guidance to decide which variants of the family to generate and which variants should be left to the generic implementation.
A preliminary validating experiment on the implementation of sets and maps shows that this technique leads to a median decrease of 55% in memory footprint for maps (and 78% for sets), while still maintaining comparable performance. Our combination of data analysis and code specialization proved to be effective
Stateful Testing: Finding More Errors in Code and Contracts
Automated random testing has shown to be an effective approach to finding
faults but still faces a major unsolved issue: how to generate test inputs
diverse enough to find many faults and find them quickly. Stateful testing, the
automated testing technique introduced in this article, generates new test
cases that improve an existing test suite. The generated test cases are
designed to violate the dynamically inferred contracts (invariants)
characterizing the existing test suite. As a consequence, they are in a good
position to detect new errors, and also to improve the accuracy of the inferred
contracts by discovering those that are unsound. Experiments on 13 data
structure classes totalling over 28,000 lines of code demonstrate the
effectiveness of stateful testing in improving over the results of long
sessions of random testing: stateful testing found 68.4% new errors and
improved the accuracy of automatically inferred contracts to over 99%, with
just a 7% time overhead.Comment: 11 pages, 3 figure
Towards Zero-Overhead Disambiguation of Deep Priority Conflicts
**Context** Context-free grammars are widely used for language prototyping
and implementation. They allow formalizing the syntax of domain-specific or
general-purpose programming languages concisely and declaratively. However, the
natural and concise way of writing a context-free grammar is often ambiguous.
Therefore, grammar formalisms support extensions in the form of *declarative
disambiguation rules* to specify operator precedence and associativity, solving
ambiguities that are caused by the subset of the grammar that corresponds to
expressions.
**Inquiry** Implementing support for declarative disambiguation within a
parser typically comes with one or more of the following limitations in
practice: a lack of parsing performance, or a lack of modularity (i.e.,
disallowing the composition of grammar fragments of potentially different
languages). The latter subject is generally addressed by scannerless
generalized parsers. We aim to equip scannerless generalized parsers with novel
disambiguation methods that are inherently performant, without compromising the
concerns of modularity and language composition.
**Approach** In this paper, we present a novel low-overhead implementation
technique for disambiguating deep associativity and priority conflicts in
scannerless generalized parsers with lightweight data-dependency.
**Knowledge** Ambiguities with respect to operator precedence and
associativity arise from combining the various operators of a language. While
*shallow conflicts* can be resolved efficiently by one-level tree patterns,
*deep conflicts* require more elaborate techniques, because they can occur
arbitrarily nested in a tree. Current state-of-the-art approaches to solving
deep priority conflicts come with a severe performance overhead.
**Grounding** We evaluated our new approach against state-of-the-art
declarative disambiguation mechanisms. By parsing a corpus of popular
open-source repositories written in Java and OCaml, we found that our approach
yields speedups of up to 1.73x over a grammar rewriting technique when parsing
programs with deep priority conflicts--with a modest overhead of 1-2 % when
parsing programs without deep conflicts.
**Importance** A recent empirical study shows that deep priority conflicts
are indeed wide-spread in real-world programs. The study shows that in a corpus
of popular OCaml projects on Github, up to 17 % of the source files contain
deep priority conflicts. However, there is no solution in the literature that
addresses efficient disambiguation of deep priority conflicts, with support for
modular and composable syntax definitions