8 research outputs found
Fast and Lean Immutable Multi-Maps on the JVM based on Heterogeneous Hash-Array Mapped Tries
An immutable multi-map is a many-to-many thread-friendly map data structure
with expected fast insert and lookup operations. This data structure is used
for applications processing graphs or many-to-many relations as applied in
static analysis of object-oriented systems. When processing such big data sets
the memory overhead of the data structure encoding itself is a memory usage
bottleneck. Motivated by reuse and type-safety, libraries for Java, Scala and
Clojure typically implement immutable multi-maps by nesting sets as the values
with the keys of a trie map. Like this, based on our measurements the expected
byte overhead for a sparse multi-map per stored entry adds up to around 65B,
which renders it unfeasible to compute with effectively on the JVM.
In this paper we propose a general framework for Hash-Array Mapped Tries on
the JVM which can store type-heterogeneous keys and values: a Heterogeneous
Hash-Array Mapped Trie (HHAMT). Among other applications, this allows for a
highly efficient multi-map encoding by (a) not reserving space for empty value
sets and (b) inlining the values of singleton sets while maintaining a (c)
type-safe API.
We detail the necessary encoding and optimizations to mitigate the overhead
of storing and retrieving heterogeneous data in a hash-trie. Furthermore, we
evaluate HHAMT specifically for the application to multi-maps, comparing them
to state-of-the-art encodings of multi-maps in Java, Scala and Clojure. We
isolate key differences using microbenchmarks and validate the resulting
conclusions on a real world case in static analysis. The new encoding brings
the per key-value storage overhead down to 30B: a 2x improvement. With
additional inlining of primitive values it reaches a 4x improvement
Towards Zero-Overhead Disambiguation of Deep Priority Conflicts
**Context** Context-free grammars are widely used for language prototyping
and implementation. They allow formalizing the syntax of domain-specific or
general-purpose programming languages concisely and declaratively. However, the
natural and concise way of writing a context-free grammar is often ambiguous.
Therefore, grammar formalisms support extensions in the form of *declarative
disambiguation rules* to specify operator precedence and associativity, solving
ambiguities that are caused by the subset of the grammar that corresponds to
expressions.
**Inquiry** Implementing support for declarative disambiguation within a
parser typically comes with one or more of the following limitations in
practice: a lack of parsing performance, or a lack of modularity (i.e.,
disallowing the composition of grammar fragments of potentially different
languages). The latter subject is generally addressed by scannerless
generalized parsers. We aim to equip scannerless generalized parsers with novel
disambiguation methods that are inherently performant, without compromising the
concerns of modularity and language composition.
**Approach** In this paper, we present a novel low-overhead implementation
technique for disambiguating deep associativity and priority conflicts in
scannerless generalized parsers with lightweight data-dependency.
**Knowledge** Ambiguities with respect to operator precedence and
associativity arise from combining the various operators of a language. While
*shallow conflicts* can be resolved efficiently by one-level tree patterns,
*deep conflicts* require more elaborate techniques, because they can occur
arbitrarily nested in a tree. Current state-of-the-art approaches to solving
deep priority conflicts come with a severe performance overhead.
**Grounding** We evaluated our new approach against state-of-the-art
declarative disambiguation mechanisms. By parsing a corpus of popular
open-source repositories written in Java and OCaml, we found that our approach
yields speedups of up to 1.73x over a grammar rewriting technique when parsing
programs with deep priority conflicts--with a modest overhead of 1-2 % when
parsing programs without deep conflicts.
**Importance** A recent empirical study shows that deep priority conflicts
are indeed wide-spread in real-world programs. The study shows that in a corpus
of popular OCaml projects on Github, up to 17 % of the source files contain
deep priority conflicts. However, there is no solution in the literature that
addresses efficient disambiguation of deep priority conflicts, with support for
modular and composable syntax definitions
Towards a Feature Model of Trie-Based Collections
<p>This archive contains a snapshot of a continuously evolving feature model description of the domain of trie-based collection data structures. The feature model is expressed in FDL, which is a Feature Description Language [1]. We use an extension of FDL that adds an integer data type (int) to the model. </p>
<p>For convenience of viewing, the Rascal programming language (http://www.rascal-mpl.org) with its provided Eclipse (https://www.eclipse.org) environment supports syntax highlighting and experimental visualization of FDL diagrams.</p>
<p>[1] van Deursen, A., & Klint, P. (2002). Domain-Specific Language Design Requires Feature Descriptions. Journal of Computing and Information Technology, 10(1), 1–17. http://doi.org/10.2498/cit.2002.01.01</p
Code Specialization for Memory Efficient Hash Tries (Short Paper)
International audienceThe hash trie data structure is a common part in standard collectionlibraries of JVM programming languages such as Clojure andScala. It enables fast immutable implementations of maps, sets,and vectors, but it requires considerably more memory than anequivalent array-based data structure. This hinders the scalabilityof functional programs and the further adoption of this otherwiseattractive style of programming.In this paper we present a product family of hash tries. We generateJava source code to specialize them using knowledge of JVMobject memory layout. The number of possible specializations isexponential. The optimization challenge is thus to find a minimal setof variants which lead to a maximal loss in memory footprint on anygiven data. Using a set of experiments we measured the distributionof internal tree node sizes in hash tries. We used the results as aguidance to decide which variants of the family to generate andwhich variants should be left to the generic implementation.A preliminary validating experiment on the implementation ofsets and maps shows that this technique leads to a median decreaseof 55% in memory footprint for maps (and 78% for sets), whilestill maintaining comparable performance. Our combination of dataanalysis and code specialization proved to be effective
usethesource/rascal-eclipse: 0.23.0
Eclipse IMP based IDE for the Rascal meta-programming language. See rascal project for Wiki, Issues and such