2 research outputs found
FastAMI -- a Monte Carlo Approach to the Adjustment for Chance in Clustering Comparison Metrics
Clustering is at the very core of machine learning, and its applications
proliferate with the increasing availability of data. However, as datasets
grow, comparing clusterings with an adjustment for chance becomes
computationally difficult, preventing unbiased ground-truth comparisons and
solution selection. We propose FastAMI, a Monte Carlo-based method to
efficiently approximate the Adjusted Mutual Information (AMI) and extend it to
the Standardized Mutual Information (SMI). The approach is compared with the
exact calculation and a recently developed variant of the AMI based on pairwise
permutations, using both synthetic and real data. In contrast to the exact
calculation our method is fast enough to enable these adjusted
information-theoretic comparisons for large datasets while maintaining
considerably more accurate results than the pairwise approach.Comment: Accepted at AAAI 202
Compressing dictionaries of strings
The aim of this work is to develop a data structure capable of storing a set of strings in a compressed way providing the facility to access and search by prefix any string in the set. The notion of string will be formally exposed in this work, but it is enough to think a string as a stream of characters or a variable length dat}. We will prove that the data structure devised in our work will be able to search prefixes of the stored strings in a very efficient way, hence giving a performant solution to one of the most discussed problem of our age.
In the discussion of our data structure, particular emphasis will be given to both space and time efficiency and a tradeoff between these two will be constantly searched.
To understand how much string based data structures are important, think about modern search engines and social networks; they must store and process continuously immense streams of data which are mainly strings, while
the output of such processed data must be available in few milliseconds not to try the patience of the user.
Space efficiency is one of the main concern in this kind of problem. In order to satisfy real-time latency bounds, the largest possible amount of data must be stored in the highest levels of the memory hierarchy.
Moreover, data compression allows to save money because it reduces the amount of physical memory needed to store abstract data and this particularly important since storage is the main source of expenditure in modern systems