7,812 research outputs found
Streaming Similarity Self-Join
We introduce and study the problem of computing the similarity self-join in a
streaming context (SSSJ), where the input is an unbounded stream of items
arriving continuously. The goal is to find all pairs of items in the stream
whose similarity is greater than a given threshold. The simplest formulation of
the problem requires unbounded memory, and thus, it is intractable. To make the
problem feasible, we introduce the notion of time-dependent similarity: the
similarity of two items decreases with the difference in their arrival time. By
leveraging the properties of this time-dependent similarity function, we design
two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch
(MB), uses existing index-based filtering techniques for the static version of
the problem, and combines them in a pipeline. The second framework, Streaming
(STR), adds time filtering to the existing indexes, and integrates new
time-based bounds deeply in the working of the algorithms. We also introduce a
new indexing technique (L2), which is based on an existing state-of-the-art
indexing technique (L2AP), but is optimized for the streaming case. Extensive
experiments show that the STR algorithm, when instantiated with the L2 index,
is the most scalable option across a wide array of datasets and parameters
Combination of molecular similarity measures using data fusion
Many different measures of structural similarity have been suggested for matching chemical structures, each such measure focusing upon some particular type of molecular characteristic. The multi-faceted nature of biological activity suggests that an appropriate similarity measure should encompass many different types of characteristic, and this article discusses the use of data fusion methods to combine the results of searches based on multiple similarity measures. Experiments with several different types of dataset and activity suggest that data fusion provides a simple, but effective, approach to the combination of individual similarity measures. The best results were generally obtained with a fusion rule that sums the rank positions achieved by each molecule in searches using individual measures
Information Outlook, February 1999
Volume 3, Issue 2https://scholarworks.sjsu.edu/sla_io_1999/1001/thumbnail.jp
- …