Search CORE

19,677 research outputs found

Detecting Data Lineage Using Variable Block Deduplication

Author: Silverberg Sam
Publication venue: Technical Disclosure Commons
Publication date: 18/11/2021
Field of study

With burgeoning data volumes, the need for accurate data cataloging, lineage, and governance is gaining importance. Traditional automation for data context discovery requires manual tagging or scanning of data using a parser to understand the content, which can require knowledge of the content type and can be error-prone, time-consuming, and expensive. Further, traditional data context discovery doesn’t track lineage between disparate data types, and cannot index binary data. This disclosure describes efficient techniques to index and map similarity across multiple datasets to determine data lineage. The techniques do not need or use the context of the underlying data or concepts therein. A rolling hash is created of the multiple datasets whose lineage is sought. The resulting hash streams, serving as indexes for their data, are compared using, e.g., a search engine. Similarity in hash streams is used to establish lineage

Technical Disclosure Common

Application of kernel functions for accurate similarity search in large chemical databases

Author: A Smalter
Aaron Smalter
C Austin
C Dobson
D Shasha
D Williams
Gerald H Lushington
H Cheng
H He
J Cheng
JP Vert
Jun Huan
L Jacob
MM Cone
N Tolliday
PJ Ballester
R Giugno
R Jorissen
Raymond ea J
T Girke
T Liu
TS Rush
X Yan
X Yan
XH Wang
Xiaohong Wang
Y Cao
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background Similaritysearch in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. Results To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep. Conclusions Efficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Our previous similarity search method, G-hash, provides a new way to perform similarity search in chemical databases. Experimental study validates the utility of G-hash in chemical databases

Crossref

Springer - Publisher Connector

KU ScholarWorks

PubMed Central