    Pattern Masking for Dictionary Matching:Theory and Practice

    Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem: given a dictionary D of d strings, each of length ℓ, a query string q of length ℓ, and a positive integer z, we are asked to compute a smallest set K⊆{1, …, ℓ}, so that if q[i] is replaced by a wildcard for all i∈K, then q matches at least z strings from D. Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for binary strings. We thus approach the problem from a more practical perspective. We show a combinatorial O((dℓ)|K|/3+dℓ)-time and O(dℓ)-space algorithm for PMDM for |K|=O(1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails (Abboud et al. in SIAM J Comput 47:2527–2555, 2018; Lincoln et al., in: 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018). Our combinatorial algorithm, executed with small |K|, is the backbone of a greedy heuristic that we propose. Our experiments on real-world and synthetic datasets show that our heuristic finds nearly-optimal solutions in practice and is also very efficient. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in D. PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z=1, one obtains the minimal number of mismatches of q with any string from D. The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time O(2ℓd). We present a data structure for PMDM that answers queries over D in time O(2ℓ/2(2ℓ/2+τ)ℓ) and requires space O(2ℓd2/τ2+2ℓ/2d), for any parameter τ∈[1, d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017]. This gives a polynomial-time O(d1/4+ϵ)-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. This is an extended version of a paper that was presented at International Symposium on Algorithms and Computation (ISAAC) 2021

    New approaches to weighted frequent pattern mining

    Researchers have proposed frequent pattern mining algorithms that are more efficient than previous algorithms and generate fewer but more important patterns. Many techniques such as depth first/breadth first search, use of tree/other data structures, top down/bottom up traversal and vertical/horizontal formats for frequent pattern mining have been developed. Most frequent pattern mining algorithms use a support measure to prune the combinatorial search space. However, support-based pruning is not enough when taking into consideration the characteristics of real datasets. Additionally, after mining datasets to obtain the frequent patterns, there is no way to adjust the number of frequent patterns through user feedback, except for changing the minimum support. Alternative measures for mining frequent patterns have been suggested to address these issues. One of the main limitations of the traditional approach for mining frequent patterns is that all items are treated uniformly when, in reality, items have different importance. For this reason, weighted frequent pattern mining algorithms have been suggested that give different weights to items according to their significance. The main focus in weighted frequent pattern mining concerns satisfying the downward closure property. In this research, frequent pattern mining approaches with weight constraints are suggested. Our main approach is to push weight constraints into the pattern growth algorithm while maintaining the downward closure property. We develop WFIM (Weighted Frequent Itemset Mining with a weight range and a minimum weight), WLPMiner (Weighted frequent Pattern Mining with length decreasing constraints), WIP (Weighted Interesting Pattern mining with a strong weight and/or support affinity), WSpan (Weighted Sequential pattern mining with a weight range and a minimum weight) and WIS (Weighted Interesting Sequential pattern mining with a similar level of support and/or weight affinity) The extensive performance analysis shows that suggested approaches are efficient and scalable in weighted frequent pattern mining

    Discovery Agent : An Interactive Approach for the Discovery of Inclusion Dependencies

    The information integration problem is a hard yet important problem in the field of databases. The goal of information integration is to provide unified views on diverse data among several resources. This subject has been studied for a long time. The integration can be performed using several ways. Schema integration using inclusion dependency constraints is one of them. The problem of discovering inclusion dependencies among input relations is NP-complete in terms of the number of attributes. Two significant algorithms address this problem: FIND2 by Andreas Koeller and Zigzag by Fabien De Marchi. Both algorithms discover inclusion dependencies among input relations on small scale databases having relatively few attributes. Because of the data discrepancy, they do not scale well with higher numbers of attributes. We propose an approach of incorporating human intelligence into the algorithmic discovery of inclusion dependencies. To use human intelligence, we design an agent called the discovery agent, to provide a communication bridge between an algorithm and a user. The discovery agent demonstrates the progress of the discovery process and provides sufficient user controls to govern the discovery process into the right direction. In this thesis, we present a prototype of the discovery agent based upon the FIND2 algorithm, which utilizes most of the phase-wise behavior of the algorithm and demonstrate how human observer and algorithm work together to achieve higher performance and better output accuracy. The goal of the discovery agent is to make the discovery process truly interactive between system and user as well as to produce the desired and accurate result. The discovery agent can deliver an applicable and feasible approximation of an NP-complete problem with the help of suitable algorithm and appropriate human expertise


    Ollivier-Ricci Curvature for Hypergraphs: A Unified Framework

    Bridging geometry and topology, curvature is a powerful and expressiveinvariant. While the utility of curvature has been theoretically andempirically confirmed in the context of manifolds and graphs, itsgeneralization to the emerging domain of hypergraphs has remained largelyunexplored. On graphs, Ollivier-Ricci curvature measures differences betweenrandom walks via Wasserstein distances, thus grounding a geometric concept inideas from probability and optimal transport. We develop ORCHID, a flexibleframework generalizing Ollivier-Ricci curvature to hypergraphs, and prove thatthe resulting curvatures have favorable theoretical properties. Throughextensive experiments on synthetic and real-world hypergraphs from differentdomains, we demonstrate that ORCHID curvatures are both scalable and useful toperform a variety of hypergraph tasks in practice.

    A Lightweight Data Preprocessing Strategy with Fast Contradiction Analysis for Incremental Classifier Learning

    A prime objective in constructing data streaming mining models is to achieve good accuracy, fast learning, and robustness to noise. Although many techniques have been proposed in the past, efforts to improve the accuracy of classification models have been somewhat disparate. These techniques include, but are not limited to, feature selection, dimensionality reduction, and the removal of noise from training data. One limitation common to all of these techniques is the assumption that the full training dataset must be applied. Although this has been effective for traditional batch training, it may not be practical for incremental classifier learning, also known as data stream mining, where only a single pass of the data stream is seen at a time. Because data streams can amount to infinity and the so-called big data phenomenon, the data preprocessing time must be kept to a minimum. This paper introduces a new data preprocessing strategy suitable for the progressive purging of noisy data from the training dataset without the need to process the whole dataset at one time. This strategy is shown via a computer simulation to provide the significant benefit of allowing for the dynamic removal of bad records from the incremental classifier learning process

    Recent advances in clustering methods for protein interaction networks

    The increasing availability of large-scale protein-protein interaction data has made it possible to understand the basic components and organization of cell machinery from the network level. The arising challenge is how to analyze such complex interacting data to reveal the principles of cellular organization, processes and functions. Many studies have shown that clustering protein interaction network is an effective approach for identifying protein complexes or functional modules, which has become a major research topic in systems biology. In this review, recent advances in clustering methods for protein interaction networks will be presented in detail. The predictions of protein functions and interactions based on modules will be covered. Finally, the performance of different clustering methods will be compared and the directions for future research will be discussed