22 research outputs found

    Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study

    Get PDF
    Features mined from knowledge graphs are widely used within multiple knowledge discovery tasks such as classification or fact-checking. Here, we consider a given set of vertices, called seed vertices, and focus on mining their associated neighboring vertices, paths, and, more generally, path patterns that involve classes of ontologies linked with knowledge graphs. Due to the combinatorial nature and the increasing size of real-world knowledge graphs, the task of mining these patterns immediately entails scalability issues. In this paper, we address these issues by proposing a pattern mining approach that relies on a set of constraints (e.g., support or degree thresholds) and the monotonicity property. As our motivation comes from the mining of real-world knowledge graphs, we illustrate our approach with PGxLOD, a biomedical knowledge graph

    FlashProfile: A Framework for Synthesizing Data Profiles

    Get PDF
    We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153153 tasks over 7575 large real datasets, we observe a median profiling time of only 0.7\sim\,0.7\,s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

    Approximate Set Union Via Approximate Randomization

    Get PDF
    We develop an randomized approximation algorithm for the size of set union problem \arrowvert A_1\cup A_2\cup...\cup A_m\arrowvert, which given a list of sets A1,...,AmA_1,...,A_m with approximate set size mim_i for AiA_i with mi((1βL)Ai,(1+βR)Ai)m_i\in \left((1-\beta_L)|A_i|, (1+\beta_R)|A_i|\right), and biased random generators with Prob(x=\randomElm(A_i))\in \left[{1-\alpha_L\over |A_i|},{1+\alpha_R\over |A_i|}\right] for each input set AiA_i and element xAi,x\in A_i, where i=1,2,...,mi=1, 2, ..., m. The approximation ratio for \arrowvert A_1\cup A_2\cup...\cup A_m\arrowvert is in the range [(1ϵ)(1αL)(1βL),(1+ϵ)(1+αR)(1+βR)][(1-\epsilon)(1-\alpha_L)(1-\beta_L), (1+\epsilon)(1+\alpha_R)(1+\beta_R)] for any ϵ(0,1)\epsilon\in (0,1), where αL,αR,βL,βR(0,1)\alpha_L, \alpha_R, \beta_L,\beta_R\in (0,1). The complexity of the algorithm is measured by both time complexity, and round complexity. The algorithm is allowed to make multiple membership queries and get random elements from the input sets in one round. Our algorithm makes adaptive accesses to input sets with multiple rounds. Our algorithm gives an approximation scheme with O(\setCount\cdot(\log \setCount)^{O(1)}) running time and O(logm)O(\log m) rounds, where mm is the number of sets. Our algorithm can handle input sets that can generate random elements with bias, and its approximation ratio depends on the bias. Our algorithm gives a flexible tradeoff with time complexity O\left(\setCount^{1+\xi}\right) and round complexity O(1ξ)O\left({1\over \xi}\right) for any ξ(0,1)\xi\in(0,1)
    corecore