Search CORE

22 research outputs found

Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study

Author: Bresso Emmanuel
Couceiro Miguel
Coulet Adrien
Monnin Pierre
Napoli Amedeo
Smaïl-Tabbone Malika
Publication venue
Publication date: 07/08/2020
Field of study

Features mined from knowledge graphs are widely used within multiple knowledge discovery tasks such as classification or fact-checking. Here, we consider a given set of vertices, called seed vertices, and focus on mining their associated neighboring vertices, paths, and, more generally, path patterns that involve classes of ontologies linked with knowledge graphs. Due to the combinatorial nature and the increasing size of real-world knowledge graphs, the task of mining these patterns immediately entails scalability issues. In this paper, we address these issues by proposing a pattern mining approach that relies on a set of constraints (e.g., support or degree thresholds) and the monotonicity property. As our motivation comes from the mining of real-world knowledge graphs, we illustrate our approach with PGxLOD, a biomedical knowledge graph

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

FlashProfile: A Framework for Synthesizing Data Profiles

Author: Gulwani Sumit
Jain Prateek
Millstein Todd
Padhi Saswat
Perelman Daniel
Polozov Oleksandr
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/10/2018
Field of study

We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across

153

tasks over

75

large real datasets, we observe a median profiling time of only

\sim\,0.7\,

s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

arXiv.org e-Print Archive

eScholarship - University of California

Approximate Set Union Via Approximate Randomization

Author: Fu Bin
Gu Pengfei
Zhao Yuming
Publication venue
Publication date: 14/06/2018
Field of study

We develop an randomized approximation algorithm for the size of set union problem \arrowvert A_1\cup A_2\cup...\cup A_m\arrowvert, which given a list of sets

A_1,...,A_m

with approximate set size

m_i

for

A_i

with

m_i\in \left((1-\beta_L)|A_i|, (1+\beta_R)|A_i|\right)

, and biased random generators with Prob(x=\randomElm(A_i))\in \left[{1-\alpha_L\over |A_i|},{1+\alpha_R\over |A_i|}\right] for each input set

A_i

and element

x\in A_i,

where

i=1, 2, ..., m

. The approximation ratio for \arrowvert A_1\cup A_2\cup...\cup A_m\arrowvert is in the range

[(1-\epsilon)(1-\alpha_L)(1-\beta_L), (1+\epsilon)(1+\alpha_R)(1+\beta_R)]

for any

\epsilon\in (0,1)

, where

\alpha_L, \alpha_R, \beta_L,\beta_R\in (0,1)

. The complexity of the algorithm is measured by both time complexity, and round complexity. The algorithm is allowed to make multiple membership queries and get random elements from the input sets in one round. Our algorithm makes adaptive accesses to input sets with multiple rounds. Our algorithm gives an approximation scheme with O(\setCount\cdot(\log \setCount)^{O(1)}) running time and

O(\log m)

rounds, where

m

is the number of sets. Our algorithm can handle input sets that can generate random elements with bias, and its approximation ratio depends on the bias. Our algorithm gives a flexible tradeoff with time complexity O\left(\setCount^{1+\xi}\right) and round complexity

O\left({1\over \xi}\right)

for any

\xi\in(0,1)

arXiv.org e-Print Archive

Scholarworks@UTRGV Univ. of Texas RioGrande Valley