22 research outputs found
Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study
Features mined from knowledge graphs are widely used within multiple
knowledge discovery tasks such as classification or fact-checking. Here, we
consider a given set of vertices, called seed vertices, and focus on mining
their associated neighboring vertices, paths, and, more generally, path
patterns that involve classes of ontologies linked with knowledge graphs. Due
to the combinatorial nature and the increasing size of real-world knowledge
graphs, the task of mining these patterns immediately entails scalability
issues. In this paper, we address these issues by proposing a pattern mining
approach that relies on a set of constraints (e.g., support or degree
thresholds) and the monotonicity property. As our motivation comes from the
mining of real-world knowledge graphs, we illustrate our approach with PGxLOD,
a biomedical knowledge graph
FlashProfile: A Framework for Synthesizing Data Profiles
We address the problem of learning a syntactic profile for a collection of
strings, i.e. a set of regex-like patterns that succinctly describe the
syntactic variations in the strings. Real-world datasets, typically curated
from multiple sources, often contain data in various syntactic formats. Thus,
any data processing task is preceded by the critical step of data format
identification. However, manual inspection of data to identify the different
formats is infeasible in standard big-data scenarios.
Prior techniques are restricted to a small set of pre-defined patterns (e.g.
digits, letters, words, etc.), and provide no control over granularity of
profiles. We define syntactic profiling as a problem of clustering strings
based on syntactic similarity, followed by identifying patterns that succinctly
describe each cluster. We present a technique for synthesizing such profiles
over a given language of patterns, that also allows for interactive refinement
by requesting a desired number of clusters.
Using a state-of-the-art inductive synthesis framework, PROSE, we have
implemented our technique as FlashProfile. Across tasks over large
real datasets, we observe a median profiling time of only s.
Furthermore, we show that access to syntactic profiles may allow for more
accurate synthesis of programs, i.e. using fewer examples, in
programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201
Approximate Set Union Via Approximate Randomization
We develop an randomized approximation algorithm for the size of set union
problem \arrowvert A_1\cup A_2\cup...\cup A_m\arrowvert, which given a list
of sets with approximate set size for with , and biased random generators
with Prob(x=\randomElm(A_i))\in \left[{1-\alpha_L\over |A_i|},{1+\alpha_R\over
|A_i|}\right] for each input set and element where . The approximation ratio for \arrowvert A_1\cup A_2\cup...\cup
A_m\arrowvert is in the range for any , where
. The complexity of the algorithm
is measured by both time complexity, and round complexity. The algorithm is
allowed to make multiple membership queries and get random elements from the
input sets in one round. Our algorithm makes adaptive accesses to input sets
with multiple rounds. Our algorithm gives an approximation scheme with
O(\setCount\cdot(\log \setCount)^{O(1)}) running time and rounds,
where is the number of sets. Our algorithm can handle input sets that can
generate random elements with bias, and its approximation ratio depends on the
bias. Our algorithm gives a flexible tradeoff with time complexity
O\left(\setCount^{1+\xi}\right) and round complexity for any