46 research outputs found
On the Complexity of Mining Itemsets from the Crowd Using Taxonomies
We study the problem of frequent itemset mining in domains where data is not
recorded in a conventional database but only exists in human knowledge. We
provide examples of such scenarios, and present a crowdsourcing model for them.
The model uses the crowd as an oracle to find out whether an itemset is
frequent or not, and relies on a known taxonomy of the item domain to guide the
search for frequent itemsets. In the spirit of data mining with oracles, we
analyze the complexity of this problem in terms of (i) crowd complexity, that
measures the number of crowd questions required to identify the frequent
itemsets; and (ii) computational complexity, that measures the computational
effort required to choose the questions. We provide lower and upper complexity
bounds in terms of the size and structure of the input taxonomy, as well as the
size of a concise description of the output itemsets. We also provide
constructive algorithms that achieve the upper bounds, and consider more
efficient variants for practical situations.Comment: 18 pages, 2 figures. To be published to ICDT'13. Added missing
acknowledgemen
Structurally Tractable Uncertain Data
Many data management applications must deal with data which is uncertain,
incomplete, or noisy. However, on existing uncertain data representations, we
cannot tractably perform the important query evaluation tasks of determining
query possibility, certainty, or probability: these problems are hard on
arbitrary uncertain input instances. We thus ask whether we could restrict the
structure of uncertain data so as to guarantee the tractability of exact query
evaluation. We present our tractability results for tree and tree-like
uncertain data, and a vision for probabilistic rule reasoning. We also study
uncertainty about order, proposing a suitable representation, and study
uncertain data conditioned by additional observations.Comment: 11 pages, 1 figure, 1 table. To appear in SIGMOD/PODS PhD Symposium
201
Uncertainty in Crowd Data Sourcing under Structural Constraints
Applications extracting data from crowdsourcing platforms must deal with the
uncertainty of crowd answers in two different ways: first, by deriving
estimates of the correct value from the answers; second, by choosing crowd
questions whose answers are expected to minimize this uncertainty relative to
the overall data collection goal. Such problems are already challenging when we
assume that questions are unrelated and answers are independent, but they are
even more complicated when we assume that the unknown values follow hard
structural constraints (such as monotonicity).
In this vision paper, we examine how to formally address this issue with an
approach inspired by [Amsterdamer et al., 2013]. We describe a generalized
setting where we model constraints as linear inequalities, and use them to
guide the choice of crowd questions and the processing of answers. We present
the main challenges arising in this setting, and propose directions to solve
them.Comment: 8 pages, vision paper. To appear at UnCrowd 201
Privacy Preservation by Disassociation
In this work, we focus on protection against identity disclosure in the
publication of sparse multidimensional data. Existing multidimensional
anonymization techniquesa) protect the privacy of users either by altering the
set of quasi-identifiers of the original data (e.g., by generalization or
suppression) or by adding noise (e.g., using differential privacy) and/or (b)
assume a clear distinction between sensitive and non-sensitive information and
sever the possible linkage. In many real world applications the above
techniques are not applicable. For instance, consider web search query logs.
Suppressing or generalizing anonymization methods would remove the most
valuable information in the dataset: the original query terms. Additionally,
web search query logs contain millions of query terms which cannot be
categorized as sensitive or non-sensitive since a term may be sensitive for a
user and non-sensitive for another. Motivated by this observation, we propose
an anonymization technique termed disassociation that preserves the original
terms but hides the fact that two or more different terms appear in the same
record. We protect the users' privacy by disassociating record terms that
participate in identifying combinations. This way the adversary cannot
associate with high probability a record with a rare combination of terms. To
the best of our knowledge, our proposal is the first to employ such a technique
to provide protection against identity disclosure. We propose an anonymization
algorithm based on our approach and evaluate its performance on real and
synthetic datasets, comparing it against other state-of-the-art methods based
on generalization and differential privacy.Comment: VLDB201
Automating Software Customization via Crowdsourcing using Association Rule Mining and Markov Decision Processes
As systems grow in size and complexity so do their configuration possibilities. Users of modern systems are easy to be confused and overwhelmed by the amount of choices they need to make in order to fit their systems to their exact needs. In this thesis, we propose a technique to select what information to elicit from the user so that the system can recommend the maximum number of personalized configuration items. Our method is based on constructing configuration elicitation dialogs through utilizing crowd wisdom.
A set of configuration preferences in form of association rules is first mined from a crowd configuration data set. Possible configuration elicitation dialogs are then modeled through a Markov Decision Processes (MDPs). Within the model, association rules are used to automatically infer configuration decisions based on knowledge already elicited earlier in the dialog. This way, an MDP solver can search for elicitation strategies which maximize the expected amount of automated decisions, reducing thereby elicitation effort and increasing user confidence of the result. We conclude by reporting results of a case study in which this method is applied to the privacy configuration of Facebook
Requirements and Use Cases ; Report I on the sub-project Smart Content Enrichment
In this technical report, we present the results of the first milestone phase
of the Corporate Smart Content sub-project "Smart Content Enrichment". We
present analyses of the state of the art in the fields concerning the three
working packages defined in the sub-project, which are aspect-oriented
ontology development, complex entity recognition, and semantic event pattern
mining. We compare the research approaches related to our three research
subjects and outline briefly our future work plan
Sequential Pattern Mining using FCA and Pattern Structures for Analyzing Visitor Trajectories in a Museum
International audienceThis paper presents our work on mining visitor trajectories in Hecht Museum (Haifa, Israel), within the framework of CrossCult Eu-ropean Project about cultural heritage. We present a theoretical and practical research work about the characterization of visitor trajectories and the mining of these trajectories as sequences. The mining process is based on two approaches in the framework of FCA, namely the mining of subsequences without any constraint and the mining of frequent contiguous subsequences. Both approaches are based on pattern structures. In parallel, a similarity measure allows us to build a hierarchical classification which is used for interpretation and characterization of the trajectories w.r.t. four well-known visiting styles