Search CORE

13 research outputs found

Closing the gap: Sequence mining at scale

Author: Beedkar Kaustubh
Berberich Klaus
Gemulla Rainer
Miliaraki Iris
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches

Crossref

MAnnheim DOCument Server

MPG.PuRe

Querying and Learning in Probabilistic Databases

Author: Dylla Maximilian
Miliaraki Iris
Theobald Martin
Publication venue
Publication date: 01/01/2014
Field of study

Open Repository and Bibliography - Luxembourg

A Temporal-Probabilistic Database Model for Information Extraction

Author: Iris Miliaraki
Martin Theobald
Maximilian Dylla
Publication venue
Publication date: 01/01/2013
Field of study

Temporal annotations of facts are a key component both for building a high-accuracy knowledge base and for answering queries over the resulting temporal knowledge base with high precision and recall. In this paper, we present a temporalprobabilistic database model for cleaning uncertain temporal facts obtained from information extraction methods. Specifically, we consider a combination of temporal deduction rules, temporal consistency constraints and probabilistic inference based on the common possible-worlds semantics with data lineage, and we study the theoretical properties of this data model. We further develop a query engine that is capable of scaling to very large temporal knowledge bases, with millions of uncertain facts and hundreds of thousands of grounded rules. Our experiments over two real-world datasets demonstrate the increased robustness of our approach compared to related techniques based on constraint solving via Integer Linear Programming (ILP) and probabilistic inference via Markov Logic Networks (MLNs). We are also able to show that our runtime performance is more than competitive to ILP solvers and the fastest available, probabilistic but non-temporal, database engines. 1

CiteSeerX

DBIS EPub

Top-k Query Processing in Probabilistic Databases with Non-Materialized Views

Author: Iris Miliaraki
Martin Theobald
Maximilian Dylla
Publication venue
Publication date: 01/01/2012
Field of study

Abstract—We investigate a novel approach of computing confidence bounds for top-k ranking queries in probabilistic databases with non-materialized views. Unlike related approaches, we present an exact pruning algorithm for finding the topranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of select-project-join views, the latter of which are cast into Datalog rules which we ground in a top-down fashion directly at query processing time. To our knowledge, this work is the first to address integrated data and confidence computations for intensional query evaluations in the context of probabilistic databases by considering confidence bounds over first-order lineage formulas. We extend our query processing techniques by a tool-suite of scheduling strategies based on selectivity estimation and the expected impact on confidence bounds. Further extensions to our query processing strategies include improved top-k bounds in the case when sorted relations are available as input, as well as the consideration of recursive rules. Experiments with large datasets demonstrate significant runtime improvements of our approach compared to both exact and sampling-based top-k methods over probabilistic data. I

CiteSeerX

DBIS EPub

Mind the Gap: Large-Scale Frequent Sequence Mining

Author: Iris Miliaraki
Klaus Berberich
Rainer Gemulla
Spyros Zoupanos
Publication venue
Publication date: 01/01/2013
Field of study

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a “projected database ” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the context of text mining suggests that MG-FSM is significantly more efficient and scalable than alternative approaches

CiteSeerX

MPG.PuRe

10 years of probabilistic querying - what next?

Author: De Raedt Luc
Dylla Maximilian
Kimmig Angelika
Miliaraki Iris
Theobald Martin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Over the past decade, the two research areas of probabilistic databases and probabilistic programming have intensively studied the problem of making structured probabilistic inference scalable, but — so far — both areas developed almost independently of one another. While probabilistic databases have focused on describing tractable query classes based on the structure of query plans and data lineage, probabilistic programming has contributed sophisticated inference techniques based on knowledge compilation and lifted (first-order) inference. Both fields have developed their own variants of — both exact and approximate — top-k algorithms for query evaluation, and both investigate query optimization techniques known from SQL, Datalog, and Prolog, which all calls for a more intensive study of the commonalities and integration of the two fields. Moreover, we believe that natural-language processing and information extraction will remain a driving factor and in fact a longstanding challenge for developing expressive representation models which can be combined with structured probabilistic inference — also for the next decades to come.invited paperstatus: publishe

Lirias

Semantic Grid resource discovery in Atlas

Author: Kaoudi Zoi Miliaraki, Iris Magiridou, Matoula Liarou, Erietta Idreos, Stratos Koubarakis, Manolis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

we study the problem of resource discovery in the Semantic Grid. We show how to solve this problem by utilizing Atlas, a P2P system for the distributed storage and retrieval of RDF(S) data. Atlas is currently under development in project OntoGrid funded by FP6. Atlas is built on top of the distributed hash table Bamboo and supports pull and push querying scenarios. It inherits all the nice features of Bamboo (openness, scalability, fault-tolerance, resistance to high chum rates) and extends Bamboo’s protocols for storing and querying RDF(S) data. Atlas is being used currently to realize the metadata service of S-OGSA in a fully distributed and scalable way. In this paper, we concentrate on the main features of Atlas and demonstrate its use for Semantic Grid resource discovery in an OntoGrid use case scenario

Pergamos : Unified Institutional Repository / Digital Library Platform of the National and Kapodistrian University of Athens

Semantic Grid Resource Discovery using DHTs in Atlas

Author: Antonios Papadakis-pesaresi
Iris Miliaraki
Manolis Koubarakis
Matoula Magiridou
Zoi Kaoudi
Publication venue: Springer
Publication date
Field of study

CiteSeerX