Search CORE

50,729 research outputs found

On SAT Models Enumeration in Itemset Mining

Author: Jabbour Said
Sais Lakhdar
Salhi Yakoub
Publication venue
Publication date: 08/06/2015
Field of study

Frequent itemset mining is an essential part of data analysis and data mining. Recent works propose interesting SAT-based encodings for the problem of discovering frequent itemsets. Our aim in this work is to define strategies for adapting SAT solvers to such encodings in order to improve models enumeration. In this context, we deeply study the effects of restart, branching heuristics and clauses learning. We then conduct an experimental evaluation on SAT-Based itemset mining instances to show how SAT solvers can be adapted to obtain an efficient SAT model enumerator

arXiv.org e-Print Archive

On Solving the Maximum $k$ -club Problem

Author: Wotzlaw Andreas
Publication venue
Publication date: 03/04/2014
Field of study

Given a simple undirected graph

G

, the maximum

k

-club problem is to find a maximum-cardinality subset of nodes inducing a subgraph of diameter at most

k

G

. This NP-hard generalization of clique, originally introduced to model low diameter clusters in social networks, is of interest in network-based data mining and clustering applications. We give two MAX-SAT formulations of the problem and show that two exact methods resulting from our encodings outperform significantly the state-of-the-art exact methods when evaluated both on sparse and dense random graphs as well as on diverse real-life graphs from the literature

arXiv.org e-Print Archive

Flexible constrained sampling with guarantees for pattern mining

Author: A Giacometti
A Zimmermann
C Bucilă
CP Gomes
F Bonchi
Luc De Raedt
M Berlingerio
M Boley
MA Hasan
Matthijs van Leeuwen
S Ermon
S Nijssen
T Calders
T Guns
T Guns
Vladimir Dzyuba
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling algorithms have been proposed, but each of them has its limitations when it comes to 1) flexibility in terms of quality measures and constraints that can be used, and/or 2) guarantees with respect to sampling accuracy. We therefore present Flexics, the first flexible pattern sampler that supports a broad class of quality measures and constraints, while providing strong guarantees regarding sampling accuracy. To achieve this, we leverage the perspective on pattern mining as a constraint satisfaction problem and build upon the latest advances in sampling solutions in SAT as well as existing pattern mining algorithms. Furthermore, the proposed algorithm is applicable to a variety of pattern languages, which allows us to introduce and tackle the novel task of sampling sets of patterns. We introduce and empirically evaluate two variants of Flexics: 1) a generic variant that addresses the well-known itemset sampling task and the novel pattern set sampling task as well as a wide range of expressive constraints within these tasks, and 2) a specialized variant that exploits existing frequent itemset techniques to achieve substantial speed-ups. Experiments show that Flexics is both accurate and efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal (ECML/PKDD 2017 journal track

arXiv.org e-Print Archive

Crossref

Leiden University Scholary Publications

A SAT model to mine flexible sequences in transactional datasets

Author: Coletta Rémi
Negrevergne Benjamin
Publication venue
Publication date: 01/04/2016
Field of study

Traditional pattern mining algorithms generally suffer from a lack of flexibility. In this paper, we propose a SAT formulation of the problem to successfully mine frequent flexible sequences occurring in transactional datasets. Our SAT-based approach can easily be extended with extra constraints to address a broad range of pattern mining applications. To demonstrate this claim, we formulate and add several constraints, such as gap and span constraints, to our model in order to extract more specific patterns. We also use interactive solving to perform important derived tasks, such as closed pattern mining or maximal pattern mining. Finally, we prove the practical feasibility of our SAT model by running experiments on two real datasets

arXiv.org e-Print Archive

On When and How to use SAT to Mine Frequent Itemsets

Author: Henriques Rui
Lynce Inês
Manquinho Vasco
Publication venue
Publication date: 26/07/2012
Field of study

A new stream of research was born in the last decade with the goal of mining itemsets of interest using Constraint Programming (CP). This has promoted a natural way to combine complex constraints in a highly flexible manner. Although CP state-of-the-art solutions formulate the task using Boolean variables, the few attempts to adopt propositional Satisfiability (SAT) provided an unsatisfactory performance. This work deepens the study on when and how to use SAT for the frequent itemset mining (FIM) problem by defining different encodings with multiple task-driven enumeration options and search strategies. Although for the majority of the scenarios SAT-based solutions appear to be non-competitive with CP peers, results show a variety of interesting cases where SAT encodings are the best option

arXiv.org e-Print Archive

Practical Algorithms for Finding Extremal Sets

Author: Gregg David
Marinov Martin
Nash Nicholas
Publication venue
Publication date: 07/08/2015
Field of study

The minimal sets within a collection of sets are defined as the ones which do not have a proper subset within the collection, and the maximal sets are the ones which do not have a proper superset within the collection. Identifying extremal sets is a fundamental problem with a wide-range of applications in SAT solvers, data-mining and social network analysis. In this paper, we present two novel improvements of the high-quality extremal set identification algorithm, \textit{AMS-Lex}, described by Bayardo and Panda. The first technique uses memoization to improve the execution time of the single-threaded variant of the AMS-Lex, whilst our second improvement uses parallel programming methods. In a subset of the presented experiments our memoized algorithm executes more than

400

times faster than the highly efficient publicly available implementation of AMS-Lex. Moreover, we show that our modified algorithm's speedup is not bounded above by a constant and that it increases as the length of the common prefixes in successive input \textit{itemsets} increases. We provide experimental results using both real-world and synthetic data sets, and show our multi-threaded variant algorithm out-performing AMS-Lex by

3

6

times. We find that on synthetic input datasets when executed using

16

CPU cores of a

32

-core machine, our multi-threaded program executes about as fast as the state of the art parallel GPU-based program using an NVIDIA GTX 580 graphics processing unit

arXiv.org e-Print Archive

Mining to Compact CNF Propositional Formulae

Author: Jabbour Said
Sais Lakhdar
Salhi Yakoub
Publication venue
Publication date: 16/04/2013
Field of study

In this paper, we propose a first application of data mining techniques to propositional satisfiability. Our proposed Mining4SAT approach aims to discover and to exploit hidden structural knowledge for reducing the size of propositional formulae in conjunctive normal form (CNF). Mining4SAT combines both frequent itemset mining techniques and Tseitin's encoding for a compact representation of CNF formulae. The experiments of our Mining4SAT approach show interesting reductions of the sizes of many application instances taken from the last SAT competitions

arXiv.org e-Print Archive

NetSimile: A Scalable Approach to Size-Independent Network Similarity

Author: Berlingerio Michele
Eliassi-Rad Tina
Faloutsos Christos
Koutra Danai
Publication venue
Publication date: 12/09/2012
Field of study

Given a set of k networks, possibly with different sizes and no overlaps in nodes or edges, how can we quickly assess similarity between them, without solving the node-correspondence problem? Analogously, how can we extract a small number of descriptive, numerical features from each graph that effectively serve as the graph's "signature"? Having such features will enable a wealth of graph mining tasks, including clustering, outlier detection, visualization, etc. We propose NetSimile -- a novel, effective, and scalable method for solving the aforementioned problem. NetSimile has the following desirable properties: (a) It gives similarity scores that are size-invariant. (b) It is scalable, being linear on the number of edges for "signature" vector extraction. (c) It does not need to solve the node-correspondence problem. We present extensive experiments on numerous synthetic and real graphs from disparate domains, and show NetSimile's superiority over baseline competitors. We also show how NetSimile enables several mining tasks such as clustering, visualization, discontinuity detection, network transfer learning, and re-identification across networks.Comment: 12 pages, 10 figure

arXiv.org e-Print Archive

Improving efficiency of information measurement system of coal mine air gas protection

Author: Laktionov I
Vovna O
Zori A
Publication venue: 'National Academy of Sciences of Ukraine (Co. LTD Ukrinformnauka)'
Publication date: 01/01/2017
Field of study

Purpose. Development of scientific approaches to creation of high-precision and high-speed optoelectronic measurement systems within the complex of air gas safety of coal mines by means of the developed and implemented methods and means of measurement systems efficiency improvement taking into account compensation of the effect of destabilizing factors. Methods. Experimental studies have been carried out in mine production conditions and laboratories on the physical models of information measurement systems using metrologically certified measuring instruments. Findings. It has been proposed to determine the efficiency of the developed information and measurement systems on the basis of the arithmetic mean of n groups and the geometric mean of the information data rate of m meters measuring mine atmosphere parameters in coal mines for each group separately. It has been found that the use of the developed information system measuring methane and dust concentration within the UTSSC increases data rate of mine air gas protection system by 16.5 bits/s. Originality. For the first time, logical design of information and measurement system of methane and dust concentration has been proposed and implemented, which, in contrast to the existing ones, is based on increasing accuracy and speed of measuring channels response to methane and dust concentration, which allowed to increase probability of detecting explosive situations from 0.90 to 0.98 and provide enhancement of mine air gas protection. Practical implications. The developed methods and techniques allowed to implement a number of projects for the mining industry: high-speed measurement system evaluating methane concentration in a mine complex of monitoring telephone communication and notification “SAT” (private company “Deyta Express”, Ukraine); measurement system of polydisperse dust concentration for unified telecommunication systems of supervisory control and automated management of mining machines and technological complexes “UTSSC” (State Enterprise “Petrovsky Plant of Mining Machinery”, Ukraine).Мета. Розробка наукових підходів до створення високоточних швидкодіючих оптоелектронних вимірювальних систем у складі комплексу забезпечення аерогазової безпеки шахт за рахунок використання запропонованих і реалізованих методів та засобів підвищення ефективності вимірювальних систем на основі обліку й компенсації впливу дестабілізуючих факторів. Методика. Експериментальні дослідження виконано у виробничих умовах шахт і в лабораторіях на фізичних моделях інформаційно-вимірювальних систем з використанням метрологічно-атестованих засобів вимірювань. Результати. Запропоновано визначати ефективність досліджуваної інформаційно-вимірювальної системи на основі середнього арифметичного n груп від середнього геометричних значень інформаційних пропускних спроможностей m вимірювачів параметрів рудничної атмосфери вугільних шахт за кожною групою окремо. Встановлено, що використання розробленої інформаційно-вимірювальної системи концентрації метану та пилу у складі УТАС підвищує пропускну спроможність системи аерогазового захисту шахт на 16.5 біт/с. Наукова новизна. Вперше запропоновано і реалізовано логічну побудову інформаційно-вимірювальної системи концентрації метану та пилу, яка, на відміну від існуючих, заснована на підвищенні точності та швидкодії вимірювальних каналів концентрації метану і пилу, що дозволило збільшити вірогідність виявлення вибухонебезпечних ситуацій з 0.90 до 0.98 та забезпечити зростання рівня аерогазового захисту шахт. Практична значимість. Розроблені методи і засоби дозволили реалізувати низку проектів для підприємств гірничої промисловості: швидкодіюча вимірювальна система концентрації метану для комплексу шахтного диспетчерського телефонного зв’язку та оповіщення “САТ” (Приватна компанія “Дейта Експрес”, Україна); вимірювальна система концентрації полідисперсного пилу для уніфікованої телекомунікаційної системи диспетчерського контролю та автоматизованого управління гірничими машинами і технологічними комплексами “УТАС” (Державне підприємство “Петровський завод вугільного машинобудування”, Україна).Цель. Разработка научных подходов к созданию высокоточных быстродействующих оптоэлектронных измерительных систем в составе комплекса обеспечения аэрогазовой безопасности шахт за счет использования предложенных и реализованных методов и средств повышения эффективности измерительных систем на основе учета и компенсации влияния дестабилизирующих факторов. Методика. Экспериментальные исследования выполнены в производственных условиях шахт и в лабораториях на физических моделях информационно-измерительных систем с использованием метрологически-аттестованных средств измерений. Результаты. Предложено определять эффективность исследуемой информационно-измерительной системы на основе среднего арифметического n групп среднего геометрических значений информационных пропускных способностей m измерителей параметров рудничной атмосферы угольных шахт по каждой группе отдельно. Установлено, что использование разработанной информационно-измерительной системы концентрации метана и пыли в составе УТАС повышает пропускную способность системы аэрогазового защиты шахт на 16.5 бит/с. Научная новизна. Впервые предложено и реализовано логическое построение информационно-измерительной системы концентрации метана и пыли, которая, в отличие от существующих, основана на повышении точности и быстродействия измерительных каналов концентрации метана и пыли, что позволило увеличить вероятность обнаружения взрывоопасных ситуаций с 0.90 до 0.98 и обеспечить рост уровня аэрогазовой защиты шахт. Практическая значимость. Разработанные методы и средства позволили реализовать ряд проектов для предприятий горной промышленности: быстродействующая измерительная система концентрации метана для комплекса шахтной диспетчерской телефонной связи и оповещения “САТ” (Частная компания “Дейта Экспресс”, Украина); измерительная система концентрации полидисперсной пыли для унифицированной телекоммуникационной системы диспетчерского контроля и автоматизированного управления горными машинами и технологическими комплексами “УТАС” (Государственное предприятие “Петровский завод угольного машиностроения”, Украина).This work would be impossible without the financial support of the Ministry of Education and Science of Ukraine during the execution of the project No 0115U002655 “Research and development of an experimental sample of optical meter of methane concentration for coal mines”. Additional financial support was provided during the implementation of the Inter-Regional Programme of the European Neighbourhood and Partnership Instrument Tempus VI on the project 544010 – TEMPUS – 1 – 2013 – 1 – DE – TEMPUS – JPHES “TATU: Trainings in Automation Technologies for Ukraine”. The authors express gratitude to the employees of the State Enterprise “Petrovsky Plant of Mining Machinery” and the private company “Deyta Express” for participating in creation of research sample meters of methane and dust concentration for coal mine conditions, as well as support in conducting research in industrial conditions

Наукова електронна бібліотека періодичних видань НАН України (Vernadsky National Library of Ukraine)

eLibrary National Mining University

On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

Author: Equi Massimo
Grossi Roberto
Mäkinen Veli
Publication venue
Publication date: 08/07/2019
Field of study

Exact pattern matching in labeled graphs is the problem of searching paths of a graph

G=(V,E)

that spell the same string as the pattern

P[1..m]

. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks, where the nodes of some paths must match a sequence of labels or types. We describe a simple conditional lower bound that, for any constant

\epsilon>0

, an

O(|E|^{1 - \epsilon} \, m)

-time or an

O(|E| \, m^{1 - \epsilon})

-time algorithm for exact pattern matching on graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. The result holds even if restricted to undirected graphs of maximum degree three or directed acyclic graphs of maximum sum of indegree and outdegree three. Although a conditional lower bound of this kind can be somehow derived from previous results (Backurs and Indyk, FOCS'16), we give a direct reduction from SETH for dissemination purposes, as the result might interest researchers from several areas, such as computational biology, graph database, and graph mining, as mentioned before. Indeed, as approximate pattern matching on graphs can be solved in

O(|E|\,m)

time, exact and approximate matching are thus equally hard (quadratic time) on graphs under the SETH assumption. In comparison, the same problems restricted to strings have linear time vs quadratic time solutions, respectively, where the latter ones have a matching SETH lower bound on computing the edit distance of two strings (Backurs and Indyk, STOC'15).Comment: Using Lemma 12 and Lemma 13 might to be enough to prove Lemma 14. However, the proof of Lemma 14 is correct if you assume that the graph used in the reduction is a DAG. Hence, since the problem is already quadratic for a DAG and a binary alphabet, it has to be quadratic also for a general graph and a binary alphabe

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server