50,729 research outputs found
On SAT Models Enumeration in Itemset Mining
Frequent itemset mining is an essential part of data analysis and data
mining. Recent works propose interesting SAT-based encodings for the problem of
discovering frequent itemsets. Our aim in this work is to define strategies for
adapting SAT solvers to such encodings in order to improve models enumeration.
In this context, we deeply study the effects of restart, branching heuristics
and clauses learning. We then conduct an experimental evaluation on SAT-Based
itemset mining instances to show how SAT solvers can be adapted to obtain an
efficient SAT model enumerator
On Solving the Maximum -club Problem
Given a simple undirected graph , the maximum -club problem is to find
a maximum-cardinality subset of nodes inducing a subgraph of diameter at most
in . This NP-hard generalization of clique, originally introduced to
model low diameter clusters in social networks, is of interest in network-based
data mining and clustering applications. We give two MAX-SAT formulations of
the problem and show that two exact methods resulting from our encodings
outperform significantly the state-of-the-art exact methods when evaluated both
on sparse and dense random graphs as well as on diverse real-life graphs from
the literature
Flexible constrained sampling with guarantees for pattern mining
Pattern sampling has been proposed as a potential solution to the infamous
pattern explosion. Instead of enumerating all patterns that satisfy the
constraints, individual patterns are sampled proportional to a given quality
measure. Several sampling algorithms have been proposed, but each of them has
its limitations when it comes to 1) flexibility in terms of quality measures
and constraints that can be used, and/or 2) guarantees with respect to sampling
accuracy. We therefore present Flexics, the first flexible pattern sampler that
supports a broad class of quality measures and constraints, while providing
strong guarantees regarding sampling accuracy. To achieve this, we leverage the
perspective on pattern mining as a constraint satisfaction problem and build
upon the latest advances in sampling solutions in SAT as well as existing
pattern mining algorithms. Furthermore, the proposed algorithm is applicable to
a variety of pattern languages, which allows us to introduce and tackle the
novel task of sampling sets of patterns. We introduce and empirically evaluate
two variants of Flexics: 1) a generic variant that addresses the well-known
itemset sampling task and the novel pattern set sampling task as well as a wide
range of expressive constraints within these tasks, and 2) a specialized
variant that exploits existing frequent itemset techniques to achieve
substantial speed-ups. Experiments show that Flexics is both accurate and
efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal
(ECML/PKDD 2017 journal track
A SAT model to mine flexible sequences in transactional datasets
Traditional pattern mining algorithms generally suffer from a lack of
flexibility. In this paper, we propose a SAT formulation of the problem to
successfully mine frequent flexible sequences occurring in transactional
datasets. Our SAT-based approach can easily be extended with extra constraints
to address a broad range of pattern mining applications. To demonstrate this
claim, we formulate and add several constraints, such as gap and span
constraints, to our model in order to extract more specific patterns. We also
use interactive solving to perform important derived tasks, such as closed
pattern mining or maximal pattern mining. Finally, we prove the practical
feasibility of our SAT model by running experiments on two real datasets
On When and How to use SAT to Mine Frequent Itemsets
A new stream of research was born in the last decade with the goal of mining
itemsets of interest using Constraint Programming (CP). This has promoted a
natural way to combine complex constraints in a highly flexible manner.
Although CP state-of-the-art solutions formulate the task using Boolean
variables, the few attempts to adopt propositional Satisfiability (SAT)
provided an unsatisfactory performance. This work deepens the study on when and
how to use SAT for the frequent itemset mining (FIM) problem by defining
different encodings with multiple task-driven enumeration options and search
strategies. Although for the majority of the scenarios SAT-based solutions
appear to be non-competitive with CP peers, results show a variety of
interesting cases where SAT encodings are the best option
Practical Algorithms for Finding Extremal Sets
The minimal sets within a collection of sets are defined as the ones which do
not have a proper subset within the collection, and the maximal sets are the
ones which do not have a proper superset within the collection. Identifying
extremal sets is a fundamental problem with a wide-range of applications in SAT
solvers, data-mining and social network analysis. In this paper, we present two
novel improvements of the high-quality extremal set identification algorithm,
\textit{AMS-Lex}, described by Bayardo and Panda. The first technique uses
memoization to improve the execution time of the single-threaded variant of the
AMS-Lex, whilst our second improvement uses parallel programming methods. In a
subset of the presented experiments our memoized algorithm executes more than
times faster than the highly efficient publicly available implementation
of AMS-Lex. Moreover, we show that our modified algorithm's speedup is not
bounded above by a constant and that it increases as the length of the common
prefixes in successive input \textit{itemsets} increases. We provide
experimental results using both real-world and synthetic data sets, and show
our multi-threaded variant algorithm out-performing AMS-Lex by to
times. We find that on synthetic input datasets when executed using CPU
cores of a -core machine, our multi-threaded program executes about as fast
as the state of the art parallel GPU-based program using an NVIDIA GTX 580
graphics processing unit
Mining to Compact CNF Propositional Formulae
In this paper, we propose a first application of data mining techniques to
propositional satisfiability. Our proposed Mining4SAT approach aims to discover
and to exploit hidden structural knowledge for reducing the size of
propositional formulae in conjunctive normal form (CNF). Mining4SAT combines
both frequent itemset mining techniques and Tseitin's encoding for a compact
representation of CNF formulae. The experiments of our Mining4SAT approach show
interesting reductions of the sizes of many application instances taken from
the last SAT competitions
NetSimile: A Scalable Approach to Size-Independent Network Similarity
Given a set of k networks, possibly with different sizes and no overlaps in
nodes or edges, how can we quickly assess similarity between them, without
solving the node-correspondence problem? Analogously, how can we extract a
small number of descriptive, numerical features from each graph that
effectively serve as the graph's "signature"? Having such features will enable
a wealth of graph mining tasks, including clustering, outlier detection,
visualization, etc.
We propose NetSimile -- a novel, effective, and scalable method for solving
the aforementioned problem. NetSimile has the following desirable properties:
(a) It gives similarity scores that are size-invariant. (b) It is scalable,
being linear on the number of edges for "signature" vector extraction. (c) It
does not need to solve the node-correspondence problem. We present extensive
experiments on numerous synthetic and real graphs from disparate domains, and
show NetSimile's superiority over baseline competitors. We also show how
NetSimile enables several mining tasks such as clustering, visualization,
discontinuity detection, network transfer learning, and re-identification
across networks.Comment: 12 pages, 10 figure
Improving efficiency of information measurement system of coal mine air gas protection
Purpose. Development of scientific approaches to creation of high-precision and high-speed optoelectronic measurement systems within the complex of air gas safety of coal mines by means of the developed and implemented methods and means of measurement systems efficiency improvement taking into account compensation of the effect of destabilizing factors.
Methods. Experimental studies have been carried out in mine production conditions and laboratories on the physical models of information measurement systems using metrologically certified measuring instruments.
Findings. It has been proposed to determine the efficiency of the developed information and measurement systems on the basis of the arithmetic mean of n groups and the geometric mean of the information data rate of m meters measuring mine atmosphere parameters in coal mines for each group separately. It has been found that the use of the developed information system measuring methane and dust concentration within the UTSSC increases data rate of mine air gas protection system by 16.5 bits/s.
Originality. For the first time, logical design of information and measurement system of methane and dust concentration has been proposed and implemented, which, in contrast to the existing ones, is based on increasing accuracy and speed of measuring channels response to methane and dust concentration, which allowed to increase probability of detecting explosive situations from 0.90 to 0.98 and provide enhancement of mine air gas protection.
Practical implications. The developed methods and techniques allowed to implement a number of projects for the mining industry: high-speed measurement system evaluating methane concentration in a mine complex of monitoring telephone communication and notification “SAT” (private company “Deyta Express”, Ukraine); measurement system of polydisperse dust concentration for unified telecommunication systems of supervisory control and automated management of mining machines and technological complexes “UTSSC” (State Enterprise “Petrovsky Plant of Mining Machinery”, Ukraine).Мета. Розробка наукових підходів до створення високоточних швидкодіючих оптоелектронних вимірювальних систем у складі комплексу забезпечення аерогазової безпеки шахт за рахунок використання запропонованих і реалізованих методів та засобів підвищення ефективності вимірювальних систем на основі обліку й компенсації впливу дестабілізуючих факторів.
Методика. Експериментальні дослідження виконано у виробничих умовах шахт і в лабораторіях на фізичних моделях інформаційно-вимірювальних систем з використанням метрологічно-атестованих засобів вимірювань.
Результати. Запропоновано визначати ефективність досліджуваної інформаційно-вимірювальної системи на основі середнього арифметичного n груп від середнього геометричних значень інформаційних пропускних спроможностей m вимірювачів параметрів рудничної атмосфери вугільних шахт за кожною групою окремо. Встановлено, що використання розробленої інформаційно-вимірювальної системи концентрації метану та пилу у складі УТАС підвищує пропускну спроможність системи аерогазового захисту шахт на 16.5 біт/с.
Наукова новизна. Вперше запропоновано і реалізовано логічну побудову інформаційно-вимірювальної системи концентрації метану та пилу, яка, на відміну від існуючих, заснована на підвищенні точності та швидкодії вимірювальних каналів концентрації метану і пилу, що дозволило збільшити вірогідність виявлення вибухонебезпечних ситуацій з 0.90 до 0.98 та забезпечити зростання рівня аерогазового захисту шахт.
Практична значимість. Розроблені методи і засоби дозволили реалізувати низку проектів для підприємств гірничої промисловості: швидкодіюча вимірювальна система концентрації метану для комплексу шахтного диспетчерського телефонного зв’язку та оповіщення “САТ” (Приватна компанія “Дейта Експрес”, Україна); вимірювальна система концентрації полідисперсного пилу для уніфікованої телекомунікаційної системи диспетчерського контролю та автоматизованого управління гірничими машинами і технологічними комплексами “УТАС” (Державне підприємство “Петровський завод вугільного машинобудування”, Україна).Цель. Разработка научных подходов к созданию высокоточных быстродействующих оптоэлектронных измерительных систем в составе комплекса обеспечения аэрогазовой безопасности шахт за счет использования предложенных и реализованных методов и средств повышения эффективности измерительных систем на основе учета и компенсации влияния дестабилизирующих факторов.
Методика. Экспериментальные исследования выполнены в производственных условиях шахт и в лабораториях на физических моделях информационно-измерительных систем с использованием метрологически-аттестованных средств измерений.
Результаты. Предложено определять эффективность исследуемой информационно-измерительной системы на основе среднего арифметического n групп среднего геометрических значений информационных пропускных способностей m измерителей параметров рудничной атмосферы угольных шахт по каждой группе отдельно. Установлено, что использование разработанной информационно-измерительной системы концентрации метана и пыли в составе УТАС повышает пропускную способность системы аэрогазового защиты шахт на 16.5 бит/с.
Научная новизна. Впервые предложено и реализовано логическое построение информационно-измерительной системы концентрации метана и пыли, которая, в отличие от существующих, основана на повышении точности и быстродействия измерительных каналов концентрации метана и пыли, что позволило увеличить вероятность обнаружения взрывоопасных ситуаций с 0.90 до 0.98 и обеспечить рост уровня аэрогазовой защиты шахт.
Практическая значимость. Разработанные методы и средства позволили реализовать ряд проектов для предприятий горной промышленности: быстродействующая измерительная система концентрации метана для комплекса шахтной диспетчерской телефонной связи и оповещения “САТ” (Частная компания “Дейта Экспресс”, Украина); измерительная система концентрации полидисперсной пыли для унифицированной телекоммуникационной системы диспетчерского контроля и автоматизированного управления горными машинами и технологическими комплексами “УТАС” (Государственное предприятие “Петровский завод угольного машиностроения”, Украина).This work would be impossible without the financial support of the Ministry of Education and Science of Ukraine during the execution of the project No 0115U002655 “Research and development of an experimental sample of optical meter of methane concentration for coal mines”. Additional financial support was provided during the implementation of the Inter-Regional Programme of the European Neighbourhood and Partnership Instrument Tempus VI on the project 544010 – TEMPUS – 1 – 2013 – 1 – DE – TEMPUS – JPHES “TATU: Trainings in Automation Technologies for Ukraine”. The authors express gratitude to the employees of the State Enterprise “Petrovsky Plant of Mining Machinery” and the private company “Deyta Express” for participating in creation of research sample meters of methane and dust concentration for coal mine conditions, as well as support in conducting research in industrial conditions
On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree
Exact pattern matching in labeled graphs is the problem of searching paths of
a graph that spell the same string as the pattern . This
basic problem can be found at the heart of more complex operations on variation
graphs in computational biology, of query operations in graph databases, and of
analysis operations in heterogeneous networks, where the nodes of some paths
must match a sequence of labels or types. We describe a simple conditional
lower bound that, for any constant , an -time or an -time algorithm for exact pattern
matching on graphs, with node labels and patterns drawn from a binary alphabet,
cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is
false. The result holds even if restricted to undirected graphs of maximum
degree three or directed acyclic graphs of maximum sum of indegree and
outdegree three. Although a conditional lower bound of this kind can be somehow
derived from previous results (Backurs and Indyk, FOCS'16), we give a direct
reduction from SETH for dissemination purposes, as the result might interest
researchers from several areas, such as computational biology, graph database,
and graph mining, as mentioned before. Indeed, as approximate pattern matching
on graphs can be solved in time, exact and approximate matching are
thus equally hard (quadratic time) on graphs under the SETH assumption. In
comparison, the same problems restricted to strings have linear time vs
quadratic time solutions, respectively, where the latter ones have a matching
SETH lower bound on computing the edit distance of two strings (Backurs and
Indyk, STOC'15).Comment: Using Lemma 12 and Lemma 13 might to be enough to prove Lemma 14.
However, the proof of Lemma 14 is correct if you assume that the graph used
in the reduction is a DAG. Hence, since the problem is already quadratic for
a DAG and a binary alphabet, it has to be quadratic also for a general graph
and a binary alphabe
- …