36,963 research outputs found
Solving Large Scale Instances of the Distribution Design Problem Using Data Mining
In this paper we approach the solution of large instances of the distribution design problem. The traditional approaches do not consider that the instance size can significantly reduce the efficiency of the solution process. We propose a new approach that includes compression methods to transform the original instance into a new one using data mining techniques. The goal of the transformation is to condense the operation access pattern of the original instance to reduce the amount of resources needed to solve the original instance, without significantly reducing the quality of its solution. In order to validate the approach, we tested it proposing two instance compression methods on a new model of the replicated version of the distribution design problem that incorporates generalized database objects. The experimental results show that our approach permits to reduce the computational resources needed for solving large instances by at least 65%, without significantly reducing the quality of its solution. Given the encouraging results, at the moment we are working on the design and implementation of efficient instance compression methods using other data mining techniques
An efficient model for secure data publishing
Data Mining is the field of extracting and analyzing the data from large datasets. Exchange of databases is very important to get financial benefits now a day. To review business strategies and to get maximum benefit data analytics is needed. Data stored at distributed sites are integrated and published by data publisher. Data Publishing is the technique in which the data is released to others for the use. In addition to privacy preserving data size and security is also a challenge while publishing and transmitting the database. So there is a requirement of a technique which can reduce the size of database efficiently and transfer it in a secure manner. This paper proposes an efficient model for secure data publishing by using both compression and color encryption thus introducing a new approach. A new algorithm is designed and a tool is developed for implementing the proposed work
A Mining-Based Compression Approach for Constraint Satisfaction Problems
In this paper, we propose an extension of our Mining for SAT framework to
Constraint satisfaction Problem (CSP). We consider n-ary extensional
constraints (table constraints). Our approach aims to reduce the size of the
CSP by exploiting the structure of the constraints graph and of its associated
microstructure. More precisely, we apply itemset mining techniques to search
for closed frequent itemsets on these two representation. Using Tseitin
extension, we rewrite the whole CSP to another compressed CSP equivalent with
respect to satisfiability. Our approach contrast with previous proposed
approach by Katsirelos and Walsh, as we do not change the structure of the
constraints.Comment: arXiv admin note: substantial text overlap with arXiv:1304.441
Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain
Real-world data typically contain repeated and periodic patterns. This
suggests that they can be effectively represented and compressed using only a
few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.).
However, distance estimation when the data are represented using different sets
of coefficients is still a largely unexplored area. This work studies the
optimization problems related to obtaining the \emph{tightest} lower/upper
bound on Euclidean distances when each data object is potentially compressed
using a different set of orthonormal coefficients. Our technique leads to
tighter distance estimates, which translates into more accurate search,
learning and mining operations \textit{directly} in the compressed domain.
We formulate the problem of estimating lower/upper distance bounds as an
optimization problem. We establish the properties of optimal solutions, and
leverage the theoretical analysis to develop a fast algorithm to obtain an
\emph{exact} solution to the problem. The suggested solution provides the
tightest estimation of the -norm or the correlation. We show that typical
data-analysis operations, such as k-NN search or k-Means clustering, can
operate more accurately using the proposed compression and distance
reconstruction technique. We compare it with many other prevalent compression
and reconstruction techniques, including random projections and PCA-based
techniques. We highlight a surprising result, namely that when the data are
highly sparse in some basis, our technique may even outperform PCA-based
compression.
The contributions of this work are generic as our methodology is applicable
to any sequential or high-dimensional data as well as to any orthogonal data
transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD
Bolt: Accelerated Data Mining with Fast Vector Compression
Vectors of data are at the heart of machine learning and data mining.
Recently, vector quantization methods have shown great promise in reducing both
the time and space costs of operating on vectors. We introduce a vector
quantization algorithm that can compress vectors over 12x faster than existing
techniques while also accelerating approximate vector operations such as
distance and dot product computations by up to 10x. Because it can encode over
2GB of vectors per second, it makes vector quantization cheap enough to employ
in many more circumstances. For example, using our technique to compute
approximate dot products in a nested loop can multiply matrices faster than a
state-of-the-art BLAS implementation, even when our algorithm must first
compress the matrices.
In addition to showing the above speedups, we demonstrate that our approach
can accelerate nearest neighbor search and maximum inner product search by over
100x compared to floating point operations and up to 10x compared to other
vector quantization methods. Our approximate Euclidean distance and dot product
computations are not only faster than those of related algorithms with slower
encodings, but also faster than Hamming distance computations, which have
direct hardware support on the tested platforms. We also assess the errors of
our algorithm's approximate distances and dot products, and find that it is
competitive with existing, slower vector quantization algorithms.Comment: Research track paper at KDD 201
Rate-Distortion Classification for Self-Tuning IoT Networks
Many future wireless sensor networks and the Internet of Things are expected
to follow a software defined paradigm, where protocol parameters and behaviors
will be dynamically tuned as a function of the signal statistics. New protocols
will be then injected as a software as certain events occur. For instance, new
data compressors could be (re)programmed on-the-fly as the monitored signal
type or its statistical properties change. We consider a lossy compression
scenario, where the application tolerates some distortion of the gathered
signal in return for improved energy efficiency. To reap the full benefits of
this paradigm, we discuss an automatic sensor profiling approach where the
signal class, and in particular the corresponding rate-distortion curve, is
automatically assessed using machine learning tools (namely, support vector
machines and neural networks). We show that this curve can be reliably
estimated on-the-fly through the computation of a small number (from ten to
twenty) of statistical features on time windows of a few hundreds samples
- …