95 research outputs found
Non-uniform Feature Sampling for Decision Tree Ensembles
We study the effectiveness of non-uniform randomized feature selection in
decision tree classification. We experimentally evaluate two feature selection
methodologies, based on information extracted from the provided dataset:
\emph{leverage scores-based} and \emph{norm-based} feature selection.
Experimental evaluation of the proposed feature selection techniques indicate
that such approaches might be more effective compared to naive uniform feature
selection and moreover having comparable performance to the random forest
algorithm [3]Comment: 7 pages, 7 figures, 1 tabl
Fixed-rank Rayleigh Quotient Maximization by an PSK Sequence
Certain optimization problems in communication systems, such as
limited-feedback constant-envelope beamforming or noncoherent -ary
phase-shift keying (PSK) sequence detection, result in the maximization of a
fixed-rank positive semidefinite quadratic form over the PSK alphabet. This
form is a special case of the Rayleigh quotient of a matrix and, in general,
its maximization by an PSK sequence is -hard. However, if the
rank of the matrix is not a function of its size, then the optimal solution can
be computed with polynomial complexity in the matrix size. In this work, we
develop a new technique to efficiently solve this problem by utilizing
auxiliary continuous-valued angles and partitioning the resulting continuous
space of solutions into a polynomial-size set of regions, each of which
corresponds to a distinct PSK sequence. The sequence that maximizes the
Rayleigh quotient is shown to belong to this polynomial-size set of sequences,
thus efficiently reducing the size of the feasible set from exponential to
polynomial. Based on this analysis, we also develop an algorithm that
constructs this set in polynomial time and show that it is fully
parallelizable, memory efficient, and rank scalable. The proposed algorithm
compares favorably with other solvers for this problem that have appeared
recently in the literature.Comment: 15 pages, 12 figures, To appear in IEEE Transactions on
Communication
Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain
Real-world data typically contain repeated and periodic patterns. This
suggests that they can be effectively represented and compressed using only a
few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.).
However, distance estimation when the data are represented using different sets
of coefficients is still a largely unexplored area. This work studies the
optimization problems related to obtaining the \emph{tightest} lower/upper
bound on Euclidean distances when each data object is potentially compressed
using a different set of orthonormal coefficients. Our technique leads to
tighter distance estimates, which translates into more accurate search,
learning and mining operations \textit{directly} in the compressed domain.
We formulate the problem of estimating lower/upper distance bounds as an
optimization problem. We establish the properties of optimal solutions, and
leverage the theoretical analysis to develop a fast algorithm to obtain an
\emph{exact} solution to the problem. The suggested solution provides the
tightest estimation of the -norm or the correlation. We show that typical
data-analysis operations, such as k-NN search or k-Means clustering, can
operate more accurately using the proposed compression and distance
reconstruction technique. We compare it with many other prevalent compression
and reconstruction techniques, including random projections and PCA-based
techniques. We highlight a surprising result, namely that when the data are
highly sparse in some basis, our technique may even outperform PCA-based
compression.
The contributions of this work are generic as our methodology is applicable
to any sequential or high-dimensional data as well as to any orthogonal data
transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD
Randomized Low-Memory Singular Value Projection
Affine rank minimization algorithms typically rely on calculating the
gradient of a data error followed by a singular value decomposition at every
iteration. Because these two steps are expensive, heuristic approximations are
often used to reduce computational burden. To this end, we propose a recovery
scheme that merges the two steps with randomized approximations, and as a
result, operates on space proportional to the degrees of freedom in the
problem. We theoretically establish the estimation guarantees of the algorithm
as a function of approximation tolerance. While the theoretical approximation
requirements are overly pessimistic, we demonstrate that in practice the
algorithm performs well on the quantum tomography recovery problem.Comment: 13 pages. This version has a revised theorem and new numerical
experiment
- …