89 research outputs found
Non-uniform Feature Sampling for Decision Tree Ensembles
We study the effectiveness of non-uniform randomized feature selection in
decision tree classification. We experimentally evaluate two feature selection
methodologies, based on information extracted from the provided dataset:
\emph{leverage scores-based} and \emph{norm-based} feature selection.
Experimental evaluation of the proposed feature selection techniques indicate
that such approaches might be more effective compared to naive uniform feature
selection and moreover having comparable performance to the random forest
algorithm [3]Comment: 7 pages, 7 figures, 1 tabl
Approximate Matrix Multiplication with Application to Linear Embeddings
In this paper, we study the problem of approximately computing the product of
two real matrices. In particular, we analyze a dimensionality-reduction-based
approximation algorithm due to Sarlos [1], introducing the notion of nuclear
rank as the ratio of the nuclear norm over the spectral norm. The presented
bound has improved dependence with respect to the approximation error (as
compared to previous approaches), whereas the subspace -- on which we project
the input matrices -- has dimensions proportional to the maximum of their
nuclear rank and it is independent of the input dimensions. In addition, we
provide an application of this result to linear low-dimensional embeddings.
Namely, we show that any Euclidean point-set with bounded nuclear rank is
amenable to projection onto number of dimensions that is independent of the
input dimensionality, while achieving additive error guarantees.Comment: 8 pages, International Symposium on Information Theor
Fixed-rank Rayleigh Quotient Maximization by an PSK Sequence
Certain optimization problems in communication systems, such as
limited-feedback constant-envelope beamforming or noncoherent -ary
phase-shift keying (PSK) sequence detection, result in the maximization of a
fixed-rank positive semidefinite quadratic form over the PSK alphabet. This
form is a special case of the Rayleigh quotient of a matrix and, in general,
its maximization by an PSK sequence is -hard. However, if the
rank of the matrix is not a function of its size, then the optimal solution can
be computed with polynomial complexity in the matrix size. In this work, we
develop a new technique to efficiently solve this problem by utilizing
auxiliary continuous-valued angles and partitioning the resulting continuous
space of solutions into a polynomial-size set of regions, each of which
corresponds to a distinct PSK sequence. The sequence that maximizes the
Rayleigh quotient is shown to belong to this polynomial-size set of sequences,
thus efficiently reducing the size of the feasible set from exponential to
polynomial. Based on this analysis, we also develop an algorithm that
constructs this set in polynomial time and show that it is fully
parallelizable, memory efficient, and rank scalable. The proposed algorithm
compares favorably with other solvers for this problem that have appeared
recently in the literature.Comment: 15 pages, 12 figures, To appear in IEEE Transactions on
Communication
Randomized Low-Memory Singular Value Projection
Affine rank minimization algorithms typically rely on calculating the
gradient of a data error followed by a singular value decomposition at every
iteration. Because these two steps are expensive, heuristic approximations are
often used to reduce computational burden. To this end, we propose a recovery
scheme that merges the two steps with randomized approximations, and as a
result, operates on space proportional to the degrees of freedom in the
problem. We theoretically establish the estimation guarantees of the algorithm
as a function of approximation tolerance. While the theoretical approximation
requirements are overly pessimistic, we demonstrate that in practice the
algorithm performs well on the quantum tomography recovery problem.Comment: 13 pages. This version has a revised theorem and new numerical
experiment
On Quantifying Qualitative Geospatial Data: A Probabilistic Approach
Living in the era of data deluge, we have witnessed a web content explosion,
largely due to the massive availability of User-Generated Content (UGC). In
this work, we specifically consider the problem of geospatial information
extraction and representation, where one can exploit diverse sources of
information (such as image and audio data, text data, etc), going beyond
traditional volunteered geographic information. Our ambition is to include
available narrative information in an effort to better explain geospatial
relationships: with spatial reasoning being a basic form of human cognition,
narratives expressing such experiences typically contain qualitative spatial
data, i.e., spatial objects and spatial relationships.
To this end, we formulate a quantitative approach for the representation of
qualitative spatial relations extracted from UGC in the form of texts. The
proposed method quantifies such relations based on multiple text observations.
Such observations provide distance and orientation features which are utilized
by a greedy Expectation Maximization-based (EM) algorithm to infer a
probability distribution over predefined spatial relationships; the latter
represent the quantified relationships under user-defined probabilistic
assumptions. We evaluate the applicability and quality of the proposed approach
using real UGC data originating from an actual travel blog text corpus. To
verify the quality of the result, we generate grid-based maps visualizing the
spatial extent of the various relations
- …