3,302 research outputs found
Delete or merge regressors for linear model selection
We consider a problem of linear model selection in the presence of both
continuous and categorical predictors. Feasible models consist of subsets of
numerical variables and partitions of levels of factors. A new algorithm called
delete or merge regressors (DMR) is presented which is a stepwise backward
procedure involving ranking the predictors according to squared t-statistics
and choosing the final model minimizing BIC. In the article we prove
consistency of DMR when the number of predictors tends to infinity with the
sample size and describe a simulation study using a pertaining R package. The
results indicate significant advantage in time complexity and selection
accuracy of our algorithm over Lasso-based methods described in the literature.
Moreover, a version of DMR for generalized linear models is proposed
Fast Hierarchical Clustering and Other Applications of Dynamic Closest Pairs
We develop data structures for dynamic closest pair problems with arbitrary
distance functions, that do not necessarily come from any geometric structure
on the objects. Based on a technique previously used by the author for
Euclidean closest pairs, we show how to insert and delete objects from an
n-object set, maintaining the closest pair, in O(n log^2 n) time per update and
O(n) space. With quadratic space, we can instead use a quadtree-like structure
to achieve an optimal time bound, O(n) per update. We apply these data
structures to hierarchical clustering, greedy matching, and TSP heuristics, and
discuss other potential applications in machine learning, Groebner bases, and
local improvement algorithms for partition and placement problems. Experiments
show our new methods to be faster in practice than previously used heuristics.Comment: 20 pages, 9 figures. A preliminary version of this paper appeared at
the 9th ACM-SIAM Symp. on Discrete Algorithms, San Francisco, 1998, pp.
619-628. For source code and experimental results, see
http://www.ics.uci.edu/~eppstein/projects/pairs
Minimum-Cost Coverage of Point Sets by Disks
We consider a class of geometric facility location problems in which the goal
is to determine a set X of disks given by their centers (t_j) and radii (r_j)
that cover a given set of demand points Y in the plane at the smallest possible
cost. We consider cost functions of the form sum_j f(r_j), where f(r)=r^alpha
is the cost of transmission to radius r. Special cases arise for alpha=1 (sum
of radii) and alpha=2 (total area); power consumption models in wireless
network design often use an exponent alpha>2. Different scenarios arise
according to possible restrictions on the transmission centers t_j, which may
be constrained to belong to a given discrete set or to lie on a line, etc. We
obtain several new results, including (a) exact and approximation algorithms
for selecting transmission points t_j on a given line in order to cover demand
points Y in the plane; (b) approximation algorithms (and an algebraic
intractability result) for selecting an optimal line on which to place
transmission points to cover Y; (c) a proof of NP-hardness for a discrete set
of transmission points in the plane and any fixed alpha>1; and (d) a
polynomial-time approximation scheme for the problem of computing a minimum
cost covering tour (MCCT), in which the total cost is a linear combination of
the transmission cost for the set of disks and the length of a tour/path that
connects the centers of the disks.Comment: 10 pages, 4 figures, Latex, to appear in ACM Symposium on
Computational Geometry 200
Superclustering by finding statistically significant separable groups of optimal gaussian clusters
The paper presents the algorithm for clustering a dataset by grouping the
optimal, from the point of view of the BIC criterion, number of Gaussian
clusters into the optimal, from the point of view of their statistical
separability, superclusters.
The algorithm consists of three stages: representation of the dataset as a
mixture of Gaussian distributions - clusters, which number is determined based
on the minimum of the BIC criterion; using the Mahalanobis distance, to
estimate the distances between the clusters and cluster sizes; combining the
resulting clusters into superclusters using the DBSCAN method by finding its
hyperparameter (maximum distance) providing maximum value of introduced matrix
quality criterion at maximum number of superclusters. The matrix quality
criterion corresponds to the proportion of statistically significant separated
superclusters among all found superclusters.
The algorithm has only one hyperparameter - statistical significance level,
and automatically detects optimal number and shape of superclusters based of
statistical hypothesis testing approach. The algorithm demonstrates a good
results on test datasets in noise and noiseless situations. An essential
advantage of the algorithm is its ability to predict correct supercluster for
new data based on already trained clusterer and perform soft (fuzzy)
clustering. The disadvantages of the algorithm are: its low speed and
stochastic nature of the final clustering. It requires a sufficiently large
dataset for clustering, which is typical for many statistical methods.Comment: 32 pages, 7 figures, 1 tabl
- …