3,302 research outputs found

    Delete or merge regressors for linear model selection

    Full text link
    We consider a problem of linear model selection in the presence of both continuous and categorical predictors. Feasible models consist of subsets of numerical variables and partitions of levels of factors. A new algorithm called delete or merge regressors (DMR) is presented which is a stepwise backward procedure involving ranking the predictors according to squared t-statistics and choosing the final model minimizing BIC. In the article we prove consistency of DMR when the number of predictors tends to infinity with the sample size and describe a simulation study using a pertaining R package. The results indicate significant advantage in time complexity and selection accuracy of our algorithm over Lasso-based methods described in the literature. Moreover, a version of DMR for generalized linear models is proposed

    Fast Hierarchical Clustering and Other Applications of Dynamic Closest Pairs

    Full text link
    We develop data structures for dynamic closest pair problems with arbitrary distance functions, that do not necessarily come from any geometric structure on the objects. Based on a technique previously used by the author for Euclidean closest pairs, we show how to insert and delete objects from an n-object set, maintaining the closest pair, in O(n log^2 n) time per update and O(n) space. With quadratic space, we can instead use a quadtree-like structure to achieve an optimal time bound, O(n) per update. We apply these data structures to hierarchical clustering, greedy matching, and TSP heuristics, and discuss other potential applications in machine learning, Groebner bases, and local improvement algorithms for partition and placement problems. Experiments show our new methods to be faster in practice than previously used heuristics.Comment: 20 pages, 9 figures. A preliminary version of this paper appeared at the 9th ACM-SIAM Symp. on Discrete Algorithms, San Francisco, 1998, pp. 619-628. For source code and experimental results, see http://www.ics.uci.edu/~eppstein/projects/pairs

    Minimum-Cost Coverage of Point Sets by Disks

    Full text link
    We consider a class of geometric facility location problems in which the goal is to determine a set X of disks given by their centers (t_j) and radii (r_j) that cover a given set of demand points Y in the plane at the smallest possible cost. We consider cost functions of the form sum_j f(r_j), where f(r)=r^alpha is the cost of transmission to radius r. Special cases arise for alpha=1 (sum of radii) and alpha=2 (total area); power consumption models in wireless network design often use an exponent alpha>2. Different scenarios arise according to possible restrictions on the transmission centers t_j, which may be constrained to belong to a given discrete set or to lie on a line, etc. We obtain several new results, including (a) exact and approximation algorithms for selecting transmission points t_j on a given line in order to cover demand points Y in the plane; (b) approximation algorithms (and an algebraic intractability result) for selecting an optimal line on which to place transmission points to cover Y; (c) a proof of NP-hardness for a discrete set of transmission points in the plane and any fixed alpha>1; and (d) a polynomial-time approximation scheme for the problem of computing a minimum cost covering tour (MCCT), in which the total cost is a linear combination of the transmission cost for the set of disks and the length of a tour/path that connects the centers of the disks.Comment: 10 pages, 4 figures, Latex, to appear in ACM Symposium on Computational Geometry 200

    Superclustering by finding statistically significant separable groups of optimal gaussian clusters

    Full text link
    The paper presents the algorithm for clustering a dataset by grouping the optimal, from the point of view of the BIC criterion, number of Gaussian clusters into the optimal, from the point of view of their statistical separability, superclusters. The algorithm consists of three stages: representation of the dataset as a mixture of Gaussian distributions - clusters, which number is determined based on the minimum of the BIC criterion; using the Mahalanobis distance, to estimate the distances between the clusters and cluster sizes; combining the resulting clusters into superclusters using the DBSCAN method by finding its hyperparameter (maximum distance) providing maximum value of introduced matrix quality criterion at maximum number of superclusters. The matrix quality criterion corresponds to the proportion of statistically significant separated superclusters among all found superclusters. The algorithm has only one hyperparameter - statistical significance level, and automatically detects optimal number and shape of superclusters based of statistical hypothesis testing approach. The algorithm demonstrates a good results on test datasets in noise and noiseless situations. An essential advantage of the algorithm is its ability to predict correct supercluster for new data based on already trained clusterer and perform soft (fuzzy) clustering. The disadvantages of the algorithm are: its low speed and stochastic nature of the final clustering. It requires a sufficiently large dataset for clustering, which is typical for many statistical methods.Comment: 32 pages, 7 figures, 1 tabl
    • …
    corecore