68,979 research outputs found

    On linear balancing sets

    Full text link
    Let n be an even positive integer and F be the field \GF(2). A word in F^n is called balanced if its Hamming weight is n/2. A subset C \subseteq F^n$ is called a balancing set if for every word y \in F^n there is a word x \in C such that y + x is balanced. It is shown that most linear subspaces of F^n of dimension slightly larger than 3/2\log_2(n) are balancing sets. A generalization of this result to linear subspaces that are "almost balancing" is also presented. On the other hand, it is shown that the problem of deciding whether a given set of vectors in F^n spans a balancing set, is NP-hard. An application of linear balancing sets is presented for designing efficient error-correcting coding schemes in which the codewords are balanced.Comment: The abstract of this paper appeared in the proc. of 2009 International Symposium on Information Theor

    PSO-based method for svm classification on skewed data-sets

    Get PDF
    Support Vector Machines (SVM) have shown excellent generalization power in classification problems. However, on skewed data-sets, SVM learns a biased model that affects the classifier performance, which is severely damaged when the unbalanced ratio is very large. In this paper, a new external balancing method for applying SVM on skewed data sets is developed. In the first phase of the method, the separating hyperplane is computed. Support vectors are then used to generate the initial population of PSO algorithm, which is used to improve the population of artificial instances and to eliminate noise instances. Experimental results demonstrate the ability of the proposed method to improve the performance of SVM on imbalanced data-sets.Proyecto UAEM 3771/2014/CI

    Dynamic load balancing in parallel KD-tree k-means

    Get PDF
    One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy

    Tradeoffs for nearest neighbors on the sphere

    Get PDF
    We consider tradeoffs between the query and update complexities for the (approximate) nearest neighbor problem on the sphere, extending the recent spherical filters to sparse regimes and generalizing the scheme and analysis to account for different tradeoffs. In a nutshell, for the sparse regime the tradeoff between the query complexity nρqn^{\rho_q} and update complexity nρun^{\rho_u} for data sets of size nn is given by the following equation in terms of the approximation factor cc and the exponents ρq\rho_q and ρu\rho_u: c2ρq+(c21)ρu=2c21.c^2\sqrt{\rho_q}+(c^2-1)\sqrt{\rho_u}=\sqrt{2c^2-1}. For small c=1+ϵc=1+\epsilon, minimizing the time for updates leads to a linear space complexity at the cost of a query time complexity n14ϵ2n^{1-4\epsilon^2}. Balancing the query and update costs leads to optimal complexities n1/(2c21)n^{1/(2c^2-1)}, matching bounds from [Andoni-Razenshteyn, 2015] and [Dubiner, IEEE-TIT'10] and matching the asymptotic complexities of [Andoni-Razenshteyn, STOC'15] and [Andoni-Indyk-Laarhoven-Razenshteyn-Schmidt, NIPS'15]. A subpolynomial query time complexity no(1)n^{o(1)} can be achieved at the cost of a space complexity of the order n1/(4ϵ2)n^{1/(4\epsilon^2)}, matching the bound nΩ(1/ϵ2)n^{\Omega(1/\epsilon^2)} of [Andoni-Indyk-Patrascu, FOCS'06] and [Panigrahy-Talwar-Wieder, FOCS'10] and improving upon results of [Indyk-Motwani, STOC'98] and [Kushilevitz-Ostrovsky-Rabani, STOC'98]. For large cc, minimizing the update complexity results in a query complexity of n2/c2+O(1/c4)n^{2/c^2+O(1/c^4)}, improving upon the related exponent for large cc of [Kapralov, PODS'15] by a factor 22, and matching the bound nΩ(1/c2)n^{\Omega(1/c^2)} of [Panigrahy-Talwar-Wieder, FOCS'08]. Balancing the costs leads to optimal complexities n1/(2c21)n^{1/(2c^2-1)}, while a minimum query time complexity can be achieved with update complexity n2/c2+O(1/c4)n^{2/c^2+O(1/c^4)}, improving upon the previous best exponents of Kapralov by a factor 22.Comment: 16 pages, 1 table, 2 figures. Mostly subsumed by arXiv:1608.03580 [cs.DS] (along with arXiv:1605.02701 [cs.DS]
    corecore