8 research outputs found

    Active Sampling of Pairs and Points for Large-scale Linear Bipartite Ranking

    Full text link
    Bipartite ranking is a fundamental ranking problem that learns to order relevant instances ahead of irrelevant ones. The pair-wise approach for bi-partite ranking construct a quadratic number of pairs to solve the problem, which is infeasible for large-scale data sets. The point-wise approach, albeit more efficient, often results in inferior performance. That is, it is difficult to conduct bipartite ranking accurately and efficiently at the same time. In this paper, we develop a novel active sampling scheme within the pair-wise approach to conduct bipartite ranking efficiently. The scheme is inspired from active learning and can reach a competitive ranking performance while focusing only on a small subset of the many pairs during training. Moreover, we propose a general Combined Ranking and Classification (CRC) framework to accurately conduct bipartite ranking. The framework unifies point-wise and pair-wise approaches and is simply based on the idea of treating each instance point as a pseudo-pair. Experiments on 14 real-word large-scale data sets demonstrate that the proposed algorithm of Active Sampling within CRC, when coupled with a linear Support Vector Machine, usually outperforms state-of-the-art point-wise and pair-wise ranking approaches in terms of both accuracy and efficiency.Comment: a shorter version was presented in ACML 201

    Minimizing Finite Sums with the Stochastic Average Gradient

    Get PDF
    We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for p \textless{} 1. Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.Comment: Revision from January 2015 submission. Major changes: updated literature follow and discussion of subsequent work, additional Lemma showing the validity of one of the formulas, somewhat simplified presentation of Lyapunov bound, included code needed for checking proofs rather than the polynomials generated by the code, added error regions to the numerical experiment

    Imbalanced dataset for benchmarking

    No full text
    Imbalanced dataset for benchmarking ======================= The different algorithms of the `imbalanced-learn` toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark. Characteristics ------------------- |ID |Name |Repository & Target |Ratio |# samples| # features | |:---:|:----------------------:|--------------------------------------|:------:|:-------------:|:--------------:| |1 |Ecoli |UCI, target: imU |8.6:1 |336 |7 | |2 |Optical Digits |UCI, target: 8 |9.1:1 |5,620 |64 | |3 |SatImage |UCI, target: 4 |9.3:1 |6,435 |36 | |4 |Pen Digits |UCI, target: 5 |9.4:1 |10,992 |16 | |5 |Abalone |UCI, target: 7 |9.7:1 |4,177 |8 | |6 |Sick Euthyroid |UCI, target: sick euthyroid |9.8:1 |3,163 |25 | |7 |Spectrometer |UCI, target: >=44 |11:1 |531 |93 | |8 |Car_Eval_34 |UCI, target: good, v good |12:1 |1,728 |6 | |9 |ISOLET |UCI, target: A, B |12:1 |7,797 |617 | |10 |US Crime |UCI, target: >0.65 |12:1 |1,994 |122 | |11 |Yeast_ML8 |LIBSVM, target: 8 |13:1 |2,417 |103 | |12 |Scene |LIBSVM, target: >one label |13:1 |2,407 |294 | |13 |Libras Move |UCI, target: 1 |14:1 |360 |90 | |14 |Thyroid Sick |UCI, target: sick |15:1 |3,772 |28 | |15 |Coil_2000 |KDD, CoIL, target: minority |16:1 |9,822 |85 | |16 |Arrhythmia |UCI, target: 06 |17:1 |452 |279 | |17 |Solar Flare M0 |UCI, target: M->0 |19:1 |1,389 |10 | |18 |OIL |UCI, target: minority |22:1 |937 |49 | |19 |Car_Eval_4 |UCI, target: vgood |26:1 |1,728 |6 | |20 |Wine Quality |UCI, wine, target: <=4 |26:1 |4,898 |11 | |21 |Letter Img |UCI, target: Z |26:1 |20,000 |16 | |22 |Yeast _ME2 |UCI, target: ME2 |28:1 |1,484 |8 | |23 |Webpage |LIBSVM, w7a, target: minority|33:1 |49,749 |300 | |24 |Ozone Level |UCI, ozone, data |34:1 |2,536 |72 | |25 |Mammography |UCI, target: minority |42:1 |11,183 |6 | |26 |Protein homo. |KDD CUP 2004, minority |111:1|145,751 |74 | |27 |Abalone_19 |UCI, target: 19 |130:1|4,177 |8 | References ---------- [1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011). [2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998). [3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27. [4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108
    corecore