32 research outputs found

    Combination of Multiple Bipartite Ranking for Web Content Quality Evaluation

    Full text link
    Web content quality estimation is crucial to various web content processing applications. Our previous work applied Bagging + C4.5 to achive the best results on the ECML/PKDD Discovery Challenge 2010, which is the comibination of many point-wise rankinig models. In this paper, we combine multiple pair-wise bipartite ranking learner to solve the multi-partite ranking problems for the web quality estimation. In encoding stage, we present the ternary encoding and the binary coding extending each rank value to L−1L - 1 (L is the number of the different ranking value). For the decoding, we discuss the combination of multiple ranking results from multiple bipartite ranking models with the predefined weighting and the adaptive weighting. The experiments on ECML/PKDD 2010 Discovery Challenge datasets show that \textit{binary coding} + \textit{predefined weighting} yields the highest performance in all four combinations and furthermore it is better than the best results reported in ECML/PKDD 2010 Discovery Challenge competition.Comment: 17 pages, 8 figures, 2 table

    Score Fusion by Maximizing the Area under the ROC Curve

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-02172-5_61Information fusion is currently a very active research topic aimed at improving the performance of biometric systems. This paper proposes a novel method for optimizing the parameters of a score fusion model based on maximizing an index related to the Area Under the ROC Curve. This approach has the convenience that the fusion parameters are learned without having to specify the client and impostor priors or the costs for the different errors. Empirical results on several datasets show the effectiveness of the proposed approach.Work supported by the Spanish projects DPI2006-15542-C04 and TIN2008-04571 and the Generalitat Valenciana - Consellería d’Educació under an FPI scholarship.Villegas Santamaría, M.; Paredes Palacios, R. (2009). Score Fusion by Maximizing the Area under the ROC Curve. En Pattern Recognition and Image Analysis: 4th Iberian Conference, IbPRIA 2009 Póvoa de Varzim, Portugal, June 10-12, 2009 Proceedings. Springer Verlag (Germany). 473-480. https://doi.org/10.1007/978-3-642-02172-5_61S473480Toh, K.A., Kim, J., Lee, S.: Biometric scores fusion based on total error rate minimization. Pattern Recognition 41(3), 1066–1082 (2008)Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recognition 38(12), 2270–2285 (2005)Gutschoven, B., Verlinde, P.: Multi-modal identity verification using support vector machines (svm). In: Proceedings of the Third International Conference on Information Fusion. FUSION 2000, vol. 2, pp. THB3/3–THB3/8 (July 2000)Ma, Y., Cukic, B., Singh, H.: A classification approach to multi-biometric score fusion. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 484–493. Springer, Heidelberg (2005)Maurer, D.E., Baker, J.P.: Fusing multimodal biometrics with quality estimates via a bayesian belief network. Pattern Recogn. 41(3), 821–832 (2008)Ling, C.X., Huang, J., Zhang, H.: Auc: a statistically consistent and more discriminating measure than accuracy. In: Proc. of IJCAI 2003, pp. 519–524 (2003)Yan, L., Dodier, R.H., Mozer, M., Wolniewicz, R.H.: Optimizing classifier performance via an approximation to the wilcoxon-mann-whitney statistic. In: Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), Washington, DC, USA, pp. 848–855. AAAI Press, Menlo Park (2003)Marrocco, C., Molinara, M., Tortorella, F.: Exploiting auc for optimal linear combinations of dichotomizers. Pattern Recogn. Lett. 27(8), 900–907 (2006)Marrocco, C., Duin, R.P.W., Tortorella, F.: Maximizing the area under the roc curve by pairwise feature combination. Pattern Recogn. 41(6), 1961–1974 (2008)Paredes, R., Vidal, E.: Learning prototypes and distances: a prototype reduction technique based on nearest neighbor error minimization. Pattern Recognition 39(2), 180–188 (2006)Villegas, M., Paredes, R.: Simultaneous learning of a discriminative projection and prototypes for nearest-neighbor classification. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2008, pp. 1–8 (2008)Nandakumar, K., Chen, Y., Dass, S.C., Jain, A.: Likelihood ratio-based biometric score fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(2), 342–347 (2008)Poh, N., Bengio, S.: A score-level fusion benchmark database for biometric authentication. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 1059–1070. Springer, Heidelberg (2005)National Institute of Standards and Technology: NIST Biometric Scores Set - Release 1 (BSSR1) (2004), http://www.itl.nist.gov/iad/894.03/biometricscores/Bengio, S., Mariéthoz, J., Keller, M.: The expected performance curve. In: Proceedings of the Second Workshop on ROC Analysis in ML, pp. 9–16 (2005

    Ranking instances by maximizing the area under ROC curve

    Get PDF
    Cataloged from PDF version of article.In recent years, the problem of learning a real-valued function that induces a ranking over an instance space has gained importance in machine learning literature. Here, we propose a supervised algorithm that learns a ranking function, called ranking instances by maximizing the area under the ROC curve (RIMARC). Since the area under the ROC curve (AUC) is a widely accepted performance measure for evaluating the quality of ranking, the algorithm aims to maximize the AUC value directly. For a single categorical feature, we show the necessary and sufficient condition that any ranking function must satisfy to achieve the maximum AUC. We also sketch a method to discretize a continuous feature in a way to reach the maximum AUC as well. RIMARC uses a heuristic to extend this maximization to all features of a data set. The ranking function learned by the RIMARC algorithm is in a humanreadable form; therefore, it provides valuable information to domain experts for decision making. Performance of RIMARC is evaluated on many real-life data sets by using different state-of-the-art algorithms. Evaluations of the AUC metric show that RIMARC achieves significantly better performance compared to other similar methods

    Large-scale Optimization of Partial AUC in a Range of False Positive Rates

    Full text link
    The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. However, it summarizes the true positive rates (TPRs) over all false positive rates (FPRs) in the ROC space, which may include the FPRs with no practical relevance in some applications. The partial AUC, as a generalization of the AUC, summarizes only the TPRs over a specific range of the FPRs and is thus a more suitable performance measure in many real-world situations. Although partial AUC optimization in a range of FPRs had been studied, existing algorithms are not scalable to big data and not applicable to deep learning. To address this challenge, we cast the problem into a non-smooth difference-of-convex (DC) program for any smooth predictive functions (e.g., deep neural networks), which allowed us to develop an efficient approximated gradient descent method based on the Moreau envelope smoothing technique, inspired by recent advances in non-smooth DC optimization. To increase the efficiency of large data processing, we used an efficient stochastic block coordinate update in our algorithm. Our proposed algorithm can also be used to minimize the sum of ranked range loss, which also lacks efficient solvers. We established a complexity of O~(1/ϵ6)\tilde O(1/\epsilon^6) for finding a nearly ϵ\epsilon-critical solution. Finally, we numerically demonstrated the effectiveness of our proposed algorithms for both partial AUC maximization and sum of ranked range loss minimization

    Online hashing for fast similarity search

    Full text link
    In this thesis, the problem of online adaptive hashing for fast similarity search is studied. Similarity search is a central problem in many computer vision applications. The ever-growing size of available data collections and the increasing usage of high-dimensional representations in describing data have increased the computational cost of performing similarity search, requiring search strategies that can explore such collections in an efficient and effective manner. One promising family of approaches is based on hashing, in which the goal is to map the data into the Hamming space where fast search mechanisms exist, while preserving the original neighborhood structure of the data. We first present a novel online hashing algorithm in which the hash mapping is updated in an iterative manner with streaming data. Being online, our method is amenable to variations of the data. Moreover, our formulation is orders of magnitude faster to train than state-of-the-art hashing solutions. Secondly, we propose an online supervised hashing framework in which the goal is to map data associated with similar labels to nearby binary representations. For this purpose, we utilize Error Correcting Output Codes (ECOCs) and consider an online boosting formulation in learning the hash mapping. Our formulation does not require any prior assumptions on the label space and is well-suited for expanding datasets that have new label inclusions. We also introduce a flexible framework that allows us to reduce hash table entry updates. This is critical, especially when frequent updates may occur as the hash table grows larger and larger. Thirdly, we propose a novel mutual information measure to efficiently infer the quality of a hash mapping and retrieval performance. This measure has lower complexity than standard retrieval metrics. With this measure, we first address a key challenge in online hashing that has often been ignored: the binary representations of the data must be recomputed to keep pace with updates to the hash mapping. Based on our novel mutual information measure, we propose an efficient quality measure for hash functions, and use it to determine when to update the hash table. Next, we show that this mutual information criterion can be used as an objective in learning hash functions, using gradient-based optimization. Experiments on image retrieval benchmarks confirm the effectiveness of our formulation, both in reducing hash table recomputations and in learning high-quality hash functions

    CASP-DM: Context Aware Standard Process for Data Mining

    Get PDF
    We propose an extension of the Cross Industry Standard Process for Data Mining (CRISPDM) which addresses specific challenges of machine learning and data mining for context and model reuse handling. This new general context-aware process model is mapped with CRISP-DM reference model proposing some new or enhanced outputs

    ROC curves for regression

    Full text link
    “NOTICE: this is the author’s version of a work that was accepted for publication in Pattern Recognition. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern Recognition Volume 46, Issue 12, December 2013, Pages 3395–3411 DOI: 10.1016/j.patcog.2013.06.014Receiver Operating Characteristic (ROC) analysis is one of the most popular tools for the visual assessment and understanding of classifier performance. In this paper we present a new representation of regression models in the so-called regression ROC (RROC) space. The basic idea is to represent over-estimation against under-estimation. The curves are just drawn by adjusting a shift, a constant that is added (or subtracted) to the predictions, and plays a similar role as a threshold in classification. From here, we develop the notions of optimal operating condition, convexity, dominance, and explore several evaluation metrics that can be shown graphically, such as the area over the RROC curve (AOC). In particular, we show a novel and significant result: the AOC is equivalent to the error variance. We illustrate the application of RROC curves to resource estimation, namely the estimation of software project effort.I would like to thank Peter Flach and Nicolas Lachiche for some very useful comments and corrections on earlier versions of this paper, especially the suggestion of drawing normalised curves (dividing x-axis and y-axis by n). This work was supported by the MEC/MINECO projects CONSOLIDER-INGENIO CSD2007-00022 and TIN 2010-21062-C02-02, GVA project Prometeo/2008/051, the COST - European Cooperation in the field of Scientific and Technical Research IC0801 AT, and the REFRAME project granted by the European Coordinated Research on Long-term Challenges in Information and Communication Sciences & Technologies ERA-Net (CHIST-ERA), and funded by the respective national research councils and ministries.Hernández-Orallo, J. (2013). ROC curves for regression. Pattern Recognition. 46(12):3395-3411. https://doi.org/10.1016/j.patcog.2013.06.014S33953411461

    Multi-feature approach for writer-independent offline signature verification

    Get PDF
    Some of the fundamental problems facing handwritten signature verification are the large number of users, the large number of features, the limited number of reference signatures for training, the high intra-personal variability of the signatures and the unavailability of forgeries as counterexamples. This research first presents a survey of offline signature verification techniques, focusing on the feature extraction and verification strategies. The goal is to present the most important advances, as well as the current challenges in this field. Of particular interest are the techniques that allow for designing a signature verification system based on a limited amount of data. Next is presented a novel offline signature verification system based on multiple feature extraction techniques, dichotomy transformation and boosting feature selection. Using multiple feature extraction techniques increases the diversity of information extracted from the signature, thereby producing features that mitigate intra-personal variability, while dichotomy transformation ensures writer-independent classification, thus relieving the verification system from the burden of a large number of users. Finally, using boosting feature selection allows for a low cost writer-independent verification system that selects features while learning. As such, the proposed system provides a practical framework to explore and learn from problems with numerous potential features. Comparison of simulation results from systems found in literature confirms the viability of the proposed system, even when only a single reference signature is available. The proposed system provides an efficient solution to a wide range problems (eg. biometric authentication) with limited training samples, new training samples emerging during operations, numerous classes, and few or no counterexamples
    corecore