39 research outputs found

    Combination of Multiple Bipartite Ranking for Web Content Quality Evaluation

    Full text link
    Web content quality estimation is crucial to various web content processing applications. Our previous work applied Bagging + C4.5 to achive the best results on the ECML/PKDD Discovery Challenge 2010, which is the comibination of many point-wise rankinig models. In this paper, we combine multiple pair-wise bipartite ranking learner to solve the multi-partite ranking problems for the web quality estimation. In encoding stage, we present the ternary encoding and the binary coding extending each rank value to L−1L - 1 (L is the number of the different ranking value). For the decoding, we discuss the combination of multiple ranking results from multiple bipartite ranking models with the predefined weighting and the adaptive weighting. The experiments on ECML/PKDD 2010 Discovery Challenge datasets show that \textit{binary coding} + \textit{predefined weighting} yields the highest performance in all four combinations and furthermore it is better than the best results reported in ECML/PKDD 2010 Discovery Challenge competition.Comment: 17 pages, 8 figures, 2 table

    Improving classifications for cardiac autonomic neuropathy using multi-level ensemble classifiers and feature selection based on random forest

    Full text link
    This paper is devoted to empirical investigation of novel multi-level ensemble meta classifiers for the detection and monitoring of progression of cardiac autonomic neuropathy, CAN, in diabetes patients. Our experiments relied on an extensive database and concentrated on ensembles of ensembles, or multi-level meta classifiers, for the classification of cardiac autonomic neuropathy progression. First, we carried out a thorough investigation comparing the performance of various base classifiers for several known sets of the most essential features in this database and determined that Random Forest significantly and consistently outperforms all other base classifiers in this new application. Second, we used feature selection and ranking implemented in Random Forest. It was able to identify a new set of features, which has turned out better than all other sets considered for this large and well-known database previously. Random Forest remained the very best classier for the new set of features too. Third, we investigated meta classifiers and new multi-level meta classifiers based on Random Forest, which have improved its performance. The results obtained show that novel multi-level meta classifiers achieved further improvement and obtained new outcomes that are significantly better compared with the outcomes published in the literature previously for cardiac autonomic neuropathy

    Active Sampling of Pairs and Points for Large-scale Linear Bipartite Ranking

    Full text link
    Bipartite ranking is a fundamental ranking problem that learns to order relevant instances ahead of irrelevant ones. The pair-wise approach for bi-partite ranking construct a quadratic number of pairs to solve the problem, which is infeasible for large-scale data sets. The point-wise approach, albeit more efficient, often results in inferior performance. That is, it is difficult to conduct bipartite ranking accurately and efficiently at the same time. In this paper, we develop a novel active sampling scheme within the pair-wise approach to conduct bipartite ranking efficiently. The scheme is inspired from active learning and can reach a competitive ranking performance while focusing only on a small subset of the many pairs during training. Moreover, we propose a general Combined Ranking and Classification (CRC) framework to accurately conduct bipartite ranking. The framework unifies point-wise and pair-wise approaches and is simply based on the idea of treating each instance point as a pseudo-pair. Experiments on 14 real-word large-scale data sets demonstrate that the proposed algorithm of Active Sampling within CRC, when coupled with a linear Support Vector Machine, usually outperforms state-of-the-art point-wise and pair-wise ranking approaches in terms of both accuracy and efficiency.Comment: a shorter version was presented in ACML 201

    An Empirical Comparison of Learning Algorithms for Nonparametric Scoring

    Get PDF
    The TreeRank algorithm was recently proposed as a scoring-based method based on recursive partitioning of the input space. This tree induction algorithm builds orderings by recursively optimizing the Receiver Operating Characteristic (ROC) curve through a one-step optimization procedure called LeafRank. One of the aim of this paper is the indepth analysis of the empirical performance of the variants of TreeRank/LeafRank method. Numerical experiments based on both artificial and real data sets are provided. Further experiments using resampling and randomization, in the spirit of bagging and random forests are developed and we show how they increase both stability and accuracy in bipartite ranking. Moreover, an empirical comparison with other efficient scoring algorithms such as RankBoost and RankSVM is presented on UCI benchmark data sets

    Genome-wide Protein-chemical Interaction Prediction

    Get PDF
    The analysis of protein-chemical reactions on a large scale is critical to understanding the complex interrelated mechanisms that govern biological life at the cellular level. Chemical proteomics is a new research area aimed at genome-wide screening of such chemical-protein interactions. Traditional approaches to such screening involve in vivo or in vitro experimentation, which while becoming faster with the application of high-throughput screening technologies, remains costly and time-consuming compared to in silico methods. Early in silico methods are dependant on knowing 3D protein structures (docking) or knowing binding information for many chemicals (ligand-based approaches). Typical machine learning approaches follow a global classification approach where a single predictive model is trained for an entire data set, but such an approach is unlikely to generalize well to the protein-chemical interaction space considering its diversity and heterogeneous distribution. In response to the global approach, work on local models has recently emerged to improve generalization across the interaction space by training a series of independant models localized to each predict a single interaction. This work examines current approaches to genome-wide protein-chemical interaction prediction and explores new computational methods based on modifications to the boosting framework for ensemble learning. The methods are described and compared to several competing classification methods. Genome-wide chemical-protein interaction data sets are acquired from publicly available resources, and a series of experimental studies are performed in order to compare the the performance of each method under a variety of conditions

    Integer optimization methods for machine learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2012.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 129-137).In this thesis, we propose new mixed integer optimization (MIO) methods to ad- dress problems in machine learning. The first part develops methods for supervised bipartite ranking, which arises in prioritization tasks in diverse domains such as information retrieval, recommender systems, natural language processing, bioinformatics, and preventative maintenance. The primary advantage of using MIO for ranking is that it allows for direct optimization of ranking quality measures, as opposed to current state-of-the-art algorithms that use heuristic loss functions. We demonstrate using a number of datasets that our approach can outperform other ranking methods. The second part of the thesis focuses on reverse-engineering ranking models. This is an application of a more general ranking problem than the bipartite case. Quality rankings affect business for many organizations, and knowing the ranking models would allow these organizations to better understand the standards by which their products are judged and help them to create higher quality products. We introduce an MIO method for reverse-engineering such models and demonstrate its performance in a case-study with real data from a major ratings company. We also devise an approach to find the most cost-effective way to increase the rank of a certain product. In the final part of the thesis, we develop MIO methods to first generate association rules and then use the rules to build an interpretable classifier in the form of a decision list, which is an ordered list of rules. These are both combinatorially challenging problems because even a small dataset may yield a large number of rules and a small set of rules may correspond to many different orderings. We show how to use MIO to mine useful rules, as well as to construct a classifier from them. We present results in terms of both classification accuracy and interpretability for a variety of datasets.by Allison An Chang.Ph.D

    Il modello Bradley-Terry per l’analisi delle partite della Serie A italiana di calcio

    Get PDF
    Viviamo nell'era dei cosiddetti Big Data in cui, grazie all'interconnessione, è possibile ottenere un grande flusso di informazioni da ogni attività. Non fa eccezione il calcio in cui da un paio d'anni, le società calcistiche si affidano a sistemi di analisi per produrre tattiche di gioco ma anche per effettuare scouting di giocatori emergenti. Nel calcio moderno, perciò, numerose statistiche ad esempio il possesso della palla, il numero di tiri effettuati da una squadra ecc. vengono raccolte durante una partita di calcio. Questo porta alla domanda: poiché disponiamo di una grande quantità di dati sulle prestazioni delle squadre nelle loro partite, è possibile identificare quali statistiche influiscono significativamente sul successo o sul fallimento sportivo delle singole squadre? Da qui nasce la tesi che verrà presentata. L'obiettivo è quello di fornire un'analisi che risponda a questa domanda utilizzando tecniche di Data Mining, in particolare attraverso l'utilizzo di un modello di confronto a coppie per le partite di calcio che tenga conto delle statistiche inserite. Il modello scelto per l'analisi sarà il modello Bradley-Terry con le sue estensioni. Successivamente i modelli Bradley-Terry saranno utilizzati per predire l’esito delle partite e confrontati con le predizioni dei principali bookmakers e degli algoritmi di Machine Learning: K-Nearest-Neighbors (K-NN), Support Vector Machine (SVM), Decision Tree, Random Forest e AdaBoost. Infine, Decision Tree e Random Forest verranno ulteriormente approfonditi per individuare quali statistiche sono importanti. Lo studio prenderà in considerazione i dati relativi alle partite della Serie A italiana della stagione 2021/2022.We live in the era of so-called Big Data where, thanks to interconnectivity, a large flow of information can be obtained from every activity. This also applies to soccer where for the past couple of years, soccer teams have relied on analysis systems to produce play tactics and to scout emerging players. In modern soccer, therefore, numerous statistics such as ball possession, the number of shots taken by a team, etc. are collected during a soccer game. This leads to the question: since we have a large amount of data on team performances in their games, can we identify which statistics significantly influence the success or failure of individual teams in sports? This is where the thesis comes in. The objective is to provide an analysis that answers this question using Data Mining techniques, specifically using a comparison model for soccer games that considers the statistics entered. The model chosen for the analysis will be the Bradley-Terry model with its extensions. Subsequently, the Bradley-Terry models will be used to predict the outcome of the games and compared with the predictions of the main bookmakers and the Machine Learning algorithms: K-Nearest-Neighbors (K-NN), Support Vector Machine (SVM), Decision Tree, Random Forest, and AdaBoost. Finally, Decision Tree and Random Forest will be further studied to determine which statistics are important. The study will consider data relating to the Italian Serie A games of the 2021/2022 season
    corecore