39 research outputs found
Combination of Multiple Bipartite Ranking for Web Content Quality Evaluation
Web content quality estimation is crucial to various web content processing
applications. Our previous work applied Bagging + C4.5 to achive the best
results on the ECML/PKDD Discovery Challenge 2010, which is the comibination of
many point-wise rankinig models. In this paper, we combine multiple pair-wise
bipartite ranking learner to solve the multi-partite ranking problems for the
web quality estimation. In encoding stage, we present the ternary encoding and
the binary coding extending each rank value to (L is the number of the
different ranking value). For the decoding, we discuss the combination of
multiple ranking results from multiple bipartite ranking models with the
predefined weighting and the adaptive weighting. The experiments on ECML/PKDD
2010 Discovery Challenge datasets show that \textit{binary coding} +
\textit{predefined weighting} yields the highest performance in all four
combinations and furthermore it is better than the best results reported in
ECML/PKDD 2010 Discovery Challenge competition.Comment: 17 pages, 8 figures, 2 table
Improving classifications for cardiac autonomic neuropathy using multi-level ensemble classifiers and feature selection based on random forest
This paper is devoted to empirical investigation of novel multi-level ensemble meta classifiers for the detection and monitoring of progression of cardiac autonomic neuropathy, CAN, in diabetes patients. Our experiments relied on an extensive database and concentrated on ensembles of ensembles, or multi-level meta classifiers, for the classification of cardiac autonomic neuropathy progression. First, we carried out a thorough investigation comparing the performance of various base classifiers for several known sets of the most essential features in this database and determined that Random Forest significantly and consistently outperforms all other base classifiers in this new application. Second, we used feature selection and ranking implemented in Random Forest. It was able to identify a new set of features, which has turned out better than all other sets considered for this large and well-known database previously. Random Forest remained the very best classier for the new set of features too. Third, we investigated meta classifiers and new multi-level meta classifiers based on Random Forest, which have improved its performance. The results obtained show that novel multi-level meta classifiers achieved further improvement and obtained new outcomes that are significantly better compared with the outcomes published in the literature previously for cardiac autonomic neuropathy
Active Sampling of Pairs and Points for Large-scale Linear Bipartite Ranking
Bipartite ranking is a fundamental ranking problem that learns to order
relevant instances ahead of irrelevant ones. The pair-wise approach for
bi-partite ranking construct a quadratic number of pairs to solve the problem,
which is infeasible for large-scale data sets. The point-wise approach, albeit
more efficient, often results in inferior performance. That is, it is difficult
to conduct bipartite ranking accurately and efficiently at the same time. In
this paper, we develop a novel active sampling scheme within the pair-wise
approach to conduct bipartite ranking efficiently. The scheme is inspired from
active learning and can reach a competitive ranking performance while focusing
only on a small subset of the many pairs during training. Moreover, we propose
a general Combined Ranking and Classification (CRC) framework to accurately
conduct bipartite ranking. The framework unifies point-wise and pair-wise
approaches and is simply based on the idea of treating each instance point as a
pseudo-pair. Experiments on 14 real-word large-scale data sets demonstrate that
the proposed algorithm of Active Sampling within CRC, when coupled with a
linear Support Vector Machine, usually outperforms state-of-the-art point-wise
and pair-wise ranking approaches in terms of both accuracy and efficiency.Comment: a shorter version was presented in ACML 201
An Empirical Comparison of Learning Algorithms for Nonparametric Scoring
The TreeRank algorithm was recently proposed as a scoring-based method based on recursive partitioning of the input space. This tree induction algorithm builds orderings by recursively optimizing the Receiver Operating Characteristic (ROC) curve through a one-step optimization procedure called LeafRank. One of the aim of this paper is the indepth analysis of the empirical performance of the variants of TreeRank/LeafRank method. Numerical experiments based on both artificial and real data sets are provided. Further experiments using resampling and randomization, in the spirit of bagging and random forests are developed and we show how they increase both stability and accuracy in bipartite ranking. Moreover, an empirical comparison with other efficient scoring algorithms such as RankBoost and RankSVM is presented on UCI benchmark data sets
Genome-wide Protein-chemical Interaction Prediction
The analysis of protein-chemical reactions on a large scale is critical to understanding the complex interrelated mechanisms that govern biological life at the cellular level. Chemical proteomics is a new research area aimed at genome-wide screening of such chemical-protein interactions. Traditional approaches to such screening involve in vivo or in vitro experimentation, which while becoming faster with the application of high-throughput screening technologies, remains costly and time-consuming compared to in silico methods. Early in silico methods are dependant on knowing 3D protein structures (docking) or knowing binding information for many chemicals (ligand-based approaches). Typical machine learning approaches follow a global classification approach where a single predictive model is trained for an entire data set, but such an approach is unlikely to generalize well to the protein-chemical interaction space considering its diversity and heterogeneous distribution. In response to the global approach, work on local models has recently emerged to improve generalization across the interaction space by training a series of independant models localized to each predict a single interaction. This work examines current approaches to genome-wide protein-chemical interaction prediction and explores new computational methods based on modifications to the boosting framework for ensemble learning. The methods are described and compared to several competing classification methods. Genome-wide chemical-protein interaction data sets are acquired from publicly available resources, and a series of experimental studies are performed in order to compare the the performance of each method under a variety of conditions
Recommended from our members
Accurate Prediction Methods on Biomolecular Data
With the recent advancements in sequencing technologies, molecular biologists are producing ever-increasing amounts of biomolecular data. Extracting useful information from these massive data sets requires efficient and effective data mining and machine learning methods. In this dissertation, we explore the use of supervised machine learning (ML) to solve some challenging classification problems in molecular biology.First, we devise an ML model for classifying cancer types from very sparse somatic point mutation data. Accumulation of mutation and epigenetic modifications in somatic cells results in various cancer. For this purpose, we propose a method called mClass for efficient feature (gene) ranking that uses clustering, normalized mutual information and logistic regression. We show that somatic mutation data has sufficient discriminative power for cancer type classification.Next, we address the problem of gene essentiality prediction in microbes. Essential genes are significant to identify since their function is vital for the survival of the organism. Our proposed deep learning architecture called DeeplyEssential exclusively uses features extracted from the primary sequence of genes and their corresponding proteins, to maximize the utility and practicality of the tool. DeeplyEssential achieved state-of-the-art performance over previously proposed methods as well as expose and study a hidden performance bias affected previous models.Finally, we consider the problem of predicting the enhancer regions in the human genome from chromatin data. Enhancers contribute to the transcription of target genes. We propose a convolutional neural network framework named Epi2En that takes advantage of epigenetic ChIP-seq data. Epi2En's classification performance is not only very strong on cross-validation experiments, but also when testing across different cell-lines
Integer optimization methods for machine learning
Thesis (Ph. D.)--Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2012.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 129-137).In this thesis, we propose new mixed integer optimization (MIO) methods to ad- dress problems in machine learning. The first part develops methods for supervised bipartite ranking, which arises in prioritization tasks in diverse domains such as information retrieval, recommender systems, natural language processing, bioinformatics, and preventative maintenance. The primary advantage of using MIO for ranking is that it allows for direct optimization of ranking quality measures, as opposed to current state-of-the-art algorithms that use heuristic loss functions. We demonstrate using a number of datasets that our approach can outperform other ranking methods. The second part of the thesis focuses on reverse-engineering ranking models. This is an application of a more general ranking problem than the bipartite case. Quality rankings affect business for many organizations, and knowing the ranking models would allow these organizations to better understand the standards by which their products are judged and help them to create higher quality products. We introduce an MIO method for reverse-engineering such models and demonstrate its performance in a case-study with real data from a major ratings company. We also devise an approach to find the most cost-effective way to increase the rank of a certain product. In the final part of the thesis, we develop MIO methods to first generate association rules and then use the rules to build an interpretable classifier in the form of a decision list, which is an ordered list of rules. These are both combinatorially challenging problems because even a small dataset may yield a large number of rules and a small set of rules may correspond to many different orderings. We show how to use MIO to mine useful rules, as well as to construct a classifier from them. We present results in terms of both classification accuracy and interpretability for a variety of datasets.by Allison An Chang.Ph.D
Il modello Bradley-Terry per l’analisi delle partite della Serie A italiana di calcio
Viviamo nell'era dei cosiddetti Big Data in cui, grazie all'interconnessione, è possibile ottenere un grande flusso di informazioni da ogni attività .
Non fa eccezione il calcio in cui da un paio d'anni, le società calcistiche si affidano a sistemi di analisi per produrre tattiche di gioco ma anche per effettuare scouting di giocatori emergenti. Nel calcio moderno, perciò, numerose statistiche ad esempio il possesso della palla, il numero di tiri effettuati da una squadra ecc. vengono raccolte durante una partita di calcio.
Questo porta alla domanda: poiché disponiamo di una grande quantità di dati sulle prestazioni delle squadre nelle loro partite, è possibile identificare quali statistiche influiscono significativamente sul successo o sul fallimento sportivo delle singole squadre?
Da qui nasce la tesi che verrà presentata. L'obiettivo è quello di fornire un'analisi che risponda a questa domanda utilizzando tecniche di Data Mining, in particolare attraverso l'utilizzo di un modello di confronto a coppie per le partite di calcio che tenga conto delle statistiche inserite. Il modello scelto per l'analisi sarà il modello Bradley-Terry con le sue estensioni.
Successivamente i modelli Bradley-Terry saranno utilizzati per predire l’esito delle partite e confrontati con le predizioni dei principali bookmakers e degli algoritmi di Machine Learning: K-Nearest-Neighbors (K-NN), Support Vector Machine (SVM), Decision Tree, Random Forest e AdaBoost. Infine, Decision Tree e Random Forest verranno ulteriormente approfonditi per individuare quali statistiche sono importanti.
Lo studio prenderà in considerazione i dati relativi alle partite della Serie A italiana della stagione 2021/2022.We live in the era of so-called Big Data where, thanks to interconnectivity, a large flow of information can be obtained from every activity. This also applies to soccer where for the past couple of years, soccer teams have relied on analysis systems to produce play tactics and to scout emerging players. In modern soccer, therefore, numerous statistics such as ball possession, the number of shots taken by a team, etc. are collected during a soccer game.
This leads to the question: since we have a large amount of data on team performances in their games, can we identify which statistics significantly influence the success or failure of individual teams in sports?
This is where the thesis comes in. The objective is to provide an analysis that answers this question using Data Mining techniques, specifically using a comparison model for soccer games that considers the statistics entered. The model chosen for the analysis will be the Bradley-Terry model with its extensions.
Subsequently, the Bradley-Terry models will be used to predict the outcome of the games and compared with the predictions of the main bookmakers and the Machine Learning algorithms: K-Nearest-Neighbors (K-NN), Support Vector Machine (SVM), Decision Tree, Random Forest, and AdaBoost. Finally, Decision Tree and Random Forest will be further studied to determine which statistics are important.
The study will consider data relating to the Italian Serie A games of the 2021/2022 season