12,347 research outputs found
A Comparative Study on Feature Selection for a Risk Prediction Model for Colorectal Cancer
[EN]Background and objective: Risk prediction models aim at identifying people at higher risk of developing
a target disease. Feature selection is particularly important to improve the prediction model performance
avoiding overfitting and to identify the leading cancer risk (and protective) factors. Assessing the stability of feature selection/ranking algorithms becomes an important issue when the aim is to analyze the
features with more prediction power.
Methods: This work is focused on colorectal cancer, assessing several feature ranking algorithms in terms
of performance for a set of risk prediction models (Neural Networks, Support Vector Machines (SVM),
Logistic Regression, k-Nearest Neighbors and Boosted Trees). Additionally, their robustness is evaluated
following a conventional approach with scalar stability metrics and a visual approach proposed in this
work to study both similarity among feature ranking techniques as well as their individual stability. A
comparative analysis is carried out between the most relevant features found out in this study and features provided by the experts according to the state-of-the-art knowledge.
Results: The two best performance results in terms of Area Under the ROC Curve (AUC) are achieved with
a SVM classifier using the top-41 features selected by the SVM wrapper approach (AUC=0.693) and Logistic Regression with the top-40 features selected by the Pearson (AUC=0.689). Experiments showed that
performing feature selection contributes to classification performance with a 3.9% and 1.9% improvement
in AUC for the SVM and Logistic Regression classifier, respectively, with respect to the results using the
full feature set. The visual approach proposed in this work allows to see that the Neural Network-based
wrapper ranking is the most unstable while the Random Forest is the most stable.
Conclusions: This study demonstrates that stability and model performance should be studied jointly
as Random Forest turned out to be the most stable algorithm but outperformed by others in terms of
model performance while SVM wrapper and the Pearson correlation coefficient are moderately stable
while achieving good model performance.
© 2019 Elsevier B.V. All rights reservedS
Are screening methods useful in feature selection? An empirical study
Filter or screening methods are often used as a preprocessing step for
reducing the number of variables used by a learning algorithm in obtaining a
classification or regression model. While there are many such filter methods,
there is a need for an objective evaluation of these methods. Such an
evaluation is needed to compare them with each other and also to answer whether
they are at all useful, or a learning algorithm could do a better job without
them. For this purpose, many popular screening methods are partnered in this
paper with three regression learners and five classification learners and
evaluated on ten real datasets to obtain accuracy criteria such as R-square and
area under the ROC curve (AUC). The obtained results are compared through curve
plots and comparison tables in order to find out whether screening methods help
improve the performance of learning algorithms and how they fare with each
other. Our findings revealed that the screening methods were useful in
improving the prediction of the best learner on two regression and two
classification datasets out of the ten datasets evaluated.Comment: 29 pages, 4 figures, 21 table
Entity Personalized Talent Search Models with Tree Interaction Features
Talent Search systems aim to recommend potential candidates who are a good
match to the hiring needs of a recruiter expressed in terms of the recruiter's
search query or job posting. Past work in this domain has focused on linear and
nonlinear models which lack preference personalization in the user-level due to
being trained only with globally collected recruiter activity data. In this
paper, we propose an entity-personalized Talent Search model which utilizes a
combination of generalized linear mixed (GLMix) models and gradient boosted
decision tree (GBDT) models, and provides personalized talent recommendations
using nonlinear tree interaction features generated by the GBDT. We also
present the offline and online system architecture for the productionization of
this hybrid model approach in our Talent Search systems. Finally, we provide
offline and online experiment results benchmarking our entity-personalized
model with tree interaction features, which demonstrate significant
improvements in our precision metrics compared to globally trained
non-personalized models.Comment: This paper has been accepted for publication at ACM WWW 201
Ranking News-Quality Multimedia
News editors need to find the photos that best illustrate a news piece and
fulfill news-media quality standards, while being pressed to also find the most
recent photos of live events. Recently, it became common to use social-media
content in the context of news media for its unique value in terms of immediacy
and quality. Consequently, the amount of images to be considered and filtered
through is now too much to be handled by a person. To aid the news editor in
this process, we propose a framework designed to deliver high-quality,
news-press type photos to the user. The framework, composed of two parts, is
based on a ranking algorithm tuned to rank professional media highly and a
visual SPAM detection module designed to filter-out low-quality media. The core
ranking algorithm is leveraged by aesthetic, social and deep-learning semantic
features. Evaluation showed that the proposed framework is effective at finding
high-quality photos (true-positive rate) achieving a retrieval MAP of 64.5% and
a classification precision of 70%.Comment: To appear in ICMR'1
Factorizing LambdaMART for cold start recommendations
Recommendation systems often rely on point-wise loss metrics such as the mean
squared error. However, in real recommendation settings only few items are
presented to a user. This observation has recently encouraged the use of
rank-based metrics. LambdaMART is the state-of-the-art algorithm in learning to
rank which relies on such a metric. Despite its success it does not have a
principled regularization mechanism relying in empirical approaches to control
model complexity leaving it thus prone to overfitting.
Motivated by the fact that very often the users' and items' descriptions as
well as the preference behavior can be well summarized by a small number of
hidden factors, we propose a novel algorithm, LambdaMART Matrix Factorization
(LambdaMART-MF), that learns a low rank latent representation of users and
items using gradient boosted trees. The algorithm factorizes lambdaMART by
defining relevance scores as the inner product of the learned representations
of the users and items. The low rank is essentially a model complexity
controller; on top of it we propose additional regularizers to constraint the
learned latent representations that reflect the user and item manifolds as
these are defined by their original feature based descriptors and the
preference behavior. Finally we also propose to use a weighted variant of NDCG
to reduce the penalty for similar items with large rating discrepancy.
We experiment on two very different recommendation datasets, meta-mining and
movies-users, and evaluate the performance of LambdaMART-MF, with and without
regularization, in the cold start setting as well as in the simpler matrix
completion setting. In both cases it outperforms in a significant manner
current state of the art algorithms
Scalable Privacy-Compliant Virality Prediction on Twitter
The digital town hall of Twitter becomes a preferred medium of communication
for individuals and organizations across the globe. Some of them reach
audiences of millions, while others struggle to get noticed. Given the impact
of social media, the question remains more relevant than ever: how to model the
dynamics of attention in Twitter. Researchers around the world turn to machine
learning to predict the most influential tweets and authors, navigating the
volume, velocity, and variety of social big data, with many compromises. In
this paper, we revisit content popularity prediction on Twitter. We argue that
strict alignment of data acquisition, storage and analysis algorithms is
necessary to avoid the common trade-offs between scalability, accuracy and
privacy compliance. We propose a new framework for the rapid acquisition of
large-scale datasets, high accuracy supervisory signal and multilanguage
sentiment prediction while respecting every privacy request applicable. We then
apply a novel gradient boosting framework to achieve state-of-the-art results
in virality ranking, already before including tweet's visual or propagation
features. Our Gradient Boosted Regression Tree is the first to offer
explainable, strong ranking performance on benchmark datasets. Since the
analysis focused on features available early, the model is immediately
applicable to incoming tweets in 18 languages.Comment: AffCon@AAAI-19 Best Paper Award; Presented at AAAI-19 W1: Affective
Content Analysi
- …