33,783 research outputs found

    An Empirical Analysis on Point-wise Machine Learning Techniques using Regression Trees for Web-search Ranking

    Get PDF
    Learning how to rank a set of objects relative to an user defined query has received much interest in the machine learning community during the past decade. In fact, there have been two recent competitions hosted by internationally prominent search companies to encourage research on ranking web site documents. Recent literature on learning to rank has focused on three approaches: point-wise, pair-wise, and list-wise. Many different kinds of classifiers, including boosted decision trees, neural networks, and SVMs have proven successful in the field. This thesis surveys traditional point-wise techniques that use regression trees for web-search ranking. The thesis contains empirical studies on Random Forests and Gradient Boosted Decision Trees, with novel augmentations to them on real world data sets. We also analyze how these point-wise techniques perform on new areas of research for web-search ranking: transfer learning and feature-cost aware models

    Optimized data collection and analysis process for studying solar-thermal desalination by machine learning

    Full text link
    An effective interdisciplinary study between machine learning and solar-thermal desalination requires a sufficiently large and well-analyzed experimental datasets. This study develops a modified dataset collection and analysis process for studying solar-thermal desalination by machine learning. Based on the optimized water condensation and collection process, the proposed experimental method collects over one thousand datasets, which is ten times more than the average number of datasets in previous works, by accelerating data collection and reducing the time by 83.3%. On the other hand, the effects of dataset features are investigated by using three different algorithms, including artificial neural networks, multiple linear regressions, and random forests. The investigation focuses on the effects of dataset size and range on prediction accuracy, factor importance ranking, and the model's generalization ability. The results demonstrate that a larger dataset can significantly improve prediction accuracy when using artificial neural networks and random forests. Additionally, the study highlights the significant impact of dataset size and range on ranking the importance of influence factors. Furthermore, the study reveals that the extrapolation data range significantly affects the extrapolation accuracy of artificial neural networks. Based on the results, massive dataset collection and analysis of dataset feature effects are important steps in an effective and consistent machine learning process flow for solar-thermal desalination, which can promote machine learning as a more general tool in the field of solar-thermal desalination

    Interaction Forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects

    Get PDF
    Although interaction effects can be exploited to improve predictions and allow for valuable insights into covariate interplay, they are given little attention in analysis. We introduce interaction forests, which are a variant of random forests for categorical, continuous, and survival outcomes, explicitly considering quantitative and qualitative interaction effects in bivariable splits performed by the trees constituting the forests. The new effect importance measure (EIM) associated with interaction forests allows ranking of the covariate pairs with respect to their interaction effects' importance for prediction. Using EIM, separate importance value lists for univariable effects, quantitative interaction effects, and qualitative interaction effects are obtained. In the spirit of interpretable machine learning, the bivariable split types of interaction forests target well interpretable interaction effects that are easy to communicate. To learn about the nature of the interplay between identified interacting covariate pairs it is convenient to visualise their estimated bivariable influence. We provide functions that perform this task in the R package diversityForest that implements interaction forests. In a large-scale empirical study using 220 data sets, interaction forests tended to deliver better predictions than conventional random forests and competing random forest variants that use multivariable splitting. In a simulation study, EIM delivered considerably better rankings for the relevant quantitative and qualitative interaction effects than competing approaches. These results indicate that interaction forests are suitable tools for the challenging task of identifying and making use of well interpretable interaction effects in predictive modelling

    Pairwise meta-rules for better meta-learning-based algorithm ranking

    Get PDF
    In this paper, we present a novel meta-feature generation method in the context of meta-learning, which is based on rules that compare the performance of individual base learners in a one-against-one manner. In addition to these new meta-features, we also introduce a new meta-learner called Approximate Ranking Tree Forests (ART Forests) that performs very competitively when compared with several state-of-the-art meta-learners. Our experimental results are based on a large collection of datasets and show that the proposed new techniques can improve the overall performance of meta-learning for algorithm ranking significantly. A key point in our approach is that each performance figure of any base learner for any specific dataset is generated by optimising the parameters of the base learner separately for each dataset

    Random Forests: some methodological insights

    Get PDF
    This paper examines from an experimental perspective random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001. It first aims at confirming, known but sparse, advice for using random forests and at proposing some complementary remarks for both standard problems as well as high dimensional ones for which the number of variables hugely exceeds the sample size. But the main contribution of this paper is twofold: to provide some insights about the behavior of the variable importance index based on random forests and in addition, to propose to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good prediction model. The strategy involves a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy

    User Engagement as Evaluation: a Ranking or a Regression Problem?

    Get PDF
    1. Introduction2. Recsys Challenge 2014: Data and Protocol 2.1 Data Characteristics and Statistics 2.2 About User Engagement as Evaluation 2.3 Input Features for the Model3. Method 3.1 LambdaMART Model 3.2 Random Forests 3.3 Description of the Approach4. Experiments 4.1 Experimental results 4.2 Relevant Features5. Discussions6. Conclusions7. Acknowledgments8. ReferencesIn this paper, we describe the winning approach used on the RecSys Challenge 2014 which focuses on employing user en-gagement as evaluation of recommendations. On one hand, we regard the challenge as a ranking problem and apply the LambdaMART algorithm, which is a listwise model special-ized in a Learning To Rank approach. On the other hand, after noticing some specific characteristics of this challenge, we also consider it as a regression problem and use pointwise regression models such as Random Forests. We compare how these different methods can be modified or combined to improve the accuracy and robustness of our model and we draw the advantages or disadvantages of each approach
    corecore