411 research outputs found
A Comparison of Multi-instance Learning Algorithms
Motivated by various challenging real-world applications, such as drug activity prediction and image retrieval, multi-instance (MI) learning has attracted considerable interest in recent years. Compared with standard supervised learning, the MI learning task is more difficult as the label information of each training example is incomplete. Many MI algorithms have been proposed. Some of them are specifically designed for MI problems whereas others have been upgraded or adapted from standard single-instance learning algorithms. Most algorithms have been evaluated on only one or two benchmark datasets, and there is a lack of systematic comparisons of MI learning algorithms.
This thesis presents a comprehensive study of MI learning algorithms that aims to compare their performance and find a suitable way to properly address different MI problems. First, it briefly reviews the history of research on MI learning. Then it discusses five general classes of MI approaches that cover a total of 16 MI algorithms. After that, it presents empirical results for these algorithms that were obtained from 15 datasets which involve five different real-world application domains. Finally, some conclusions are drawn from these results: (1) applying suitable standard single-instance learners to MI problems can often generate the best result on the datasets that were tested, (2) algorithms exploiting the standard asymmetric MI assumption do not show significant advantages over approaches using the so-called collective assumption, and (3) different MI approaches are suitable for different application domains, and no MI algorithm works best on all MI problems
A survey of cost-sensitive decision tree induction algorithms
The past decade has seen a significant interest on the problem of inducing decision trees that take account of costs of misclassification and costs of acquiring the features used for decision making. This survey identifies over 50 algorithms including approaches that are direct adaptations of accuracy based methods, use genetic algorithms, use anytime methods and utilize boosting and bagging. The survey brings together these different studies and novel approaches to cost-sensitive decision tree learning, provides a useful taxonomy, a historical timeline of how the field has developed and should provide a useful reference point for future research in this field
Boosting Applied to Word Sense Disambiguation
In this paper Schapire and Singer's AdaBoost.MH boosting algorithm is applied
to the Word Sense Disambiguation (WSD) problem. Initial experiments on a set of
15 selected polysemous words show that the boosting approach surpasses Naive
Bayes and Exemplar-based approaches, which represent state-of-the-art accuracy
on supervised WSD. In order to make boosting practical for a real learning
domain of thousands of words, several ways of accelerating the algorithm by
reducing the feature space are studied. The best variant, which we call
LazyBoosting, is tested on the largest sense-tagged corpus available containing
192,800 examples of the 191 most frequent and ambiguous English words. Again,
boosting compares favourably to the other benchmark algorithms.Comment: 12 page
Metalearning: a survey of trends and technologies
Metalearning attracted considerable interest in the machine learning community in the last years. Yet, some disagreement remains on what does or what does not constitute a metalearning problem and in which contexts the term is used in. This survey aims at giving an all-encompassing overview of the research directions pursued under the umbrella of metalearning, reconciling different definitions given in scientific literature, listing the choices involved when designing a metalearning system and identifying some of the future research challenges in this domain. © 2013 The Author(s)
Sentiment Classification of Customer Reviews about Automobiles in Roman Urdu
Text mining is a broad field having sentiment mining as its important
constituent in which we try to deduce the behavior of people towards a specific
item, merchandise, politics, sports, social media comments, review sites etc.
Out of many issues in sentiment mining, analysis and classification, one major
issue is that the reviews and comments can be in different languages like
English, Arabic, Urdu etc. Handling each language according to its rules is a
difficult task. A lot of research work has been done in English Language for
sentiment analysis and classification but limited sentiment analysis work is
being carried out on other regional languages like Arabic, Urdu and Hindi. In
this paper, Waikato Environment for Knowledge Analysis (WEKA) is used as a
platform to execute different classification models for text classification of
Roman Urdu text. Reviews dataset has been scrapped from different automobiles
sites. These extracted Roman Urdu reviews, containing 1000 positive and 1000
negative reviews, are then saved in WEKA attribute-relation file format (arff)
as labeled examples. Training is done on 80% of this data and rest of it is
used for testing purpose which is done using different models and results are
analyzed in each case. The results show that Multinomial Naive Bayes
outperformed Bagging, Deep Neural Network, Decision Tree, Random Forest,
AdaBoost, k-NN and SVM Classifiers in terms of more accuracy, precision, recall
and F-measure.Comment: This is a pre-print of a contribution published in Advances in
Intelligent Systems and Computing (editors: Kohei Arai, Supriya Kapoor and
Rahul Bhatia) published by Springer, Cham. The final authenticated version is
available online at: https://doi.org/10.1007/978-3-030-03405-4_4
MetaBags: Bagged Meta-Decision Trees for Regression
Ensembles are popular methods for solving practical supervised learning
problems. They reduce the risk of having underperforming models in
production-grade software. Although critical, methods for learning
heterogeneous regression ensembles have not been proposed at large scale,
whereas in classical ML literature, stacking, cascading and voting are mostly
restricted to classification problems. Regression poses distinct learning
challenges that may result in poor performance, even when using well
established homogeneous ensemble schemas such as bagging or boosting.
In this paper, we introduce MetaBags, a novel, practically useful stacking
framework for regression. MetaBags is a meta-learning algorithm that learns a
set of meta-decision trees designed to select one base model (i.e. expert) for
each query, and focuses on inductive bias reduction. A set of meta-decision
trees are learned using different types of meta-features, specially created for
this purpose - to then be bagged at meta-level. This procedure is designed to
learn a model with a fair bias-variance trade-off, and its improvement over
base model performance is correlated with the prediction diversity of different
experts on specific input space subregions. The proposed method and
meta-features are designed in such a way that they enable good predictive
performance even in subregions of space which are not adequately represented in
the available training data.
An exhaustive empirical testing of the method was performed, evaluating both
generalization error and scalability of the approach on synthetic, open and
real-world application datasets. The obtained results show that our method
significantly outperforms existing state-of-the-art approaches
Machine learning approach for credit score analysis : a case study of predicting mortgage loan defaults
Dissertation submitted in partial fulfilment of the requirements for the degree of Statistics and Information Management specialized in Risk Analysis and ManagementTo effectively manage credit score analysis, financial institutions instigated techniques and models that are mainly designed for the purpose of improving the process assessing creditworthiness during the credit evaluation process. The foremost objective is to discriminate their clients – borrowers – to fall either in the non-defaulter group, that is more likely to pay their financial obligations, or the defaulter one which has a higher probability of failing to pay their debts. In this paper, we devote to use machine learning models in the prediction of mortgage defaults. This study employs various single classification machine learning methodologies including Logistic Regression, Classification and Regression Trees, Random Forest, K-Nearest Neighbors, and Support Vector Machine. To further improve the predictive power, a meta-algorithm ensemble approach – stacking – will be introduced to combine the outputs – probabilities – of the afore mentioned methods. The sample for this study is solely based on the publicly provided dataset by Freddie Mac. By modelling this approach, we achieve an improvement in the model predictability performance. We then compare the performance of each model, and the meta-learner, by plotting the ROC Curve and computing the AUC rate. This study is an extension of various preceding studies that used different techniques to further enhance the model predictivity. Finally, our results are compared with work from different authors.Para gerir com eficácia a análise de risco de crédito, as instituições financeiras desenvolveram técnicas e modelos que foram projetados principalmente para melhorar o processo de avaliação da qualidade de crédito durante o processo de avaliação de crédito. O objetivo final é classifica os seus clientes - tomadores de empréstimos - entre aqueles que tem maior probabilidade de pagar suas obrigações financeiras, e os potenciais incumpridores que têm maior probabilidade de entrar em default. Neste artigo, nos dedicamos a usar modelos de aprendizado de máquina na previsão de defaults de hipoteca. Este estudo emprega várias metodologias de aprendizado de máquina de classificação única, incluindo Regressão Logística, Classification and Regression Trees, Random Forest, K-Nearest Neighbors, and Support Vector Machine. Para melhorar ainda mais o poder preditivo, a abordagem do conjunto de meta-algoritmos - stacking - será introduzida para combinar as saídas - probabilidades - dos métodos acima mencionados. A amostra deste estudo é baseada exclusivamente no conjunto de dados fornecido publicamente pela Freddie Mac. Ao modelar essa abordagem, alcançamos uma melhoria no desempenho do modelo de previsibilidade. Em seguida, comparamos o desempenho de cada modelo e o meta-aprendiz, plotando a Curva ROC e calculando a taxa de AUC. Este estudo é uma extensão de vários estudos anteriores que usaram diferentes técnicas para melhorar ainda mais o modelo preditivo. Finalmente, nossos resultados são comparados com trabalhos de diferentes autores
Stacked Generalizations in Imbalanced Fraud Data Sets using Resampling Methods
This study uses stacked generalization, which is a two-step process of
combining machine learning methods, called meta or super learners, for
improving the performance of algorithms in step one (by minimizing the error
rate of each individual algorithm to reduce its bias in the learning set) and
then in step two inputting the results into the meta learner with its stacked
blended output (demonstrating improved performance with the weakest algorithms
learning better). The method is essentially an enhanced cross-validation
strategy. Although the process uses great computational resources, the
resulting performance metrics on resampled fraud data show that increased
system cost can be justified. A fundamental key to fraud data is that it is
inherently not systematic and, as of yet, the optimal resampling methodology
has not been identified. Building a test harness that accounts for all
permutations of algorithm sample set pairs demonstrates that the complex,
intrinsic data structures are all thoroughly tested. Using a comparative
analysis on fraud data that applies stacked generalizations provides useful
insight needed to find the optimal mathematical formula to be used for
imbalanced fraud data sets.Comment: 19 pages, 3 figures, 8 table
- …