437 research outputs found
An urn model for majority voting in classification ensembles
In this work we analyze the class prediction of parallel randomized ensembles by
majority voting as an urn model. For a given test instance, the ensemble can be
viewed as an urn of marbles of different colors. A marble represents an individual
classifier. Its color represents the class label prediction of the corresponding
classifier. The sequential querying of classifiers in the ensemble can be seen
as draws without replacement from the urn. An analysis of this classical urn
model based on the hypergeometric distribution makes it possible to estimate
the confidence on the outcome of majority voting when only a fraction of the
individual predictions is known. These estimates can be used to speed up the
prediction by the ensemble. Specifically, the aggregation of votes can be halted
when the confidence in the final prediction is sufficiently high. If one assumes
a uniform prior for the distribution of possible votes the analysis is shown to be
equivalent to a previous one based on Dirichlet distributions. The advantage of
the current approach is that prior knowledge on the possible vote outcomes can be
readily incorporated in a Bayesian framework. We show how incorporating this
type of problem-specific knowledge into the statistical analysis of majority voting
leads to faster classification by the ensemble and allows us to estimate the expected
average speed-up beforehandThe authors acknowledge financial support from the Comunidad de Madrid (project CASI-CAMCM
S2013/ICE-2845), and from the Spanish Ministerio de Economía y Competitividad (projects
TIN2013-42351-P and TIN2015-70308-REDT
Boosting Randomized Smoothing with Variance Reduced Classifiers
Randomized Smoothing (RS) is a promising method for obtaining robustness
certificates by evaluating a base model under noise. In this work, we: (i)
theoretically motivate why ensembles are a particularly suitable choice as base
models for RS, and (ii) empirically confirm this choice, obtaining
state-of-the-art results in multiple settings. The key insight of our work is
that the reduced variance of ensembles over the perturbations introduced in RS
leads to significantly more consistent classifications for a given input. This,
in turn, leads to substantially increased certifiable radii for samples close
to the decision boundary. Additionally, we introduce key optimizations which
enable an up to 55-fold decrease in sample complexity of RS, thus drastically
reducing its computational overhead. Experimentally, we show that ensembles of
only 3 to 10 classifiers consistently improve on their strongest constituting
model with respect to their average certified radius (ACR) by 5% to 21% on both
CIFAR10 and ImageNet, achieving a new state-of-the-art ACR of 0.86 and 1.11,
respectively. We release all code and models required to reproduce our results
upon publication
Recommended from our members
Cloudy with a Chance of Poaching: Adversary Behavior Modeling and Forecasting with Real-World Poaching Data
Wildlife conservation organizations task rangers to deter and capture wildlife poachers. Since rangers are responsible for patrolling vast areas, adversary behavior modeling can help more effectively direct future patrols. In this innovative application track paper, we present an adversary behavior modeling system, INTERCEPT (INTERpretable Classification Ensemble to Protect Threatened species), and provide the most extensive evaluation in the AI literature of one of the largest poaching datasets from Queen Elizabeth National Park (QENP) in Uganda, comparing INTERCEPT with its competitors; we also present results from a month-long test of INTERCEPT in the field. We present three major contributions. First, we present a paradigm shift in modeling and forecasting wildlife poacher behavior. Some of the latest work in the AI literature (and in Conservation) has relied on models similar to the Quantal Response model from Behavioral Game Theory for poacher behavior prediction. In contrast, INTERCEPT presents a behavior model based on an ensemble of decision trees (i) that more effectively predicts poacher attacks and (ii) that is more effectively interpretable and verifiable. We augment this model to account for spatial correlations and construct an ensemble of the best models, significantly improving performance. Second, we conduct an extensive evaluation on the QENP dataset, comparing 41 models in prediction performance over two years. Third, we present the results of deploying INTERCEPT for a one-month field test in QENP - a first for adversary behavior modeling applications in this domain. This field test has led to finding a poached elephant and more than a dozen snares (including a roll of elephant snares) before they were deployed, potentially saving the lives of multiple animals - including elephants.Engineering and Applied Science
Can BERT Dig It? -- Named Entity Recognition for Information Retrieval in the Archaeology Domain
The amount of archaeological literature is growing rapidly. Until recently,
these data were only accessible through metadata search. We implemented a text
retrieval engine for a large archaeological text collection ( Million
words). In archaeological IR, domain-specific entities such as locations, time
periods, and artefacts, play a central role. This motivated the development of
a named entity recognition (NER) model to annotate the full collection with
archaeological named entities. In this paper, we present ArcheoBERTje, a BERT
model pre-trained on Dutch archaeological texts. We compare the model's quality
and output on a Named Entity Recognition task to a generic multilingual model
and a generic Dutch model. We also investigate ensemble methods for combining
multiple BERT models, and combining the best BERT model with a domain thesaurus
using Conditional Random Fields (CRF). We find that ArcheoBERTje outperforms
both the multilingual and Dutch model significantly with a smaller standard
deviation between runs, reaching an average F1 score of 0.735. The model also
outperforms ensemble methods combining the three models. Combining ArcheoBERTje
predictions and explicit domain knowledge from the thesaurus did not increase
the F1 score. We quantitatively and qualitatively analyse the differences
between the vocabulary and output of the BERT models on the full collection and
provide some valuable insights in the effect of fine-tuning for specific
domains. Our results indicate that for a highly specific text domain such as
archaeology, further pre-training on domain-specific data increases the model's
quality on NER by a much larger margin than shown for other domains in the
literature, and that domain-specific pre-training makes the addition of domain
knowledge from a thesaurus unnecessary
Prediction based on averages over automatically induced learners: ensemble methods and Bayesian techniques
Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, noviembre de 200
Technical and Fundamental Features Analysis for Stock Market Prediction with Data Mining Methods
Predicting stock prices is an essential objective in the financial world. Forecasting stock returns and their risk represents one of the most critical concerns of market decision makers. This thesis investigates the stock price forecasting with three approaches from the data mining concept and shows how different elements in the stock price can help to enhance the accuracy of our prediction. For this reason, the first and second approaches capture many fundamental indicators from the stocks and implement them as explanatory variables to do stock price classification and forecasting. In the third approach, technical features from the candlestick representation of the share prices are extracted and used to enhance the accuracy of the forecasting. In each approach, different tools and techniques from data mining and machine learning are employed to justify why the forecasting is working.
Furthermore, since the idea is to evaluate the potential of features in the stock trend forecasting, therefore we diversify our experiments using both technical and fundamental features. Therefore, in the first approach, a three-stage methodology is developed while in the first step, a comprehensive investigation of all possible features which can be effective on stocks risk and return are identified. Then, in the next stage, risk and return are predicted by applying data mining techniques for the given features. Finally, we develop a hybrid algorithm, based on some filters and function-based clustering; and re-predicted the risk and return of stocks.
In the second approach, instead of using single classifiers, a fusion model is proposed based on the use of multiple diverse base classifiers that operate on a common input and a meta-classifier that learns from base classifiers’ outputs to obtain a more precise stock return and risk predictions. A set of diversity methods, including Bagging, Boosting, and AdaBoost, is applied to create diversity in classifier combinations. Moreover, the number and procedure for selecting base classifiers for fusion schemes are determined using a methodology based on dataset clustering and candidate classifiers’ accuracy.
Finally, in the third approach, a novel forecasting model for stock markets based on the wrapper ANFIS (Adaptive Neural Fuzzy Inference System) – ICA (Imperialist Competitive Algorithm) and technical analysis of Japanese Candlestick is presented. Two approaches of Raw-based and Signal-based are devised to extract the model’s input variables and buy and sell signals are considered as output variables.
To illustrate the methodologies, for the first and second approaches, Tehran Stock Exchange (TSE) data for the period from 2002 to 2012 are applied, while for the third approach, we used General Motors and Dow Jones indexes.Predicting stock prices is an essential objective in the financial world. Forecasting stock returns and their risk represents one of the most critical concerns of market decision makers. This thesis investigates the stock price forecasting with three approaches from the data mining concept and shows how different elements in the stock price can help to enhance the accuracy of our prediction. For this reason, the first and second approaches capture many fundamental indicators from the stocks and implement them as explanatory variables to do stock price classification and forecasting. In the third approach, technical features from the candlestick representation of the share prices are extracted and used to enhance the accuracy of the forecasting. In each approach, different tools and techniques from data mining and machine learning are employed to justify why the forecasting is working.
Furthermore, since the idea is to evaluate the potential of features in the stock trend forecasting, therefore we diversify our experiments using both technical and fundamental features. Therefore, in the first approach, a three-stage methodology is developed while in the first step, a comprehensive investigation of all possible features which can be effective on stocks risk and return are identified. Then, in the next stage, risk and return are predicted by applying data mining techniques for the given features. Finally, we develop a hybrid algorithm, based on some filters and function-based clustering; and re-predicted the risk and return of stocks.
In the second approach, instead of using single classifiers, a fusion model is proposed based on the use of multiple diverse base classifiers that operate on a common input and a meta-classifier that learns from base classifiers’ outputs to obtain a more precise stock return and risk predictions. A set of diversity methods, including Bagging, Boosting, and AdaBoost, is applied to create diversity in classifier combinations. Moreover, the number and procedure for selecting base classifiers for fusion schemes are determined using a methodology based on dataset clustering and candidate classifiers’ accuracy.
Finally, in the third approach, a novel forecasting model for stock markets based on the wrapper ANFIS (Adaptive Neural Fuzzy Inference System) – ICA (Imperialist Competitive Algorithm) and technical analysis of Japanese Candlestick is presented. Two approaches of Raw-based and Signal-based are devised to extract the model’s input variables and buy and sell signals are considered as output variables.
To illustrate the methodologies, for the first and second approaches, Tehran Stock Exchange (TSE) data for the period from 2002 to 2012 are applied, while for the third approach, we used General Motors and Dow Jones indexes.154 - Katedra financívyhově
- …