13 research outputs found

    Using Bad Learners to find Good Configurations

    Full text link
    Finding the optimally performing configuration of a software system for a given setting is often challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, building an accurate performance model can be very expensive (and is often infeasible in practice). The central insight of this paper is that exact performance values (e.g. the response time of a software system) are not required to rank configurations and to identify the optimal one. As shown by our experiments, models that are cheap to learn but inaccurate (with respect to the difference between actual and predicted performance) can still be used rank configurations and hence find the optimal configuration. This novel \emph{rank-based approach} allows us to significantly reduce the cost (in terms of number of measurements of sample configuration) as well as the time required to build models. We evaluate our approach with 21 scenarios based on 9 software systems and demonstrate that our approach is beneficial in 16 scenarios; for the remaining 5 scenarios, an accurate model can be built by using very few samples anyway, without the need for a rank-based approach.Comment: 11 pages, 11 figure

    Automatic Figure Ranking and User Interfacing for Intelligent Figure Search

    Get PDF
    Figures are important experimental results that are typically reported in full-text bioscience articles. Bioscience researchers need to access figures to validate research facts and to formulate or to test novel research hypotheses. On the other hand, the sheer volume of bioscience literature has made it difficult to access figures. Therefore, we are developing an intelligent figure search engine (http://figuresearch.askhermes.org). Existing research in figure search treats each figure equally, but we introduce a novel concept of "figure ranking": figures appearing in a full-text biomedical article can be ranked by their contribution to the knowledge discovery.We empirically validated the hypothesis of figure ranking with over 100 bioscience researchers, and then developed unsupervised natural language processing (NLP) approaches to automatically rank figures. Evaluating on a collection of 202 full-text articles in which authors have ranked the figures based on importance, our best system achieved a weighted error rate of 0.2, which is significantly better than several other baseline systems we explored. We further explored a user interfacing application in which we built novel user interfaces (UIs) incorporating figure ranking, allowing bioscience researchers to efficiently access important figures. Our evaluation results show that 92% of the bioscience researchers prefer as the top two choices the user interfaces in which the most important figures are enlarged. With our automatic figure ranking NLP system, bioscience researchers preferred the UIs in which the most important figures were predicted by our NLP system than the UIs in which the most important figures were randomly assigned. In addition, our results show that there was no statistical difference in bioscience researchers' preference in the UIs generated by automatic figure ranking and UIs by human ranking annotation.The evaluation results conclude that automatic figure ranking and user interfacing as we reported in this study can be fully implemented in online publishing. The novel user interface integrated with the automatic figure ranking system provides a more efficient and robust way to access scientific information in the biomedical domain, which will further enhance our existing figure search engine to better facilitate accessing figures of interest for bioscientists

    Receiver operating characteristic (ROC) movies, universal ROC (UROC) curves, and coefficient of predictive ability (CPA)

    Get PDF
    Throughout science and technology, receiver operating characteristic (ROC) curves and associated area under the curve (AUC) measures constitute powerful tools for assessing the predictive abilities of features, markers and tests in binary classification problems. Despite its immense popularity, ROC analysis has been subject to a fundamental restriction, in that it applies to dichotomous (yes or no) outcomes only. Here we introduce ROC movies and universal ROC (UROC) curves that apply to just any linearly ordered outcome, along with an associated coefficient of predictive ability (CPA) measure. CPA equals the area under the UROC curve, and admits appealing interpretations in terms of probabilities and rank based covariances. For binary outcomes CPA equals AUC, and for pairwise distinct outcomes CPA relates linearly to Spearman’s coefficient, in the same way that the C index relates linearly to Kendall’s coefficient. ROC movies, UROC curves, and CPA nest and generalize the tools of classical ROC analysis, and are bound to supersede them in a wealth of applications. Their usage is illustrated in data examples from biomedicine and meteorology, where rank based measures yield new insights in the WeatherBench comparison of the predictive performance of convolutional neural networks and physical-numerical models for weather prediction

    Non-Parametric Classification of Time Series Using Permutation Ordinal Statistics

    Get PDF
    The present thesis explores some approaches to classify time series without prior statistical information using the concept of permutation entropy. Motivated by the results from a previous published and relevant work that set similarity relationships between EEG time series, a reproduction of the proposed approach was performed giving negative results. The failure to reproduce those results led to the conclusion that the approach of building statistics from permutation patterns have to be complemented with another metric in order to be used for classification purposes. The concept of Total Variation Distance (TVD) was then used to develop three algorithms to classify time series in a non-parametric way. At first, the developed algorithms were tested using EEG time series. Even though the results using the developed algorithms were better than previous results, they were not as satisfactory as desired. However, the inherent complexity of brain measurements led to switch to self-generated data to test the algorithms. Using time series coming from different sets of filtered versions of Gaussian white noise the classification was performed. For comparison purposes a parametric classification approach using the Maximum Likelihood Estimation was also used. Results showed that when each set of data came from the same filtering equation the classification using the developed algorithms was optimal reaching 100% success rate in many cases, being as good as the ML approach. On the other hand, when each set of data came from a mixture of different filter equations that generate the time series (reflecting the complex situations we faced when processing EEG data) , results were fairly successful with variations with respect to the ML approach, which was outperformed in some cases but also not surpassed in others. The results obtained pointed the permutation entropy analysis to be an approach in the right direction to efficiently classify time series, however more research needs to be done to adjust the correct metric to get better results

    Probabilistic reframing for cost-sensitive regression

    Full text link
    © ACM, 2014. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Knowledge Discovery from Data (TKDD), VOL. 8, ISS. 4, (October 2014) http://doi.acm.org/10.1145/2641758Common-day applications of predictive models usually involve the full use of the available contextual information. When the operating context changes, one may fine-tune the by-default (incontextual) prediction or may even abstain from predicting a value (a reject). Global reframing solutions, where the same function is applied to adapt the estimated outputs to a new cost context, are possible solutions here. An alternative approach, which has not been studied in a comprehensive way for regression in the knowledge discovery and data mining literature, is the use of a local (e.g., probabilistic) reframing approach, where decisions are made according to the estimated output and a reliability, confidence, or probability estimation. In this article, we advocate for a simple two-parameter (mean and variance) approach, working with a normal conditional probability density. Given the conditional mean produced by any regression technique, we develop lightweight “enrichment” methods that produce good estimates of the conditional variance, which are used by the probabilistic (local) reframing methods. We apply these methods to some very common families of costsensitive problems, such as optimal predictions in (auction) bids, asymmetric loss scenarios, and rejection rules.This work was supported by the MEC/MINECO projects CONSOLIDER-INGENIO CSD2007-00022 and TIN 2010-21062-C02-02, and TIN 2013-45732-C4-1-P and GVA projects PROMETEO/2008/051 and PROMETEO2011/052. Finally, part of this work was motivated by the REFRAME project (http://www.reframe-d2k.org) granted by the European Coordinated Research on Long-term Challenges in Information and Communication Sciences & Technologies ERA-Net (CHIST-ERA) and funded by Ministerio de Economia y Competitividad in Spain (PCIN-2013-037).Hernández Orallo, J. (2014). Probabilistic reframing for cost-sensitive regression. ACM Transactions on Knowledge Discovery from Data. 8(4):1-55. https://doi.org/10.1145/2641758S15584G. Bansal, A. Sinha, and H. Zhao. 2008. Tuning data mining methods for cost-sensitive regression: A study in loan charge-off forecasting. Journal of Management Information System 25, 3 (Dec. 2008), 315--336.A. P. Basu and N. Ebrahimi. 1992. Bayesian approach to life testing and reliability estimation using asymmetric loss function. Journal of Statistical Planning and Inference 29, 1--2 (1992), 21--31.A. Bella, C. Ferri, J. Hernández-Orallo, and M. J. Ramírez-Quintana. 2010. Quantification via probability estimators. In Proceedings of the 2010 IEEE International Conference on Data Mining. IEEE, 737--742.A. Bella, C. Ferri, J. Hernández-Orallo, and M. J. Ramírez-Quintana. 2013. Aggregative quantification for regression. Data Mining and Knowledge Discovery (2013), 1--44.A. Bella, C. Ferri, J. Hernández-Orallo, and M. J. Ramírez-Quintana. 2009. Calibration of machine learning models. In Handbook of Research on Machine Learning Applications. IGI Global, 128--146.A. Bella, C. Ferri, J. Hernández-Orallo, and M. J. Ramírez-Quintana. 2011. Using negotiable features for prescription problems. Computing 91, 2 (2011), 135--168.J. Bi and K. P. Bennett. 2003. Regression error characteristic curves. In Proceedings of the 20th International Conference on Machine Learning (ICML’03).Z. Bosnić and I. Kononenko. 2008. Comparison of approaches for estimating reliability of individual regression predictions. Data & Knowledge Engineering 67, 3 (2008), 504--516.Z. Bosnić and I. Kononenko. 2009. An overview of advances in reliability estimation of individual predictions in machine learning. Intelligent Data Analysis 13, 2 (2009), 385--401.L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Wadsworth.P. F. Christoffersen and F. X. Diebold. 1996. Further results on forecasting and model selection under asymmetric loss. Journal of Applied Econometrics 11, 5 (1996), 561--571.P. F. Christoffersen and F. X. Diebold. 1997. Optimal prediction under asymmetric loss. Econometric Theory 13 (1997), 808--817.I. Cohen and M. Goldszmidt. 2004. Properties and benefits of calibrated classifiers. Knowledge Discovery in Databases: PKDD 2004 (2004), 125--136.S. Crone. 2002. Training artificial neural networks for time series prediction using asymmetric cost functions. In Proceedings of the 9th International Conference on Neural Information Processing.J. Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7 (2006), 1--30.M. Dumas, L. Aldred, G. Governatori, and A. H. M. Ter Hofstede. 2005. Probabilistic automated bidding in multiple auctions. Electronic Commerce Research 5, 1 (2005), 25--49.C. Elkan. 2001. The foundations of cost-sensitive learning. In Proceedings of the 17th International Conference on Artificial Intelligence (’01), Bernhard Nebel (Ed.). San Francisco, CA, 973--978.G. Elliott and A. Timmermann. 2004. Optimal forecast combinations under general loss functions and forecast error distributions. Journal of Econometrics 122, 1 (2004), 47--79.T. Fawcett. 2006a. An introduction to ROC analysis. Pattern Recognition Letters 27, 8 (2006), 861--874.T. Fawcett. 2006b. ROC graphs with instance-varying costs. Pattern Recognition Letters 27, 8 (2006), 882--891.C. Ferri, P. Flach, and J. Hernández-Orallo. 2002. Learning decision trees using the area under the ROC curve. In Proceedings of the International Conference on Machine Learning. 139--146.C. Ferri, P. Flach, and J. Hernández-Orallo. 2003. Improving the AUC of probabilistic estimation trees. In Proceedings of the 14th European Conference on Machine Learning (ECML’03). Springer, 121--132.C. Ferri and J. Hernández-Orallo. 2004. Cautious classifiers. In ROC Analysis in Artificial Intelligence, 1st International Workshop, ROCAI-2004, Valencia, Spain, August 22, 2004, J. Hernández-Orallo, C. Ferri, N. Lachiche, and P. A. Flach (Eds.). 27--36.P. Flach. 2012. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press.G. Forman. 2008. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery 17, 2 (2008), 164--206.S. García and F. Herrera. 2008. An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. The Journal of Machine Learning Research 9, 2677--2694 (2008), 66.R. Ghani. 2005. Price prediction and insurance for online auctions. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). ACM, New York, NY, 411--418.C. W. J. Granger. 1969. Prediction with a generalized cost of error function. Operational Research (1969), 199--207.C. W. J. Granger. 1999. Outline of forecast theory using generalized cost functions. Spanish Economic Review 1, 2 (1999), 161--173.P. Hall, J. Racine, and Q. Li. 2004. Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association 99, 468 (2004), 1015--1026.P. Hall, R. C. L. Wolff, and Q. Yao. 1999. Methods for estimating a conditional distribution function. Journal of the American Statistical Association (1999), 154--163.T. J. Hastie, R. J. Tibshirani, and J. H. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.J. Hernández-Orallo. 2013. ROC curves for regression. Pattern Recognition 46, 12 (2013), 3395--3411.J. Hernández-Orallo, P. Flach, and C. Ferri. 2012. A unified view of performance metrics: Translating threshold choice into expected classification loss. Journal of Machine Learning Research 13 (2012), 2813--2869.J. Hernández-Orallo, P. Flach, and C. Ferri. 2013. ROC curves in cost space. Machine Learning 93, 1 (2013), 71--91.J. N. Hwang, S. R. Lay, and A. Lippman. 1994. Nonparametric multivariate density estimation: A comparative study. IEEE Transactions on Signal Processing 42, 10 (1994), 2795--2810.R. J. Hyndman, D. M. Bashtannyk, and G. K. Grunwald. 1996. Estimating and visualizing conditional densities. Journal of Computational and Graphical Statistics (1996), 315--336.N. Japkowicz and M. Shah. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press.M. Jino, B. T. de Abreu, and others. 2010. Machine learning methods and asymmetric cost function to estimate execution effort of software testing. In Proceedings of the 2010 3rd International Conference on Software Testing, Verification and Validation (ICST’10). IEEE, 275--284.B. Kitts and B. Leblanc. 2004. Optimal bidding on keyword auctions. Electronic Markets 14, 3 (2004), 186--201.N. Lachiche and P. Flach. 2003. Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In Proceedings of the International Conference on Machine Learning, Vol. 20-1. 416.H. Papadopoulos. 2008. Inductive conformal prediction: Theory and application to neural networks. Tools in Artificial Intelligence 18 (2008), 315--330.H. Papadopoulos, K. Proedrou, V. Vovk, and A. Gammerman. 2002. Inductive confidence machines for regression. In Machine Learning: ECML 2002, Tapio Elomaa, Heikki Mannila, and Hannu Toivonen (Eds.). Lecture Notes in Computer Science, Vol. 2430. Springer, Berlin, 185--194.H. Papadopoulos, V. Vovk, and A. Gammerman. 2011. Regression conformal prediction with nearest neighbours. Journal of Artificial Intelligence Research 40, 1 (2011), 815--840.T. Pietraszek. 2007. On the use of ROC analysis for the optimization of abstaining classifiers. Machine Learning 68, 2 (2007), 137--169.J. C. Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers. MIT Press, Boston, 61--74.F. Provost and P. Domingos. 2003. Tree induction for probability-based ranking. Machine Learning 52, 3 (2003), 199--215.R Team and others. 2012. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.R. Ribeiro. 2011. Utility-based Regression. PhD thesis, Department of Computer Science, Faculty of Sciences, University of Porto.M. Rosenblatt. 1969. Conditional probability density and regression estimators. Multivariate Analysis II 25 (1969), 31.S. Rosset, C. Perlich, and B. Zadrozny. 2007. Ranking-based evaluation of regression models. Knowledge and Information Systems 12, 3 (2007), 331--353.R. E. Schapire, P. Stone, D. McAllester, M. L. Littman, and J. A. Csirik. 2002. Modeling auction price uncertainty using boosting-based conditional density estimation. In Proceedings of the International Conference on Machine Learning. 546--553.G. Shafer and V. Vovk. 2008. A tutorial on conformal prediction. Journal of Machine Learning Research 9 (2008), 371--421.J. A. Swets, R. M. Dawes, and J. Monahan. 2000. Better decisions through science. Scientific American 283, 4 (Oct. 2000), 82--87.R. D. Thompson and A. P. Basu. 1996. Asymmetric loss functions for estimating system reliability. In Bayesian Analysis in Statistics and Econometrics. John Wiley & Sons, 471--482.L. Torgo. 2005. Regression error characteristic surfaces. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 697--702.L. Torgo. 2010. Data Mining with R. Chapman and Hall/CRC Press.L. Torgo and R. Ribeiro. 2007. Utility-based regression. Knowledge Discovery in Databases: PKDD 2007. 597--604.L. Torgo and R. Ribeiro. 2009. Precision and recall for regression. In Discovery Science. Springer, 332--346.P. Turney. 2000. Types of cost in inductive concept learning. Canada National Research Council Publications Archive.L. Wasserman. 2006. All of Nonparametric Statistics. Springer-Verlag, New York.M. P. Wellman, D. M. Reeves, K. M. Lochner, and Y. Vorobeychik. 2004. Price prediction in a trading agent competition. Journal of Artificial Intelligence Research 21 (2004), 19--36.K. Yu and M. C. Jones. 2004. Likelihood-based local linear estimation of the conditional variance function. Journal of the American Statistical Association 99, 465 (2004), 139--144.B. Zadrozny and C. Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 694--699.A. Zellner. 1986. Bayesian estimation and prediction using asymmetric loss functions. Journal of the American Statistical Association (1986), 446--451.H. Zhao, A. P. Sinha, and G. Bansal. 2011. An extended tuning method for cost-sensitive regression and forecasting. Decision Support Systems

    Stochastic Tools for Network Security: Anonymity Protocol Analysis and Network Intrusion Detection

    Get PDF
    With the rapid development of Internet and the sharp increase of network crime, network security has become very important and received a lot of attention. In this dissertation, we model security issues as stochastic systems. This allows us to find weaknesses in existing security systems and propose new solutions. Exploring the vulnerabilities of existing security tools can prevent cyber-attacks from taking advantages of the system weaknesses. We consider The Onion Router (Tor), which is one of the most popular anonymity systems in use today, and show how to detect a protocol tunnelled through Tor. A hidden Markov model (HMM) is used to represent the protocol. Hidden Markov models are statistical models of sequential data like network traffic, and are an effective tool for pattern analysis. New, flexible and adaptive security schemes are needed to cope with emerging security threats. We propose a hybrid network security scheme including intrusion detection systems (IDSs) and honeypots scattered throughout the network. This combines the advantages of two security technologies. A honeypot is an activity-based network security system, which could be the logical supplement of the passive detection policies used by IDSs. This integration forces us to balance security performance versus cost by scheduling device activities for the proposed system. By formulating the scheduling problem as a decentralized partially observable Markov decision process (DEC-POMDP), decisions are made in a distributed manner at each device without requiring centralized control. When using a HMM, it is important to ensure that it accurately represents both the data used to train the model and the underlying process. Current methods assume that observations used to construct a HMM completely represent the underlying process. It is often the case that the training data size is not large enough to adequately capture all statistical dependencies in the system. It is therefore important to know the statistical significance level that the constructed model represents the underlying process, not only the training set. We present a method to determine if the observation data and constructed model fully express the underlying process with a given level of statistical significance. We apply this approach to detecting the existence of protocols tunnelled through Tor. While HMMs are a powerful tool for representing patterns allowing for uncertainties, they cannot be used for system control. The partially observable Markov decision process (POMDP) is a useful choice for controlling stochastic systems. As a combination of two Markov models, POMDPs combine the strength of HMM (capturing dynamics that depend on unobserved states) and that of Markov decision process (MDP) (taking the decision aspect into account). Decision making under uncertainty is used in many parts of business and science. We use here for security tools. We propose three approximation methods for discrete-time infinite-horizon POMDPs. One of the main contributions of our work is high-quality approximation solution for finite-space POMDPs with the average cost criterion, and their extension to DEC-POMDPs. The solution of the first algorithm is built out of the observable portion when the underlying MDP operates optimally. The other two methods presented here can be classified as the policy-based approximation schemes, in which we formulate the POMDP planning as a quadratically constrained linear program (QCLP), which defines an optimal controller of a desired size. This representation allows a wide range of powerful nonlinear programming (NLP) algorithms to be used to solve POMDPs. Simulation results for a set of benchmark problems illustrate the effectiveness of the proposed method. We show how this tool could be used to design a network security framework

    Der Einfluss von mtry auf Random Forests

    Get PDF

    Zwischen Tradition und Moderne

    Get PDF
    corecore