    Using Bad Learners to find Good Configurations

    Finding the optimally performing configuration of a software system for a given setting is often challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, building an accurate performance model can be very expensive (and is often infeasible in practice). The central insight of this paper is that exact performance values (e.g. the response time of a software system) are not required to rank configurations and to identify the optimal one. As shown by our experiments, models that are cheap to learn but inaccurate (with respect to the difference between actual and predicted performance) can still be used rank configurations and hence find the optimal configuration. This novel \emph{rank-based approach} allows us to significantly reduce the cost (in terms of number of measurements of sample configuration) as well as the time required to build models. We evaluate our approach with 21 scenarios based on 9 software systems and demonstrate that our approach is beneficial in 16 scenarios; for the remaining 5 scenarios, an accurate model can be built by using very few samples anyway, without the need for a rank-based approach.Comment: 11 pages, 11 figure

    Automatic Figure Ranking and User Interfacing for Intelligent Figure Search

    Figures are important experimental results that are typically reported in full-text bioscience articles. Bioscience researchers need to access figures to validate research facts and to formulate or to test novel research hypotheses. On the other hand, the sheer volume of bioscience literature has made it difficult to access figures. Therefore, we are developing an intelligent figure search engine (http://figuresearch.askhermes.org). Existing research in figure search treats each figure equally, but we introduce a novel concept of "figure ranking": figures appearing in a full-text biomedical article can be ranked by their contribution to the knowledge discovery.We empirically validated the hypothesis of figure ranking with over 100 bioscience researchers, and then developed unsupervised natural language processing (NLP) approaches to automatically rank figures. Evaluating on a collection of 202 full-text articles in which authors have ranked the figures based on importance, our best system achieved a weighted error rate of 0.2, which is significantly better than several other baseline systems we explored. We further explored a user interfacing application in which we built novel user interfaces (UIs) incorporating figure ranking, allowing bioscience researchers to efficiently access important figures. Our evaluation results show that 92% of the bioscience researchers prefer as the top two choices the user interfaces in which the most important figures are enlarged. With our automatic figure ranking NLP system, bioscience researchers preferred the UIs in which the most important figures were predicted by our NLP system than the UIs in which the most important figures were randomly assigned. In addition, our results show that there was no statistical difference in bioscience researchers' preference in the UIs generated by automatic figure ranking and UIs by human ranking annotation.The evaluation results conclude that automatic figure ranking and user interfacing as we reported in this study can be fully implemented in online publishing. The novel user interface integrated with the automatic figure ranking system provides a more efficient and robust way to access scientific information in the biomedical domain, which will further enhance our existing figure search engine to better facilitate accessing figures of interest for bioscientists

    Receiver operating characteristic (ROC) movies, universal ROC (UROC) curves, and coefficient of predictive ability (CPA)

    Throughout science and technology, receiver operating characteristic (ROC) curves and associated area under the curve (AUC) measures constitute powerful tools for assessing the predictive abilities of features, markers and tests in binary classification problems. Despite its immense popularity, ROC analysis has been subject to a fundamental restriction, in that it applies to dichotomous (yes or no) outcomes only. Here we introduce ROC movies and universal ROC (UROC) curves that apply to just any linearly ordered outcome, along with an associated coefficient of predictive ability (CPA) measure. CPA equals the area under the UROC curve, and admits appealing interpretations in terms of probabilities and rank based covariances. For binary outcomes CPA equals AUC, and for pairwise distinct outcomes CPA relates linearly to Spearman’s coefficient, in the same way that the C index relates linearly to Kendall’s coefficient. ROC movies, UROC curves, and CPA nest and generalize the tools of classical ROC analysis, and are bound to supersede them in a wealth of applications. Their usage is illustrated in data examples from biomedicine and meteorology, where rank based measures yield new insights in the WeatherBench comparison of the predictive performance of convolutional neural networks and physical-numerical models for weather prediction

    Non-Parametric Classification of Time Series Using Permutation Ordinal Statistics

    The present thesis explores some approaches to classify time series without prior statistical information using the concept of permutation entropy. Motivated by the results from a previous published and relevant work that set similarity relationships between EEG time series, a reproduction of the proposed approach was performed giving negative results. The failure to reproduce those results led to the conclusion that the approach of building statistics from permutation patterns have to be complemented with another metric in order to be used for classification purposes. The concept of Total Variation Distance (TVD) was then used to develop three algorithms to classify time series in a non-parametric way. At first, the developed algorithms were tested using EEG time series. Even though the results using the developed algorithms were better than previous results, they were not as satisfactory as desired. However, the inherent complexity of brain measurements led to switch to self-generated data to test the algorithms. Using time series coming from different sets of filtered versions of Gaussian white noise the classification was performed. For comparison purposes a parametric classification approach using the Maximum Likelihood Estimation was also used. Results showed that when each set of data came from the same filtering equation the classification using the developed algorithms was optimal reaching 100% success rate in many cases, being as good as the ML approach. On the other hand, when each set of data came from a mixture of different filter equations that generate the time series (reflecting the complex situations we faced when processing EEG data) , results were fairly successful with variations with respect to the ML approach, which was outperformed in some cases but also not surpassed in others. The results obtained pointed the permutation entropy analysis to be an approach in the right direction to efficiently classify time series, however more research needs to be done to adjust the correct metric to get better results

    Probabilistic reframing for cost-sensitive regression

    Full text link
    © ACM, 2014. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Knowledge Discovery from Data (TKDD), VOL. 8, ISS. 4, (October 2014) http://doi.acm.org/10.1145/2641758Common-day applications of predictive models usually involve the full use of the available contextual information. When the operating context changes, one may fine-tune the by-default (incontextual) prediction or may even abstain from predicting a value (a reject). Global reframing solutions, where the same function is applied to adapt the estimated outputs to a new cost context, are possible solutions here. An alternative approach, which has not been studied in a comprehensive way for regression in the knowledge discovery and data mining literature, is the use of a local (e.g., probabilistic) reframing approach, where decisions are made according to the estimated output and a reliability, confidence, or probability estimation. In this article, we advocate for a simple two-parameter (mean and variance) approach, working with a normal conditional probability density. Given the conditional mean produced by any regression technique, we develop lightweight “enrichment” methods that produce good estimates of the conditional variance, which are used by the probabilistic (local) reframing methods. We apply these methods to some very common families of costsensitive problems, such as optimal predictions in (auction) bids, asymmetric loss scenarios, and rejection rules.This work was supported by the MEC/MINECO projects CONSOLIDER-INGENIO CSD2007-00022 and TIN 2010-21062-C02-02, and TIN 2013-45732-C4-1-P and GVA projects PROMETEO/2008/051 and PROMETEO2011/052. Finally, part of this work was motivated by the REFRAME project (http://www.reframe-d2k.org) granted by the European Coordinated Research on Long-term Challenges in Information and Communication Sciences & Technologies ERA-Net (CHIST-ERA) and funded by Ministerio de Economia y Competitividad in Spain (PCIN-2013-037).Hernández Orallo, J. (2014). Probabilistic reframing for cost-sensitive regression. ACM Transactions on Knowledge Discovery from Data. 8(4):1-55. https://doi.org/10.1145/2641758S15584G. Bansal, A. Sinha, and H. Zhao. 2008.     Stochastic Tools for Network Security: Anonymity Protocol Analysis and Network Intrusion Detection

    With the rapid development of Internet and the sharp increase of network crime, network security has become very important and received a lot of attention. In this dissertation, we model security issues as stochastic systems. This allows us to find weaknesses in existing security systems and propose new solutions. Exploring the vulnerabilities of existing security tools can prevent cyber-attacks from taking advantages of the system weaknesses. We consider The Onion Router (Tor), which is one of the most popular anonymity systems in use today, and show how to detect a protocol tunnelled through Tor. A hidden Markov model (HMM) is used to represent the protocol. Hidden Markov models are statistical models of sequential data like network traffic, and are an effective tool for pattern analysis. New, flexible and adaptive security schemes are needed to cope with emerging security threats. We propose a hybrid network security scheme including intrusion detection systems (IDSs) and honeypots scattered throughout the network. This combines the advantages of two security technologies. A honeypot is an activity-based network security system, which could be the logical supplement of the passive detection policies used by IDSs. This integration forces us to balance security performance versus cost by scheduling device activities for the proposed system. By formulating the scheduling problem as a decentralized partially observable Markov decision process (DEC-POMDP), decisions are made in a distributed manner at each device without requiring centralized control. When using a HMM, it is important to ensure that it accurately represents both the data used to train the model and the underlying process. Current methods assume that observations used to construct a HMM completely represent the underlying process. It is often the case that the training data size is not large enough to adequately capture all statistical dependencies in the system. It is therefore important to know the statistical significance level that the constructed model represents the underlying process, not only the training set. We present a method to determine if the observation data and constructed model fully express the underlying process with a given level of statistical significance. We apply this approach to detecting the existence of protocols tunnelled through Tor. While HMMs are a powerful tool for representing patterns allowing for uncertainties, they cannot be used for system control. The partially observable Markov decision process (POMDP) is a useful choice for controlling stochastic systems. As a combination of two Markov models, POMDPs combine the strength of HMM (capturing dynamics that depend on unobserved states) and that of Markov decision process (MDP) (taking the decision aspect into account). Decision making under uncertainty is used in many parts of business and science. We use here for security tools. We propose three approximation methods for discrete-time infinite-horizon POMDPs. One of the main contributions of our work is high-quality approximation solution for finite-space POMDPs with the average cost criterion, and their extension to DEC-POMDPs. The solution of the first algorithm is built out of the observable portion when the underlying MDP operates optimally. The other two methods presented here can be classified as the policy-based approximation schemes, in which we formulate the POMDP planning as a quadratically constrained linear program (QCLP), which defines an optimal controller of a desired size. This representation allows a wide range of powerful nonlinear programming (NLP) algorithms to be used to solve POMDPs. Simulation results for a set of benchmark problems illustrate the effectiveness of the proposed method. We show how this tool could be used to design a network security framework

