128 research outputs found

    Inferring Causal Direction from Observational Data: A Complexity Approach

    Get PDF
    At the heart of causal structure learning from observational data lies a deceivingly simple question: given two statistically dependent random variables, which one has a causal effect on the other? This is impossible to answer using statistical dependence testing alone and requires that we make additional assumptions. We propose several fast and simple criteria for distinguishing cause and effect in pairs of discrete or continuous random variables. The intuition behind them is that predicting the effect variable using the cause variable should be ‘simpler’ than the reverse – different notions of ‘simplicity’ giving rise to different criteria. We demonstrate the accuracy of the criteria on synthetic data generated under a broad family of causal mechanisms and types of noise

    Statistical Hypothesis Testing in Positive Unlabelled Data

    Get PDF
    We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions: (1) a proof that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities; (2) a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power; and finally, (3) a new capability, supervision determination, which can determine a-priori the number of labelled examples the user must collect before being able to observe a desired statistical effect. Beyond general hypothesis testing, we suggest the tools will additionally be useful for information theoretic feature selection, and Bayesian Network structure learning

    Information theoretic feature selection in multi-label data through composite likelihood

    Get PDF
    In this paper we present a framework to unify information theoretic feature selection criteria for multi-label data. Our framework combines two different ideas; expressing multi-label decomposition methods as composite likelihoods and then showing how feature selection criteria can be derived by maximizing these likelihood expressions. Many existing criteria, until now proposed as heuristics, can be reproduced from a single basis under the proposed framework. Furthermore we can derive new problem-specific criteria by making different independence assumptions over the feature and label spaces. One such derived criterion is shown experimentally to outperform other approaches proposed in the literature on real-world datasets

    Simultaneous prediction of four ATP-binding cassette transporters' substrates using multi-label QSAR

    Get PDF
    Efflux by the ATP-binding cassette (ABC) transporters affects the pharmacokinetic profile of drugs and it has been implicated in drug-drug interactions as well as its major role in multi-drug resistance in cancer. It is therefore important for the pharmaceutical industry to be able to understand what phenomena rule ABC substrate recognition. Considering a high degree of substrate overlap between various members of ABC transporter family, it is advantageous to employ a multi-label classification approach where predictions made for one transporter can be used for modeling of the other ABC transporters. Here, we present decision tree-based QSAR classification models able to simultaneously predict substrates and non-substrates for BCRP1, P-gp/MDR1 and MRP1 and MRP2, using a dataset of 1493 compounds. To this end, two multi-label classification QSAR modelling approaches were adopted: Binary Relevance (BR) and Classifier Chain (CC). Even though both multi-label models yielded similar predictive performances in terms of overall accuracies (close to 70), the CC model overcame the problem of skewed performance towards identifying substrates compared with non-substrates, which is a common problem in the literature. The models were thoroughly validated by using external testing, applicability domain and activity cliffs characterization. In conclusion, a multi-label classification approach is an appropriate alternative for the prediction of ABC efflux. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

    Insights into distributed feature ranking

    Get PDF
    This version of the article: Bolón-Canedo, V., Sechidis, K., Sánchez-Maroño, N., Alonso-Betanzos, A., & Brown, G. (2019). ‘Insights into distributed feature ranking’ has been accepted for publication in: Information Sciences, 496, 378–398. The Version of Record is available online at https://doi.org/10.1016/j.ins.2018.09.045.[Abstract]: In an era in which the volume and complexity of datasets is continuously growing, feature selection techniques have become indispensable to extract useful information from huge amounts of data. However, existing algorithms may not scale well when dealing with huge datasets, and a possible solution is to distribute the data in several nodes. In this work we explore the different ways of distributing the data (by features and by samples) and we evaluate to what extent it is possible to obtain similar results as those obtained with the whole dataset. Trying to deal with the challenge of distributing the feature ranking process, we have performed experiments with different aggregation methods and feature rankers, and also evaluated the effect of distributing the feature ranking process in the subsequent classification performance.This research has been economically supported in part by the Spanish Ministerio de Economía y Competitividad and FEDER funds of the European Union through the research project TIN2015-65069-C2-1-R; and by the Consellería de Industria of the Xunta de Galicia through the research project GRC2014/035. Financial support from the Xunta de Galicia (Centro singular de investigación de Galicia accreditation 2016-2019) and the European Union (European Regional Development Fund - ERDF), is gratefully acknowledged (research project ED431G/01). V. Bolón-Canedo acknowledges support of the Xunta de Galicia under postdoctoral Grant code ED481B 2014/164-0.Xunta de Galicia; GRC2014/035Xunta de Galicia; ED431G/01Xunta de Galicia; ED481B 2014/164-

    Feature selection with limited bit depth mutual information for portable embedded systems

    Get PDF
    This version of the article: Morán-Fernández, L., Sechidis, K., Bolón-Canedo, V., Alonso-Betanzos, A., & Brown, G. (2020). ‘Feature selection with limited bit depth mutual information for portable embedded systems’ has been accepted for publication in: Knowledge-Based Systems, 197, 105885. The Version of Record is available online at https://doi.org/10.1016/j.knosys.2020.105885.[Abstract]: Since wearable computing systems have grown in importance in the last years, there is an increased interest in implementing machine learning algorithms with reduced precision parameters/computations. Not only learning, also feature selection, most of the times a mandatory preprocessing step in machine learning, is often constrained by the available computational resources. This work considers mutual information – one of the most common measures of dependence used in feature selection algorithms – with a limited number of bits. In order to test the procedure designed, we have implemented it in several well-known feature selection algorithms. Experimental results over several synthetic and real datasets demonstrate that low bit representations are sufficient to achieve performances close to that of double precision parameters and thus open the door for the use of feature selection in embedded platforms that minimize the energy consumption and carbon emissions.This research has been financially supported in part by the Spanish Ministerio de Economía y Competitividad (research project TIN2015-65069-C2-1-R), by European Union FEDER funds and by the Consellería de Industria of the Xunta de Galicia (research project GRC2014 /035). Financial sup-port from the Xunta de Galicia (Centro singular de investigación de Galicia accreditation 2016-2019) and the European Union (European Regional Development Fund - ERDF), is gratefully acknowledged (research project ED431G/01). Project supported by a 2018 Leonardo Grant for Researchers and Cultural Creators, BBVA Foundation. Laura Morán-Fernández acknowledges predoctoral stay grant by INDITEX-UDC 2015.Xunta de Galicia; ED431G/01Xunta de Galicia; GRC2014 /03

    WATCH: A Workflow to Assess Treatment Effect Heterogeneity in Drug Development for Clinical Trial Sponsors

    Full text link
    This paper proposes a Workflow for Assessing Treatment effeCt Heterogeneity (WATCH) in clinical drug development targeted at clinical trial sponsors. The workflow is designed to address the challenges of investigating treatment effect heterogeneity (TEH) in randomized clinical trials, where sample size and multiplicity limit the reliability of findings. The proposed workflow includes four steps: Analysis Planning, Initial Data Analysis and Analysis Dataset Creation, TEH Exploration, and Multidisciplinary Assessment. The workflow aims to provide a systematic approach to explore treatment effect heterogeneity in the exploratory setting, taking into account external evidence and best scientific understanding

    E-Learning & Environmental Policy: The case of a politico-administrative GIS

    Get PDF
    Is an effective knowledge exchange and cooperation between academic community and practitioners possible? Implementation of e-learning in specialized policy fields pertains to the most challenging priorities of ICTs and software engineering. In multidisciplinary academic areas which combine environmental policy studies with positivist subjects (like environmental issues, forest policy, rural development, Landscape Architecture etc), the using of e-learning system in analyzing policy issues steadily gains in importance and is a method which connects the academic community and the researchers with the practitioners and field experts. Such initiatives incorporate a number of politometrics- relevant algorithms embedded in a context of political geography (i.e. visualized hierarchies in different regionrelated policy issues). This is the case addressed in this paper. The GIS learning management system introduced in this paper is based on certain criteria concerning organizational models and region-specific politico-administrative hierarchies. Scenarios of politico-administrative metadata achieving optimal power synergy are extracted through a sequencing technique, combining vector-algebra software and statistics and can be used for both teaching and research purposes

    Automated Selection and Configuration of Multi-Label Classification Algorithms with Grammar-Based Genetic Programming

    Get PDF
    This paper proposes Auto-MEKAGGP, an Automated Machine Learning (Auto-ML) method for Multi-Label Classification (MLC) based on the MEKA tool, which offers a number of MLC algorithms. In MLC, each example can be associated with one or more class labels, making MLC problems harder than conventional (single-label) classification problems. Hence, it is essential to select an MLC algorithm and its configuration tailored (optimized) for the input dataset. Auto-MEKAGGP addresses this problem with two key ideas. First, a large number of choices of MLC algorithms and configurations from MEKA are represented into a grammar. Second, our proposed Grammar-based Genetic Programming (GGP) method uses that grammar to search for the best MLC algorithm and configuration for the input dataset. Auto-MEKAGGP was tested in 10 datasets and compared to two well-known MLC methods, namely Binary Relevance and Classifier Chain, and also compared to GA-AutoMLC, a genetic algorithm we recently proposed for the same task. Two versions of Auto-MEKAGGP were tested: a full version with the proposed grammar, and a simplified version where the grammar includes only the algorithmic components used by GA-Auto-MLC. Overall, the full version of Auto-MEKAGGP achieved the best predictive accuracy among all five evaluated methods, being the winner in six out of the 10 datasets
    corecore