260,146 research outputs found

    On the power of conditional independence testing under model-X

    Full text link
    For testing conditional independence (CI) of a response Y and a predictor X given covariates Z, the recently introduced model-X (MX) framework has been the subject of active methodological research, especially in the context of MX knockoffs and their successful application to genome-wide association studies. In this paper, we study the power of MX CI tests, yielding quantitative explanations for empirically observed phenomena and novel insights to guide the design of MX methodology. We show that any valid MX CI test must also be valid conditionally on Y and Z; this conditioning allows us to reformulate the problem as testing a point null hypothesis involving the conditional distribution of X. The Neyman-Pearson lemma then implies that the conditional randomization test (CRT) based on a likelihood statistic is the most powerful MX CI test against a point alternative. We also obtain a related optimality result for MX knockoffs. Switching to an asymptotic framework with arbitrarily growing covariate dimension, we derive an expression for the limiting power of the CRT against local semiparametric alternatives in terms of the prediction error of the machine learning algorithm on which its test statistic is based. Finally, we exhibit a resampling-free test with uniform asymptotic Type-I error control under the assumption that only the first two moments of X given Z are known, a significant relaxation of the MX assumption

    Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor

    Get PDF
    BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. RESULTS: The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions. CONCLUSION: Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general

    Prediction of cascading failures and simultaneous learning of functional connectivity in power system

    Get PDF
    The prediction of power system cascading failures is a challenging task, especially with increasing uncertainty and complexity in power system dynamics due to integration of renewable energy sources (RES). Given the spatio-temporal and combinatorial nature of the problem, physics based approaches for characterizing cascading failures are often limited by their scope and/or speed, thereby prompting the use of a spatio-temporal learning technique. This paper proposes prediction of cascading failures using a spatio-temporal Graph Convolution Network (GCN) based machine learning (ML) framework. Additionally, the model also learns an importance matrix to reveal power system interconnections (graph nodes/edges) which are crucial to the prediction. The elements of learnt importance matrix are further projected as power system functional connectivities. Using these connectivities, insights on vulnerable power system interconnections may be derived for enhanced situational awareness. The proposed method has been tested on a modified IEEE 10 machine 39 bus test system, with RES and action of protection devices

    Fast characterization of input-output behavior of non-charge-based logic devices by machine learning

    Get PDF
    Non-charge-based logic devices are promising candidates for the replacement of conventional complementary metal-oxide semiconductors (CMOS) devices. These devices utilize magnetic properties to store or process information making them power efficient. Traditionally, to fully characterize the input-output behavior of these devices a large number of micromagnetic simulations are required, which makes the process computationally expensive. Machine learning techniques have been shown to dramatically decrease the computational requirements of many complex problems. We use state-of-the-art data-efficient machine learning techniques to expedite the characterization of their behavior. Several intelligent sampling strategies are combined with machine learning (binary and multi-class) classification models. These techniques are applied to a magnetic logic device that utilizes direct exchange interaction between two distinct regions containing a bistable canted magnetization configuration. Three classifiers were developed with various adaptive sampling techniques in order to capture the input-output behavior of this device. By adopting an adaptive sampling strategy, it is shown that prediction accuracy can approach that of full grid sampling while using only a small training set of micromagnetic simulations. Comparing model predictions to a grid-based approach on two separate cases, the best performing machine learning model accurately predicts 99.92% of the dense test grid while utilizing only 2.36% of the training data respectively

    Developing machine learning models to predict methane and nitrogen oxide engine-out emissions from a heavy-duty natural-gas engine

    Get PDF
    Abstract: Heavy-duty engine manufacturers must comply with challenging and more stringent emission and greenhouse gas (GHG) regulations. Predicting engine emission behavior in system-level models with reasonable accuracy is advantageous for engine and powertrain development. Machine learning (ML) models are promising alongside 3D physics-based and one-dimensional models. In this study, five different ML models are trained using experimental engine data for emission prediction of methane (CH4) and nitrogen oxides (NOx). The models are compared with an existing phenomenological engine model (GT-Power). The ML models include linear regression, Ridge regression, Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBoost). The results show that the RF model outperforms other models and a one-dimensional model regarding NOx emission prediction. The results of RF NOx and CH4 emission prediction in the test set fit with 80% accuracy (±20 error margin). Also, 95% of test data points have less than 10% error compared to real experimental data.Communication présentée lors du congrès international tenu conjointement par Canadian Society for Mechanical Engineering (CSME) et Computational Fluid Dynamics Society of Canada (CFD Canada), à l’Université de Sherbrooke (Québec), du 28 au 31 mai 2023

    On predictability of rare events leveraging social media: a machine learning perspective

    Full text link
    Information extracted from social media streams has been leveraged to forecast the outcome of a large number of real-world events, from political elections to stock market fluctuations. An increasing amount of studies demonstrates how the analysis of social media conversations provides cheap access to the wisdom of the crowd. However, extents and contexts in which such forecasting power can be effectively leveraged are still unverified at least in a systematic way. It is also unclear how social-media-based predictions compare to those based on alternative information sources. To address these issues, here we develop a machine learning framework that leverages social media streams to automatically identify and predict the outcomes of soccer matches. We focus in particular on matches in which at least one of the possible outcomes is deemed as highly unlikely by professional bookmakers. We argue that sport events offer a systematic approach for testing the predictive power of social media, and allow to compare such power against the rigorous baselines set by external sources. Despite such strict baselines, our framework yields above 8% marginal profit when used to inform simple betting strategies. The system is based on real-time sentiment analysis and exploits data collected immediately before the games, allowing for informed bets. We discuss the rationale behind our approach, describe the learning framework, its prediction performance and the return it provides as compared to a set of betting strategies. To test our framework we use both historical Twitter data from the 2014 FIFA World Cup games, and real-time Twitter data collected by monitoring the conversations about all soccer matches of four major European tournaments (FA Premier League, Serie A, La Liga, and Bundesliga), and the 2014 UEFA Champions League, during the period between Oct. 25th 2014 and Nov. 26th 2014.Comment: 10 pages, 10 tables, 8 figure
    • …