260,146 research outputs found
On the power of conditional independence testing under model-X
For testing conditional independence (CI) of a response Y and a predictor X
given covariates Z, the recently introduced model-X (MX) framework has been the
subject of active methodological research, especially in the context of MX
knockoffs and their successful application to genome-wide association studies.
In this paper, we study the power of MX CI tests, yielding quantitative
explanations for empirically observed phenomena and novel insights to guide the
design of MX methodology. We show that any valid MX CI test must also be valid
conditionally on Y and Z; this conditioning allows us to reformulate the
problem as testing a point null hypothesis involving the conditional
distribution of X. The Neyman-Pearson lemma then implies that the conditional
randomization test (CRT) based on a likelihood statistic is the most powerful
MX CI test against a point alternative. We also obtain a related optimality
result for MX knockoffs. Switching to an asymptotic framework with arbitrarily
growing covariate dimension, we derive an expression for the limiting power of
the CRT against local semiparametric alternatives in terms of the prediction
error of the machine learning algorithm on which its test statistic is based.
Finally, we exhibit a resampling-free test with uniform asymptotic Type-I error
control under the assumption that only the first two moments of X given Z are
known, a significant relaxation of the MX assumption
Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. RESULTS: The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions. CONCLUSION: Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general
Prediction of cascading failures and simultaneous learning of functional connectivity in power system
The prediction of power system cascading failures is a challenging task, especially with increasing uncertainty and complexity in power system dynamics due to integration of renewable energy sources (RES). Given the spatio-temporal and combinatorial nature of the problem, physics based approaches for characterizing cascading failures are often limited by their scope and/or speed, thereby prompting the use of a spatio-temporal learning technique. This paper proposes prediction of cascading failures using a spatio-temporal Graph Convolution Network (GCN) based machine learning (ML) framework. Additionally, the model also learns an importance matrix to reveal power system interconnections (graph nodes/edges) which are crucial to the prediction. The elements of learnt importance matrix are further projected as power system functional connectivities. Using these connectivities, insights on vulnerable power system interconnections may be derived for enhanced situational awareness. The proposed method has been tested on a modified IEEE 10 machine 39 bus test system, with RES and action of protection devices
Fast characterization of input-output behavior of non-charge-based logic devices by machine learning
Non-charge-based logic devices are promising candidates for the replacement of conventional complementary metal-oxide semiconductors (CMOS) devices. These devices utilize magnetic properties to store or process information making them power efficient. Traditionally, to fully characterize the input-output behavior of these devices a large number of micromagnetic simulations are required, which makes the process computationally expensive. Machine learning techniques have been shown to dramatically decrease the computational requirements of many complex problems. We use state-of-the-art data-efficient machine learning techniques to expedite the characterization of their behavior. Several intelligent sampling strategies are combined with machine learning (binary and multi-class) classification models. These techniques are applied to a magnetic logic device that utilizes direct exchange interaction between two distinct regions containing a bistable canted magnetization configuration. Three classifiers were developed with various adaptive sampling techniques in order to capture the input-output behavior of this device. By adopting an adaptive sampling strategy, it is shown that prediction accuracy can approach that of full grid sampling while using only a small training set of micromagnetic simulations. Comparing model predictions to a grid-based approach on two separate cases, the best performing machine learning model accurately predicts 99.92% of the dense test grid while utilizing only 2.36% of the training data respectively
Developing machine learning models to predict methane and nitrogen oxide engine-out emissions from a heavy-duty natural-gas engine
Abstract: Heavy-duty engine manufacturers must comply with challenging and more stringent emission and greenhouse gas (GHG) regulations. Predicting engine emission behavior in system-level models with reasonable accuracy is advantageous for engine and powertrain development. Machine learning (ML) models are promising alongside 3D physics-based and one-dimensional models. In this study, five different ML models are trained using experimental engine data for emission prediction of methane (CH4) and nitrogen oxides (NOx). The models are compared with an existing phenomenological engine model (GT-Power). The ML models include linear regression, Ridge regression, Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBoost). The results show that the RF model outperforms other models and a one-dimensional model regarding NOx emission prediction. The results of RF NOx and CH4 emission prediction in the test set fit with 80% accuracy (±20 error margin). Also, 95% of test data points have less than 10% error compared to real experimental data.Communication présentée lors du congrès international tenu conjointement par Canadian Society for Mechanical Engineering (CSME) et Computational Fluid Dynamics Society of Canada (CFD Canada), à l’Université de Sherbrooke (Québec), du 28 au 31 mai 2023
On predictability of rare events leveraging social media: a machine learning perspective
Information extracted from social media streams has been leveraged to
forecast the outcome of a large number of real-world events, from political
elections to stock market fluctuations. An increasing amount of studies
demonstrates how the analysis of social media conversations provides cheap
access to the wisdom of the crowd. However, extents and contexts in which such
forecasting power can be effectively leveraged are still unverified at least in
a systematic way. It is also unclear how social-media-based predictions compare
to those based on alternative information sources. To address these issues,
here we develop a machine learning framework that leverages social media
streams to automatically identify and predict the outcomes of soccer matches.
We focus in particular on matches in which at least one of the possible
outcomes is deemed as highly unlikely by professional bookmakers. We argue that
sport events offer a systematic approach for testing the predictive power of
social media, and allow to compare such power against the rigorous baselines
set by external sources. Despite such strict baselines, our framework yields
above 8% marginal profit when used to inform simple betting strategies. The
system is based on real-time sentiment analysis and exploits data collected
immediately before the games, allowing for informed bets. We discuss the
rationale behind our approach, describe the learning framework, its prediction
performance and the return it provides as compared to a set of betting
strategies. To test our framework we use both historical Twitter data from the
2014 FIFA World Cup games, and real-time Twitter data collected by monitoring
the conversations about all soccer matches of four major European tournaments
(FA Premier League, Serie A, La Liga, and Bundesliga), and the 2014 UEFA
Champions League, during the period between Oct. 25th 2014 and Nov. 26th 2014.Comment: 10 pages, 10 tables, 8 figure
- …