71 research outputs found

    A pipeline and comparative study of 12 machine learning models for text classification

    Get PDF
    Text-based communication is highly favoured as a communication method, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem. We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models' performance by applying specific methods (based on natural language processing) in the preprocessing stage. Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. By analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public email corpus. Statistical tests and explainability techniques are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94%.Comment: This article has been accepted for publication in Expert Systems with Applications, April 2022. Published by Elsevier. All data, models, and code used in this work are available on GitHub at https://github.com/Angione-Lab/12-machine-learning-models-for-text-classificatio

    Surrogate models for seismic and pushover response prediction of steel special moment resisting frames

    Get PDF
    For structural engineers, existing surrogate models of buildings present challenges due to inadequate datasets, exclusion of significant input variables impacting nonlinear building response, and failure to consider uncertainties associated with input parameters. Moreover, there are no surrogate models for the prediction of both pushover and nonlinear time history analysis (NLTHA) outputs. To overcome these challenges, the present study proposes a novel framework for surrogate modelling of steel structures, considering crucial structural factors impacting engineering demand parameters (EDPs). The first phase involves the development of a process by which 30,000 random steel special moment resisting frames (SMRFs) for low to high-rise buildings are generated, considering the material and geometrical uncertainties embedded in the design of structures. In the second phase, a surrogate model is developed to predict the seismic EDPs of SMRFs when exposed to various earthquake levels. This is accomplished by leveraging the results obtained from phase one. Moreover, separate surrogate models are developed for the prediction of SMRFs’ essential pushover parameters. Various machine learning (ML) methods are examined, and the outcomes are presented as user-friendly GUI tools. The findings highlighted the substantial influence of pushover parameters as well as beams and columns’ plastic hinges properties on the prediction of NLTHA, factors that have been overlooked in prior studies. Moreover, CatBoost has been acknowledged as the superior ML technique for predicting both pushover and NLTHA parameters for all buildings. This framework offers engineers the ability to estimate building responses without the necessity of conducting NLTHA, pushover, or even modal analysis which is computationally intensive

    Surrogate models for seismic and pushover response prediction of steel special moment resisting frames

    Get PDF
    For structural engineers, existing surrogate models of buildings present challenges due to inadequate datasets, exclusion of significant input variables impacting nonlinear building response, and failure to consider uncertainties associated with input parameters. Moreover, there are no surrogate models for the prediction of both pushover and nonlinear time history analysis (NLTHA) outputs. To overcome these challenges, the present study proposes a novel framework for surrogate modelling of steel structures, considering crucial structural factors impacting engineering demand parameters (EDPs). The first phase involves the development of a process by which 30,000 random steel special moment resisting frames (SMRFs) for low to high-rise buildings are generated, considering the material and geometrical uncertainties embedded in the design of structures. In the second phase, a surrogate model is developed to predict the seismic EDPs of SMRFs when exposed to various earthquake levels. This is accomplished by leveraging the results obtained from phase one. Moreover, separate surrogate models are developed for the prediction of SMRFs’ essential pushover parameters. Various machine learning (ML) methods are examined, and the outcomes are presented as user-friendly GUI tools. The findings highlighted the substantial influence of pushover parameters as well as beams and columns’ plastic hinges properties on the prediction of NLTHA, factors that have been overlooked in prior studies. Moreover, CatBoost has been acknowledged as the superior ML technique for predicting both pushover and NLTHA parameters for all buildings. This framework offers engineers the ability to estimate building responses without the necessity of conducting NLTHA, pushover, or even modal analysis which is computationally intensive

    A Practical Guide to Integrating Multimodal Machine Learning and Metabolic Modeling

    Get PDF
    Complex, distributed, and dynamic sets of clinical biomedical data are collectively referred to as multimodal clinical data. In order to accommodate the volume and heterogeneity of such diverse data types and aid in their interpretation when they are combined with a multi-scale predictive model, machine learning is a useful tool that can be wielded to deconstruct biological complexity and extract relevant outputs. Additionally, genome-scale metabolic models (GSMMs) are one of the main frameworks striving to bridge the gap between genotype and phenotype by incorporating prior biological knowledge into mechanistic models. Consequently, the utilization of GSMMs as a foundation for the integration of multi-omic data originating from different domains is a valuable pursuit towards refining predictions. In this chapter, we show how cancer multi-omic data can be analyzed via multimodal machine learning and metabolic modeling. Firstly, we focus on the merits of adopting an integrative systems biology led approach to biomedical data mining. Following this, we propose how constraint-based metabolic models can provide a stable yet adaptable foundation for the integration of multimodal data with machine learning. Finally, we provide a step-by-step tutorial for the combination of machine learning and GSMMs, which includes: (i) tissue-specific constraint-based modeling; (ii) survival analysis using time-to-event prediction for cancer; and (iii) classification and regression approaches for multimodal machine learning. The code associated with the tutorial can be found at https://github.com/Angione-Lab/Tutorials_Combining_ML_and_GSMM

    Cancer Markers Selection Using Network-Based Cox Regression: A Methodological and Computational Practice.

    Get PDF
    International initiatives such as the Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) are collecting multiple datasets at different genome-scales with the aim of identifying novel cancer biomarkers and predicting survival of patients. To analyze such data, several statistical methods have been applied, among them Cox regression models. Although these models provide a good statistical framework to analyze omic data, there is still a lack of studies that illustrate advantages and drawbacks in integrating biological information and selecting groups of biomarkers. In fact, classical Cox regression algorithms focus on the selection of a single biomarker, without taking into account the strong correlation between genes. Even though network-based Cox regression algorithms overcome such drawbacks, such network-based approaches are less widely used within the life science community. In this article, we aim to provide a clear methodological framework on the use of such approaches in order to turn cancer research results into clinical applications. Therefore, we first discuss the rationale and the practical usage of three recently proposed network-based Cox regression algorithms (i.e., Net-Cox, AdaLnet, and fastcox). Then, we show how to combine existing biological knowledge and available data with such algorithms to identify networks of cancer biomarkers and to estimate survival of patients. Finally, we describe in detail a new permutation-based approach to better validate the significance of the selection in terms of cancer gene signatures and pathway/networks identification. We illustrate the proposed methodology by means of both simulations and real case studies. Overall, the aim of our work is two-fold. Firstly, to show how network-based Cox regression models can be used to integrate biological knowledge (e.g., multi-omics data) for the analysis of survival data. Secondly, to provide a clear methodological and computational approach for investigating cancers regulatory networks
    • …
    corecore