106 research outputs found

    Systems for AutoML Research

    Get PDF

    Improving generalisation of AutoML systems with dynamic fitness evaluations

    Full text link
    A common problem machine learning developers are faced with is overfitting, that is, fitting a pipeline too closely to the training data that the performance degrades for unseen data. Automated machine learning aims to free (or at least ease) the developer from the burden of pipeline creation, but this overfitting problem can persist. In fact, this can become more of a problem as we look to iteratively optimise the performance of an internal cross-validation (most often \textit{k}-fold). While this internal cross-validation hopes to reduce this overfitting, we show we can still risk overfitting to the particular folds used. In this work, we aim to remedy this problem by introducing dynamic fitness evaluations which approximate repeated \textit{k}-fold cross-validation, at little extra cost over single \textit{k}-fold, and far lower cost than typical repeated \textit{k}-fold. The results show that when time equated, the proposed fitness function results in significant improvement over the current state-of-the-art baseline method which uses an internal single \textit{k}-fold. Furthermore, the proposed extension is very simple to implement on top of existing evolutionary computation methods, and can provide essentially a free boost in generalisation/testing performance.Comment: 19 pages, 4 figure

    Natural Language Processing in Electronic Health Records in Relation to Healthcare Decision-making: A Systematic Review

    Full text link
    Background: Natural Language Processing (NLP) is widely used to extract clinical insights from Electronic Health Records (EHRs). However, the lack of annotated data, automated tools, and other challenges hinder the full utilisation of NLP for EHRs. Various Machine Learning (ML), Deep Learning (DL) and NLP techniques are studied and compared to understand the limitations and opportunities in this space comprehensively. Methodology: After screening 261 articles from 11 databases, we included 127 papers for full-text review covering seven categories of articles: 1) medical note classification, 2) clinical entity recognition, 3) text summarisation, 4) deep learning (DL) and transfer learning architecture, 5) information extraction, 6) Medical language translation and 7) other NLP applications. This study follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Result and Discussion: EHR was the most commonly used data type among the selected articles, and the datasets were primarily unstructured. Various ML and DL methods were used, with prediction or classification being the most common application of ML or DL. The most common use cases were: the International Classification of Diseases, Ninth Revision (ICD-9) classification, clinical note analysis, and named entity recognition (NER) for clinical descriptions and research on psychiatric disorders. Conclusion: We find that the adopted ML models were not adequately assessed. In addition, the data imbalance problem is quite important, yet we must find techniques to address this underlining problem. Future studies should address key limitations in studies, primarily identifying Lupus Nephritis, Suicide Attempts, perinatal self-harmed and ICD-9 classification

    Enhancing extremist data classification through textual analysis

    Get PDF
    The high volume of extremist materials on the Internet has created the need for intelligence gathering via the Web and real-time monitoring of potential websites for evidence of extremist activities. However, the manual classification for such contents is practically difficult and time-consuming. In response to this challenge, the work reported here developed several classification frameworks. Each framework provides a basis of text representation before being fed into machine learning algorithm. The basis of text representation are Sentiment-rule, Posit-textual analysis with word-level features, and an extension of Posit analysis, known as Extended-Posit, which adopts character-level as well as word-level data. Identifying some gaps in the aforementioned techniques created avenues for further improvements, most especially in handling larger datasets with better classification accuracy. Consequently, a novel basis of text representation known as the Composite-based method was developed. This is a computational framework that explores the combination of both sentiment and syntactic features of textual contents of a Web page. Subsequently, these techniques are applied on a dataset that had been subjected to a manual classification process, thereafter fed into machine learning algorithm. This is to generate a measure of how well each page can be classified into their appropriate classes. The classifiers considered are both Neural Network (RNN and MLP) and Machine Learning classifiers (such as J48, Random Forest and KNN). In addition, features selection and model optimisation were evaluated to know the cost when creating machine learning model. However, considering all the result obtained from each of the framework, the results indicated that composite features are preferable to solely syntactic or sentiment features which offer improved classification accuracy when used with machine learning algorithms. Furthermore, the extension of Posit analysis to include both word and character-level data out-performed word-level feature alone when applied on the assembled textual data. Moreover, Random Forest classifier outperformed other classifiers explored. Taking cost into account, feature selection improves classification accuracy and save time better than hyperparameter turning (model optimisation).The high volume of extremist materials on the Internet has created the need for intelligence gathering via the Web and real-time monitoring of potential websites for evidence of extremist activities. However, the manual classification for such contents is practically difficult and time-consuming. In response to this challenge, the work reported here developed several classification frameworks. Each framework provides a basis of text representation before being fed into machine learning algorithm. The basis of text representation are Sentiment-rule, Posit-textual analysis with word-level features, and an extension of Posit analysis, known as Extended-Posit, which adopts character-level as well as word-level data. Identifying some gaps in the aforementioned techniques created avenues for further improvements, most especially in handling larger datasets with better classification accuracy. Consequently, a novel basis of text representation known as the Composite-based method was developed. This is a computational framework that explores the combination of both sentiment and syntactic features of textual contents of a Web page. Subsequently, these techniques are applied on a dataset that had been subjected to a manual classification process, thereafter fed into machine learning algorithm. This is to generate a measure of how well each page can be classified into their appropriate classes. The classifiers considered are both Neural Network (RNN and MLP) and Machine Learning classifiers (such as J48, Random Forest and KNN). In addition, features selection and model optimisation were evaluated to know the cost when creating machine learning model. However, considering all the result obtained from each of the framework, the results indicated that composite features are preferable to solely syntactic or sentiment features which offer improved classification accuracy when used with machine learning algorithms. Furthermore, the extension of Posit analysis to include both word and character-level data out-performed word-level feature alone when applied on the assembled textual data. Moreover, Random Forest classifier outperformed other classifiers explored. Taking cost into account, feature selection improves classification accuracy and save time better than hyperparameter turning (model optimisation)

    Automatic Machine Learning for Insurance: H2O Experiment

    Full text link
    Treballs Finals del Màster de Ciències Actuarials i Financeres, Facultat d'Economia i Empresa, Universitat de Barcelona, Curs: 2020-2021, Tutor: Dr. Salvador Torra PorrasThis thesis provides an introduction of machine learning (ML), shows the implication that ML has on the insurance sector and takes a special consideration to the H2O ensemble modelling approach for the insurance claim fraud detection binary classification. The aim of this thesis is to study the H2O Automatic ML potential and compare the results generated with traditional algorithms such as lineal perceptron, Logistic Regression, multilayer perceptron, support vector machine and decision tree. Using H2O web interface or R programming, not only the most efficient ML algorithms are obtained with no effort but also provide better modelling metrics than traditional methods

    Beamforming analysis using Random Forest classifier

    Get PDF
    Abstract. Wireless communication has a long history that has changed shape throughout the centuries, from smoke signals to electromagnetic radiation. Data transmission evolution has made worldwide communication possible and has contributed to globalisation. Today, information can be shared in real-time—for example, to the other side of the world. Wireless communication has evolved to the point that real-time plays a vital role, and data loss should not occur. For efficient wireless data transmission, a beamforming technique has been developed. This is a signal processing technique used in antennas for directional signal transmission or reception. Beamforming includes numerous variations, making the analysis of beamforming challenging. Due to its complex nature, beamforming is attempted to be understood more simply at a higher level, and for that reason, elements are listed that enable the analysis to check whether beamforming succeeded on the radio. Machine learning is a new trend in different aspects of technology. Problems are aimed to be solved and predicted more efficiently by using suitable machine learning methods. Machine learning enables more precise analysis and error tracking, which are utilised in combination to minimise errors. Furthermore, machine learning has been integrated into various automation systems. This thesis concentrates on analysing the success of beamforming at a high level and aims to automate testing and provide feedback to radio architects who utilise beamforming. For a high-level analysis, a few criteria define the success of beamforming on the radio. In this thesis, a machine learning pipeline is presented from prepossessing to the final model, and we demonstrate the promising results we have been able to achieve using the random forest classifier. Such promising results make it possible to continue with the beamforming classification and serve as motivation to improve and gather detailed feedback for the end-user

    An automated machine learning approach to predict brain age from cortical anatomical measures

    Get PDF
    The use of machine learning (ML) algorithms has significantly increased in neuroscience. However, from the vast extent of possible ML algorithms, which one is the optimal model to predict the target variable? What are the hyperparameters for such a model? Given the plethora of possible answers to these questions, in the last years, automated ML (autoML) has been gaining attention. Here, we apply an autoML library called Tree-based Pipeline Optimisation Tool (TPOT) which uses a tree-based representation of ML pipelines and conducts a genetic programming-based approach to find the model and its hyperparameters that more closely predicts the subject's true age. To explore autoML and evaluate its efficacy within neuroimaging data sets, we chose a problem that has been the focus of previous extensive study: brain age prediction. Without any prior knowledge, TPOT was able to scan through the model space and create pipelines that outperformed the state-of-the-art accuracy for Freesurfer-based models using only thickness and volume information for anatomical structure. In particular, we compared the performance of TPOT (mean absolute error [MAE]: 4.612 ± .124 years) and a relevance vector regression (MAE 5.474 ± .140 years). TPOT also suggested interesting combinations of models that do not match the current most used models for brain prediction but generalise well to unseen data. AutoML showed promising results as a data-driven approach to find optimal models for neuroimaging applications

    MEG: Multi-objective Ensemble Generation for Software Defect Prediction

    Get PDF
    Background: Defect Prediction research aims at assisting software engineers in the early identification of software defect during the development process. A variety of automated approaches, ranging from traditional classification models to more sophisticated learning approaches, have been explored to this end. Among these, recent studies have proposed the use of ensemble prediction models (i.e., aggregation of multiple base classifiers) to build more robust defect prediction models. / Aims: In this paper, we introduce a novel approach based on multi-objective evolutionary search to automatically generate defect prediction ensembles. Our proposal is not only novel with respect to the more general area of evolutionary generation of ensembles, but it also advances the state-of-the-art in the use of ensemble in defect prediction. / Method: We assess the effectiveness of our approach, dubbed as Multi-objective Ensemble Generation (MEG), by empirically benchmarking it with respect to the most related proposals we found in the literature on defect prediction ensembles and on multi-objective evolutionary ensembles (which, to the best of our knowledge, had never been previously applied to tackle defect prediction). / Result: Our results show that MEG is able to generate ensembles which produce similar or more accurate predictions than those achieved by all the other approaches considered in 73% of the cases (with favourable large effect sizes in 80% of them). / Conclusions: MEG is not only able to generate ensembles that yield more accurate defect predictions with respect to the benchmarks considered, but it also does it automatically, thus relieving the engineers from the burden of manual design and experimentation
    • …
    corecore