106 research outputs found
Improving generalisation of AutoML systems with dynamic fitness evaluations
A common problem machine learning developers are faced with is overfitting,
that is, fitting a pipeline too closely to the training data that the
performance degrades for unseen data. Automated machine learning aims to free
(or at least ease) the developer from the burden of pipeline creation, but this
overfitting problem can persist. In fact, this can become more of a problem as
we look to iteratively optimise the performance of an internal cross-validation
(most often \textit{k}-fold). While this internal cross-validation hopes to
reduce this overfitting, we show we can still risk overfitting to the
particular folds used. In this work, we aim to remedy this problem by
introducing dynamic fitness evaluations which approximate repeated
\textit{k}-fold cross-validation, at little extra cost over single
\textit{k}-fold, and far lower cost than typical repeated \textit{k}-fold. The
results show that when time equated, the proposed fitness function results in
significant improvement over the current state-of-the-art baseline method which
uses an internal single \textit{k}-fold. Furthermore, the proposed extension is
very simple to implement on top of existing evolutionary computation methods,
and can provide essentially a free boost in generalisation/testing performance.Comment: 19 pages, 4 figure
Natural Language Processing in Electronic Health Records in Relation to Healthcare Decision-making: A Systematic Review
Background: Natural Language Processing (NLP) is widely used to extract
clinical insights from Electronic Health Records (EHRs). However, the lack of
annotated data, automated tools, and other challenges hinder the full
utilisation of NLP for EHRs. Various Machine Learning (ML), Deep Learning (DL)
and NLP techniques are studied and compared to understand the limitations and
opportunities in this space comprehensively.
Methodology: After screening 261 articles from 11 databases, we included 127
papers for full-text review covering seven categories of articles: 1) medical
note classification, 2) clinical entity recognition, 3) text summarisation, 4)
deep learning (DL) and transfer learning architecture, 5) information
extraction, 6) Medical language translation and 7) other NLP applications. This
study follows the Preferred Reporting Items for Systematic Reviews and
Meta-Analyses (PRISMA) guidelines.
Result and Discussion: EHR was the most commonly used data type among the
selected articles, and the datasets were primarily unstructured. Various ML and
DL methods were used, with prediction or classification being the most common
application of ML or DL. The most common use cases were: the International
Classification of Diseases, Ninth Revision (ICD-9) classification, clinical
note analysis, and named entity recognition (NER) for clinical descriptions and
research on psychiatric disorders.
Conclusion: We find that the adopted ML models were not adequately assessed.
In addition, the data imbalance problem is quite important, yet we must find
techniques to address this underlining problem. Future studies should address
key limitations in studies, primarily identifying Lupus Nephritis, Suicide
Attempts, perinatal self-harmed and ICD-9 classification
Enhancing extremist data classification through textual analysis
The high volume of extremist materials on the Internet has created the need for intelligence gathering via the Web and real-time monitoring of potential websites for evidence of extremist activities. However, the manual classification for such contents is practically difficult and time-consuming. In response to this challenge, the work reported here developed several classification frameworks. Each framework provides a basis of text representation before being fed into machine learning algorithm. The basis of text representation are Sentiment-rule, Posit-textual analysis with word-level features, and an extension of Posit analysis, known as Extended-Posit, which adopts character-level as well as word-level data. Identifying some gaps in the aforementioned techniques created avenues for further improvements, most especially in handling larger datasets with better classification accuracy.
Consequently, a novel basis of text representation known as the Composite-based method was developed. This is a computational framework that explores the combination of both sentiment and syntactic features of textual contents of a Web page. Subsequently, these techniques are applied on a dataset that had been subjected to a manual classification process, thereafter fed into machine learning algorithm. This is to generate a measure of how well each page can be classified into their appropriate classes. The classifiers considered are both Neural Network (RNN and MLP) and Machine Learning classifiers (such as J48, Random Forest and KNN). In addition, features selection and model optimisation were evaluated to know the cost when creating machine learning model.
However, considering all the result obtained from each of the framework, the results indicated that composite features are preferable to solely syntactic or sentiment features which offer improved classification accuracy when used with machine learning algorithms. Furthermore, the extension of Posit analysis to include both word and character-level data out-performed word-level feature alone when applied on the assembled textual data. Moreover, Random Forest classifier outperformed other classifiers explored. Taking cost into account, feature selection improves classification accuracy and save time better than hyperparameter turning (model optimisation).The high volume of extremist materials on the Internet has created the need for intelligence gathering via the Web and real-time monitoring of potential websites for evidence of extremist activities. However, the manual classification for such contents is practically difficult and time-consuming. In response to this challenge, the work reported here developed several classification frameworks. Each framework provides a basis of text representation before being fed into machine learning algorithm. The basis of text representation are Sentiment-rule, Posit-textual analysis with word-level features, and an extension of Posit analysis, known as Extended-Posit, which adopts character-level as well as word-level data. Identifying some gaps in the aforementioned techniques created avenues for further improvements, most especially in handling larger datasets with better classification accuracy.
Consequently, a novel basis of text representation known as the Composite-based method was developed. This is a computational framework that explores the combination of both sentiment and syntactic features of textual contents of a Web page. Subsequently, these techniques are applied on a dataset that had been subjected to a manual classification process, thereafter fed into machine learning algorithm. This is to generate a measure of how well each page can be classified into their appropriate classes. The classifiers considered are both Neural Network (RNN and MLP) and Machine Learning classifiers (such as J48, Random Forest and KNN). In addition, features selection and model optimisation were evaluated to know the cost when creating machine learning model.
However, considering all the result obtained from each of the framework, the results indicated that composite features are preferable to solely syntactic or sentiment features which offer improved classification accuracy when used with machine learning algorithms. Furthermore, the extension of Posit analysis to include both word and character-level data out-performed word-level feature alone when applied on the assembled textual data. Moreover, Random Forest classifier outperformed other classifiers explored. Taking cost into account, feature selection improves classification accuracy and save time better than hyperparameter turning (model optimisation)
Automatic Machine Learning for Insurance: H2O Experiment
Treballs Finals del Mà ster de Ciències Actuarials i Financeres, Facultat d'Economia i Empresa, Universitat de Barcelona, Curs: 2020-2021, Tutor: Dr. Salvador Torra PorrasThis thesis provides an introduction of machine learning (ML), shows the implication that ML has on the insurance sector and takes a special consideration to the H2O ensemble modelling approach for the insurance claim fraud detection binary classification. The aim of this thesis is to study the H2O Automatic ML potential and compare the results generated with traditional algorithms such as lineal perceptron, Logistic Regression, multilayer perceptron, support vector machine and decision tree. Using H2O web interface or R programming, not only the most efficient ML algorithms are obtained with no effort but also provide better modelling metrics than traditional methods
Beamforming analysis using Random Forest classifier
Abstract. Wireless communication has a long history that has changed shape throughout the centuries, from smoke signals to electromagnetic radiation. Data transmission evolution has made worldwide communication possible and has contributed to globalisation. Today, information can be shared in real-time—for example, to the other side of the world. Wireless communication has evolved to the point that real-time plays a vital role, and data loss should not occur.
For efficient wireless data transmission, a beamforming technique has been developed. This is a signal processing technique used in antennas for directional signal transmission or reception. Beamforming includes numerous variations, making the analysis of beamforming challenging. Due to its complex nature, beamforming is attempted to be understood more simply at a higher level, and for that reason, elements are listed that enable the analysis to check whether beamforming succeeded on the radio.
Machine learning is a new trend in different aspects of technology. Problems are aimed to be solved and predicted more efficiently by using suitable machine learning methods. Machine learning enables more precise analysis and error tracking, which are utilised in combination to minimise errors. Furthermore, machine learning has been integrated into various automation systems. This thesis concentrates on analysing the success of beamforming at a high level and aims to automate testing and provide feedback to radio architects who utilise beamforming. For a high-level analysis, a few criteria define the success of beamforming on the radio.
In this thesis, a machine learning pipeline is presented from prepossessing to the final model, and we demonstrate the promising results we have been able to achieve using the random forest classifier. Such promising results make it possible to continue with the beamforming classification and serve as motivation to improve and gather detailed feedback for the end-user
An automated machine learning approach to predict brain age from cortical anatomical measures
The use of machine learning (ML) algorithms has significantly increased in neuroscience. However, from the vast extent of possible ML algorithms, which one is the optimal model to predict the target variable? What are the hyperparameters for such a model? Given the plethora of possible answers to these questions, in the last years, automated ML (autoML) has been gaining attention. Here, we apply an autoML library called Tree-based Pipeline Optimisation Tool (TPOT) which uses a tree-based representation of ML pipelines and conducts a genetic programming-based approach to find the model and its hyperparameters that more closely predicts the subject's true age. To explore autoML and evaluate its efficacy within neuroimaging data sets, we chose a problem that has been the focus of previous extensive study: brain age prediction. Without any prior knowledge, TPOT was able to scan through the model space and create pipelines that outperformed the state-of-the-art accuracy for Freesurfer-based models using only thickness and volume information for anatomical structure. In particular, we compared the performance of TPOT (mean absolute error [MAE]: 4.612 ± .124 years) and a relevance vector regression (MAE 5.474 ± .140 years). TPOT also suggested interesting combinations of models that do not match the current most used models for brain prediction but generalise well to unseen data. AutoML showed promising results as a data-driven approach to find optimal models for neuroimaging applications
MEG: Multi-objective Ensemble Generation for Software Defect Prediction
Background: Defect Prediction research aims at assisting software
engineers in the early identification of software defect during the
development process. A variety of automated approaches, ranging from traditional classification models to more sophisticated
learning approaches, have been explored to this end. Among these,
recent studies have proposed the use of ensemble prediction models
(i.e., aggregation of multiple base classifiers) to build more robust
defect prediction models. /
Aims: In this paper, we introduce a novel
approach based on multi-objective evolutionary search to automatically generate defect prediction ensembles. Our proposal is not
only novel with respect to the more general area of evolutionary
generation of ensembles, but it also advances the state-of-the-art
in the use of ensemble in defect prediction. /
Method: We assess
the effectiveness of our approach, dubbed as Multi-objective
Ensemble Generation (MEG), by empirically benchmarking it
with respect to the most related proposals we found in the literature
on defect prediction ensembles and on multi-objective evolutionary
ensembles (which, to the best of our knowledge, had never been
previously applied to tackle defect prediction). /
Result: Our results
show that MEG is able to generate ensembles which produce similar
or more accurate predictions than those achieved by all the other
approaches considered in 73% of the cases (with favourable large
effect sizes in 80% of them). /
Conclusions: MEG is not only able
to generate ensembles that yield more accurate defect predictions
with respect to the benchmarks considered, but it also does it automatically, thus relieving the engineers from the burden of manual
design and experimentation
- …