5,601 research outputs found
How complex is the microarray dataset? A novel data complexity metric for biological high-dimensional microarray data
Data complexity analysis quantifies the hardness of constructing a predictive
model on a given dataset. However, the effectiveness of existing data
complexity measures can be challenged by the existence of irrelevant features
and feature interactions in biological micro-array data. We propose a novel
data complexity measure, depth, that leverages an evolutionary inspired feature
selection algorithm to quantify the complexity of micro-array data. By
examining feature subsets of varying sizes, the approach offers a novel
perspective on data complexity analysis. Unlike traditional metrics, depth is
robust to irrelevant features and effectively captures complexity stemming from
feature interactions. On synthetic micro-array data, depth outperforms existing
methods in robustness to irrelevant features and identifying complexity from
feature interactions. Applied to case-control genotype and gene-expression
micro-array datasets, the results reveal that a single feature of
gene-expression data can account for over 90% of the performance of
multi-feature model, confirming the adequacy of the commonly used
differentially expressed gene (DEG) feature selection method for the gene
expression data. Our study also demonstrates that constructing predictive
models for genotype data is harder than gene expression data. The results in
this paper provide evidence for the use of interpretable machine learning
algorithms on microarray data
A Random Forest model for predicting allosteric and functional sites on proteins
We thank the Scottish Universities Life Sciences Alliance (SULSA) for funding to JBOM and for PB’s PhD studentship under NJW’s supervision.We created a computational method to identify allosteric sites using a machine learning method trained and tested on protein structures containing bound ligand molecules. The Random Forest machine learning approach was adopted to build our three-way predictive model. Based on descriptors collated for each ligand and binding site, the classification model allows us to assign protein cavities as allosteric, regular or orthosteric, and hence to identify allosteric sites. 43 structural descriptors per complex were derived and were used to characterize individual protein-ligand binding sites belonging to the three classes, allosteric, regular and orthosteric. We carried out a separate validation on a further unseen set of protein structures containing the ligand 2-(N-cyclohexylamino) ethane sulfonic acid (CHES).PostprintPeer reviewe
Using machine learning to predict smartphone usage
Abstract. This thesis shows the process of creating and analyzing a machine-learning model. It goes over prevalent classification algorithms and their advantages and disadvantages. Furthermore, techniques and metrics used to evaluate the performance of the model are introduced. In the latter part of the thesis, a Random Forest model is implemented. The objective was to predict the participants’ smartphone usage, more specifically the category of an application they had opened. This starts with a pre-processing phase, where relevant information is extracted from the raw data. Multiple variations of the model are built, and the best-performing model was able to achieve 63.37% accuracy. Additionally, the features are scored to provide more insight into the model. The thesis ends with a brief discussion section, which contemplates the reasons behind the results, some of the model’s deficiencies and how it could be improved
Reaction Prediction: The Case of Tweets from Luxury Fashion Brands
Social media platforms represent an essential tool for both consumers and marketers. Meanwhile,
luxury fashion brands play a key role in fashion, one of the most important industries of the
world economy. Despite assumptions to the contrary, social media platforms and luxury fashion
brands do mix, especially in the recent time. Consequently, it is worth asking whether it is
possible to predict the reaction a post will generate in the audience of luxury fashion brands.
This new question is the one this thesis intends to answer. To do so, the concept of reaction is
defined through a novel composite index that is created and named Tweet reaction overall score
(TROS), which is one of the solid and relevant contributions this thesis makes. Then, several
predictive models are implemented, based on a wide range of different learning algorithms. The
results show that it is indeed possible to predict the TROS that a post on Twitter will obtain in
the audience of luxury fashion brands the day it is posted
- …