5,601 research outputs found

    Augmented nomogram with dependent feature pairs

    Get PDF
    Master'sMASTER OF SCIENC

    How complex is the microarray dataset? A novel data complexity metric for biological high-dimensional microarray data

    Full text link
    Data complexity analysis quantifies the hardness of constructing a predictive model on a given dataset. However, the effectiveness of existing data complexity measures can be challenged by the existence of irrelevant features and feature interactions in biological micro-array data. We propose a novel data complexity measure, depth, that leverages an evolutionary inspired feature selection algorithm to quantify the complexity of micro-array data. By examining feature subsets of varying sizes, the approach offers a novel perspective on data complexity analysis. Unlike traditional metrics, depth is robust to irrelevant features and effectively captures complexity stemming from feature interactions. On synthetic micro-array data, depth outperforms existing methods in robustness to irrelevant features and identifying complexity from feature interactions. Applied to case-control genotype and gene-expression micro-array datasets, the results reveal that a single feature of gene-expression data can account for over 90% of the performance of multi-feature model, confirming the adequacy of the commonly used differentially expressed gene (DEG) feature selection method for the gene expression data. Our study also demonstrates that constructing predictive models for genotype data is harder than gene expression data. The results in this paper provide evidence for the use of interpretable machine learning algorithms on microarray data

    A Random Forest model for predicting allosteric and functional sites on proteins

    Get PDF
    We thank the Scottish Universities Life Sciences Alliance (SULSA) for funding to JBOM and for PB’s PhD studentship under NJW’s supervision.We created a computational method to identify allosteric sites using a machine learning method trained and tested on protein structures containing bound ligand molecules. The Random Forest machine learning approach was adopted to build our three-way predictive model. Based on descriptors collated for each ligand and binding site, the classification model allows us to assign protein cavities as allosteric, regular or orthosteric, and hence to identify allosteric sites. 43 structural descriptors per complex were derived and were used to characterize individual protein-ligand binding sites belonging to the three classes, allosteric, regular and orthosteric. We carried out a separate validation on a further unseen set of protein structures containing the ligand 2-(N-cyclohexylamino) ethane sulfonic acid (CHES).PostprintPeer reviewe

    Using machine learning to predict smartphone usage

    Get PDF
    Abstract. This thesis shows the process of creating and analyzing a machine-learning model. It goes over prevalent classification algorithms and their advantages and disadvantages. Furthermore, techniques and metrics used to evaluate the performance of the model are introduced. In the latter part of the thesis, a Random Forest model is implemented. The objective was to predict the participants’ smartphone usage, more specifically the category of an application they had opened. This starts with a pre-processing phase, where relevant information is extracted from the raw data. Multiple variations of the model are built, and the best-performing model was able to achieve 63.37% accuracy. Additionally, the features are scored to provide more insight into the model. The thesis ends with a brief discussion section, which contemplates the reasons behind the results, some of the model’s deficiencies and how it could be improved

    Reaction Prediction: The Case of Tweets from Luxury Fashion Brands

    Get PDF
    Social media platforms represent an essential tool for both consumers and marketers. Meanwhile, luxury fashion brands play a key role in fashion, one of the most important industries of the world economy. Despite assumptions to the contrary, social media platforms and luxury fashion brands do mix, especially in the recent time. Consequently, it is worth asking whether it is possible to predict the reaction a post will generate in the audience of luxury fashion brands. This new question is the one this thesis intends to answer. To do so, the concept of reaction is defined through a novel composite index that is created and named Tweet reaction overall score (TROS), which is one of the solid and relevant contributions this thesis makes. Then, several predictive models are implemented, based on a wide range of different learning algorithms. The results show that it is indeed possible to predict the TROS that a post on Twitter will obtain in the audience of luxury fashion brands the day it is posted
    corecore