8 research outputs found

    Variable importance for sustaining macrophyte presence via random forests : data imputation and model settings

    Get PDF
    Data sets plagued with missing data and performance-affecting model parameters represent recurrent issues within the field of data mining. Via random forests, the influence of data reduction, outlier and correlated variable removal and missing data imputation technique on the performance of habitat suitability models for three macrophytes (Lemna minor, Spirodela polyrhiza and Nuphar lutea) was assessed. Higher performances (Cohen’s kappa values around 0.2–0.3) were obtained for a high degree of data reduction, without outlier or correlated variable removal and with imputation of the median value. Moreover, the influence of model parameter settings on the performance of random forest trained on this data set was investigated along a range of individual trees (ntree), while the number of variables to be considered (mtry), was fixed at two. Altering the number of individual trees did not have a uniform effect on model performance, but clearly changed the required computation time. Combining both criteria provided an ntree value of 100, with the overall effect of ntree on performance being relatively limited. Temperature, pH and conductivity remained as variables and showed to affect the likelihood of L. minor, S. polyrhiza and N. lutea being present. Generally, high likelihood values were obtained when temperature is high (>20 °C), conductivity is intermediately low (50–200 mS m−1) or pH is intermediate (6.9–8), thereby also highlighting that a multivariate management approach for supporting macrophyte presence remains recommended. Yet, as our conclusions are only based on a single freshwater data set, they should be further tested for other data sets

    Improving official statistics in emerging markets using machine learning and mobile phone data

    Get PDF
    Mobile phones are one of the fastest growing technologies in the developing world with global penetration rates reaching 90%. Mobile phone data, also called CDR, are generated everytime phones are used and recorded by carriers at scale. CDR have generated groundbreaking insights in public health, official statistics, and logistics. However, the fact that most phones in developing countries are prepaid means that the data lacks key information about the user, including gender and other demographic variables. This precludes numerous uses of this data in social science and development economic research. It furthermore severely prevents the development of humanitarian applications such as the use of mobile phone data to target aid towards the most vulnerable groups during crisis. We developed a framework to extract more than 1400 features from standard mobile phone data and used them to predict useful individual characteristics and group estimates. We here present a systematic cross-country study of the applicability of machine learning for dataset augmentation at low cost. We validate our framework by showing how it can be used to reliably predict gender and other information for more than half a million people in two countries. We show how standard machine learning algorithms trained on only 10,000 users are sufficient to predict individual’s gender with an accuracy ranging from 74.3 to 88.4% in a developed country and from 74.5 to 79.7% in a developing country using only metadata. This is significantly higher than previous approaches and, once calibrated, gives highly accurate estimates of gender balance in groups. Performance suffers only marginally if we reduce the training size to 5,000, but significantly decreases in a smaller training set. We finally show that our indicators capture a large range of behavioral traits using factor analysis and that the framework can be used to predict other indicators of vulnerability such as age or socio-economic status. Mobile phone data has a great potential for good and our framework allows this data to be augmented with vulnerability and other information at a fraction of the cost

    Towards Adversarial Malware Detection: Lessons Learned from PDF-based Attacks

    Full text link
    Malware still constitutes a major threat in the cybersecurity landscape, also due to the widespread use of infection vectors such as documents. These infection vectors hide embedded malicious code to the victim users, facilitating the use of social engineering techniques to infect their machines. Research showed that machine-learning algorithms provide effective detection mechanisms against such threats, but the existence of an arms race in adversarial settings has recently challenged such systems. In this work, we focus on malware embedded in PDF files as a representative case of such an arms race. We start by providing a comprehensive taxonomy of the different approaches used to generate PDF malware, and of the corresponding learning-based detection systems. We then categorize threats specifically targeted against learning-based PDF malware detectors, using a well-established framework in the field of adversarial machine learning. This framework allows us to categorize known vulnerabilities of learning-based PDF malware detectors and to identify novel attacks that may threaten such systems, along with the potential defense mechanisms that can mitigate the impact of such threats. We conclude the paper by discussing how such findings highlight promising research directions towards tackling the more general challenge of designing robust malware detectors in adversarial settings

    Forecasting monthly airline passenger numbers with small datasets using feature engineering and a modified principal component analysis

    Get PDF
    In this study, a machine learning approach based on time series models, different feature engineering, feature extraction, and feature derivation is proposed to improve air passenger forecasting. Different types of datasets were created to extract new features from the core data. An experiment was undertaken with artificial neural networks to test the performance of neurons in the hidden layer, to optimise the dimensions of all layers and to obtain an optimal choice of connection weights – thus the nonlinear optimisation problem could be solved directly. A method of tuning deep learning models using H2O (which is a feature-rich, open source machine learning platform known for its R and Spark integration and its ease of use) is also proposed, where the trained network model is built from samples of selected features from the dataset in order to ensure diversity of the samples and to improve training. A successful application of deep learning requires setting numerous parameters in order to achieve greater model accuracy. The number of hidden layers and the number of neurons, are key parameters in each layer of such a network. Hyper-parameter, grid search, and random hyper-parameter approaches aid in setting these important parameters. Moreover, a new ensemble strategy is suggested that shows potential to optimise parameter settings and hence save more computational resources throughout the tuning process of the models. The main objective, besides improving the performance metric, is to obtain a distribution on some hold-out datasets that resemble the original distribution of the training data. Particular attention is focused on creating a modified version of Principal Component Analysis (PCA) using a different correlation matrix – obtained by a different correlation coefficient based on kinetic energy to derive new features. The data were collected from several airline datasets to build a deep prediction model for forecasting airline passenger numbers. Preliminary experiments show that fine-tuning provides an efficient approach for tuning the ultimate number of hidden layers and the number of neurons in each layer when compared with the grid search method. Similarly, the results show that the modified version of PCA is more effective in data dimension reduction, classes reparability, and classification accuracy than using traditional PCA.</div

    Machine learning methods for MicroRNA target prediction

    Get PDF
    MicroRNAs are small non-coding RNA molecules that form a post-transcriptional layer of gene regulation. microRNA binds with messenger RNA in order to repress translation and accelerate its degradation, ultimately downregulating the expression of genes. The mechanics of these bindings in animals are complex and entrenched in a myriad of contextual factors which influence the specificity and efficacy of potential interactions. This thesis describes the development of miRsight, a novel target prediction tool utilising advanced machine learning techniques. miRsight is trained using 44 target recognition features compiled through testing on published microRNA-transfected RNA sequencing data, an experimental procedure in which microRNA molecules are introduced into a sample to quantify their impact on gene expression. In addition to the tool itself, a database of pre-computed predictions is hosted at https://mirsight.info, which also provides search, filter, and export functionality for user convenience. The results of this study indicate that miRsight is able to more effectively predict and rank microRNA targets compared to popular target prediction tools. This is validated by examining the downregulation of gene expression from predicted targets using microRNA transfection. In the 12 samples reserved for testing, miRsight is shown to more consistently identify true targets in the top 100, 300 and 500 of predictions by rank compared to TargetScan, MirTarget and DIANA-microT. Additionally, miRsight is capable of producing several thousand total predictions for each microRNA while maintaining this high rate of prediction accuracy

    Data-driven models and trait-oriented experiments of aquatic macrophytes to support freshwater management

    Get PDF
    corecore