2,239 research outputs found

    Stochastic Local Search Heuristics for Efficient Feature Selection: An Experimental Study

    Get PDF
    Feature engineering, including feature selection, plays a key role in data science, knowledge discovery, machine learning, and statistics. Recently, much progress has been made in increasing the accuracy of machine learning for complex problems. In part, this is due to improvements in feature engineering, for example by means of deep learning or feature selection. This progress has, to a large extent, come at the cost of dramatic and perhaps unsustainable increases in the computational resources used. Consequently, there is now a need to emphasize not only accuracy but also computational cost in research on and applications of machine learning including feature selection. With a focus on both the accuracy and computational cost of feature selection, we study stochastic local search (SLS) methods when applied to feature selection in this paper. With an eye to containing computational cost, we consider an SLS method for efficient feature selection, SLS4FS. SLS4FS is an amalgamation of several heuristics, including filter and wrapper methods, controlled by hyperparameters. While SLS4FS admits, for certain hyperparameter settings, analysis by means of homogeneous Markov chains, our focus is on experiments with several realworld datasets in this paper. Our experimental study suggests that SLS4FS is competitive with several existing methods, and is useful in settings where one wants to control the computational cost

    Variational Autoencoder Based Estimation Of Distribution Algorithms And Applications To Individual Based Ecosystem Modeling Using EcoSim

    Get PDF
    Individual based modeling provides a bottom up approach wherein interactions give rise to high-level phenomena in patterns equivalent to those found in nature. This method generates an immense amount of data through artificial simulation and can be made tractable by machine learning where multidimensional data is optimized and transformed. Using individual based modeling platform known as EcoSim, we modeled the abilities of elitist sexual selection and communication of fear. Data received from these experiments was reduced in dimension through use of a novel algorithm proposed by us: Variational Autoencoder based Estimation of Distribution Algorithms with Population Queue and Adaptive Variance Scaling (VAE-EDA-Q AVS). We constructed a novel Estimation of Distribution Algorithm (EDA) by extending generative models known as variational autoencoders (VAE). VAE-EDA-Q, proposed by us, smooths the data generation process using an iteratively updated queue (Q) of populations. Adaptive Variance Scaling (AVS) dynamically updates the variance at which models are sampled based on fitness. The combination of VAE-EDA-Q with AVS demonstrates high computational efficiency and requires few fitness evaluations. We extended VAE-EDA-Q AVS to act as a feature reducing wrapper method in conjunction with C4.5 Decision trees to reduce the dimensionality of data. The relationship between sexual selection, random selection, and speciation is a contested topic. Supporting evidence suggests sexual selection to drive speciation. Opposing evidence contends either a negative or absence of correlation to exist. We utilized EcoSim to model elitist and random mate selection. Our results demonstrated a significantly lower speciation rate, a significantly lower extinction rate, and a significantly higher turnover rate for sexual selection groups. Species diversification was found to display no significant difference. The relationship between communication and foraging behavior similarly features opposing hypotheses in claim of both increases and decreases of foraging behavior in response to alarm communication. Through modeling with EcoSim, we found alarm communication to decrease foraging activity in most cases, yet gradually increase foraging activity in some other cases. Furthermore, we found both outcomes resulting from alarm communication to increase fitness as compared to non-communication

    Random Subset Feature Selection for Ecological Niche Modeling of Wildfire Activity and the Monarch Butterfly

    Get PDF
    Correlative ecological niche models (ENMs) are essential for investigating distributions of species and natural phenomena via environmental correlates across broad fields, including entomology and pyrogeography featured in this study. Feature (variable) selection is critical for producing more robust ENMs with greater transferability across space and time, but few studies evaluate formal feature selection algorithms (FSAs) for producing higher performance ENMs. Variability of ENMs arising from feature subsets is also seldom represented. A novel FSA is developed and evaluated, the random subset feature selection algorithm (RSFSA). The RSFSA generates an ensemble of higher accuracy ENMs from different feature subsets, producing a feature subset ensemble (FSE). The RSFSA-selected FSEs are novelly used to represent ENM variability. Wildfire activity presence/absence databases for the western US prove ideal for evaluating RSFSA-selected MaxEnt ENMs. The RSFSA was effective in identifying FSEs of 15 of 90 variables with higher accuracy and information content than random FSEs. Selected FSEs were used to identify severe contemporary wildfire deficits and significant future increases in wildfire activity for many ecoregions. Migratory roosting localities of declining eastern North American monarch butterflies (Danaus plexippus) were used to spatially model migratory pathways, comparing RSFSAselected MaxEnt ENMs and kernel density estimate models (KDEMs). The higher information content ENMs best correlated migratory pathways with nectar resources in grasslands. Higher accuracy KDEMs best revealed migratory pathways through less suitable desert environments. Monarch butterfly roadkill data was surveyed for Texas within the main Oklahoma to Mexico Central Funnel migratory pathway. A random FSE of MaxEnt roadkill ENMs was used to estimate a 2-3% loss of migrants to roadkill. Hotspots of roadkill in west Texas and Mexico were recommended for assessing roadkill mitigation to assist in monarch population recovery. The RSFSA effectively produces higher performance ENM FSEs for estimating optimal feature subset sizes, and comparing ENM algorithms and parameters, and environmental scenarios. The RSFSA also performed comparably to expert variable selection, confirming its value in the absence of expert information. The RSFSA should be compared with other FSAs for developing ENMs and in data mining applications across other disciplines, such as image classification and molecular bioinformatics

    BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology

    Get PDF
    This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software

    Addressing low dimensionality feature subset selection: reliefF(-k) or extended correlation-based feature selection(eCFS)?

    Get PDF
    This paper tackles problems where attribute selection is not only able to choose a few features but also to achieve a low performance classification in terms of accuracy compared to the full attribute set. Correlation-based feature selection (CFS) has been set as the baseline attribute subset selection due to its popularity and high performance. Around hundred data sets have been collected and submitted to CFS; then the problems fulling simultaneously the conditions: a) a number of selected attributes lower than six and b) a percentage of selected attributes lower than a forty per cent, have been tested onto two directions.Firstly, in the scope of data selection at the feature level, some options proposed in a prior work as well as an advanced contemporary approach have been conducted. Secondly, the data pre-processed and initial problems have been tested with some sturdy classifiers. Moreover, this work introduces a new taxonomy of feature selection according to the solution type and the followed way to compute it. The test bed comprises seven problems, three out of them report a single selected attribute, another one with two extracted features and the three remaining data sets with four or five retained attributes, all of them by CFS; additionally, the feature set is between six and twenty nine and the complexity of the problems, in terms of classes, uctuates between two and twenty one, throwing averages of sixteen and around five for both aforementioned properties. The contribution concluded that the advanced procedure is suitable for problems where only one or two attributes are selected by CFS; for data sets with more than two selected features the baseline method is preferable to the advanced one, although the considered feature ranking method achieved intermediate results.info:eu-repo/semantics/publishedVersio

    Machine Learning Approaches for Improving Prediction Performance of Structure-Activity Relationship Models

    Get PDF
    In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies. First, to improve the prediction accuracy of learning from imbalanced data, Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms combined with bagging as an ensemble strategy was evaluated. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that this method significantly outperformed other conventional methods. SMOTEENN with bagging became less effective when IR exceeded a certain threshold (e.g., \u3e40). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22-27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Lastly, current features used for QSAR based machine learning are often very sparse and limited by the logic and mathematical processes used to compute them. Transformer embedding features (TEF) were developed as new continuous vector descriptors/features using the latent space embedding from a multi-head self-attention. The significance of TEF as new descriptors was evaluated by applying them to tasks such as predictive modeling, clustering, and similarity search. An accuracy of 84% on the Ames mutagenicity test indicates that these new features has a correlation to biological activity. Overall, the findings in this study can be applied to improve the performance of machine learning based Quantitative Structure-Activity/Property Relationship (QSAR) efforts for enhanced drug discovery and toxicology assessments

    Land-Surface Parameters for Spatial Predictive Mapping and Modeling

    Get PDF
    Land-surface parameters derived from digital land surface models (DLSMs) (for example, slope, surface curvature, topographic position, topographic roughness, aspect, heat load index, and topographic moisture index) can serve as key predictor variables in a wide variety of mapping and modeling tasks relating to geomorphic processes, landform delineation, ecological and habitat characterization, and geohazard, soil, wetland, and general thematic mapping and modeling. However, selecting features from the large number of potential derivatives that may be predictive for a specific feature or process can be complicated, and existing literature may offer contradictory or incomplete guidance. The availability of multiple data sources and the need to define moving window shapes, sizes, and cell weightings further complicate selecting and optimizing the feature space. This review focuses on the calculation and use of DLSM parameters for empirical spatial predictive modeling applications, which rely on training data and explanatory variables to make predictions of landscape features and processes over a defined geographic extent. The target audience for this review is researchers and analysts undertaking predictive modeling tasks that make use of the most widely used terrain variables. To outline best practices and highlight future research needs, we review a range of land-surface parameters relating to steepness, local relief, rugosity, slope orientation, solar insolation, and moisture and characterize their relationship to geomorphic processes. We then discuss important considerations when selecting such parameters for predictive mapping and modeling tasks to assist analysts in answering two critical questions: What landscape conditions or processes does a given measure characterize? How might a particular metric relate to the phenomenon or features being mapped, modeled, or studied? We recommend the use of landscape- and problem-specific pilot studies to answer, to the extent possible, these questions for potential features of interest in a mapping or modeling task. We describe existing techniques to reduce the size of the feature space using feature selection and feature reduction methods, assess the importance or contribution of specific metrics, and parameterize moving windows or characterize the landscape at varying scales using alternative methods while highlighting strengths, drawbacks, and knowledge gaps for specific techniques. Recent developments, such as explainable machine learning and convolutional neural network (CNN)-based deep learning, may guide and/or minimize the need for feature space engineering and ease the use of DLSMs in predictive modeling tasks
    • 

    corecore