23 research outputs found

    Massively-Parallel Feature Selection for Big Data

    Full text link
    We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of pp-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class

    Using Feature Selection Methods to Discover Common Users’ Preferences for Online Recommender Systems

    Get PDF
    Recommender systems have taken over user’s choice to choose the items/services they want from online markets, where lots of merchandise is traded. Collaborative filtering-based recommender systems uses user opinions and preferences. Determination of commonly used attributes that influence preferences used for prediction and subsequent recommendation of unknown or new items to users is a significant objective while developing recommender engines.  In conventional systems, study of user behavior to know their dis/like over items would be carried-out. In this paper, presents feature selection methods to mine such preferences through selection of high influencing attributes of the items. In machine learning, feature selection is used as a data pre-processing method but extended its use on this work to achieve two objectives; removal of redundant, uninformative features and for selecting formative, relevant features based on the response variable. The latter objective, was suggested to identify and determine the frequent and shared features that would be preferred mostly by marketplace online users as they express their preferences. The dataset used for experimentation and determination was synthetic dataset.  The Jupyter Notebook™ using python was used to run the experiments. Results showed that given a number of formative features, there were those selected, with high influence to the response variable. Evidence showed that different feature selection methods resulted with different feature scores, and intrinsic method had the best overall results with 85% model accuracy. Selected features were used as frequently preferred attributes that influence users’ preferences

    Using Feature Selection Methods to Discover Common Users’ Preferences for Online Recommender Systems

    Get PDF
    Recommender systems have taken over user’s choice to choose the items/services they want from online markets, where lots of merchandise is traded. Collaborative filtering-based recommender systems uses user opinions and preferences. Determination of commonly used attributes that influence preferences used for prediction and subsequent recommendation of unknown or new items to users is a significant objective while developing recommender engines.  In conventional systems, study of user behavior to know their dis/like over items would be carried-out. In this paper, presents feature selection methods to mine such preferences through selection of high influencing attributes of the items. In machine learning, feature selection is used as a data pre-processing method but extended its use on this work to achieve two objectives; removal of redundant, uninformative features and for selecting formative, relevant features based on the response variable. The latter objective, was suggested to identify and determine the frequent and shared features that would be preferred mostly by marketplace online users as they express their preferences. The dataset used for experimentation and determination was synthetic dataset.  The Jupyter Notebook™ using python was used to run the experiments. Results showed that given a number of formative features, there were those selected, with high influence to the response variable. Evidence showed that different feature selection methods resulted with different feature scores, and intrinsic method had the best overall results with 85% model accuracy. Selected features were used as frequently preferred attributes that influence users’ preferences

    Supervised Methods for Biomarker Detection from Microarray Experiments

    Get PDF
    Biomarkers are valuable indicators of the state of a biological system. Microarray technology has been extensively used to identify biomarkers and build computational predictive models for disease prognosis, drug sensitivity and toxicity evaluations. Activation biomarkers can be used to understand the underlying signaling cascades, mechanisms of action and biological cross talk. Biomarker detection from microarray data requires several considerations both from the biological and computational points of view. In this chapter, we describe the main methodology used in biomarkers discovery and predictive modeling and we address some of the related challenges. Moreover, we discuss biomarker validation and give some insights into multiomics strategies for biomarker detection.Non peer reviewe

    A model-free feature selection technique of feature screening and random forest based recursive feature elimination

    Full text link
    In this paper, we propose a model-free feature selection method for ultra-high dimensional data with mass features. This is a two phases procedure that we propose to use the fused Kolmogorov filter with the random forest based RFE to remove model limitations and reduce the computational complexity. The method is fully nonparametric and can work with various types of datasets. It has several appealing characteristics, i.e., accuracy, model-free, and computational efficiency, and can be widely used in practical problems, such as multiclass classification, nonparametric regression, and Poisson regression, among others. We show that the proposed method is selection consistent and L2L_2 consistent under weak regularity conditions. We further demonstrate the superior performance of the proposed method over other existing methods by simulations and real data examples

    A framework for feature selection through boosting

    Get PDF
    As dimensions of datasets in predictive modelling continue to grow, feature selection becomes increasingly practical. Datasets with complex feature interactions and high levels of redundancy still present a challenge to existing feature selection methods. We propose a novel framework for feature selection that relies on boosting, or sample re-weighting, to select sets of informative features in classification problems. The method uses as its basis the feature rankings derived from fast and scalable tree-boosting models, such as XGBoost. We compare the proposed method to standard feature selection algorithms on 9 benchmark datasets. We show that the proposed approach reaches higher accuracies with fewer features on most of the tested datasets, and that the selected features have lower redundancy

    The FEDHC Bayesian network learning algorithm

    Full text link
    The paper proposes a new hybrid Bayesian network learning algorithm, termed Forward Early Dropping Hill Climbing (FEDHC), devised to work with either continuous or categorical variables. Specifically for the case of continuous data, a robust to outliers version of FEDHC, that can be adopted by other BN learning algorithms, is proposed. Further, the paper manifests that the only implementation of MMHC in the statistical software \textit{R}, is prohibitively expensive and a new implementation is offered. The FEDHC is tested via Monte Carlo simulations that distinctly show it is computationally efficient, and produces Bayesian networks of similar to, or of higher accuracy than MMHC and PCHC. Finally, an application of FEDHC, PCHC and MMHC algorithms to real data, from the field of economics, is demonstrated using the statistical software \textit{R}
    corecore