145,232 research outputs found

    Adaptive grid based localized learning for multidimensional data

    Get PDF
    Rapid advances in data-rich domains of science, technology, and business has amplified the computational challenges of Big Data synthesis necessary to slow the widening gap between the rate at which the data is being collected and analyzed for knowledge. This has led to the renewed need for efficient and accurate algorithms, framework, and algorithmic mechanisms essential for knowledge discovery, especially in the domains of clustering, classification, dimensionality reduction, feature ranking, and feature selection. However, data mining algorithms are frequently challenged by the sparseness due to the high dimensionality of the datasets in such domains which is particularly detrimental to the performance of unsupervised learning algorithms. The motivation for the research presented in this dissertation is to develop novel data mining algorithms to address the challenges of high dimensionality, sparseness and large volumes of datasets by using a unique grid-based localized learning paradigm for data movement clustering and classification schema. The grid-based learning is recognized in data mining as these algorithms are inherently efficient since they reduce the search space by partitioning the feature space into effective partitions. However, these approaches have not been successfully devised for supervised learning algorithms or sparseness reduction algorithm as they require careful estimation of grid sizes, partitions and data movement error calculations. Grid-based localized learning algorithms can scale well with an increase in dimensionality and the size of the datasets. To fulfill the goal of designing and developing learning algorithms that can handle data sparseness, high data dimensionality, and large size of data, in a concurrent manner to avoid the feature selection biases, a set of novel data mining algorithms using grid-based localized learning principles are developed and presented. The first algorithm is a unique computational framework for feature ranking that employs adaptive grid-based data shrinking for feature ranking. This method addresses the limitations of existing feature ranking methods by using a scoring function that discovers and exploits dependencies from all the features in the data. Data shrinking principles are established and metricized to capture and exploit dependencies between features. The second core algorithmic contribution is a novel supervised learning algorithm that utilizes grid-based localized learning to build a nonparametric classification model. In this classification model, feature space is divided using uniform/non-uniform partitions and data space subdivision is performed using a grid structure which is then used to build a classification model using grid-based nearest-neighbor learning. The third algorithm is an unsupervised clustering algorithm that is augmented with data shrinking to enhance the clustering performance of the algorithm. This algorithm addresses the limitations of the existing grid-based data shrinking and clustering algorithms by using an adaptive grid-based learning. Multiple experiments on a diversified set of datasets evaluate and discuss the effectiveness of dimensionality reduction, feature selection, unsupervised and supervised learning, and the scalability of the proposed methods compared to the established methods in the literature

    From Ordinal Ranking to Binary Classification

    Get PDF
    We study the ordinal ranking problem in machine learning. The problem can be viewed as a classification problem with additional ordinal information or as a regression problem without actual numerical information. From the classification perspective, we formalize the concept of ordinal information by a cost-sensitive setup, and propose some novel cost-sensitive classification algorithms. The algorithms are derived from a systematic cost-transformation technique, which carries a strong theoretical guarantee. Experimental results show that the novel algorithms perform well both in a general cost-sensitive setup and in the specific ordinal ranking setup. From the regression perspective, we propose the threshold ensemble model for ordinal ranking, which allows the machines to estimate a real-valued score (like regression) before quantizing it to an ordinal rank. We study the generalization ability of threshold ensembles and derive novel large-margin bounds on its expected test performance. In addition, we improve an existing algorithm and propose a novel algorithm for constructing large-margin threshold ensembles. Our proposed algorithms are efficient in training and achieve decent out-of-sample performance when compared with the state-of-the-art algorithm on benchmark data sets. We then study how ordinal ranking can be reduced to weighted binary classification. The reduction framework is simpler than the cost-sensitive classification approach and includes the threshold ensemble model as a special case. The framework allows us to derive strong theoretical results that tightly connect ordinal ranking with binary classification. We demonstrate the algorithmic and theoretical use of the reduction framework by extending SVM and AdaBoost, two of the most popular binary classification algorithms, to the area of ordinal ranking. Coupling SVM with the reduction framework results in a novel and faster algorithm for ordinal ranking with superior performance on real-world data sets, as well as a new bound on the expected test performance for generalized linear ordinal rankers. Coupling AdaBoost with the reduction framework leads to a novel algorithm that boosts the training accuracy of any cost-sensitive ordinal ranking algorithms theoretically, and in turn improves their test performance empirically. From the studies above, the key to improve ordinal ranking is to improve binary classification. In the final part of the thesis, we include two projects that aim at understanding binary classification better in the context of ensemble learning. First, we discuss how AdaBoost is restricted to combining only a finite number of hypotheses and remove the restriction by formulating a framework of infinite ensemble learning based on SVM. The framework can output an infinite ensemble through embedding infinitely many hypotheses into an~SVM kernel. Using the framework, we show that binary classification (and hence ordinal ranking) can be improved by going from a finite ensemble to an infinite one. Second, we discuss how AdaBoost carries the property of being resistant to overfitting. Then, we propose the SeedBoost algorithm, which uses the property as a machinery to prevent other learning algorithms from overfitting. Empirical results demonstrate that SeedBoost can indeed improve an overfitting algorithm on some data sets.</p

    Automatic Selection of Molecular Descriptors using Random Forest: Application to Drug Discovery

    Get PDF
    The optimal selection of chemical features (molecular descriptors) is an essential pre-processing step for the efficient application of computational intelligence techniques in virtual screening for identification of bioactive molecules in drug discovery. The selection of molecular descriptors has key influence in the accuracy of affinity prediction. In order to improve this prediction, we examined a Random Forest (RF)-based approach to automatically select molecular descriptors of training data for ligands of kinases, nuclear hormone receptors, and other enzymes. The reduction of features to use during prediction dramatically reduces the computing time over existing approaches and consequently permits the exploration of much larger sets of experimental data. To test the validity of the method, we compared the results of our approach with the ones obtained using manual feature selection in our previous study (Perez-Sanchez et al., 2014). The main novelty of this work in the field of drug discovery is the use of RF in two different ways: feature ranking and dimensionality reduction, and classification using the automatically selected feature subset. Our RF-based method out-performs classification results provided by Support Vector Machine (SVM) and Neural Networks (NN) approaches

    Band Ranking via Extended Coefficient of Variation for Hyperspectral Band Selection

    Get PDF
    Hundreds of narrow bands over a continuous spectral range make hyperspectral imagery rich in information about objects, while at the same time causing the neighboring bands to be highly correlated. Band selection is a technique that provides clear physical-meaning results for hyperspectral dimensional reduction, alleviating the difficulty for transferring and processing hyperspectral images caused by a property of hyperspectral images: large data volumes. In this study, a simple and efficient band ranking via extended coefficient of variation (BRECV) is proposed for unsupervised hyperspectral band selection. The naive idea of the BRECV algorithm is to select bands with relatively smaller means and lager standard deviations compared to their adjacent bands. To make this simple idea into an algorithm, and inspired by coefficient of variation (CV), we constructed an extended CV matrix for every three adjacent bands to study the changes of means and standard deviations, and accordingly propose a criterion to allocate values to each band for ranking. A derived unsupervised band selection based on the same idea while using entropy is also presented. Though the underlying idea is quite simple, and both cluster and optimization methods are not used, the BRECV method acquires qualitatively the same level of classification accuracy, compared with some state-of-the-art band selection methodsPeer reviewe

    Active Sampling of Pairs and Points for Large-scale Linear Bipartite Ranking

    Full text link
    Bipartite ranking is a fundamental ranking problem that learns to order relevant instances ahead of irrelevant ones. The pair-wise approach for bi-partite ranking construct a quadratic number of pairs to solve the problem, which is infeasible for large-scale data sets. The point-wise approach, albeit more efficient, often results in inferior performance. That is, it is difficult to conduct bipartite ranking accurately and efficiently at the same time. In this paper, we develop a novel active sampling scheme within the pair-wise approach to conduct bipartite ranking efficiently. The scheme is inspired from active learning and can reach a competitive ranking performance while focusing only on a small subset of the many pairs during training. Moreover, we propose a general Combined Ranking and Classification (CRC) framework to accurately conduct bipartite ranking. The framework unifies point-wise and pair-wise approaches and is simply based on the idea of treating each instance point as a pseudo-pair. Experiments on 14 real-word large-scale data sets demonstrate that the proposed algorithm of Active Sampling within CRC, when coupled with a linear Support Vector Machine, usually outperforms state-of-the-art point-wise and pair-wise ranking approaches in terms of both accuracy and efficiency.Comment: a shorter version was presented in ACML 201

    Sequential Design for Ranking Response Surfaces

    Full text link
    We propose and analyze sequential design methods for the problem of ranking several response surfaces. Namely, given L2L \ge 2 response surfaces over a continuous input space X\cal X, the aim is to efficiently find the index of the minimal response across the entire X\cal X. The response surfaces are not known and have to be noisily sampled one-at-a-time. This setting is motivated by stochastic control applications and requires joint experimental design both in space and response-index dimensions. To generate sequential design heuristics we investigate stepwise uncertainty reduction approaches, as well as sampling based on posterior classification complexity. We also make connections between our continuous-input formulation and the discrete framework of pure regret in multi-armed bandits. To model the response surfaces we utilize kriging surrogates. Several numerical examples using both synthetic data and an epidemics control problem are provided to illustrate our approach and the efficacy of respective adaptive designs.Comment: 26 pages, 7 figures (updated several sections and figures

    Feature subset selection and ranking for data dimensionality reduction

    Get PDF
    A new unsupervised forward orthogonal search (FOS) algorithm is introduced for feature selection and ranking. In the new algorithm, features are selected in a stepwise way, one at a time, by estimating the capability of each specified candidate feature subset to represent the overall features in the measurement space. A squared correlation function is employed as the criterion to measure the dependency between features and this makes the new algorithm easy to implement. The forward orthogonalization strategy, which combines good effectiveness with high efficiency, enables the new algorithm to produce efficient feature subsets with a clear physical interpretation

    Feature subset selection and ranking for data dimensionality reduction

    Get PDF
    A new unsupervised forward orthogonal search (FOS) algorithm is introduced for feature selection and ranking. In the new algorithm, features are selected in a stepwise way, one at a time, by estimating the capability of each specified candidate feature subset to represent the overall features in the measurement space. A squared correlation function is employed as the criterion to measure the dependency between features and this makes the new algorithm easy to implement. The forward orthogonalization strategy, which combines good effectiveness with high efficiency, enables the new algorithm to produce efficient feature subsets with a clear physical interpretation
    corecore