9,369 research outputs found

    A scalable implementation of information theoretic feature selection for high dimensional data

    Get PDF
    With the growth of high dimensional data, feature selection is a vital component of machine learning as well as an important stand alone data analytics tool. Without it, the computation cost of big data analytics can become unmanageable and spurious correlations and noise can reduce the accuracy of any results. Feature selection removes irrelevant and redundant information leading to faster, more reliable data analysis. Feature selection techniques based on information theory are among the fastest known and the Manchester AnalyticS Toolkit (MAST) provides an efficient, parallel and scalable implementation of these methods. This paper considers a number of data structures for storing the frequency counters that underpin MAST. We show that preprocessing the data to reduce the number of zero-valued counters in an array structure results in an order of magnitude reduction in both memory usage and execution time compared to state of the art structures that use explicit mappings to avoid zero-valued counters. We also describe a number of parallel processing techniques that enable MAST to scale linearly with the number of processors even on NUMA architectures. MAST targets scale-up servers rather than scale-out clusters and we show that it performs orders of magnitude faster than existing tools. Moreover, we show that MAST is 3.5 times faster than a scale-out solution built for Spark running on the same server. As an example of the performance of MAST, we were able to process a dataset of 100 million examples and 100,000 features in under 10 minutes on a four socket server which each socket containing an 8-core Intel Xeon E5-4620 processor

    Feature selection in high-dimensional dataset using MapReduce

    Full text link
    This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features

    Active Semi-Supervised Learning Using Sampling Theory for Graph Signals

    Full text link
    We consider the problem of offline, pool-based active semi-supervised learning on graphs. This problem is important when the labeled data is scarce and expensive whereas unlabeled data is easily available. The data points are represented by the vertices of an undirected graph with the similarity between them captured by the edge weights. Given a target number of nodes to label, the goal is to choose those nodes that are most informative and then predict the unknown labels. We propose a novel framework for this problem based on our recent results on sampling theory for graph signals. A graph signal is a real-valued function defined on each node of the graph. A notion of frequency for such signals can be defined using the spectrum of the graph Laplacian matrix. The sampling theory for graph signals aims to extend the traditional Nyquist-Shannon sampling theory by allowing us to identify the class of graph signals that can be reconstructed from their values on a subset of vertices. This approach allows us to define a criterion for active learning based on sampling set selection which aims at maximizing the frequency of the signals that can be reconstructed from their samples on the set. Experiments show the effectiveness of our method.Comment: 10 pages, 6 figures, To appear in KDD'1
    • …
    corecore