1,499 research outputs found

    VizRank: Data Visualization Guided by Machine Learning

    Get PDF
    Data visualization plays a crucial role in identifying interesting patterns in exploratory data analysis. Its use is, however, made difficult by the large number of possible data projections showing different attribute subsets that must be evaluated by the data analyst. In this paper, we introduce a method called VizRank, which is applied on classified data to automatically select the most useful data projections. VizRank can be used with any visualization method that maps attribute values to points in a two-dimensional visualization space. It assesses possible data projections and ranks them by their ability to visually discriminate between classes. The quality of class separation is estimated by computing the predictive accuracy of k-nearest neighbor classifier on the data set consisting of x and y positions of the projected data points and their class information. The paper introduces the method and presents experimental results which show that VizRank's ranking of projections highly agrees with subjective rankings by data analysts. The practical use of VizRank is also demonstrated by an application in the field of functional genomics

    Physics-inspired Replica Approaches to Computer Science Problems

    Get PDF
    We study machine learning class classification problems and combinatorial optimization problems using physics inspired replica approaches. In the current work, we focus on the traveling salesman problem which is one of the most famous problems in the entire field of combinatorial optimization. Our approach is specifically motivated by the desire to avoid trapping in metastable local minima-a common occurrence in hard problems with multiple extrema. Our method involves (i) coupling otherwise independent simulations of a system (“replicas”) via geometrical distances as well as (ii) probabilistic inference applied to the solutions found by individual replicas. In particular, we apply our method to the well-known “k-opt” algorithm and examine two particular cases-k = 2 and k = 3. With the aid of geometrical coupling alone, we are able to determine for the optimum tour length on systems up to 280 cities (an order of magnitude larger than the largest systems typically solved by the bare k = 3 opt). The probabilistic replica-based inference approach improves k - opt even further and determines the optimal solution of a problem with 318 cities. In this work, we also formulate a supervised machine learning algorithm for classification problems which is called “Stochastic Replica Voting Machine” (SRVM). The method is based on the representations of known data via multiple linear expansions in terms of various stochastic functions. The algorithm is developed, implemented and applied to a binary and a 3-class classification problems in material science. Here, we employ SRVM to predict candidate compounds capable of forming cubic Perovskite structure and further classify binary (AB) solids. We demonstrated that our SRVM method exceeds the well-known Support Vector Machine (SVM) in terms of accuracy when predicting the cubic Perovskite structure. The algorithm has also been tested on 8 diverse training data sets of various types and feature space dimensions from UCI machine learning repository. It has been shown to consistently match or exceed the accuracy of existing algorithms, while simultaneously avoiding many of their pitfalls

    Machine Learning Model Selection for Predicting Global Bathymetry

    Get PDF
    This work is concerned with the viability of Machine Learning (ML) in training models for predicting global bathymetry, and whether there is a best fit model for predicting that bathymetry. The desired result is an investigation of the ability for ML to be used in future prediction models and to experiment with multiple trained models to determine an optimum selection. Ocean features were aggregated from a set of external studies and placed into two minute spatial grids representing the earth\u27s oceans. A set of regression models, classification models, and a novel classification model were then fit to this data and analyzed. The novel classification model is optimized by selecting the best performing model in a geospatial area. This optimization increases prediction accuracy for test purposes by approximately 3%. These models were trained using bathymetry data from the ETOPO2v2 dataset. Analysis and validation for each model also used bathymetry from the ETOPO dataset, and subsequent metrics were produced and reported. Results demonstrate that ocean features can potentially be used to build a prediction model for bathymetry with the inclusion of accurate data and intelligent model selection. Based on the results in this work, evidence supports that no single model will best predict all Global bathymetry

    Data Mining and Machine Learning in Astronomy

    Full text link
    We review the current state of data mining and machine learning in astronomy. 'Data Mining' can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.Comment: Published in IJMPD. 61 pages, uses ws-ijmpd.cls. Several extra figures, some minor additions to the tex

    Path Recognition with DTW in a Distributed Environment

    Get PDF
    The Internet of Things is a concept, where various devices are connected in a network and data is exchanged between them. With the help of Internet of Things applications, it is possible to access sensors remotely to collect data from the physical world. The collected data contains potential knowledge, which could be revealed by applying machine learning techniques. Due to the rapid development of Internet of Things applications, the amount of collected data increases enormously. In order to perform computations on large datasets, distributed computing technologies are used. Recognizing people’s movements is a popular topic in the context of the Internet of Things. Movement patterns are usually sequential and continuous, and can therefore be encoded in the form of time series. Since the Dynamic-Time-Warping (DTW) is an established algorithm for processing time series data, it is chosen as a similarity measure for different movement patterns. Moreover, based on the DTW results, the movements are classified. In this thesis, we provide an implementation for the recognition of movement patterns. The prototype is built on Apache Spark and Apache Hadoop and uses their distributed computation possibilities. In an experiment, data from probands is collected and evaluated. Finally, the algorithm performance and accuracy is measured

    Using the information embedded in the testing sample to break the limits caused by the small sample size in microarray-based classification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray-based tumor classification is characterized by a very large number of features (genes) and small number of samples. In such cases, statistical techniques cannot determine which genes are correlated to each tumor type. A popular solution is the use of a subset of pre-specified genes. However, molecular variations are generally correlated to a large number of genes. A gene that is not correlated to some disease may, by combination with other genes, express itself.</p> <p>Results</p> <p>In this paper, we propose a new classiification strategy that can reduce the effect of over-fitting without the need to pre-select a small subset of genes. Our solution works by taking advantage of the information embedded in the testing samples. We note that a well-defined classification algorithm works best when the data is properly labeled. Hence, our classification algorithm will discriminate all samples best when the testing sample is assumed to belong to the correct class. We compare our solution with several well-known alternatives for tumor classification on a variety of publicly available data-sets. Our approach consistently leads to better classification results.</p> <p>Conclusion</p> <p>Studies indicate that thousands of samples may be required to extract useful statistical information from microarray data. Herein, it is shown that this problem can be circumvented by using the information embedded in the testing samples.</p

    Machine Learning and Neutron Sensing in Mobile Nuclear Threat Detection

    Get PDF
    A proof of concept (PoC) neutron/gamma-ray mobile threat detection system was constructed at Oak Ridge National Laboratory. This device, the Dual Detection Localization and Identification (DDLI) system, was designed to detect threat sources at standoff distance using neutron and gamma ray coded aperture imaging. A major research goal of the project was to understand the benefit of neutron sensing in the mobile threat search scenario. To this end, a series of mobile measurements were conducted with the completed DDLI PoC. These measurements indicated that high detection rates would be possible using neutron counting alone in a fully instrumented system. For a 280,000 neutrons per second Cf-252 source placed 15.9 meters away, a 4σ [sigma] detection rate of 99.3% was expected at 5 m/s. These results support the conclusion that neutron sensing enhances the detection capabilities of systems like the DDLI when compared to gamma-only platforms. Advanced algorithms were also investigated to fuse neutron and gamma coded aperture images and suppress background. In a simulated 1-D coded aperture imaging study, machine learning algorithms using both neutron and gamma ray data outperformed gamma-only threshold methods for alarming on weapons grade plutonium. In a separate study, a Random Forest classifier was trained on a source injection dataset from the Large Area Imager, a mobile gamma ray coded aperture system. Geant4 simulations of weapons-grade plutonium (WGPu) were combined with background data measured by the Large Area Imager to create nearly 4000 coded aperture images. At 30 meter standoff and 10 m/s, the Random Forest classifier was able to detect WGPu with error rates as low as 0.65% without spectroscopic information. A background subtracting filter further reduced this error rate to 0.2%. Finally, a background subtraction method based on principal component analysis was shown to improve detection by over 150% in figure of merit
    corecore