1,613 research outputs found

    Impact of Biases in Big Data

    Get PDF
    The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is having simply more data always helpful? In 1936, The Literary Digest collected 2.3M filled in questionnaires to predict the outcome of that year's US presidential election. The outcome of this big data prediction proved to be entirely wrong, whereas George Gallup only needed 3K handpicked people to make an accurate prediction. Generally, biases occur in machine learning whenever the distributions of training set and test set are different. In this work, we provide a review of different sorts of biases in (big) data sets in machine learning. We provide definitions and discussions of the most commonly appearing biases in machine learning: class imbalance and covariate shift. We also show how these biases can be quantified and corrected. This work is an introductory text for both researchers and practitioners to become more aware of this topic and thus to derive more reliable models for their learning problems

    Star–galaxy classification in the Dark Energy Survey Y1 data set

    Get PDF
    We perform a comparison of different approaches to star–galaxy classification using the broadband photometric data from Year 1 of the Dark Energy Survey. This is done by performing a wide range of tests with and without external ‘truth’ information, which can be ported to other similar data sets. We make a broad evaluation of the performance of the classifiers in two science cases with DES data that are most affected by this systematic effect: large-scale structure and MilkyWay studies. In general, even though the default morphological classifiers used for DES Y1 cosmology studies are sufficient to maintain a low level of systematic contamination from stellarmisclassification, contamination can be reduced to theO(1 per cent) level by using multi-epoch and infrared information from external data sets. For Milky Way studies, the stellar sample can be augmented by ~20 per cent for a given flux limit

    Star-galaxy classification in the dark energy survey Y1 data set

    Get PDF
    FINEP - FINANCIADORA DE ESTUDOS E PROJETOSFAPERJ - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DO RIO DE JANEIROCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICOMCTIC - MINISTÉRIO DA CIÊNCIA, TECNOLOGIA, INOVAÇÕES E COMUNICAÇÕESWe perform a comparison of different approaches to star-galaxy classification using the broadband photometric data from Year 1 of the Dark Energy Survey. This is done by performing a wide range of tests with and without external 'truth' information, which can be ported to other similar data sets. We make a broad evaluation of the performance of the classifiers in two science cases with DES data that are most affected by this systematic effect: large-scale structure and Milky Way studies. In general, even though the default morphological classifiers used for DES Y1 cosmology studies are sufficient to maintain a low level of systematic contamination from stellar misclassification, contamination can be reduced to the O(1 per cent) level by using multi-epoch and infrared information from external data sets. For Milky Way studies, the stellar sample can be augmented by similar to 20 per cent for a given flux limit.481454515469FINEP - FINANCIADORA DE ESTUDOS E PROJETOSFAPERJ - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DO RIO DE JANEIROCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICOMCTIC - MINISTÉRIO DA CIÊNCIA, TECNOLOGIA, INOVAÇÕES E COMUNICAÇÕESFINEP - FINANCIADORA DE ESTUDOS E PROJETOSFAPERJ - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DO RIO DE JANEIROCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICOMCTIC - MINISTÉRIO DA CIÊNCIA, TECNOLOGIA, INOVAÇÕES E COMUNICAÇÕESAgĂȘncias de fomento estrangeiras apoiaram essa pesquisa, mais informaçÔes acesse artig

    Machine Learning Classification of SDSS Transient Survey Images

    Full text link
    We show that multiple machine learning algorithms can match human performance in classifying transient imaging data from the Sloan Digital Sky Survey (SDSS) supernova survey into real objects and artefacts. This is a first step in any transient science pipeline and is currently still done by humans, but future surveys such as the Large Synoptic Survey Telescope (LSST) will necessitate fully machine-enabled solutions. Using features trained from eigenimage analysis (principal component analysis, PCA) of single-epoch g, r and i-difference images, we can reach a completeness (recall) of 96 per cent, while only incorrectly classifying at most 18 per cent of artefacts as real objects, corresponding to a precision (purity) of 84 per cent. In general, random forests performed best, followed by the k-nearest neighbour and the SkyNet artificial neural net algorithms, compared to other methods such as na\"ive Bayes and kernel support vector machine. Our results show that PCA-based machine learning can match human success levels and can naturally be extended by including multiple epochs of data, transient colours and host galaxy information which should allow for significant further improvements, especially at low signal-to-noise.Comment: 14 pages, 8 figures. In this version extremely minor adjustments to the paper were made - e.g. Figure 5 is now easier to view in greyscal

    Application of Synthetic Informative Minority Over-Sampling (SIMO) Algorithm Leveraging Support Vector Machine (SVM) On Small Datasets with Class Imbalance

    Get PDF
    Developing predictive models for classification problems considering imbalanced datasets is one of the basic difficulties in data mining and decision-analytics. A classifier’s performance will decline dramatically when applied to an imbalanced dataset. Standard classifiers such as logistic regression, Support Vector Machine (SVM) are appropriate for balanced training sets whereas provides suboptimal classification results when used on unbalanced dataset. Performance metric with prediction accuracy encourages a bias towards the majority class, while the rare instances remain unknown though the model contributes a high overall precision. There are chances where minority instances might be treated as noise and vice versa. (Haixiang et al., 2017). Wide range of Class Imbalanced learning techniques are introduced to overcome the above-mentioned problems, although each has some advantages and shortcomings. This paper provides details on the behavior of a novel imbalanced learning technique Synthetic Informative Minority Over-Sampling (SIMO) Algorithm Leveraging Support Vector Machine (SVM) on small datasets of records less than 200. Base classifiers, Logistic regression and SVM is used to validate the impact of SIMO on classifier’s performance in terms of metrices G-mean and Area Under Curve. A Comparison is derived between SIMO and other algorithms SMOTE, Smote-Borderline, ADAYSN to evaluate performance of SIMO over others

    A NEW METHODOLOGY FOR IDENTIFYING INTERFACE RESIDUES INVOLVED IN BINDING PROTEIN COMPLEXES

    Get PDF
    Genome-sequencing projects with advanced technologies have rapidly increased the amount of protein sequences, and demands for identifying protein interaction sites are significantly increased due to its impact on understanding cellular process, biochemical events and drug design studies. However, the capacity of current wet laboratory techniques is not enough to handle the exponentially growing protein sequence data; therefore, sequence based predictive methods identifying protein interaction sites have drawn increasing interest. In this article, a new predictive model which can be valuable as a first approach for guiding experimental methods investigating protein-protein interactions and localizing the specific interface residues is proposed. The proposed method extracts a wide range of features from protein sequences. Random forests framework is newly redesigned to effectively utilize these features and the problems of imbalanced data classification commonly encountered in binding site predictions. The method is evaluated with 2,829 interface residues and 24,616 non-interface residues extracted from 99 polypeptide chains in the Protein Data Bank. The experimental results show that the proposed method performs significantly better than two other conventional predictive methods and can reliably predict residues involved in protein interaction sites. As blind tests, the proposed method predicts interaction sites and constructs three protein complexes: the DnaK molecular chaperone system, 1YUW and 1DKG, which provide new insight into the sequence-function relationship. Finally, the robustness of the proposed method is assessed by evaluating the performances obtained from four different ensemble methods
    • 

    corecore