Search CORE

1,613 research outputs found

Impact of Biases in Big Data

Author: Glauner Patrick
State Radu
Valtchev Petko
Publication venue
Publication date: 01/01/2018
Field of study

The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is having simply more data always helpful? In 1936, The Literary Digest collected 2.3M filled in questionnaires to predict the outcome of that year's US presidential election. The outcome of this big data prediction proved to be entirely wrong, whereas George Gallup only needed 3K handpicked people to make an accurate prediction. Generally, biases occur in machine learning whenever the distributions of training set and test set are different. In this work, we provide a review of different sorts of biases in (big) data sets in machine learning. We provide definitions and discussions of the most commonly appearing biases in machine learning: class imbalance and covariate shift. We also show how these biases can be quantified and corrected. This work is an introductory text for both researchers and practitioners to become more aware of this topic and thus to derive more reliable models for their learning problems

arXiv.org e-Print Archive

Open Repository and Bibliography - Luxembourg

Star–galaxy classification in the Dark Energy Survey Y1 data set

Author: DES Collaboration
Santiago Basilio Xavier
Sevilla Noarbe Ignacio
Publication venue
Publication date: 01/01/2018
Field of study

We perform a comparison of different approaches to star–galaxy classification using the broadband photometric data from Year 1 of the Dark Energy Survey. This is done by performing a wide range of tests with and without external ‘truth’ information, which can be ported to other similar data sets. We make a broad evaluation of the performance of the classifiers in two science cases with DES data that are most affected by this systematic effect: large-scale structure and MilkyWay studies. In general, even though the default morphological classifiers used for DES Y1 cosmology studies are sufficient to maintain a low level of systematic contamination from stellarmisclassification, contamination can be reduced to theO(1 per cent) level by using multi-epoch and infrared information from external data sets. For Milky Way studies, the stellar sample can be augmented by ~20 per cent for a given flux limit

Lume 5.8

Star-galaxy classification in the dark energy survey Y1 data set

Author: Abbott T.
Abdalla F.
Aleksic J.
Allam S.
Avestruz C.
Balbinot E.
Banerji M.
Bechtol K.
Bertin E.
Bonnett C.
Brooks D.
Brunner R.
Carnero-Rosell A.
Carrasco-Kind M.
Carretero J.
Choi A.
Cunha C.
da Costa L.
Davis C.
de Vicente J.
Desai S.
Doel P.
Drlica-Wagner A.
Fernandez E.
Flaugher B.
Frieman J.
Garcia-Bellido J.
Gaztanaga E.
Giannantonio T.
Gruen D.
Gruendl R.
Gschwend J.
Gutierrez G.
Hollowood D. L.
Honscheid K.
Hoyle B.
James D.
Jeltema T.
Kim E.
Kirk D.
Krause E.
Kuehn K.
Lahav O.
Li T. S.
Lima M.
Maia M. A. G.
March M.
Marcha M. J.
McMahon R. G.
Menanteau F.
Miquel R.
Moraes B.
Nord B.
Ogando R. L. C.
Plazas A. A.
Ross A. J.
Rykoff E. S.
Sanchez E.
Santiago B.
Scarpine V.
Schindler R.
Schubnell M.
Sevilla-Noarbe I.
Sheldon E.
Smith M.
Smith R. C.
Soares-Santos M.
Sobreira F.
Soumagnac M. T.
Suchyta E.
Swanson M. E. C.
Tarle G.
Thomas D.
Tucker D. L.
Walker A. R.
Wei K.
Wester W.
Yanny B.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 30/03/2020
Field of study

FINEP - FINANCIADORA DE ESTUDOS E PROJETOSFAPERJ - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DO RIO DE JANEIROCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICOMCTIC - MINISTÉRIO DA CIÊNCIA, TECNOLOGIA, INOVAÇÕES E COMUNICAÇÕESWe perform a comparison of different approaches to star-galaxy classification using the broadband photometric data from Year 1 of the Dark Energy Survey. This is done by performing a wide range of tests with and without external 'truth' information, which can be ported to other similar data sets. We make a broad evaluation of the performance of the classifiers in two science cases with DES data that are most affected by this systematic effect: large-scale structure and Milky Way studies. In general, even though the default morphological classifiers used for DES Y1 cosmology studies are sufficient to maintain a low level of systematic contamination from stellar misclassification, contamination can be reduced to the O(1 per cent) level by using multi-epoch and infrared information from external data sets. For Milky Way studies, the stellar sample can be augmented by similar to 20 per cent for a given flux limit.481454515469FINEP - FINANCIADORA DE ESTUDOS E PROJETOSFAPERJ - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DO RIO DE JANEIROCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICOMCTIC - MINISTÉRIO DA CIÊNCIA, TECNOLOGIA, INOVAÇÕES E COMUNICAÇÕESFINEP - FINANCIADORA DE ESTUDOS E PROJETOSFAPERJ - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DO RIO DE JANEIROCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICOMCTIC - MINISTÉRIO DA CIÊNCIA, TECNOLOGIA, INOVAÇÕES E COMUNICAÇÕESAgências de fomento estrangeiras apoiaram essa pesquisa, mais informações acesse artig

Repositorio da Producao Cientifica e Intelectual da Unicamp

Machine Learning Classification of SDSS Transient Survey Images

Author: Bassett B. A.
Buisson L. du
Sivanandam N.
Smith M.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 20/11/2015
Field of study

We show that multiple machine learning algorithms can match human performance in classifying transient imaging data from the Sloan Digital Sky Survey (SDSS) supernova survey into real objects and artefacts. This is a first step in any transient science pipeline and is currently still done by humans, but future surveys such as the Large Synoptic Survey Telescope (LSST) will necessitate fully machine-enabled solutions. Using features trained from eigenimage analysis (principal component analysis, PCA) of single-epoch g, r and i-difference images, we can reach a completeness (recall) of 96 per cent, while only incorrectly classifying at most 18 per cent of artefacts as real objects, corresponding to a precision (purity) of 84 per cent. In general, random forests performed best, followed by the k-nearest neighbour and the SkyNet artificial neural net algorithms, compared to other methods such as na\"ive Bayes and kernel support vector machine. Our results show that PCA-based machine learning can match human success levels and can naturally be extended by including multiple epochs of data, transient colours and host galaxy information which should allow for significant further improvements, especially at low signal-to-noise.Comment: 14 pages, 8 figures. In this version extremely minor adjustments to the paper were made - e.g. Figure 5 is now easier to view in greyscal

arXiv.org e-Print Archive

CiteSeerX

Application of Synthetic Informative Minority Over-Sampling (SIMO) Algorithm Leveraging Support Vector Machine (SVM) On Small Datasets with Class Imbalance

Author: Fakkeriah Kallappanamatt Akshatha
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2018
Field of study

Developing predictive models for classification problems considering imbalanced datasets is one of the basic difficulties in data mining and decision-analytics. A classifier’s performance will decline dramatically when applied to an imbalanced dataset. Standard classifiers such as logistic regression, Support Vector Machine (SVM) are appropriate for balanced training sets whereas provides suboptimal classification results when used on unbalanced dataset. Performance metric with prediction accuracy encourages a bias towards the majority class, while the rare instances remain unknown though the model contributes a high overall precision. There are chances where minority instances might be treated as noise and vice versa. (Haixiang et al., 2017). Wide range of Class Imbalanced learning techniques are introduced to overcome the above-mentioned problems, although each has some advantages and shortcomings. This paper provides details on the behavior of a novel imbalanced learning technique Synthetic Informative Minority Over-Sampling (SIMO) Algorithm Leveraging Support Vector Machine (SVM) on small datasets of records less than 200. Base classifiers, Logistic regression and SVM is used to validate the impact of SIMO on classifier’s performance in terms of metrices G-mean and Area Under Curve. A Comparison is derived between SIMO and other algorithms SMOTE, Smote-Borderline, ADAYSN to evaluate performance of SIMO over others

Arrow@TUDublin

A NEW METHODOLOGY FOR IDENTIFYING INTERFACE RESIDUES INVOLVED IN BINDING PROTEIN COMPLEXES

Author: Jeong Jong Cheol
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2011
Field of study

Genome-sequencing projects with advanced technologies have rapidly increased the amount of protein sequences, and demands for identifying protein interaction sites are significantly increased due to its impact on understanding cellular process, biochemical events and drug design studies. However, the capacity of current wet laboratory techniques is not enough to handle the exponentially growing protein sequence data; therefore, sequence based predictive methods identifying protein interaction sites have drawn increasing interest. In this article, a new predictive model which can be valuable as a first approach for guiding experimental methods investigating protein-protein interactions and localizing the specific interface residues is proposed. The proposed method extracts a wide range of features from protein sequences. Random forests framework is newly redesigned to effectively utilize these features and the problems of imbalanced data classification commonly encountered in binding site predictions. The method is evaluated with 2,829 interface residues and 24,616 non-interface residues extracted from 99 polypeptide chains in the Protein Data Bank. The experimental results show that the proposed method performs significantly better than two other conventional predictive methods and can reliably predict residues involved in protein interaction sites. As blind tests, the proposed method predicts interaction sites and constructs three protein complexes: the DnaK molecular chaperone system, 1YUW and 1DKG, which provide new insight into the sequence-function relationship. Finally, the robustness of the proposed method is assessed by evaluating the performances obtained from four different ensemble methods

KU ScholarWorks