42,640 research outputs found

    Instance and feature weighted k-nearest-neighbors algorithm

    Get PDF
    We present a novel method that aims at providing a more stable selection of feature subsets when variations in the training process occur. This is accomplished by using an instance-weighting process -assigning different importances to instances as a preprocessing step to a feature weighting method that is independent of the learner, and then making good use of both sets of computed weigths in a standard Nearest-Neighbours classifier. We report extensive experimentation in well-known benchmarking datasets as well as some challenging microarray gene expression problems. Our results show increases in stability for most subset sizes and most problems, without compromising prediction accuracy.Peer ReviewedPostprint (published version

    Prediction of infectious disease epidemics via weighted density ensembles

    Full text link
    Accurate and reliable predictions of infectious disease dynamics can be valuable to public health organizations that plan interventions to decrease or prevent disease transmission. A great variety of models have been developed for this task, using different model structures, covariates, and targets for prediction. Experience has shown that the performance of these models varies; some tend to do better or worse in different seasons or at different points within a season. Ensemble methods combine multiple models to obtain a single prediction that leverages the strengths of each model. We considered a range of ensemble methods that each form a predictive density for a target of interest as a weighted sum of the predictive densities from component models. In the simplest case, equal weight is assigned to each component model; in the most complex case, the weights vary with the region, prediction target, week of the season when the predictions are made, a measure of component model uncertainty, and recent observations of disease incidence. We applied these methods to predict measures of influenza season timing and severity in the United States, both at the national and regional levels, using three component models. We trained the models on retrospective predictions from 14 seasons (1997/1998 - 2010/2011) and evaluated each model's prospective, out-of-sample performance in the five subsequent influenza seasons. In this test phase, the ensemble methods showed overall performance that was similar to the best of the component models, but offered more consistent performance across seasons than the component models. Ensemble methods offer the potential to deliver more reliable predictions to public health decision makers.Comment: 20 pages, 6 figure

    Sparsity Oriented Importance Learning for High-dimensional Linear Regression

    Full text link
    With now well-recognized non-negligible model selection uncertainty, data analysts should no longer be satisfied with the output of a single final model from a model selection process, regardless of its sophistication. To improve reliability and reproducibility in model choice, one constructive approach is to make good use of a sound variable importance measure. Although interesting importance measures are available and increasingly used in data analysis, little theoretical justification has been done. In this paper, we propose a new variable importance measure, sparsity oriented importance learning (SOIL), for high-dimensional regression from a sparse linear modeling perspective by taking into account the variable selection uncertainty via the use of a sensible model weighting. The SOIL method is theoretically shown to have the inclusion/exclusion property: When the model weights are properly around the true model, the SOIL importance can well separate the variables in the true model from the rest. In particular, even if the signal is weak, SOIL rarely gives variables not in the true model significantly higher important values than those in the true model. Extensive simulations in several illustrative settings and real data examples with guided simulations show desirable properties of the SOIL importance in contrast to other importance measures

    Transforming Graph Representations for Statistical Relational Learning

    Full text link
    Relational data representations have become an increasingly important topic due to the recent proliferation of network datasets (e.g., social, biological, information networks) and a corresponding increase in the application of statistical relational learning (SRL) algorithms to these domains. In this article, we examine a range of representation issues for graph-based relational data. Since the choice of relational data representation for the nodes, links, and features can dramatically affect the capabilities of SRL algorithms, we survey approaches and opportunities for relational representation transformation designed to improve the performance of these algorithms. This leads us to introduce an intuitive taxonomy for data representation transformations in relational domains that incorporates link transformation and node transformation as symmetric representation tasks. In particular, the transformation tasks for both nodes and links include (i) predicting their existence, (ii) predicting their label or type, (iii) estimating their weight or importance, and (iv) systematically constructing their relevant features. We motivate our taxonomy through detailed examples and use it to survey and compare competing approaches for each of these tasks. We also discuss general conditions for transforming links, nodes, and features. Finally, we highlight challenges that remain to be addressed

    Towards more reliable feature evaluations for classification

    Get PDF
    In this thesis we study feature subset selection and feature weighting algorithms. Our aim is to make their output more stable and more useful when used to train a classifier. We begin by defining the concept of stability and selecting a measure to asses the output of the feature selection process. Then we study different sources of instability and propose modifications of classic algorithms that improve their stability. We propose a modification of wrapper algorithms that take otherwise unused information into account to overcome an intrinsic source of instability for this algorithms: the feature assessment being a random variable that depends on the particular training subsample. Our version accumulates the evaluation results of each feature at each iteration to average out the effect of the randomness. Another novel proposal is to make wrappers evaluate the remainder set of features at each step to overcome another source of instability: randomness of the algorithms themselves. In this case, by evaluating the non-selected set of features, the initial choice of variables is more educated. These modifications do not bring a great amount of computational overhead and deliver better results, both in terms of stability and predictive power. We finally tackle another source of instability: the differential contribution of the instances to feature assessment. We present a framework to combine almost any instance weighting algorithm with any feature weighting one. Our combination of algorithms deliver more stable results for the various feature weighting algorithms we have tested. Finally, we present a deeper integration of instance weighting with feature weighting by modifying the Simba algorithm, that delivers even better results in terms of stabilityEl focus d'aquesta tesi és mesurar, estudiar i millorar l’estabilitat d’algorismes de selecció de subconjunts de variables (SSV) i avaluació de variables (AV) en un context d'aprenentatge supervisat. El propòsit general de la SSV en un context de classificació és millorar la precisió de la predicció. Nosaltres afirmem que hi ha un altre gran repte en SSV i AV: l’estabilitat des resultats. Un cop triada una mesura d’estabilitat entre les estudiades, proposem millores d’un algorisme molt popular: el Relief. Analitzem diferents mesures de distància a més de la original i estudiem l'efecte que tenen sobre la precisió, la detecció de la redundància i l'estabilitat. També posem a prova diferents maneres d’utilitzar els pesos que es calculen a cada pas per influir en el càlcul de distàncies d’una manera similar a com ho fa un altre algorisme d'AV: el Simba. També millorem la seva estabilitat incrementant la contribució dels pesos de les variables en el càlcul de la distància a mesura que avança el temps per minimitzar l’impacte de la selecció aleatòria de les primeres instàncies. Pel què fa als algorismes embolcall, (wrappers) els modifiquem per tenir en compte informació que era ignorada per superar una font intrínseca d’inestabilitat: el fet que l’avaluació de les variables és una variable aleatòria que depèn del subconjunt de dades utilitzat. La nostra versió acumula els resultats en cada iteració per compensar l’efecte aleatori mentre que els originals descarten tota la informació recollida sobre cada variable en una determinada iteració i comencen de nou a la següent, donant lloc a resultats més inestables. Una altra proposta és fer que aquests wrappers avaluïn el subconjunt de variables no seleccionat en cada iteració per evitar una altra font d’inestabilitat. Aquestes modificacions no comporten un gran augment de cost computacional i els seus resultats són més estables i més útils per un classificador. Finalment proposem ponderar la contribució de cada instància en l’AV. Poden existir observacions atípiques que no s'haurien de tenir tant en compte com les altres; si estem intentant predir un càncer utilitzant informació d’anàlisis genètics, hauríem de donar menys credibilitat a les dades obtingudes de persones exposades a grans nivells de radiació tot i que no tenir informació sobre aquesta exposició. Els mètodes d’avaluació d’instàncies (AI) pretenen identificar aquests casos i assignar-los pesos més baixos. Varis autors han treballat en esquemes d’AI per millorar la SSV però no hi ha treball previ en la combinació d'AI amb AV. Presentem un marc de treball per combinar algorismes d'AI amb altres d'AV. A més proposem un nou algorisme d’AI basat en el concepte de marge de decisió que utilitzen alguns algorismes d’AV. Amb aquest marc de treball hem posat a prova les modificacions contra les versions originals utilitzant varis jocs de dades del repositori UCI, de xips d'ADN i els utilitzats en el desafiament de SSV del NIPS-2003. Les nostres combinacions d'algorismes d'avaluació d'instàncies i atributs ens aporten resultats més estables per varis algorismes d'avaluació d'atributs que hem estudiat. Finalment, presentem una integració més profunda de l'avaluació d'instàncies amb l'algorisme de selecció de variables Simba consistent a utilitzar els pesos de les instàncies per ponderar el càlcul de les distàncies, amb la que obtenim resultats encara millors en termes d’estabilitat. Les contribucions principals d’aquesta tesi son: (i) aportar un marc de treball per combinar l'AI amb l’AV, (ii) una revisió de les mesures d’estabilitat de SSV, (iii) diverses modificacions d’algorismes de SSV i AV que milloren la seva estabilitat i el poder predictiu del subconjunt de variables seleccionats; sense un augment significatiu del seu cost computacional, (iv) una definició teòrica de la importància d'una variable i (v) l'estudi de la relació entre l'estabilitat de la SSV i la redundància de les variables.Postprint (published version

    Bias Reduction via End-to-End Shift Learning: Application to Citizen Science

    Full text link
    Citizen science projects are successful at gathering rich datasets for various applications. However, the data collected by citizen scientists are often biased --- in particular, aligned more with the citizens' preferences than with scientific objectives. We propose the Shift Compensation Network (SCN), an end-to-end learning scheme which learns the shift from the scientific objectives to the biased data while compensating for the shift by re-weighting the training data. Applied to bird observational data from the citizen science project eBird, we demonstrate how SCN quantifies the data distribution shift and outperforms supervised learning models that do not address the data bias. Compared with competing models in the context of covariate shift, we further demonstrate the advantage of SCN in both its effectiveness and its capability of handling massive high-dimensional data
    corecore