5 research outputs found

    Generalizing resemblance coefficients to accommodate incomplete data

    Get PDF
    Large ecological data matrices may be incomplete for various reasons, preventing the use of standard multidi-mensional scaling (ordination) and cluster analysis packages. Although there exist a few resemblance functions that allow missing scores, there is no theoretical background and software support for most distance and simi-larity coefficients potentially applied in multivariate data analysis. We provide a general framework for a precise mathematical redefinition of a large set of resemblance functions originally developed for complete data sets with presence-absence (binary) or ratio-scale variables. Included are coefficients which consider double absences in abundance data. Potential problems with the use of these functions are discussed, with the conclusion that incompleteness of data would rarely if ever influence greatly the interpretability of ordinations and classifica-tions. An R function described in the Appendix represents a link to R. We also provide a stand-alone WINDOWS application for users of other computer programs. The new software will allow users of standard data analysis packages to perform multivariate analysis using a wide variety of resemblance coefficients even if the data are incomplete for whatever reason

    Recolha de contratos de despesa pública e segmentação dos perfis de despesa a nível municipal

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceDevido à necessidade de analisar como são investidos os capitais públicos nos municípios Portugueses nos diversos tipos de contratos de aquisição de bens e serviços, torna-se fundamental criar ferramentas que permitam a compreensão destes investimentos. É desejável perceber como oscilam estes investimentos em função da dimensão da população. Neste projeto, o objetivo é recolher dados disponibilizados na web sobre contratos e criar uma segmentação para os diversos tipos de despesa pública, que permita detetar eventuais desvios anómalos na relação entre despesa pública municipal e dimensão populacional. Para este efeito, foi desenvolvido um web crawler com recurso à linguagem de programação Python que permitiu extrair de forma automática os contratos públicos do site http://www.base.gov.pt/. Foram analisados os dados recolhidos tendo sido detetada uma relação do tipo log-log entre população e despesa pública. Posteriormente foi feita uma análise de segmentação com base nos resíduos da relação anteriormente mencionada com recurso a técnicas de DataMining. Foram usados diversos algoritmos de Clustering, em particular, o K-Medoids, do qual foram gerados dois grupos distintos de tipos de despesa.Due to the need to analyze how public capital is invested in Portuguese municipalities in the various types of contracts for the acquisition of goods and services, it is essential to create tools that allow the understanding of these investments. It is desirable to understand how these investments oscillate according to the size of the population. In this project, the objective is to collect data available on the web about contracts and to create a segmentation for the various types of public expenditure, allowing to detect any anomalous deviations in the relationship between municipal public expenditure and population size. For this purpose, a web crawler was developed using the Python programming language that allowed to automatically extract public contracts from the site http://www.base.gov.pt/. The data collected were analyzed and a log-log relationship between population and public expenditure was detected. Subsequently, a segmentation analysis based on the residues of the referred relationship was performed using DataMining techniques. Several Clustering algorithms were used, in particular K-Medoids, from which two distinct groups of expense types were generated

    Predictability of Missing Data Theory to Improve U.S. Estimator’s Unreliable Data Problem

    Get PDF
    Since the topic of improving data quality has not been addressed for the U.S. defense cost estimating discipline beyond changes in public policy, the goal of the study was to close this gap and provide empirical evidence that supports expanding options to improve software cost estimation data matrices for U.S. defense cost estimators. The purpose of this quantitative study was to test and measure the level of predictive accuracy of missing data theory techniques that were referenced as traditional approaches in the literature, compare each theories’ results to a complete data matrix used in support of the U.S. defense cost estimation discipline, and determine which theories rendered incomplete and missing data sets in a single data matrix most reliable and complete under eight missing value percentages. A quantitative pre-experimental research design, a one group pretest-posttest no control group design, empirically tested and measured the predictive accuracy of traditional missing data theory techniques typically used in non-cost estimating disciplines. The results from the pre-experiments on a representative U.S. defense software cost estimation data matrix obtained, a nonproprietary set of historical software effort, size, and schedule numerical data used at Defense Acquisition University revealed that single and multiple imputation techniques were two viable options to improve data quality since calculations fell within 20% of the original data value 16.4% and 18.6%, respectively. This study supports positive social change by investigating how cost estimators, engineering economists, and engineering managers could improve the reliability of their estimate forecasts, provide better estimate predictions, and ultimately reduce taxpayer funds that are spent to fund defense acquisition cost overruns

    Distance Estimation in Numerical Data Sets with Missing Values

    No full text
    The possibility of missing or incomplete data is often ignored when describing statistical or machine learning methods, but as it is a common problem in practice, it is relevant to consider. A popular strategy is to fill in the missing values by imputation as a pre-processing step, but for many methods this is not necessary, and can yield sub-optimal results. Instead, appropriately estimating pairwise distances in a data set directly enables the use of any machine learning methods using nearest neighbours or otherwise based on distances between samples. In this paper, it is shown how directly estimating distances tends to result in more accurate results than calculating distances from an imputed data set, and an algorithm to calculate the estimated distances is presented. The theoretical framework operates under the assumption of a multivariate normal distribution, but the algorithm is shown to be robust to violations of this assumption. The focus is on numerical data with a considerable proportion of missing values, and simulated experiments are provided to show accurate performance on several data sets. Keywords: missing data, distance estimation, imputation, nearest neighbour
    corecore