1,473 research outputs found

    Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

    Get PDF
    Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

    Application of Multiple imputation in Analysis of missing data in a study of Health-related quality of life

    Get PDF
    When a new treatment has similar efficacy compared to standard therapy in medical or social studies, the health-related quality of life (HRQL) becomes the main concern of health care professionals and can be the basis for making a decision in patient management. National Surgical Adjuvant Breast and Bowel Protocol (NSABP) C-06 clinical trial compared two therapies: intravenous (IV) fluorouracil (FU) plus Leucovorin (LV) and oral uracil/ftorafur (UFT) plus LV, in treatment of colon cancer. However, there was a high proportion of missing values among the HRQL measurements that only 481 (59.8%) UFT patients and 421 (52.4%) FU patients submitted the forms at all time points. Ignoring the missing data issue often leads to inefficient and sometime biased estimates. The primary objective of this thesis is to evaluate the impact of missing data on the estimated the treatment effect. In this thesis, we analyzed the HRQL data with missing values by multiple imputation. Both model-based and nearest neighborhood hot-deck imputation methods were applied. Confidence intervals for the estimated treatment effect were generated based on the pooled imputation analysis. The results based on multiple imputation indicated that missing data did not introduce major bias in the earlier analyses. However, multiple imputation was worthwhile since the most estimation from the imputation datasets are more efficient than that from incomplete data. These findings have public health importance: they have implications for development of health policies and planning interventions to improve the health related quality of life for those patients with colon cancer

    An empirical study of imputation techniques for software data sets

    Get PDF
    Software Project Effort/Cost/Time Estimation has been one of the hot topics of research in the current software engineering industry. Solutions for effort/cost/time estimation are in great demand. Knowledge of accurate effort/cost/time estimates early in the software project life cycle enables project managers manage and exploit resources efficiently. The constraints of cost and time can also be met. To this day, most companies rely on their historical database of past project data sets to predict estimates for future projects. Like other data sets, software project data sets also suffer from numerous problems. The most important problem is they contain missing/incomplete data. Significant amounts of missing or incomplete data are frequently found in data sets utilized to build effort/cost/time prediction models in the current software industry. The reasons are numerous and the missingness is inevitable. The traditional approaches used by the companies ignore all the missing data and provide estimates based on the remaining complete information. Thus, the very estimates are prone to bias. In this thesis, we investigate the application of a few well-known data imputation techniques (Listwise Deletion, Mean Imputation, 10 variants of Hot-Deck Imputation and Full Information Maximum Likelihood Approach) to six real-time software project data sets. Using the imputed data sets we build effort prediction models to evaluate their performance. We study the inherent characteristics of software project data sets such as data set size, missing mechanism, pattern of missingness etc and provide a generic classification schema for all software project data sets based on their characteristics. We further implement a hybrid methodology for solving the same. We perform useful experimental analyses and compare the impacts of these methods for enhancing prediction accuracies. We also highlight the conditions to be considered and measures to be taken while using an imputation technique. We note the ideal and worst conditions for each method. Finally, we discuss the findings and the appropriateness of each method for data imputation to software project data sets

    Imputation of missing values in survey data (Version 1.0)

    Get PDF
    Survey data often includes missing values. An approach to deal with missing values is imputation in order to obtain a complete dataset. However, the process of imputation requires researchers to make various decisions regarding the imputation method to be applied, the number of values to be imputed for each missing value, the selection of predictor variables, the treatment of multivariate nonresponse and the conduct of variance estimation. This survey guideline provides an overview of imputation procedures for missing values. It aims to support the reader with respect to aforementioned decisions when imputing missing values in survey data.Survey Daten enthalten hĂ€ufig fehlende Werte. Eine Methode mit fehlenden Werten umzugehen ist die Imputation, welche darauf abzielt, einen vollstĂ€ndigen Datensatz zu erhalten. Im Zuge der Anwendung der Imputation mĂŒssen jedoch verschiedene Entscheidungen getroffen werden. Zum Beispiel muss festgelegt werden, welche Imputationsmethode verwendet werden soll, wie viele Werte fĂŒr einen fehlenden Wert imputiert werden sollen, welche Variablen als PrĂ€diktoren verwendet werden und wie mit multivariatem Nonresponse umzugehen ist und wie die VarianzschĂ€tzung durchgefĂŒhrt werden soll. Diese Survey Guideline gibt einen Überblick ĂŒber die Imputation fehlender Werte. Das Ziel ist es, den Leser bezĂŒglich der zuvor genannten Fragestellungen bei der Imputation fehlender Werte in Survey Daten zu unterstĂŒtzen

    Tietokierto ilmakehÀfysiikassa : mitatusta millivoltista ilmakehÀn ymmÀrtÀmiseen

    Get PDF
    In this thesis the concept of data cycle is introduced. The concept itself is general and only gets the real content when the field of application is defined. If applied in the field of atmospheric physics the data cycle includes measurements, data acquisition, processing, analysis and interpretation. The atmosphere is a complex system in which everything is in a constantly moving equilibrium. The scientific community agrees unanimously that it is human activity, which is accelerating the climate change. Nevertheless a complete understanding of the process is still lacking. The biggest uncertainty in our understanding is connected to the role of nano- to micro-scale atmospheric aerosol particles, which are emitted to the atmosphere directly or formed from precursor gases. The latter process has only been discovered recently in the long history of science and links nature s own processes to human activities. The incomplete understanding of atmospheric aerosol formation and the intricacy of the process has motivated scientists to develop novel ways to acquire data, new methods to explore already acquired data, and unprecedented ways to extract information from the examined complex systems - in other words to compete a full data cycle. Until recently it has been impossible to directly measure the chemical composition of precursor gases and clusters that participate in atmospheric particle formation. However, with the arrival of the so-called atmospheric pressure interface time-of-flight mass spectrometer we are now able to detect atmospheric ions that are taking part in particle formation. The amount of data generated from on-line analysis of atmospheric particle formation with this instrument is vast and requires efficient processing. For this purpose dedicated software was developed and tested in this thesis. When combining processed data from multiple instruments, the information content is increasing which requires special tools to extract useful information. Source apportionment and data mining techniques were explored as well as utilized to investigate the origin of atmospheric aerosol in urban environments (two case studies: Krakow and Helsinki) and to uncover indirect variables influencing the atmospheric formation of new particles.TÀssÀ työssÀ esitellÀÀn konsepti - tietokierto ilmakehÀtieteissÀ. Tietokierto on sinÀnsÀ yleinen kÀsite ja ei liity mihinkÀÀn tiettyyn tieteenalaan. Tietokierto huomioi jokaisen vaiheen raa asta mittausarvosta datan soveltamiseen, ymmÀrtÀmiseen ja tulkintaan. IlmakehÀfysiikassa tietokierto sisÀltÀÀ vaiheet signaalin havainnoinnista, datan kerÀÀmiseen, esikÀsittelyyn, ja työstÀmiseen sekÀ sitÀ kautta tulkintaan. IlmakehÀ on monimutkainen kokonaisuus, jossa kaikki on jatkuvasti muuttuvassa tasapainossa keskenÀÀn. Tiedeyhteisö on yksimielisesti sitÀ mieltÀ, ettÀ kiihtyvÀ ilmastonmuutos on ihmisen toiminnan seurausta. Tarkalleen sitÀ prosessia ei kuitenkaan tunneta. Suurin epÀvarmuus ymmÀrryksessÀ on pienhiukkasten aiheuttama vaikutus ilmastomuutokseen. Pienhiukkasia pÀÀtyy ilmakehÀÀn joko suoraan pÀÀstölÀhteistÀ tai ne muodostuvat nukleaation eli kaasu-hiukkasmuuntuman kautta. Viimeksi mainittu ilmiö on havaittu vasta hiljattain ja sen yksityiskohtainen ymmÀrrys vielÀ puuttuu. Ilmiön monimutkaisuus on kiehtonut ja motivoinut tutkijoita kehittÀmÀÀn uusia mittalaitteistoja, mittausmenetelmiÀ, datan analysointimenetelmiÀ ja uusia tapoja suodattaa tietoa jo kerÀtystÀ datasta - toisin sanoen tÀydentÀÀ ja parantaa tietokiertoa. Aikaisemmin on ollut mahdotonta mitata suoraan kaasu-hiukkasmuuntumisessa osallistuvien kaasujen kemiallista koostumusta. TÀssÀ työssÀ kÀytetty laitteisto (ilmakehÀpaineliitÀnnÀinen lentoaikamassaspektrometri, APiTOF) pystyy havaitsemaan kyseisiÀ kaasuja suoraan ilman esikÀsittelyÀ. Koska laitteisto on uusi ja sen tuottama data mÀÀrÀ on iso, kehitettiin tÀssÀ työssÀ tehokas raakadatan esikÀsittelymenetelmÀ ja työkalu. Kun yhdistetÀÀn prosessoitu data useista laitteista, informaation sisÀltö kasvaa, mutta sen esille saaminen hankaloituu. TÀssÀ työssÀ kehitettiin ja kÀytettiin menetelmiÀ ilmamassojen pÀÀstölÀhdekartoitukseen, tarkoituksena selvittÀÀ kaupunginympÀristön pahimmat saastuttajat ja pÀÀstölÀhteet. Datan louhintaa hyödynnettiin löytÀmÀÀn kaasu-hiukkasmuuntumaan vaikuttavia tekijöitÀ

    Microdata Imputations and Macrodata Implications: Evidence from the Ifo Business Survey

    Get PDF
    A widespread method for now- and forecasting economic macro level parameters such as GDP growth rates are survey-based indicators which contain early information in contrast to official data. But surveys are commonly affected by nonresponding units which can produce biases if these missing values can not be regarded as missing at random. As many papers examined the effect of nonresponse in individual or household surveys, only less is known in the case of business surveys. So, literature leaves a gap on this issue. For this reason, we analyse and impute the missing observations in the Ifo Business Survey, a large business survey in Germany. The most prominent result of this survey is the Ifo Business Climate Index, a leading indicator for the German business cycle. To reflect the underlying latent data generating process, we compare different imputation approaches for longitudinal data. After this, the microdata are aggregated and the results are compared with the original indicators to evaluate their implications on the macro level. Finally, we show that the bias is minimal and ignorable
    • 

    corecore