Search CORE

1,473 research outputs found

Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Author: Albrecht
Austin
Baird
Batista
Boehm
Boehm
Breiman
Briand
Briand
Briand
Brockmeier
Cartwright
Cheung
Clark
Feelders
Finnie
Gama
Gray
Holte
Jain
Jeffery
Jun Liu
Jönsson
Kemerer
Khotanzad
Kibler
Kim
Kitchenham
Kohavi
Little
Little
Little
Little
Little
Martin Shepperd
Miranda
Myrtveit
Pickard
Putnam
Qinbao Song
Quinlan
Robins
Rubin
Rubin
Rubin
Rubin
Samson
Selby
Shao
Shepperd
Shepperd
Siedelecki
Song
Song
Srinivasan
Strike
Tabachnick
Tay
Walkerden
Walston
Xiangru Chen
Publication venue: 'Elsevier BV'
Publication date: 01/12/2008
Field of study

Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

Brunel University Research Archive

Application of Multiple imputation in Analysis of missing data in a study of Health-related quality of life

Author: Zhu Chunming
Publication venue
Publication date: 29/06/2011
Field of study

When a new treatment has similar efficacy compared to standard therapy in medical or social studies, the health-related quality of life (HRQL) becomes the main concern of health care professionals and can be the basis for making a decision in patient management. National Surgical Adjuvant Breast and Bowel Protocol (NSABP) C-06 clinical trial compared two therapies: intravenous (IV) fluorouracil (FU) plus Leucovorin (LV) and oral uracil/ftorafur (UFT) plus LV, in treatment of colon cancer. However, there was a high proportion of missing values among the HRQL measurements that only 481 (59.8%) UFT patients and 421 (52.4%) FU patients submitted the forms at all time points. Ignoring the missing data issue often leads to inefficient and sometime biased estimates. The primary objective of this thesis is to evaluate the impact of missing data on the estimated the treatment effect. In this thesis, we analyzed the HRQL data with missing values by multiple imputation. Both model-based and nearest neighborhood hot-deck imputation methods were applied. Confidence intervals for the estimated treatment effect were generated based on the pooled imputation analysis. The results based on multiple imputation indicated that missing data did not introduce major bias in the earlier analyses. However, multiple imputation was worthwhile since the most estimation from the imputation datasets are more efficient than that from incomplete data. These findings have public health importance: they have implications for development of health policies and planning interventions to improve the health related quality of life for those patients with colon cancer

An empirical study of imputation techniques for software data sets

Author: Yenduri Sumanth
Publication venue: LSU Digital Commons
Publication date: 01/01/2005
Field of study

Software Project Effort/Cost/Time Estimation has been one of the hot topics of research in the current software engineering industry. Solutions for effort/cost/time estimation are in great demand. Knowledge of accurate effort/cost/time estimates early in the software project life cycle enables project managers manage and exploit resources efficiently. The constraints of cost and time can also be met. To this day, most companies rely on their historical database of past project data sets to predict estimates for future projects. Like other data sets, software project data sets also suffer from numerous problems. The most important problem is they contain missing/incomplete data. Significant amounts of missing or incomplete data are frequently found in data sets utilized to build effort/cost/time prediction models in the current software industry. The reasons are numerous and the missingness is inevitable. The traditional approaches used by the companies ignore all the missing data and provide estimates based on the remaining complete information. Thus, the very estimates are prone to bias. In this thesis, we investigate the application of a few well-known data imputation techniques (Listwise Deletion, Mean Imputation, 10 variants of Hot-Deck Imputation and Full Information Maximum Likelihood Approach) to six real-time software project data sets. Using the imputed data sets we build effort prediction models to evaluate their performance. We study the inherent characteristics of software project data sets such as data set size, missing mechanism, pattern of missingness etc and provide a generic classification schema for all software project data sets based on their characteristics. We further implement a hybrid methodology for solving the same. We perform useful experimental analyses and compare the impacts of these methods for enhancing prediction accuracies. We also highlight the conditions to be considered and measures to be taken while using an imputation technique. We note the ideal and worst conditions for each method. Finally, we discuss the findings and the appropriateness of each method for data imputation to software project data sets

Louisiana State University

Imputation of missing values in survey data (Version 1.0)

Author: Bruch Christian
Publication venue: Mannheim
Publication date: 01/01/2023
Field of study

Survey data often includes missing values. An approach to deal with missing values is imputation in order to obtain a complete dataset. However, the process of imputation requires researchers to make various decisions regarding the imputation method to be applied, the number of values to be imputed for each missing value, the selection of predictor variables, the treatment of multivariate nonresponse and the conduct of variance estimation. This survey guideline provides an overview of imputation procedures for missing values. It aims to support the reader with respect to aforementioned decisions when imputing missing values in survey data.Survey Daten enthalten häufig fehlende Werte. Eine Methode mit fehlenden Werten umzugehen ist die Imputation, welche darauf abzielt, einen vollständigen Datensatz zu erhalten. Im Zuge der Anwendung der Imputation müssen jedoch verschiedene Entscheidungen getroffen werden. Zum Beispiel muss festgelegt werden, welche Imputationsmethode verwendet werden soll, wie viele Werte für einen fehlenden Wert imputiert werden sollen, welche Variablen als Prädiktoren verwendet werden und wie mit multivariatem Nonresponse umzugehen ist und wie die Varianzschätzung durchgeführt werden soll. Diese Survey Guideline gibt einen Überblick über die Imputation fehlender Werte. Das Ziel ist es, den Leser bezüglich der zuvor genannten Fragestellungen bei der Imputation fehlender Werte in Survey Daten zu unterstützen

Tietokierto ilmakehäfysiikassa : mitatusta millivoltista ilmakehän ymmärtämiseen

Author: Junninen Heikki
Publication venue: 'University of Helsinki Libraries'
Publication date: 14/01/2014
Field of study

In this thesis the concept of data cycle is introduced. The concept itself is general and only gets the real content when the field of application is defined. If applied in the field of atmospheric physics the data cycle includes measurements, data acquisition, processing, analysis and interpretation. The atmosphere is a complex system in which everything is in a constantly moving equilibrium. The scientific community agrees unanimously that it is human activity, which is accelerating the climate change. Nevertheless a complete understanding of the process is still lacking. The biggest uncertainty in our understanding is connected to the role of nano- to micro-scale atmospheric aerosol particles, which are emitted to the atmosphere directly or formed from precursor gases. The latter process has only been discovered recently in the long history of science and links nature s own processes to human activities. The incomplete understanding of atmospheric aerosol formation and the intricacy of the process has motivated scientists to develop novel ways to acquire data, new methods to explore already acquired data, and unprecedented ways to extract information from the examined complex systems - in other words to compete a full data cycle. Until recently it has been impossible to directly measure the chemical composition of precursor gases and clusters that participate in atmospheric particle formation. However, with the arrival of the so-called atmospheric pressure interface time-of-flight mass spectrometer we are now able to detect atmospheric ions that are taking part in particle formation. The amount of data generated from on-line analysis of atmospheric particle formation with this instrument is vast and requires efficient processing. For this purpose dedicated software was developed and tested in this thesis. When combining processed data from multiple instruments, the information content is increasing which requires special tools to extract useful information. Source apportionment and data mining techniques were explored as well as utilized to investigate the origin of atmospheric aerosol in urban environments (two case studies: Krakow and Helsinki) and to uncover indirect variables influencing the atmospheric formation of new particles.Tässä työssä esitellään konsepti - tietokierto ilmakehätieteissä. Tietokierto on sinänsä yleinen käsite ja ei liity mihinkään tiettyyn tieteenalaan. Tietokierto huomioi jokaisen vaiheen raa asta mittausarvosta datan soveltamiseen, ymmärtämiseen ja tulkintaan. Ilmakehäfysiikassa tietokierto sisältää vaiheet signaalin havainnoinnista, datan keräämiseen, esikäsittelyyn, ja työstämiseen sekä sitä kautta tulkintaan. Ilmakehä on monimutkainen kokonaisuus, jossa kaikki on jatkuvasti muuttuvassa tasapainossa keskenään. Tiedeyhteisö on yksimielisesti sitä mieltä, että kiihtyvä ilmastonmuutos on ihmisen toiminnan seurausta. Tarkalleen sitä prosessia ei kuitenkaan tunneta. Suurin epävarmuus ymmärryksessä on pienhiukkasten aiheuttama vaikutus ilmastomuutokseen. Pienhiukkasia päätyy ilmakehään joko suoraan päästölähteistä tai ne muodostuvat nukleaation eli kaasu-hiukkasmuuntuman kautta. Viimeksi mainittu ilmiö on havaittu vasta hiljattain ja sen yksityiskohtainen ymmärrys vielä puuttuu. Ilmiön monimutkaisuus on kiehtonut ja motivoinut tutkijoita kehittämään uusia mittalaitteistoja, mittausmenetelmiä, datan analysointimenetelmiä ja uusia tapoja suodattaa tietoa jo kerätystä datasta - toisin sanoen täydentää ja parantaa tietokiertoa. Aikaisemmin on ollut mahdotonta mitata suoraan kaasu-hiukkasmuuntumisessa osallistuvien kaasujen kemiallista koostumusta. Tässä työssä käytetty laitteisto (ilmakehäpaineliitännäinen lentoaikamassaspektrometri, APiTOF) pystyy havaitsemaan kyseisiä kaasuja suoraan ilman esikäsittelyä. Koska laitteisto on uusi ja sen tuottama data määrä on iso, kehitettiin tässä työssä tehokas raakadatan esikäsittelymenetelmä ja työkalu. Kun yhdistetään prosessoitu data useista laitteista, informaation sisältö kasvaa, mutta sen esille saaminen hankaloituu. Tässä työssä kehitettiin ja käytettiin menetelmiä ilmamassojen päästölähdekartoitukseen, tarkoituksena selvittää kaupunginympäristön pahimmat saastuttajat ja päästölähteet. Datan louhintaa hyödynnettiin löytämään kaasu-hiukkasmuuntumaan vaikuttavia tekijöitä

Helsingin yliopiston digitaalinen arkisto

Microdata Imputations and Macrodata Implications: Evidence from the Ifo Business Survey

Author: Heumann Christian
Seiler Christian
Publication venue
Publication date: 01/01/2012
Field of study

A widespread method for now- and forecasting economic macro level parameters such as GDP growth rates are survey-based indicators which contain early information in contrast to official data. But surveys are commonly affected by nonresponding units which can produce biases if these missing values can not be regarded as missing at random. As many papers examined the effect of nonresponse in individual or household surveys, only less is known in the case of business surveys. So, literature leaves a gap on this issue. For this reason, we analyse and impute the missing observations in the Ifo Business Survey, a large business survey in Germany. The most prominent result of this survey is the Ifo Business Climate Index, a leading indicator for the German business cycle. To reflect the underlying latent data generating process, we compare different imputation approaches for longitudinal data. After this, the microdata are aggregated and the results are compared with the original indicators to evaluate their implications on the macro level. Finally, we show that the bias is minimal and ignorable

Munich RePEc Personal Archive