8 research outputs found

    Missing-Values Adjustment for Mixed-Type Data

    Get PDF
    We propose a new method of single imputation, reconstruction, and estimation of nonreported, incorrect, implausible, or excluded values in more than one field of the record. In particular, we will be concerned with data sets involving a mixture of numeric, ordinal, binary, and categorical variables. Our technique is a variation of the popular nearest neighbor hot deck imputation (NNHDI) where “nearest” is defined in terms of a global distance obtained as a convex combination of the distance matrices computed for the various types of variables. We address the problem of proper weighting of the partial distance matrices in order to reflect their significance, reliability, and statistical adequacy. Performance of several weighting schemes is compared under a variety of settings in coordination with imputation of the least power mean of the Box-Cox transformation applied to the values of the donors. Through analysis of simulated and actual data sets, we will show that this approach is appropriate. Our main contribution has been to demonstrate that mixed data may optimally be combined to allow the accurate reconstruction of missing values in the target variable even when some data are absent from the other fields of the record

    Methodologies for model-free data interpretation of civil engineering structures

    Get PDF
    Structural health monitoring (SHM) has the potential to provide quantitative and reliable data on the real condition of structures, observe the evolution of their behaviour and detect degradation This paper presents two methodologies for model-free data interpretation to identify and localize anomalous behaviour in civil engineering structures Two statistical methods based on (i) moving principal component analysis and (ii) robust regression analysis are demonstrated to be useful for damage detection during continuous static monitoring of civil structures. The methodologies are tested on numerically simulated elements with sensors for a range of noise in measurements. A comparative study with other statistical analyses demonstrates superior performance of these methods for damage detection. Approaches for accommodating outliers and missing data, which are commonly encountered in structural health monitoring for civil structures, are also proposed. To ensure that the methodologies are scalable for complex structures with many sensors, a clustering algorithm groups sensors that have strong correlations between their measurements Methodologies are then validated on two full-scale structures: The results show the ability of the methodology to identify abrupt permanent changes in behavior. (C) 2010 Elsevier Ltd All rights reserved

    Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

    Get PDF
    Being able to predict software quality is essential, but also it pose significant challenges in software engineering. Historical software project datasets are often being utilized together with various machine learning algorithms for fault-proneness classification. Unfortunately, the missing values in datasets have negative impacts on the estimation accuracy and therefore, could lead to inconsistent results. As a method handling missing data, K nearest neighbor (KNN) imputation gradually gains acceptance in empirical studies by its exemplary performance and simplicity. To date, researchers still call for optimized parameter setting for KNN imputation to further improve its performance. In the work, we develop a novel incomplete-instance based KNN imputation technique, which utilizes a cross-validation scheme to optimize the parameters for each missing value. An experimental assessment is conducted on eight quality datasets under various missingness scenarios. The study also compared the proposed imputation approach with mean imputation and other three KNN imputation approaches. The results show that our proposed approach is superior to others in general. The relatively optimal fixed parameter settings for KNN imputation for software quality data is also determined. It is observed that the classification accuracy is improved or at least maintained by using our approach for missing data imputation

    Imputación de datos faltantes en los ingresos por hogar en la Enaho utilizando el método del K-vecino más cercano

    Get PDF
    Universidad Nacional Agraria La Molina. Facultad de Economía y Planificación. Departamento Académico de Estadística e InformáticaLa Encuesta Nacional de Hogares (ENAHO), es el instrumento que utiliza el Instituto Nacional de Estadística e Informática (INEI) para recopilar a nivel nacional los datos de los hogares sobre su condiciones económicas, educativas, salud, etc. y que permiten generar indicadores que miden el estado y la evolución de la pobreza, el bienestar y las condiciones de vida de los hogares del Perú, así como para efectuar diagnósticos y medir el alcance de los programas sociales (alimentarios y no alimentarios) en la mejora de las condiciones de vida de la población peruana. Sin embargo, un problema que debe enfrentar la ENAHO es la no respuesta total o parcial en las unidades de muestreo (no respuesta en unidades) o en una pregunta específica (no respuesta por ítem); sobre todo a las preguntas referidas a los ingresos de los hogares. Para el tratamiento de los datos faltantes, se han propuesto una variedad de métodos que comprenden desde el más simple que consiste en la eliminación de las observaciones que tengan algún dato faltante en una de las variables hasta métodos más consistentes basados en un proceso de imputación con los datos faltantes a partir de los datos completos. El objetivo de esta investigación es presentar y aplicar los métodos de imputación de la media y mediana, el método Hot-Deck y el k vecino más cercano para estimar los datos faltantes del Ingreso por hogar en la ENAHO 2017 trimestre 3. Los resultados indican que los datos faltantes del ingreso tienen un mecanismo MCAR. La estimación del intervalo de confianza del 95% para la media de los ingresos imputados, tuvieron amplitudes por el método de la media 131,41 (el menor) mientras que por el k vecino más cercano fue 139,4. Para estimación de la desviación estándar del ingreso, fue el menor para la media 92,97 y k vecino más cercano 100,99. Los resultados de la comparación de los métodos de imputación, fueron usando los datos completos para generar una muestra aleatoria de datos faltantes artificiales y luego se hallaron el Cuadrado Medio del Error (ECM) y correlaciones con los datos observados e imputados para cada método. El método del k vecino más cercano tuvo los menores valores de ECM 1412,6 y 444,4 para la media y mediana; mientras que los otros métodos sus valores fueron por la media 1504,5; por la mediana 1619,9 y por el Hot-Deck 1963,7. Los coeficientes de correlaciones resultaron con valores muy similares, para k vecino más cercano 0,968 con la media y 0,964 con la mediana.The National Household Survey (ENAHO) is the instrument used by the National Institute of Statistics and Informatics (INEI) to collect national data on household economic, educational and health conditions, etc. and that allow generating indicators that measure the status and evolution of poverty, well-being and living conditions of Peruvian households, as well as to carry out diagnoses and measure the scope of social programs (food and non-food) in the improvement of the living conditions of the Peruvian population. However, a problem that ENAHO must face is the total or partial non-response in the sampling units (non-response in units) or in a specific question (non-response per item); especially to the questions referring to the income of the households. For the treatment of missing data, a variety of methods have been proposed , ranging from the simplest, which consists of elimination of observations that have some missing data in one of the variables, to most consistent methods based on an imputation process with the missing data from the complete data. The objective of this research is to present and apply the imputation methods of the mean and median, the Hot-Deck method and the nearest k neighbor to estimate the missing data of the Income per household in the ENAHO 2017 quarter 3. The results indicate that missing income data has a MCAR mechanism. The estimate of the 95% confidence interval for the mean of the imputed income, had amplitudes by the method of the mean 131.41 (the smallest) while for the nearest k neighbor it was 139.4. To estimate the standard deviation of income, it was the lowest for the mean 92.97 and k nearest neighbor 100.99. The results of the comparison of the imputation methods, were using the complete data to generate a random sample of artificial missing data, and then the Mean Square Error (ECM) and correlations with the observed and imputed data for each method were found. The closest neighbor k method had the lowest ECM values of 1412.6 and 444.4 for the mean and median; while the other methods their values were by the average 1504.5; by the median 1619.9 and by the Hot-Deck 1963.7. The correlation coefficients resulted in very similar values, for k nearest neighbor 0.968 with the mean and 0.964 with the median

    A modeling platform to predict cancer survival and therapy outcomes using tumor tissue derived metabolomics data.

    Get PDF
    Cancer is a complex and broad disease that is challenging to treat, partially due to the vast molecular heterogeneity among patients even within the same subtype. Currently, no reliable method exists to determine which potential first-line therapy would be most effective for a specific patient, as randomized clinical trials have concluded that no single regimen may be significantly more effective than others. One ongoing challenge in the field of oncology is the search for personalization of cancer treatment based on patient data. With an interdisciplinary approach, we show that tumor-tissue derived metabolomics data is capable of predicting clinical response to systemic therapy classified as disease control vs. progressive disease and pathological stage classified as stage I/II/III vs. stage IV via data analysis with machine-learning techniques (AUROC = 0.970; AUROC=0.902). Patient survival was also analyzed via statistical methods and machine-learning, both of which show that tumor-tissue derived metabolomics data is capable of risk stratifying patients in terms of long vs. short survival (OS AUROC = 0.940TEST; PFS AUROC = 0.875TEST). A set of key metabolites as potential biomarkers and associated metabolic pathways were also found for each outcome, which may lead to insight into biological mechanisms. Additionally, we developed a methodology to calibrate tumor growth related parameters in a well-established mathematical model of cancer to help predict the potential nuances of chemotherapeutic response. The proposed methodology shows results consistent with clinical observations in predicting individual patient response to systemic therapy and helps lay the foundation for further investigation into the calibration of mathematical models of cancer with patient-tissue derived molecular data. Chapters 6 and 8 were published in the Annals of Biomedical Engineering. Chapters 2, 3, and 7 were published in Metabolomics, Lung Cancer, and Pharmaceutical Research, respectively. Chapters 4 has been accepted for publication at the journal Metabolomics (in press) and Chapter 5 is in review at the journal Metabolomics. Chapter 9 is currently undergoing preparation for submission

    Bruk av kunstig intelligens til medisinsk beslutningstøtte: Sammenligning av Dynamic Ensemble Selection med klassiske ML-algoritmer i kreftprediksjon

    Get PDF
    Sammendrag I en verden med teknologiske fremskritt som stadig påvirker ulike næringer, har maskinlæring (ML) vist seg å være en banebrytende teknologi som kan revolusjonere ulike sektorer. Helsesektoren, som i stor grad står overfor kritiske og komplekse utfordringer, er en sektor som kan dra stor nytte av ML. Maskinlæring er en gren av datavitenskapen som bruker algoritmer og statistiske modeller til å forbedre datamaskiners ytelse, og er en fundamental byggestein i utviklingen av kunstig intelligens. Denne oppgaven tar for seg problemstillingen om å analysere prediksjon av kreftpasienter ved å anvende ulike ML-algoritmer. I denne sammenhengen ble Dynamic Ensemble Selection (DES) undersøkt for å evaluere om det kan gi bedre resultater for prediksjon av kreftpasienter enn kjente klassiske ML-algoritmer. Flere ML-teknikker ble brukt til å utføre prediksjonstester og øke forståelsen av algoritmene. Videre ble en MCDA-analyse benyttet for å sammenligne resultatene med den nåværende beslutningsprosessen, som tar hensyn til kliniske og etiske retningslinjer samt pasientens behov og interesse. Studien vil gi innsikt i hvilken grad DES og de klassiske ML-algoritmene kan bidra til å forbedre dagens situasjon om å støtte medisinsk beslutningstaking i kreftbehandling. Datasettene som ble brukt til å trene de prediktive modellene inneholdt omfattende klinisk informasjon om pasienter behandlet ved Oslo universitetssykehus (OUS). Datasettene inkluderte en gruppe på 192 pasienter som gjennomgikk behandling for kolorektal kreft i tidsrommet 2013-2017, samt en annen gruppe på 197 pasienter som ble behandlet for hode- og halskreft i perioden 2007-2013. Åtte klassifiseringsalgoritmer ble trent på disse datasettene med kliniske egenskaper for generell overlevelse (OS), progresjonsfri overlevelse (PFS) og sykdomsfri overlevelse (DFS). Resultatene ble validert ved å måle nøyaktighet, F1-score for positiv og negativ, Matthews korrelasjonskoeffisient (MCC) og ROC AUC. Videre ble modellen for hode- og halskreft testet på et eksternt datasett bestående av 99 behandlede pasienter ved MAASTRO-klinikken i Nederland. Funnene fra oppgaven tyder på at det er flere muligheter å dra nytte av i forhold til å anvende ulike ML-algoritmer. De klassiske algoritmene presterer generelt bedre enn DES med hensyn til nøyaktighet, prediksjonsytelse, og antall feilaktig klassifisering. I følge MCDA-analysen blir også de klassiske algoritmene sett på som den beste løsningen i kombinasjon av den eksisterende beslutningsprosessen. Den nye løsningen skal ikke være en erstatning, men bli sett på som et mulig beslutningsstøtteverktøy. Det er viktig å merke seg at ulike algoritmer og teknikker vil respondere forskjellig og gi ulike svar på forskjellige typer data og problemer. Dermed er denne anbefalingen gitt for de datasettene og algoritmene som denne oppgaven har basert seg på. For videre forskning anbefales det å samle et større og mer dagsaktuelt datasettet, som kan bidra til å optimalisere prognosen og overlevelsesraten for kreftpasienter. Dette kan gi mer presise og pålitelige prediksjoner om hvilken behandling som vil gi best resultat for den enkelte pasient. Resultatene fra denne oppgaven kan danne grunnlag for utvikling av modeller som kan identifisere optimal kreftbehandling for en pasient og brukes som beslutningsstøtteverktøy av helsepersonell ved behandling av nye kreftpasienter.Abstract In a world of technological advancements that continue to impact various industries, machine learning (ML) has proven to be a ground-breaking technology that can revolutionise various sectors. The health sector, which largely faces critical and complex challenges, is a sector that can greatly benefits from ML. Machine learning is a part of computer science that deals with using algorithms and statistical models to learn and improve computer performance based on feedbacks and experiences from previous data and is a fundamental in the development of artificial intelligence. The master thesis deals with the problem of analyzing the prediction of cancer patients using different ML algorithms. In this context, several ML techniques are used to perform prediction tests and increase the understanding of the algorithms. Furthermore, an MCDA-analysis is used to compare the results with the current solution, which is based on clinical and ethical guidelines as well as the patients' needs and interests. The aim is to investigate whether Dynamic Ensemble Selection (DES) gives better results for predicting cancer patients than existing models, like random forest and logistic regression. The study will provide insight into the extent to which the DES algorithms and the classical algorithms can contribute to improving the current situation of supporting medical decision-making in cancer treatment. The datasets used for training the predictive models consisted of clinical information from 192 patients who were treated for colorectal cancer in the period 2013 to 2017, and 197 patients who were treated for head and neck cancer in period 2007 to 2013 at Oslo University Hospital, OUS. Eight classification algorithms were trained on these datasets with clinical characteristics of overall survival (OS), progression-free survival (PFS), and disease-free survival (DFS). The results were validated by measuring accuracy, F1-score for positive and negative, Matthew's correlation coefficient (MCC) and ROC AUC. Furthermore, an external data set consisting of 99 patients who received treatment at the MAASTRO clinic in the Netherlands was used to test head and neck cancer models. The findings from the thesis indicate that several opportunities can benefit from the use of different ML algorithms. The classical algorithms generally outperform DES when it comes to accuracy, prediction performance, and number of misclassifications. According to MCDA-analysis, the classic algorithms are also seen as the best solution in combination with the current situation. The new solution should not be a replacement but be seen as a possible decision-support tool. It is also important to note that different algorithms and techniques will respond differently and give another output to different type of data and problems. This recommendation is therefore given for the datasets and algorithms on which this task is based on. A challenge with the datasets that are used in this thesis is that they were limited and contained little information. For further research, a larger and more up-to-date data set should be collected, which can help to optimize cancer patients' prognosis and survival rate This can provide more precise and reliable predications about which treatment will give the best result for the individual patient. The results from this thesis can form the basis for the development of models that can identify optimal cancer treatment for a patient and be used as a decision-support tool by healthcare professionals when treating new cancer patients
    corecore