17,625 research outputs found

    Towards a Hybrid Imputation Approach Using Web Tables

    Get PDF
    Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing values is addressed by the data imputation community using statistical techniques, we complement these approaches by using external data sources from the data lake or even the Web to lookup missing values. In this paper we propose a novel hybrid data imputation strategy that, takes into account the characteristics of an incomplete dataset and based on that chooses the best imputation approach, i.e. either a statistical approach such as regression analysis or a Web-based lookup or a combination of both. We formalize and implement both imputation approaches, including a Web table retrieval and matching system and evaluate them extensively using a corpus with 125M Web tables. We show that applying statistical techniques in conjunction with external data sources will lead to a imputation system which is robust, accurate, and has high coverage at the same time

    The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values

    Get PDF
    The evolution of big data analytics through machine learning and artificial intelligence techniques has caused organizations in a wide range of sectors including health, manufacturing, e-commerce, governance, and social welfare to realize the value of massive volumes of data accumulating on web-based repositories daily. This has led to the adoption of data-driven decision models; for example, through sentiment analysis in marketing where produces leverage customer feedback and reviews to develop customer-oriented products. However, the data generated in real-world activities is subject to errors resulting from inaccurate measurements or fault input devices, which may result in the loss of some values. Missing attribute/variable values make data unsuitable for decision analytics due to noises and inconsistencies that create bias. The objective of this paper was to explore the problem of missing data and develop an advanced imputation model based on Machine Learning and implemented on K-Nearest Neighbor (KNN) algorithm in R programming language as an approach to handle missing values. The methodology used in this paper relied on the applying advanced machine learning algorithms with high-level accuracy in pattern detection and predictive analytics on the existing imputation techniques, which handle missing values by random replacement or deletion..  According to the results, advanced imputation technique based on machine learning models replaced missing values from a dataset with 89.5% accuracy. The experimental results showed that pre-processing by imputation delivers high-level performance efficiency in handling missing data values. These findings are consistent with the key idea of paper, which is to explore alternative imputation techniques for handling missing values to improve the accuracy and reliability of decision insights extracted from datasets

    Trends and inequalities in laryngeal cancer survival in men and women: England and Wales 1991-2006.

    Get PDF
    Laryngeal cancer in men is a relatively common malignancy, with a marked socioeconomic gradient in survival between affluent and deprived patients. Cancer of the larynx in women is rare. Survival tends to lower than for men, and little is known about the association between deprivation and survival in women with laryngeal cancer. This paper explores the trends and socio-economic inequalities in laryngeal cancer survival in women, with comparison to men. We examined relative survival among men and women diagnosed with laryngeal cancer in England and Wales during 1991-2006, followed up to 31 December 2007. We estimated the difference in survival between the most deprived and most affluent groups (the 'deprivation gap') at one and five years after diagnosis, for each sex, anatomical subsite and calendar period. Five year survival for all laryngeal cancers combined was up to 8% lower in women than in men. This difference is only partially explained by the differential distribution of anatomical subsites in men and women. Disparities in survival between men and women were also present within specific subsites. In contrast to men, there was little evidence of a consistent deprivation gap in survival for women at any of the anatomical subsites. The stark socioeconomic inequalities in laryngeal cancer survival in men do not appear to be replicated in women. The origins of the socio-economic inequalities in survival among men, and the disparities in survival between men and women at specific tumour subsites remains unclear
    corecore