A Rule Based Taxonomy of Dirty Data

Abstract

There is a growing awareness that high quality of datais a key to today’s business success and that dirty data existingwithin data sources is one of the causes of poor data quality. Toensure high quality data, enterprises need to have a process,methodologies and resources to monitor, analyze and maintainthe quality of data. Nevertheless, research shows that manyenterprises do not pay adequate attention to the existence of dirtydata and have not applied useful methodologies to ensure highquality data for their applications. One of the reasons is a lack ofappreciation of the types and extent of dirty data. In practice,detecting and cleaning all the dirty data that exists in all datasources is quite expensive and unrealistic. The cost of cleaningdirty data needs to be considered for most of enterprises. Thisproblem has not attracted enough attention from researchers. Inthis paper, a rule-based taxonomy of dirty data is developed. Theproposed taxonomy not only provides a mechanism to deal withthis problem but also includes more dirty data types than any ofexisting such taxonomies

    Similar works