7 research outputs found

    The costs of poor data quality

    Get PDF
    Purpose: The technological developments have implied that companies store increasingly more data. However, data quality maintenance work is often neglected, and poor quality business data constitute a significant cost factor for many companies. This paper argues that perfect data quality should not be the goal, but instead the data quality should be improved to only a certain level. The paper focuses on how to identify the optimal data quality level. Design/methodology/approach: The paper starts with a review of data quality literature. On this basis, the paper proposes a definition of the optimal data maintenance effort and a classification of costs inflicted by poor quality data. These propositions are investigated by a case study. Findings: The paper proposes: (1) a definition of the optimal data maintenance effort and (2) a classification of costs inflicted by poor quality data. A case study illustrates the usefulness of these propositions. Research limitations/implications: The paper provides definitions in relation to the costs of poor quality data and the data quality maintenance effort. Future research may build on these definitions. To further develop the contributions of the paper, more studies are needed. Practical implications: As illustrated by the case study, the definitions provided by this paper can be used for determining the right data maintenance effort and costs inflicted by poor quality data. In many companies, such insights may lead to significant savings. Originality/value: The paper provides a clarification of what are the costs of poor quality data and defines the relation to data quality maintenance effort. This represents an original contribution of value to future research and practice.Peer Reviewe

    Oikean datan löytämisen tärkeys: Case terveydenhuollon operaatioiden kehitysprojektit

    Get PDF
    The utilization of data in healthcare improvement projects is currently a very topical subject. Several public and private companies have shown the value of utilizing data to improve operational efficiency. Not all datasets are, however, equally useful – thus, understanding of the data quality is required to ensure correct decision-making. Currently, two streams of literature exist to guide the improvement teams: the literature on operational improvement, e.g. through methods such as Total Quality Management, Lean, and Six Sigma, and the literature on data quality. From the point-of-view of an improvement project team, a linkage between these two streams of literature is missing. This paper aims to bridge the gap between the two streams of literature by helping healthcare improvement teams to assess whether the data quality is sufficient to support decision-making. The academic framework illustrates, how the viewpoint of data quality has transformed from an intrinsic focus on the 1970s, to fitness for use on the 1990s, finally to describing the specifics of the new trends, such as big data or unstructured data, in the 2010 onwards. Using the case study method, the findings were expanded by observing an improvement project in a private Finnish healthcare company. Together with the project team, I went through an iterative process with five steps: each of which was guided by a distinctive, new set of data. Finally, the actual improvement was gained by gathering the data manually: a dataset which was highly relevant for the end users, but likely to be intrinsically less robust as the previous datasets. As a conclusion, the current data quality literature can bring only modest guidance for the improvement teams in terms of choosing the right dataset. Rather, a new model for the data quality in healthcare operational improvement was created. The model suggests that the teams should first consider whether the dataset is relevant for the goal of the improvement project. After that, the improvement team should consider if the dataset can add value to reaching the goal of the project. After these two steps, the other key data quality attributes linking to the following four dimensions come to play: accessibility, intrinsic, representational, and contextual quality.Datan käyttäminen terveydenhuollon prosessikehityksessä on laajaa kiinnostusta herättävä aihe. Kaksi pää kirjallisuussuuntaa on kehittynyt datan laadun tutkimiseksi: kirjallisuus operaatiokehityksestä eli aiheista, kuten TQM, Lean ja Six Sigma, ja kirjallisuus datan laadusta. Nämä kaksi suuntausta ovat kuitenkin usein riittämättömiä kehitystiimien päätöksenteon tueksi. Tämän diplomityön tarkoitus on yhdistää nämä kaksi kirjallisuussuuntausta frameworkiksi, joka auttaa tiimejä arvioimaan datan soveltuvuutta omaan kehitysprojektiinsa. Työn kirjallisuuskatsaus kuvaa, miten käsitys datan laadusta on muuttunut 1970-luvulta nykypäivään. 1970-luvulla datalaadun kirjallisuuden fokus oli sisäisessä laadussa (intrinsic quality). 1990-luvulle siirtyessä painopiste siirtyi kuvailemaan datan laatua sen soveltuvuuden kautta (fitness for use), ja 2010-luvulle siirryttäessä kirjallisuuteen tuli mukaan uusia trendejä, kuten big data tai strukturoimaton data. Tuloksien tueksi seurattiin kehitysprojektia, joka toteutettiin suomalaisessa yksityisessä terveydenhuollon yrityksessä. Yhdessä projektitiimin kanssa, kirjoittajan matka projektin edetessä voidaan tiivistää viiteen vaiheeseen, joista jokaisessa uusi datasetti näytteli tärkeää roolia. Lopulta suurin edistysaskel projektissa saatiin keräämällä data manuaalisesti. Manuaalisesti kerätty data oli erittäin relevantti projektille, mutta sisäisiltä ominaisuuksiltaan huonompi. Tulosten pohjalta voidaan päätellä, että nykyinen kirjallisuus datan laadusta voi tuoda enintään keskinkertaista tukea kehitystiimien datan laadun arvioinnille. Tästä syystä uusi malli data laadun tutkimiselle terveydenhuollossa luotiin työn tuloksena. Malli ehdottaa, että projekti tiimien pitäisi ensimmäisenä arvioida datasetin relevanttiutta käyttötarkoitukselle. Toisena askeleena tiimin kannattaa miettiä onko data arvokasta vastaamaan projektin senhetkisiin haasteisiin. Näiden kahden askeleen jälkeen, tiimin kannattaa käyttää kirjallisuudessa laajasti tunnistettuja datalaadun tekijöitä oman datasetin laatunsa arviointiin

    Data quality and data cleaning in database applications

    Get PDF
    Today, data plays an important role in people's daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today's business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning. In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process. Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of 'algorithm selection mechanism' in the data cleaning framework, which enhances the performance of data cleaning system in database applications.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Data quality and data cleaning in database applications

    Get PDF
    Today, data plays an important role in people’s daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today’s business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning.In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process.Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of ‘algorithm selection mechanism’ in the data cleaning framework, which enhances the performance of data cleaning system in database applications
    corecore