9 research outputs found

    Discovering the most important data quality dimensions in health big data using latent semantic analysis

    Get PDF
    Big Data quality is a field which is emerging. Many authors nowadays agree that data quality is still very relevant, even for Big Data uses. However, there is a lack of frameworks or guidelines focusing on how to carry out big data quality initiatives. The starting point of any data quality work is to determine the properties of data quality, termed ‘data quality dimensions’ (DQDs). Even these dimensions lack precise rigour in terms of definition in existing literature. This current research aims to contribute towards identifying the most important DQDs for big data in the health industry. It is a continuation of previous work, which, using relevant literature, identified five DQDs (accuracy, completeness, consistency, reliability and timeliness) as being the most important DQDs in health datasets. The previous work used a human judgement based research method known as an inner hermeneutic cycle (IHC). To remove the potential bias coming from the human judgement aspect, this research study used the same set of literature but applied a statistical research method (used to extract knowledge from a set of documents) known as latent semantic analysis (LSA). Use of LSA concluded that accuracy and completeness were the only similar DQDs classed as the most important in health Big Data for both IHC and LSA

    Discovering the most important data quality dimensions in health big data using latent semantic analysis

    Get PDF
    Big Data quality is a field which is emerging. Many authors nowadays agree that data quality is still very relevant, even for Big Data uses. However, there is a lack of frameworks or guidelines focusing on how to carry out big data quality initiatives. The starting point of any data quality work is to determine the properties of data quality, termed ‘data quality dimensions’ (DQDs). Even these dimensions lack precise rigour in terms of definition in existing literature. This current research aims to contribute towards identifying the most important DQDs for big data in the health industry. It is a continuation of previous work, which, using relevant literature, identified five DQDs (accuracy, completeness, consistency, reliability and timeliness) as being the most important DQDs in health datasets. The previous work used a human judgement based research method known as an inner hermeneutic cycle (IHC). To remove the potential bias coming from the human judgement aspect, this research study used the same set of literature but applied a statistical research method (used to extract knowledge from a set of documents) known as latent semantic analysis (LSA). Use of LSA concluded that accuracy and completeness were the only similar DQDs classed as the most important in health Big Data for both IHC and LSA

    Replacing missing values using trustworthy data values from web data sources

    Get PDF
    In practice, collected data usually are incomplete and contains missing value. Existing approaches in managing missing values overlook the importance of trustworthy data values in replacing missing values. In view that trusted completed data is very important in data analysis, we proposed a framework of missing value replacement using trustworthy data values from web data sources. The proposed framework adopted ontology to map data values from web data sources to the incomplete dataset. As data from web is conflicting with each other, we proposed a trust score measurement based on data accuracy and data reliability. Trust score is then used to select trustworthy data values from web data sources for missing values replacement. We successfully implemented the proposed framework using financial dataset and presented the findings in this paper. From our experiment, we manage to show that replacing missing values with trustworthy data values is important especially in a case of conflicting data to solve missing values problem

    Investigating the attainment of optimum data quality for EHR Big Data: proposing a new methodological approach

    Get PDF
    The value derivable from the use of data is continuously increasing since some years. Both commercial and non-commercial organisations have realised the immense benefits that might be derived if all data at their disposal could be analysed and form the basis of decision taking. The technological tools required to produce, capture, store, transmit and analyse huge amounts of data form the background to the development of the phenomenon of Big Data. With Big Data, the aim is to be able to generate value from huge amounts of data, often in non-structured format and produced extremely frequently. However, the potential value derivable depends on general level of governance of data, more precisely on the quality of the data. The field of data quality is well researched for traditional data uses but is still in its infancy for the Big Data context. This dissertation focused on investigating effective methods to enhance data quality for Big Data. The principal deliverable of this research is in the form of a methodological approach which can be used to optimize the level of data quality in the Big Data context. Since data quality is contextual, (that is a non-generalizable field), this research study focuses on applying the methodological approach in one use case, in terms of the Electronic Health Records (EHR). The first main contribution to knowledge of this study systematically investigates which data quality dimensions (DQDs) are most important for EHR Big Data. The two most important dimensions ascertained by the research methods applied in this study are accuracy and completeness. These are two well-known dimensions, and this study confirms that they are also very important for EHR Big Data. The second important contribution to knowledge is an investigation into whether Artificial Intelligence with a special focus upon machine learning could be used in improving the detection of dirty data, focusing on the two data quality dimensions of accuracy and completeness. Regression and clustering algorithms proved to be more adequate for accuracy and completeness related issues respectively, based on the experiments carried out. However, the limits of implementing and using machine learning algorithms for detecting data quality issues for Big Data were also revealed and discussed in this research study. It can safely be deduced from the knowledge derived from this part of the research study that use of machine learning for enhancing data quality issues detection is a promising area but not yet a panacea which automates this entire process. The third important contribution is a proposed guideline to undertake data repairs most efficiently for Big Data; this involved surveying and comparing existing data cleansing algorithms against a prototype developed for data reparation. Weaknesses of existing algorithms are highlighted and are considered as areas of practice which efficient data reparation algorithms must focus upon. Those three important contributions form the nucleus for a new data quality methodological approach which could be used to optimize Big Data quality, as applied in the context of EHR. Some of the activities and techniques discussed through the proposed methodological approach can be transposed to other industries and use cases to a large extent. The proposed data quality methodological approach can be used by practitioners of Big Data Quality who follow a data-driven strategy. As opposed to existing Big Data quality frameworks, the proposed data quality methodological approach has the advantage of being more precise and specific. It gives clear and proven methods to undertake the main identified stages of a Big Data quality lifecycle and therefore can be applied by practitioners in the area. This research study provides some promising results and deliverables. It also paves the way for further research in the area. Technical and technological changes in Big Data is rapidly evolving and future research should be focusing on new representations of Big Data, the real-time streaming aspect, and replicating same research methods used in this current research study but on new technologies to validate current results

    Improving the Data Quality of Drug Databases using Conditional Dependencies and Ontologies

    Get PDF
    International audienceMany health care systems and services exploit drug related information stored in databases. The poor data quality of these databases, e.g. inaccuracy of drug contraindications, can lead to catastrophic consequences for the health condition of patients. Hence it is important to ensure their quality in terms of data completeness and soundness. In the database domain, standard Functional Dependencies (FDs) and INclusion Dependencies (INDs), have been proposed to prevent the insertion of incorrect data. But they are generally not expressive enough to represent a domain-specific set of constraints. To this end, conditional dependencies, i.e. standard dependencies extended with tableau patterns containing constant values, have been introduced and several methods have been proposed for their discovery and representation. The quality of drug databases can be considerably improved by their usage. Moreover, pharmacology information is inherently hierarchical and many standards propose graph structures to represent them, e.g. the Anatomical Therapeutic Chemical classification (ATC) or OpenGalen's terminology. In this article, we emphasize that the technologies of the Semantic Web are adapted to represent these hierarchical structures, i.e. in RDFS and OWL. We also present a solution for representing conditional dependencies using a query language defined for these graph oriented structures, namely SPARQL. The benefits of this approach are interoperability with applications and ontologies of the Semantic Web as well as a reasoning-based query execution solution to clean underlying databases
    corecore