7 research outputs found

    On exploring data lakes by finding compact, isolated clusters

    Get PDF
    Data engineers are very interested in data lake technologies due to the incredible abun dance of datasets. They typically use clustering to understand the structure of the datasets before applying other methods to infer knowledge from them. This article presents the first proposal that explores how to use a meta-heuristic to address the problem of multi-way single-subspace automatic clustering, which is very appropriate in the context of data lakes. It was confronted with five strong competitors that combine the state-of-the-art attribute selection proposal with three classical single-way clustering proposals, a recent quantum-inspired one, and a recent deep-learning one. The evaluation focused on explor ing their ability to find compact and isolated clusterings as well as the extent to which such clusterings can be considered good classifications. The statistical analyses conducted on the experimental results prove that it ranks the first regarding effectiveness using six stan dard coefficients and it is very efficient in terms of CPU time, not to mention that it did not result in any degraded clusterings or timeouts. Summing up: this proposal contributes to the array of techniques that data engineers can use to explore their data lakesMinisterio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-1060Junta de Andalucía US-138137

    Framework of Big Data Analytics in Real Time for Healthcare Enterprise Performance Measurements

    Get PDF
    Healthcare organizations (HCOs) currently have many information records about their patients. Yet, they cannot make proper, faster, and more thoughtful conclusions in many cases with their information. Much of the information is structured data such as medical records, historical data, and non-clinical information. This data is stored in a central repository called the Data Warehouse (DW). DW provides querying and reporting to different groups within the healthcare organization to support their future strategic initiatives. The generated reports create metrics to measure the organization\u27s performance for post-action plans, not for real-time decisions. Additionally, healthcare organizations seek to benefit from the semi-structured and unstructured data by adopting emerging technology such as big data to aggregate all collected data from different sources obtained from Electronic Medical Record (EMR), scheduling, registration, billing systems, and wearable devices into one volume for better data analytic. For data completeness, big data is an essential element to improve healthcare systems. It is expected to revamp the outlook of the healthcare industry by reducing costs and improving quality. In this research, a framework is developed to utilize big data that interconnects all aspects of healthcare for real-time analytics and performance measurements. It is a comprehensive framework that integrates 41 integrated components in 6 layers: Organization, People, Process, Data, Technology, and Outcomes to ensure successful implementation. Each component in the framework and its linkage with other components are explained to show the coherency. Moreover, the research highlights how data completeness leads to better healthcare quality outcomes, and it is essential for healthcare organization survival. Additionally, the framework offers guidelines for selecting the appropriate technology with the flexibility of implementing the solution on a small or large scale, considering the benefits vs. investment. A case study has been used to validate the framework, and interviews with Subject Matter Experts (SMEs) have been conducted to provide another valuable perspective for a complete picture. The findings revealed that focusing only on big data technology could cause failing implementation without accomplishing the desired value of the data analytics outcomes. It is only applied for one-dimensional, not at the enterprise level. In addition, the framework proposes another 40 components that need to be considered for a successful implementation. Healthcare organizations can design the future of healthcare utilizing big data and analytics toward the fourth revolution in healthcare known as Healthcare 4.0 (H 4.0). This research is a contribution to this effort and a response to the needs

    Extensible metadata management framework for personal data lake

    Get PDF
    Common Internet users today are inundated with a deluge of diverse data being generated and siloed in a variety of digital services, applications, and a growing body of personal computing devices as we enter the era of the Internet of Things. Alongside potential privacy compromises, users are facing increasing difficulties in managing their data and are losing control over it. There appears to be a de facto agreement in business and scientific fields that there is critical new value and interesting insight that can be attained by users from analysing their own data, if only it can be freed from its silos and combined with other data in meaningful ways. This thesis takes the point of view that users should have an easy-to-use modern personal data management solution that enables them to centralise and efficiently manage their data by themselves, under their full control, for their best interests, with minimum time and efforts. In that direction, we describe the basic architecture of a management solution that is designed based on solid theoretical foundations and state of the art big data technologies. This solution (called Personal Data Lake - PDL) collects the data of a user from a plurality of heterogeneous personal data sources and stores it into a highly-scalable schema-less storage repository. To simplify the user-experience of PDL, we propose a novel extensible metadata management framework (MMF) that: (i) annotates heterogeneous data with rich lineage and semantic metadata, (ii) exploits the garnered metadata for automating data management workflows in PDL – with extensive focus on data integration, and (iii) facilitates the use and reuse of the stored data for various purposes by querying it on the metadata level either directly by the user or through third party personal analytics services. We first show how the proposed MMF is positioned in PDL architecture, and then describe its principal components. Specifically, we introduce a simple yet effective lineage manager for tracking the provenance of personal data in PDL. We then introduce an ontology-based data integration component called SemLinker which comprises two new algorithms; the first concerns generating graph-based representations to express the native schemas of (semi) structured personal data, and the second algorithm metamodels the extracted representations to a common extensible ontology. SemLinker outputs are utilised by MMF to generate user-tailored unified views that are optimised for querying heterogeneous personal data through low-level SPARQL or high-level SQL-like queries. Next, we introduce an unsupervised automatic keyphrase extraction algorithm called SemCluster that specialises in extracting thematically important keyphrases from unstructured data, and associating each keyphrase with ontological information drawn from an extensible WordNet-based ontology. SemCluster outputs serve as semantic metadata and are utilised by MMF to annotate unstructured contents in PDL, thus enabling various management functionalities such as relationship discovery and semantic search. Finally, we describe how MMF can be utilised to perform holistic integration of personal data and jointly querying it in native representations
    corecore