21,413 research outputs found

    On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research

    Full text link
    Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager ETL process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, Hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We incorporate a human-in-the-loop approach, to enhance the hybrid ETL, with selective data integration driven by the user queries and sharing of integrated data between users. We implement our hybrid ETL approach in a prototype platform, Obidos, and evaluate it in the context of data sharing for medical research. Obidos outperforms both the eager ETL and lazy ETL approaches, for scientific research data integration and sharing, through its selective loading of data and metadata, while storing the integrated data in a scalable integrated data repository.Comment: Pre-print Submitted to the DMAH Special Issue of the Springer DAPD Journa

    Recommender System Using Collaborative Filtering Algorithm

    Get PDF
    With the vast amount of data that the world has nowadays, institutions are looking for more and more accurate ways of using this data. Companies like Amazon use their huge amounts of data to give recommendations for users. Based on similarities among items, systems can give predictions for a new item’s rating. Recommender systems use the user, item, and ratings information to predict how other users will like a particular item. Recommender systems are now pervasive and seek to make profit out of customers or successfully meet their needs. However, to reach this goal, systems need to parse a lot of data and collect information, sometimes from different resources, and predict how the user will like the product or item. The computation power needed is considerable. Also, companies try to avoid flooding customer mailboxes with hundreds of products each morning, thus they are looking for one email or text that will make the customer look and act. The motivation to do the project comes from my eagerness to learn website design and get a deep understanding of recommender systems. Applying machine learning dynamically is one of the goals that I set for myself and I wanted to go beyond that and verify my result. Thus, I had to use a large dataset to test the algorithm and compare each technique in terms of error rate. My experience with applying collaborative filtering helps me to understand that finding a solution is not enough, but to strive for a fast and ultimate one. In my case, testing my algorithm in a large data set required me to refine the coding strategy of the algorithm many times to speed the process. In this project, I have designed a website that uses different techniques for recommendations. User-based, Item-based, and Model-based approaches of collaborative filtering are what I have used. Every technique has its way of predicting the user rating for a new item based on existing users’ data. To evaluate each method, I used Movie Lens, an external data set of users, items, and ratings, and calculated the error rate using Mean Absolute Error Rate (MAE) and Root Mean Squared Error (RMSE). Finally, each method has its strengths and weaknesses that relate to the domain in which I am applying these methods

    Reporting an Experience on Design and Implementation of e-Health Systems on Azure Cloud

    Full text link
    Electronic Health (e-Health) technology has brought the world with significant transformation from traditional paper-based medical practice to Information and Communication Technologies (ICT)-based systems for automatic management (storage, processing, and archiving) of information. Traditionally e-Health systems have been designed to operate within stovepipes on dedicated networks, physical computers, and locally managed software platforms that make it susceptible to many serious limitations including: 1) lack of on-demand scalability during critical situations; 2) high administrative overheads and costs; and 3) in-efficient resource utilization and energy consumption due to lack of automation. In this paper, we present an approach to migrate the ICT systems in the e-Health sector from traditional in-house Client/Server (C/S) architecture to the virtualised cloud computing environment. To this end, we developed two cloud-based e-Health applications (Medical Practice Management System and Telemedicine Practice System) for demonstrating how cloud services can be leveraged for developing and deploying such applications. The Windows Azure cloud computing platform is selected as an example public cloud platform for our study. We conducted several performance evaluation experiments to understand the Quality Service (QoS) tradeoffs of our applications under variable workload on Azure.Comment: Submitted to third IEEE International Conference on Cloud and Green Computing (CGC 2013

    Hybrid Data Storage Framework for the Biometrics Domain

    Get PDF
    Biometric based authentication is one of the most popular techniques adopted in large-scale identity matching systems due to its robustness in access control. In recent years, the number of enrolments has increased significantly posing serious issues towards the performance and scalability of these systems. In addition, the use of multiple modalities (such as face, iris and fingerprint) is further increasing the issues related to scalability. This research work focuses on the development of a new Hybrid Data Storage Framework (HDSF) that would improve scalability and performance of biometric authentication systems (BAS). In this framework, the scalability issue is addressed by integrating relational database and NoSQL data store, which combines the strengths of both. The proposed framework improves the performance of BAS in three areas (i) by proposing a new biographic match score based key filtering process, to identify any duplicate records in the storage (de-duplication search); (ii) by proposing a multi-modal biometric index based key filtering process for identification and de-duplication search operations; (iii) by adopting parallel biometric matching approach for identification, enrolment and verification processes. The efficacy of the proposed framework is compared with that of the traditional BAS and on several values of False Rejection Rate (FRR). Using our dataset and algorithms it is observed that when compared to traditional BAS, the HDSF is able to show an overall efficiency improvement of more than 54% for zero FRR and above 60% for FRR values between 1-3.5% during identification search operations

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed
    • …
    corecore