21,413 research outputs found
On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research
Scientific research requires access, analysis, and sharing of data that is
distributed across various heterogeneous data sources at the scale of the
Internet. An eager ETL process constructs an integrated data repository as its
first step, integrating and loading data in its entirety from the data sources.
The bootstrapping of this process is not efficient for scientific research that
requires access to data from very large and typically numerous distributed data
sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy
ETL is faster in bootstrapping. However, queries on the integrated data
repository of eager ETL perform faster, due to the availability of the entire
data beforehand.
In this paper, we propose a novel ETL approach for scientific data
integration, as a hybrid of eager and lazy ETL approaches, and applied both to
data as well as metadata. This way, Hybrid ETL supports incremental integration
and loading of metadata and data from the data sources. We incorporate a
human-in-the-loop approach, to enhance the hybrid ETL, with selective data
integration driven by the user queries and sharing of integrated data between
users. We implement our hybrid ETL approach in a prototype platform, Obidos,
and evaluate it in the context of data sharing for medical research. Obidos
outperforms both the eager ETL and lazy ETL approaches, for scientific research
data integration and sharing, through its selective loading of data and
metadata, while storing the integrated data in a scalable integrated data
repository.Comment: Pre-print Submitted to the DMAH Special Issue of the Springer DAPD
Journa
Recommender System Using Collaborative Filtering Algorithm
With the vast amount of data that the world has nowadays, institutions are looking for more and more accurate ways of using this data. Companies like Amazon use their huge amounts of data to give recommendations for users. Based on similarities among items, systems can give predictions for a new item’s rating. Recommender systems use the user, item, and ratings information to predict how other users will like a particular item.
Recommender systems are now pervasive and seek to make profit out of customers or successfully meet their needs. However, to reach this goal, systems need to parse a lot of data and collect information, sometimes from different resources, and predict how the user will like the product or item. The computation power needed is considerable. Also, companies try to avoid flooding customer mailboxes with hundreds of products each morning, thus they are looking for one email or text that will make the customer look and act.
The motivation to do the project comes from my eagerness to learn website design and get a deep understanding of recommender systems. Applying machine learning dynamically is one of the goals that I set for myself and I wanted to go beyond that and verify my result. Thus, I had to use a large dataset to test the algorithm and compare each technique in terms of error rate. My experience with applying collaborative filtering helps me to understand that finding a solution is not enough, but to strive for a fast and ultimate one. In my case, testing my algorithm in a large data set required me to refine the coding strategy of the algorithm many times to speed the process.
In this project, I have designed a website that uses different techniques for recommendations. User-based, Item-based, and Model-based approaches of collaborative filtering are what I have used. Every technique has its way of predicting the user rating for a new item based on existing users’ data. To evaluate each method, I used Movie Lens, an external data set of users, items, and ratings, and calculated the error rate using Mean Absolute Error Rate (MAE) and Root Mean Squared Error (RMSE). Finally, each method has its strengths and weaknesses that relate to the domain in which I am applying these methods
Reporting an Experience on Design and Implementation of e-Health Systems on Azure Cloud
Electronic Health (e-Health) technology has brought the world with
significant transformation from traditional paper-based medical practice to
Information and Communication Technologies (ICT)-based systems for automatic
management (storage, processing, and archiving) of information. Traditionally
e-Health systems have been designed to operate within stovepipes on dedicated
networks, physical computers, and locally managed software platforms that make
it susceptible to many serious limitations including: 1) lack of on-demand
scalability during critical situations; 2) high administrative overheads and
costs; and 3) in-efficient resource utilization and energy consumption due to
lack of automation. In this paper, we present an approach to migrate the ICT
systems in the e-Health sector from traditional in-house Client/Server (C/S)
architecture to the virtualised cloud computing environment. To this end, we
developed two cloud-based e-Health applications (Medical Practice Management
System and Telemedicine Practice System) for demonstrating how cloud services
can be leveraged for developing and deploying such applications. The Windows
Azure cloud computing platform is selected as an example public cloud platform
for our study. We conducted several performance evaluation experiments to
understand the Quality Service (QoS) tradeoffs of our applications under
variable workload on Azure.Comment: Submitted to third IEEE International Conference on Cloud and Green
Computing (CGC 2013
Hybrid Data Storage Framework for the Biometrics Domain
Biometric based authentication is one of the most popular techniques adopted in large-scale identity matching systems due to its robustness in access control. In recent years, the number of enrolments has increased significantly posing serious issues towards the performance and scalability of these systems. In addition, the use of multiple modalities (such as face, iris and fingerprint) is further increasing the issues related to scalability. This research work focuses on the development of a new Hybrid Data Storage Framework (HDSF) that would improve scalability and performance of biometric authentication systems (BAS). In this framework, the scalability issue is addressed by integrating relational database and NoSQL data store, which combines the strengths of both. The proposed framework improves the performance of BAS in three areas (i) by proposing a new biographic match score based key filtering process, to identify any duplicate records in the storage (de-duplication search); (ii) by proposing a multi-modal biometric index based key filtering process for identification and de-duplication search operations; (iii) by adopting parallel biometric matching approach for identification, enrolment and verification processes. The efficacy of the proposed framework is compared with that of the traditional BAS and on several values of False Rejection Rate (FRR). Using our dataset and algorithms it is observed that when compared to traditional BAS, the HDSF is able to show an overall efficiency improvement of more than 54% for zero FRR and above 60% for FRR values between 1-3.5% during identification search operations
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
- …