165 research outputs found
Performance Evaluation of Structured and Unstructured Data in PIG/HADOOP and MONGO-DB Environments
The exponential development of data initially exhibited difficulties for prominent organizations, for example, Google, Yahoo, Amazon, Microsoft, Facebook, Twitter and so forth. The size of the information that needs to be handled by cloud applications is developing significantly quicker than storage capacity. This development requires new systems for managing and breaking down data. The term Big Data is used to address large volumes of unstructured (or semi-structured) and structured data that gets created from different applications, messages, weblogs, and online networking.
Big Data is data whose size, variety and uncertainty require new supplementary models, procedures, algorithms, and research to manage and extract value and concealed learning from it. To process more information efficiently and skillfully, for analysis parallelism is utilized. To deal with the unstructured and semi-structured information NoSQL database has been presented. Hadoop better serves the Big Data analysis requirements. It is intended to scale up starting from a single server to a large cluster of machines, which has a high level of adaptation to internal failure.
Many business and research institutes such as Facebook, Yahoo, Google, and so on had an expanding need to import, store, and analyze dynamic semi-structured data and its metadata. Also, significant development of semi-structured data inside expansive web-based organizations has prompted the formation of NoSQL data collections for flexible sorting and MapReduce for adaptable parallel analysis. They assessed, used and altered Hadoop, the most popular open source execution of MapReduce, for tending to the necessities of various valid analytics problems. These institutes are also utilizing MongoDB, and a report situated NoSQL store. In any case, there is a limited comprehension of the execution trade-offs of using these two innovations. This paper assesses the execution, versatility, and adaptation to an internal failure of utilizing MongoDB and Hadoop, towards the objective of recognizing the correct programming condition for logical data analytics and research. Lately, an expanding number of organizations have developed diverse, distinctive kinds of non-relational databases (such as MongoDB, Cassandra, Hypertable, HBase/ Hadoop, CouchDB and so on), generally referred to as NoSQL databases. The enormous amount of information generated requires an effective system to analyze the data in various scenarios, under various breaking points. In this paper, the objective is to find the break-even point of both Hadoop/Pig and MongoDB and develop a robust environment for data analytics
Big data analytics in healthcare: promise and potential
Objective To describe the promise and potential of big data analytics in healthcare.
Methods The paper describes the nascent field of big data analytics in healthcare, discusses the benefits, outlines an architectural framework and methodology, describes examples reported in the literature, briefly discusses the challenges, and offers conclusions.
Results The paper provides a broad overview of big data analytics for healthcare researchers and practitioners.
Conclusions Big data analytics in healthcare is evolving into a promising field for providing insight from very large data sets and improving outcomes while reducing costs. Its potential is great; however there remain challenges to overcome
Experimental evaluation of big data querying tools
Nos últimos anos, o termo Big Data tornou-se um tópico bastanta debatido em várias
áreas de negócio. Um dos principais desafios relacionados com este conceito é como lidar
com o enorme volume e variedade de dados de forma eficiente. Devido à notória
complexidade e volume de dados associados ao conceito de Big Data, são necessários
mecanismos de consulta eficientes para fins de análise de dados. Motivado pelo rápido
desenvolvimento de ferramentas e frameworks para Big Data, há muita discussão sobre
ferramentas de consulta e, mais especificamente, quais são as mais apropriadas para
necessidades analíticas específica. Esta dissertação descreve e compara as principais
características e arquiteturas das seguintes conhecidas ferramentas analíticas para Big Data:
Drill, HAWQ, Hive, Impala, Presto e Spark. Para testar o desempenho dessas ferramentas
analíticas para Big Data, descrevemos também o processo de preparação, configuração e
administração de um Cluster Hadoop para que possamos instalar e utilizar essas ferramentas,
tendo um ambiente capaz de avaliar seu desempenho e identificar quais cenários mais
adequados à sua utilização. Para realizar esta avaliação, utilizamos os benchmarks TPC-H e
TPC-DS, onde os resultados mostraram que as ferramentas de processamento em memória
como HAWQ, Impala e Presto apresentam melhores resultados e desempenho em datasets de
dimensão baixa e média. No entanto, as ferramentas que apresentaram tempos de execuções
mais lentas, especialmente o Hive, parecem apanhar as ferramentas de melhor desempenho
quando aumentamos os datasets de referência
Collaborative Cloud Computing Framework for Health Data with Open Source Technologies
The proliferation of sensor technologies and advancements in data collection
methods have enabled the accumulation of very large amounts of data.
Increasingly, these datasets are considered for scientific research. However,
the design of the system architecture to achieve high performance in terms of
parallelization, query processing time, aggregation of heterogeneous data types
(e.g., time series, images, structured data, among others), and difficulty in
reproducing scientific research remain a major challenge. This is specifically
true for health sciences research, where the systems must be i) easy to use
with the flexibility to manipulate data at the most granular level, ii)
agnostic of programming language kernel, iii) scalable, and iv) compliant with
the HIPAA privacy law. In this paper, we review the existing literature for
such big data systems for scientific research in health sciences and identify
the gaps of the current system landscape. We propose a novel architecture for
software-hardware-data ecosystem using open source technologies such as Apache
Hadoop, Kubernetes and JupyterHub in a distributed environment. We also
evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202
Blending big data analytics : review on challenges and a recent study
With the collection of massive amounts of data every day, big data analytics has emerged as an important trend for many organizations. These collected data can contain important information that may be key to solving wide-ranging problems, such as cyber security, marketing, healthcare, and fraud. To analyze their large volumes of data for business analyses and decisions, large companies, such as Facebook and Google, adopt analytics. Such analyses and decisions impact existing and future technology. In this paper, we explore how big data analytics is utilized as a technique for solving problems of complex and unstructured data using such technologies as Hadoop, Spark, and MapReduce. We also discuss the data challenges introduced by big data according to the literature, including its six V's. Moreover, we investigate case studies of big data analytics on various techniques of such analytics, namely, text, voice, video, and network analytics. We conclude that big data analytics can bring positive changes in many fields, such as education, military, healthcare, politics, business, agriculture, banking, and marketing, in the future. © 2013 IEEE
BIG DATA ANALYTICS - AN OVERVIEW
Big Data Analytics has been in advance more attention recently since researchers in business and academic world are trying to successfully mine and use all possible knowledge from the vast amount of data generated and obtained. Demanding a paradigm shift in the storage, processing and analysis of Big Data, traditional data analysis methods stumble upon large amounts of data in a short period of time. Because of its importance, the U.S. Many agencies, including the government, have in recent years released large funds for research in Big Data and related fields. This gives a concise summary of investigate growth in various areas related to big data processing and analysis and terminate with a discussion of research guidelines in the similar areas.
 
Designing a Modern Software Engineering Training Program with Cloud Computing
The software engineering industry is trending towards cloud computing. For our project, we assessed the various tools and practices used in modern software development. The main goals of this project were to create a reference model for developing cloud-based applications, to program a functional cloud-based prototype, and to develop an accompanying training manual. These materials will be incorporated into the software engineering courses at WPI, namely CS 3733 and CS 509
Evaluation of Storage Systems for Big Data Analytics
abstract: Recent trends in big data storage systems show a shift from disk centric models to memory centric models. The primary challenges faced by these systems are speed, scalability, and fault tolerance. It is interesting to investigate the performance of these two models with respect to some big data applications. This thesis studies the performance of Ceph (a disk centric model) and Alluxio (a memory centric model) and evaluates whether a hybrid model provides any performance benefits with respect to big data applications. To this end, an application TechTalk is created that uses Ceph to store data and Alluxio to perform data analytics. The functionalities of the application include offline lecture storage, live recording of classes, content analysis and reference generation. The knowledge base of videos is constructed by analyzing the offline data using machine learning techniques. This training dataset provides knowledge to construct the index of an online stream. The indexed metadata enables the students to search, view and access the relevant content. The performance of the application is benchmarked in different use cases to demonstrate the benefits of the hybrid model.Dissertation/ThesisMasters Thesis Computer Science 201
- …