165 research outputs found

    Performance Evaluation of Structured and Unstructured Data in PIG/HADOOP and MONGO-DB Environments

    Get PDF
    The exponential development of data initially exhibited difficulties for prominent organizations, for example, Google, Yahoo, Amazon, Microsoft, Facebook, Twitter and so forth. The size of the information that needs to be handled by cloud applications is developing significantly quicker than storage capacity. This development requires new systems for managing and breaking down data. The term Big Data is used to address large volumes of unstructured (or semi-structured) and structured data that gets created from different applications, messages, weblogs, and online networking. Big Data is data whose size, variety and uncertainty require new supplementary models, procedures, algorithms, and research to manage and extract value and concealed learning from it. To process more information efficiently and skillfully, for analysis parallelism is utilized. To deal with the unstructured and semi-structured information NoSQL database has been presented. Hadoop better serves the Big Data analysis requirements. It is intended to scale up starting from a single server to a large cluster of machines, which has a high level of adaptation to internal failure. Many business and research institutes such as Facebook, Yahoo, Google, and so on had an expanding need to import, store, and analyze dynamic semi-structured data and its metadata. Also, significant development of semi-structured data inside expansive web-based organizations has prompted the formation of NoSQL data collections for flexible sorting and MapReduce for adaptable parallel analysis. They assessed, used and altered Hadoop, the most popular open source execution of MapReduce, for tending to the necessities of various valid analytics problems. These institutes are also utilizing MongoDB, and a report situated NoSQL store. In any case, there is a limited comprehension of the execution trade-offs of using these two innovations. This paper assesses the execution, versatility, and adaptation to an internal failure of utilizing MongoDB and Hadoop, towards the objective of recognizing the correct programming condition for logical data analytics and research. Lately, an expanding number of organizations have developed diverse, distinctive kinds of non-relational databases (such as MongoDB, Cassandra, Hypertable, HBase/ Hadoop, CouchDB and so on), generally referred to as NoSQL databases. The enormous amount of information generated requires an effective system to analyze the data in various scenarios, under various breaking points. In this paper, the objective is to find the break-even point of both Hadoop/Pig and MongoDB and develop a robust environment for data analytics

    Big data analytics in healthcare: promise and potential

    Full text link
    Objective To describe the promise and potential of big data analytics in healthcare. Methods The paper describes the nascent field of big data analytics in healthcare, discusses the benefits, outlines an architectural framework and methodology, describes examples reported in the literature, briefly discusses the challenges, and offers conclusions. Results The paper provides a broad overview of big data analytics for healthcare researchers and practitioners. Conclusions Big data analytics in healthcare is evolving into a promising field for providing insight from very large data sets and improving outcomes while reducing costs. Its potential is great; however there remain challenges to overcome

    Experimental evaluation of big data querying tools

    Get PDF
    Nos últimos anos, o termo Big Data tornou-se um tópico bastanta debatido em várias áreas de negócio. Um dos principais desafios relacionados com este conceito é como lidar com o enorme volume e variedade de dados de forma eficiente. Devido à notória complexidade e volume de dados associados ao conceito de Big Data, são necessários mecanismos de consulta eficientes para fins de análise de dados. Motivado pelo rápido desenvolvimento de ferramentas e frameworks para Big Data, há muita discussão sobre ferramentas de consulta e, mais especificamente, quais são as mais apropriadas para necessidades analíticas específica. Esta dissertação descreve e compara as principais características e arquiteturas das seguintes conhecidas ferramentas analíticas para Big Data: Drill, HAWQ, Hive, Impala, Presto e Spark. Para testar o desempenho dessas ferramentas analíticas para Big Data, descrevemos também o processo de preparação, configuração e administração de um Cluster Hadoop para que possamos instalar e utilizar essas ferramentas, tendo um ambiente capaz de avaliar seu desempenho e identificar quais cenários mais adequados à sua utilização. Para realizar esta avaliação, utilizamos os benchmarks TPC-H e TPC-DS, onde os resultados mostraram que as ferramentas de processamento em memória como HAWQ, Impala e Presto apresentam melhores resultados e desempenho em datasets de dimensão baixa e média. No entanto, as ferramentas que apresentaram tempos de execuções mais lentas, especialmente o Hive, parecem apanhar as ferramentas de melhor desempenho quando aumentamos os datasets de referência

    Collaborative Cloud Computing Framework for Health Data with Open Source Technologies

    Full text link
    The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202

    Blending big data analytics : review on challenges and a recent study

    Get PDF
    With the collection of massive amounts of data every day, big data analytics has emerged as an important trend for many organizations. These collected data can contain important information that may be key to solving wide-ranging problems, such as cyber security, marketing, healthcare, and fraud. To analyze their large volumes of data for business analyses and decisions, large companies, such as Facebook and Google, adopt analytics. Such analyses and decisions impact existing and future technology. In this paper, we explore how big data analytics is utilized as a technique for solving problems of complex and unstructured data using such technologies as Hadoop, Spark, and MapReduce. We also discuss the data challenges introduced by big data according to the literature, including its six V's. Moreover, we investigate case studies of big data analytics on various techniques of such analytics, namely, text, voice, video, and network analytics. We conclude that big data analytics can bring positive changes in many fields, such as education, military, healthcare, politics, business, agriculture, banking, and marketing, in the future. © 2013 IEEE

    BIG DATA ANALYTICS - AN OVERVIEW

    Get PDF
       Big Data Analytics has been in advance more attention recently since researchers in business and academic world are trying to successfully mine and use all possible knowledge from the vast amount of data generated and obtained. Demanding a paradigm shift in the storage, processing and analysis of Big Data, traditional data analysis methods stumble upon large amounts of data in a short period of time. Because of its importance, the U.S. Many agencies, including the government, have in recent years released large funds for research in Big Data and related fields. This gives a concise summary of investigate growth in various areas related to big data processing and analysis and terminate with a discussion of research guidelines in the similar areas. &nbsp

    Designing a Modern Software Engineering Training Program with Cloud Computing

    Get PDF
    The software engineering industry is trending towards cloud computing. For our project, we assessed the various tools and practices used in modern software development. The main goals of this project were to create a reference model for developing cloud-based applications, to program a functional cloud-based prototype, and to develop an accompanying training manual. These materials will be incorporated into the software engineering courses at WPI, namely CS 3733 and CS 509

    Evaluation of Storage Systems for Big Data Analytics

    Get PDF
    abstract: Recent trends in big data storage systems show a shift from disk centric models to memory centric models. The primary challenges faced by these systems are speed, scalability, and fault tolerance. It is interesting to investigate the performance of these two models with respect to some big data applications. This thesis studies the performance of Ceph (a disk centric model) and Alluxio (a memory centric model) and evaluates whether a hybrid model provides any performance benefits with respect to big data applications. To this end, an application TechTalk is created that uses Ceph to store data and Alluxio to perform data analytics. The functionalities of the application include offline lecture storage, live recording of classes, content analysis and reference generation. The knowledge base of videos is constructed by analyzing the offline data using machine learning techniques. This training dataset provides knowledge to construct the index of an online stream. The indexed metadata enables the students to search, view and access the relevant content. The performance of the application is benchmarked in different use cases to demonstrate the benefits of the hybrid model.Dissertation/ThesisMasters Thesis Computer Science 201
    corecore