18 research outputs found

    BIGhybrid: A Simulator for MapReduce Applications in Hybrid Distributed Infrastructures Validated with the Grid5000 Experimental Platform

    Get PDF
    International audienceSUMMARY Cloud computing has increasingly been used as a platform for running large business and data processing applications. Conversely, Desktop Grids have been successfully employed in a wide range of projects, because they are able to take advantage of a large number of resources provided free of charge by volunteers. A hybrid infrastructure created from the combination of Cloud and Desktop Grids infrastructures can provide a low-cost and scalable solution for Big Data analysis. Although frameworks like MapReduce have been designed to exploit commodity hardware, their ability to take advantage of a hybrid infrastructure poses significant challenges due to their large resource heterogeneity and high churn rate. In this paper is proposed BIGhybrid, a simulator for two existing classes of MapReduce runtime environments: BitDew-MapReduce designed for Desktop Grids and BlobSeer-Hadoop designed for Cloud computing, where the goal is to carry out accurate simulations of MapReduce executions in a hybrid infrastructure composed of Cloud computing and Desktop Grid resources. This work describes the principles of the simulator and describes the validation of BigHybrid with the Grid5000 experimental platform. Owing to BigHybrid, developers can investigate and evaluate new algorithms to enable MapReduce to be executed in hybrid infrastructures. This includes topics such as resource allocation and data splitting. Concurrency and Computation: Practice and Experienc

    Task Scheduling in Big Data Platforms: A Systematic Literature Review

    Get PDF
    Context: Hadoop, Spark, Storm, and Mesos are very well known frameworks in both research and industrial communities that allow expressing and processing distributed computations on massive amounts of data. Multiple scheduling algorithms have been proposed to ensure that short interactive jobs, large batch jobs, and guaranteed-capacity production jobs running on these frameworks can deliver results quickly while maintaining a high throughput. However, only a few works have examined the effectiveness of these algorithms. Objective: The Evidence-based Software Engineering (EBSE) paradigm and its core tool, i.e., the Systematic Literature Review (SLR), have been introduced to the Software Engineering community in 2004 to help researchers systematically and objectively gather and aggregate research evidences about different topics. In this paper, we conduct a SLR of task scheduling algorithms that have been proposed for big data platforms. Method: We analyse the design decisions of different scheduling models proposed in the literature for Hadoop, Spark, Storm, and Mesos over the period between 2005 and 2016. We provide a research taxonomy for succinct classification of these scheduling models. We also compare the algorithms in terms of performance, resources utilization, and failure recovery mechanisms. Results: Our searches identifies 586 studies from journals, conferences and workshops having the highest quality in this field. This SLR reports about different types of scheduling models (dynamic, constrained, and adaptive) and the main motivations behind them (including data locality, workload balancing, resources utilization, and energy efficiency). A discussion of some open issues and future challenges pertaining to improving the current studies is provided

    Big Data and Large-scale Data Analytics: Efficiency of Sustainable Scalability and Security of Centralized Clouds and Edge Deployment Architectures

    Get PDF
    One of the significant shifts of the next-generation computing technologies will certainly be in the development of Big Data (BD) deployment architectures. Apache Hadoop, the BD landmark, evolved as a widely deployed BD operating system. Its new features include federation structure and many associated frameworks, which provide Hadoop 3.x with the maturity to serve different markets. This dissertation addresses two leading issues involved in exploiting BD and large-scale data analytics realm using the Hadoop platform. Namely, (i)Scalability that directly affects the system performance and overall throughput using portable Docker containers. (ii) Security that spread the adoption of data protection practices among practitioners using access controls. An Enhanced Mapreduce Environment (EME), OPportunistic and Elastic Resource Allocation (OPERA) scheduler, BD Federation Access Broker (BDFAB), and a Secure Intelligent Transportation System (SITS) of multi-tiers architecture for data streaming to the cloud computing are the main contribution of this thesis study

    Data Processing Model to Perform Big Data Analytics in Hybrid Infrastructures

    Get PDF
    N/

    Programming Languages for Data-Intensive HPC Applications: a Systematic Mapping Study

    Get PDF
    A major challenge in modelling and simulation is the need to combine expertise in both software technologies and a given scientific domain. When High-Performance Computing (HPC) is required to solve a scientific problem, software development becomes a problematic issue. Considering the complexity of the software for HPC, it is useful to identify programming languages that can be used to alleviate this issue. Because the existing literature on the topic of HPC is very dispersed, we performed a Systematic Mapping Study (SMS) in the context of the European COST Action cHiPSet. This literature study maps characteristics of various programming languages for data-intensive HPC applications, including category, typical user profiles, effectiveness, and type of articles. We organised the SMS in two phases. In the first phase, relevant articles are identified employing an automated keyword-based search in eight digital libraries. This lead to an initial sample of 420 papers, which was then narrowed down in a second phase by human inspection of article abstracts, titles and keywords to 152 relevant articles published in the period 2006–2018. The analysis of these articles enabled us to identify 26 programming languages referred to in 33 of relevant articles. We compared the outcome of the mapping study with results of our questionnaire-based survey that involved 57 HPC experts. The mapping study and the survey revealed that the desired features of programming languages for data-intensive HPC applications are portability, performance and usability. Furthermore, we observed that the majority of the programming languages used in the context of data-intensive HPC applications are text-based general-purpose programming languages. Typically these have a steep learning curve, which makes them difficult to adopt. We believe that the outcome of this study will inspire future research and development in programming languages for data-intensive HPC applications.Additional co-authors: Sabri Pllana, Ana Respício, José Simão, Luís Veiga, Ari Vis

    T-Hoarder: a framework to process Twitter data streams

    Get PDF
    With the eruption of online social networks, like Twitter and Facebook, a series of new APIs have appeared to allow access to the data that these new sources of information accumulate. One of most popular online social networks is the micro-blogging site Twitter. Its APIs allow many machines to access the torrent simultaneously to Twitter data, listening to tweets and accessing other useful information such as user profiles. A number of tools have appeared for processing Twitter data with different algorithms and for different purposes. In this paper T-Hoarder is described: a framework that enables tweet crawling, data filtering, and which is also able to display summarized and analytical information about the Twitter activity with respect to a certain topic or event in a web-page. This information is updated on a daily basis. The tool has been validated with real use-cases that allow making a series of analysis on the performance one may expect from this type of infrastructure.This work been partially supported by HERMES SMARTDRIVER (TIN2013 46801 C4 2 R) and AUDACity (TIN2016 77158 C4 1 R)

    Linguagens para a Computação de Alto Desempenho, utilizadas no processamento de Big Data: Um Estudo de Mapeamento Sistemático

    Get PDF
    Big Data são conjuntos de informação de alto Volume, Velocidade e/ou Variedade que exigem formas inovadoras e económicas de processamento, que permitem uma melhor percepção, tomada de decisões e automação de processos. Desde 2002, a taxa de melhoria do desempenho em processadores simples diminuiu bruscamente. A fim de aumentar o poder dos processadores, foram utilizados múltiplos cores, em paralelo, num único chip. Para conseguir beneficiar deste tipo de arquiteturas, é necessário reescrever os programas sequenciais. O objetivo da Computação de Alto Desempenho (CAD) é estudar as metodologias e técnicas que permitem a exploração destas arquiteturas. O desafio é a necessidade de combinar o desenvolvimento de Software para a CAD com a gestão e análise de Big Data. Quando a computação paralela e distribuída é obrigatória, o código torna-se mais difícil. Para tal, é necessário saber quais são as linguagens a utilizar para facilitar essa tarefa. Pelo facto da literatura existente sobre o tópico da CAD se encontrar muito dispersa, foi conduzido um Estudo de Mapeamento Sistemático (EMS), que agrega caraterísticas sobre as diferentes linguagens encontradas (categoria; natureza; perfis de utilizador típicos; eficácia; tipos de artigos publicados na área), no processamento de Big Data, auxiliando estudantes, investigadores, ou outros profissionais que necessitem de uma introdução ou uma visão panorâmica sobre este tema. A pesquisa de artigos foi efetuada numa busca automatizada, baseada em palavraschave, nas bases de dados de 8 bibliotecas digitais selecionadas. Este processo resultou numa amostra inicial de 420 artigos, que foi reduzida a 152 artigos, publicados entre Janeiro de 2006 e Março de 2018. A análise manual desses artigos permitiu-nos identificar 26 linguagens em 33 publicações incluídas. Sumarizei e comparei as informações com as opiniões de profissionais. Os resultados indicaram que a maioria destas linguagens são Linguagem de Propósito Geral (LPG) em vez de Linguagem de Domínio Específico (LDE), o que nos leva a concluir que existe uma oportunidade de investigação aplicada de linguagens que tornem a codificação mais fácil para os especialistas do domínio

    Usability analysis of contending electronic health record systems

    Get PDF
    In this paper, we report measured usability of two leading EHR systems during procurement. A total of 18 users participated in paired-usability testing of three scenarios: ordering and managing medications by an outpatient physician, medicine administration by an inpatient nurse and scheduling of appointments by nursing staff. Data for audio, screen capture, satisfaction rating, task success and errors made was collected during testing. We found a clear difference between the systems for percentage of successfully completed tasks, two different satisfaction measures and perceived learnability when looking at the results over all scenarios. We conclude that usability should be evaluated during procurement and the difference in usability between systems could be revealed even with fewer measures than were used in our study. © 2019 American Psychological Association Inc. All rights reserved.Peer reviewe

    Centralized learning and planning : for cognitive robots operating in human domains

    Get PDF
    corecore