4 research outputs found
An efficient strategy for the collection and storage of large volumes of data for computation
In recent years, there has been an increasing amount of data being produced and stored, which is known as Big Data. The social networks, internet of things, scientific experiments and commercial services play a significant role in generating a vast amount of data. Three main factors are important in Big Data; Volume, Velocity and Variety. One needs to consider all three factors when designing a platform to support Big Data. The Large Hadron Collider (LHC) particle accelerator at CERN consists of a number of data-intensive experiments, which are estimated to produce a volume of about 30 PB of data, annually. The velocity of these data that are propagated will be extremely fast. Traditional methods of collecting, storing and analysing data have become insufficient in managing the rapidly growing volume of data. Therefore, it is essential to have an efficient strategy to capture these data as they are produced. In this paper, a number of models are explored to understand what should be the best approach for collecting and storing Big Data for analytics. An evaluation of the performance of full execution cycles of these approaches on the monitoring of the Worldwide LHC Computing Grid (WLCG) infrastructure for collecting, storing and analysing data is presented. Moreover, the models discussed are applied to a community driven software solution, Apache Flume, to show how they can be integrated, seamlessly
Ecosistema Big Data en un clúster de Raspberry Pi
Esta investigación mostrara un paso a paso de como instalar y configurar Hadoop en un clúster de raspberrys pi, describiendo y explicando desde los fundamentos de Big Data hasta todo el ecosistema de Apache y para que funciona cada tecnología. Además de recopilar información de algunas de las publicaciones mas relevantes relacionadas con Big Data
Sijaintipohjaisen datan visualisointiputki
The master’s thesis focused on answering how to turn a plethora of location-based data into a visualization pipeline. The goal was to find a way to bring the most value to the users of the visualization pipeline, in this case a taxi company, by defining a set of audience-oriented approaches. The audience-oriented approaches were tested using a case study, using the taxi company as the test subject. The case study was used to test the functionality and design of the visualization pipeline, as well as to test the audience-oriented approaches in practice. The success of the visualization pipeline was assessed using a Business Intelligence assessment model. The Business Intelligence assessment was usedas a benchmark for how much value the implemented visualization pipeline provided.
We were able to define three types of audiences, and correspondingly three different approaches for these said audience types. The three audience types were activists, analysts and organizational decision-makers, and their approaches were defined correspondingly as”lightweight”, ”technical”, and ”tailored” approaches.The case study was carried out by defining an audience group for the customer.
The case study defined the case customer as an organizational decision-maker, thus resulting the ”tailored” approach as the best fit. The approach provided the most value to the case customer both in the terms of technical requirements as well as in the terms of data analytical needs. This case study showed promise for the utility of an audience-oriented approach in visualization pipeline design
Recommended from our members
Intelligent big data architecture optimisation
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonMonitoring data-intensive scientific infrastructures in real-time such as jobs, data transfers, and hardware failures is vital for efficient operation. Due to the high volume and velocity of events that are produced, traditional methods are no longer optimal. Several techniques, as well as enabling architectures, are available to support the Big Data issue. In this respect, this thesis complements existing survey work by contributing an extensive literature review of both traditional and emerging Big Data architecture. Scalability, low-latency, fault-tolerance, and intelligence are key challenges of the traditional architecture. However, Big Data technologies and approaches have become increasingly popular for use cases that demand the use of scalable, data intensive processing (parallel), and fault-tolerance (data replication) and support for low-latency computations. In the context of a scalable data store and analytics platform for monitoring data-intensive scientific infrastructure, Lambda Architecture was adapted and evaluated on the Worldwide LHC Computing Grid, which has been proven effective. This is especially true for computationally and data-intensive use cases. In this thesis, an efficient strategy for the collection and storage of large volumes of data for computation is presented. By moving the transformation logic out from the data pipeline and moving to analytics layers, it simplifies the architecture and overall process. Time utilised is reduced, untampered raw data are kept at storage level for fault-tolerance, and the required transformation can be done when needed. An optimised Lambda Architecture (OLA), which involved modelling an efficient way of joining batch layer and streaming layer with minimum code duplications in order to support scalability, low-latency, and fault-tolerance is presented. A few models were evaluated; pure streaming layer, pure batch layer and the combination of both batch and streaming layers. Experimental results demonstrate that OLA performed better than the traditional architecture as well the Lambda Architecture. The OLA was also enhanced by adding an intelligence layer for predicting data access pattern. The intelligence layer actively adapts and updates the model built by the batch layer, which eliminates the re-training time while providing a high level of accuracy using the Deep Learning technique. The fundamental contribution to knowledge is a scalable, low-latency, fault-tolerant, intelligent, and heterogeneous-based architecture for monitoring a data-intensive scientific infrastructure, that can benefit from Big Data, technologies and approaches.Thomas Gerald Gray Trus