Search CORE

27 research outputs found

Using SQL-based Scripting Languages in Hadoop Ecosystem for Data Analytics

Author: Koppel Madis-Karli
Publication venue
Publication date: 01/01/2016
Field of study

Selle lõputöö eesmärk on andmeanalüütika algoritmide rakendamine,\n\ret võrrelda erinevaid SQL-il põhinevaid skriptimiskeeli Hadoopi ökosüsteemis.\n\rLõputöö võrdleb erinevate raamistike efektiivsust ja algoritmide implementeerimise\n\rlihtsust kasutajal, kellel pole varasemat hajusarvutuse kogemust. Eesmärgi\n\rtäitmiseks implementeeriti kolm algoritmi: Pearsoni korrelatsioon, lihtne lineaarne\n\rregressioon ja naiivne Bayesi klassifikaator. Algoritmid implementeerti kahes\n\rSQL-il põhinevas raamistikus: Spark SQL-s ja HiveQL-s, samuti implementeeriti\n\rsamade algoritmide Spark MLlibi versioon. Algoritme testiti klastris erinevate sisendfaili\n\rsuurustega, samuti muudeti kasutatavate tuumade arvu. Selles lõputöös\n\ruuriti ka Spark SQLi ja Spark MLlibi algoritmide skaleeruvust. Algoritmide jooksutamise\n\rtulemusel selgus, et Pearsoni korrelatsioon oli HiveQL’is veidi kiirem kui\n\rteistes uuritud raamistikes. Lineaarse regressiooni tulemused näitavad, et Spark\n\rSQL ja Spark MLlib olid selle algoritmiga sama kiired, HiveQL oli umbes 30%\n\raeglasem. Kahe esimese algoritmiga skaleerusid Spark SQL ja Spark MLlibist pärit\n\ralgoritm hästi. Naiivse Bayesi klasifikaatoriga tehtud testid näitasid, et Spark\n\rSQL on selle algoritmiga kiirem kui HiveQL, hoolimata sellest, et ta ei skallerunud\n\rhästi. Spark MLlibi tulemused selle algoritmiga ei olnud piisavad järelduste\n\rtegemiseks. Korrelatsiooni ja lineaarse regressiooni implementatsioonid HiveContextis\n\rja SQLContextis andsid sama tulemuse. Selle lõputöö käigus leiti, et SQL-il\n\rpõhinevaid raamistikke on kerge kasutada: HiveQL oli kõige lihtsam samas kui\n\rSpark SQL nõudis veidi hajusarvutuse tundma õppimist. Spark MLlibi algoritmide\n\rimplementeerimine oli raskem kui oodatud, kuna nõudis algoritmi sisemise töö\n\rmõistmist, samuti osutusid vajalikuks teadmised hajusarvutusest.The goal of this thesis is to compare different SQL-based scripting languages\n\rin Hadoop ecosystem by implementing data analytics algorithms. The thesis compared framework efficiencies and easiness of implementing algorithms with no previous\n\rexperience in distributed computing. To fulfill this goal three algorithms were\n\rimplemented: Pearson’s correlation, simple linear regression and naive Bayes classifier.\n\rThe algorithms were implemented in two SQL-based frameworks on Hadoop\n\recosystem: Spark SQL and HiveQL, algorithms were also implemented from Spark\n\rMLlib. SQLContext and HiveContext were also compared in Spark SQL. Algorithms\n\rwere tested in a cluster with different dataset sizes and different number of\n\rexecutors. Scaling of Spark SQL and Spark MLlib’s algorithm was also measured.\n\rResults obtained in this thesis show that in the implementation of Pearson’s correlation\n\rHiveQL is slightly faster than other two frameworks. Linear regression\n\rresults show that Spark SQL and Spark MLlib are with similar run times, both\n\rabout 30% faster than HiveQL. Spark SQL and Spark MLlib algorithms scaled\n\rwell with these two algorithms. In the implementation of naive Bayes classifier\n\rSpark SQL did not scale well but was still faster than HiveQL. Results for Spark\n\rMLlib in multinomial naive Bayes proved to be inconclusive. With correlation\n\rand regression no difference between SQLContext and HiveContext was found.\n\rThe thesis found SQL-based frameworks easy to use: HiveQL was the easiest\n\rwhile Spark SQL required some additional investigation into distributed computing.\n\rImplementing algorithms from Spark MLlib was more difficulty as there it\n\rwas necessary to understand the internal workings of the algorithm, knowledge of\n\rdistributed computing was also necessary

DSpace at Tartu University Library

Node.js based Document Store for Web Crawling

Author: Bui David
Publication venue: SJSU ScholarWorks
Publication date: 14/12/2021
Field of study

WARC files are central to internet preservation projects. They contain the raw resources of web crawled data and can be used to create windows into the past of web pages at the time they were accessed. Yet there are few tools that manipulate WARC files outside of basic parsing. The creation of our tool WARC-KIT gives users in the Node.js JavaScript environment, a tool kit to interact with and manipulate WARC files. Included with WARC-KIT is a WARC parsing tool known as WARCFilter that can be used standalone tool to parse, filter, and create new WARC files. WARCFilter can also, create CDX index files on the WARC files, parse existing CDX files, or even generate webgraph datasets for graph analysis algorithms. Aside from WARCFilter, WARC-KIT includes a custom on disk database system implemented with an underlying Linear Hash Table data structure. The database system is the first of its kind as a JavaScript only on disk document store. The overall main application of WARC-KIT is that it allows users to create custom indices upon collections of WARC files. After creating an index on a WARC collections, users are then query their collection using the GraphQL query language to retrieve desired WARC records. Experiments with WARCFilter on a WARC dataset composed of 238,000 WARC records demonstrates that utilizing CDX index files speeds WARC record filtering around ten to twenty times faster than raw WARC parsing. Database timing tests with the JavaScript Linear Hash Table database system displayed twice as fast insertion and retrieval operations than a similar Rust implemented Linear Hash Table database. Experiments with the overall WARC-KIT application on the same 238,000 WARC record dataset exhibited consistent query times for different complex queries

SJSU ScholarWorks

Development and Evaluation of a Big Data Framework for Performance Management in Mobile Networks

Author: Luján-Mora Sergio
Martinez-Mosquera Diana
Navarrete Rosa
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/12/2020
Field of study

In telecommunications, Performance Management (PM) data are collected from network elements to a centralized system, the Network Management System (NMS), which acts as a business intelligence tool specialized in monitoring and reporting network performance. Performance Management files contain the metrics and named counters used to quantify the performance of the network. Current NMS implementations have limitations in scalability and support for volume, variety, and velocity of the collected PM data, especially for 5G and 6G mobile network technologies. To overcome these limitations, we proposed a Big Data framework based on an analysis of the following components: software architecture, ingestion, data lake, processing, reporting, and deployment. Our work analyzed the PM files’ format on a real data set from four different vendors and 2G, 3G, 4G, and 5G technologies. Then, we experimentally assessed our proposed framework’s feasibility through a case study involving 5G PM files. Test results of the ingestion and reporting components are presented, identifying the hardware and software required to support up to one billion counters per hour. This proposal can help telecommunications operators to have a reference Big Data framework to face the current and future challenges in the NMS, for instance, the support of data analytics in addition to the well-known services.This work was supported by the Unidad de Gestión de Investigación y Proyección Social from the Escuela Politécnica Nacional

Repositorio Institucional de la Universidad de Alicante

The SPARQLGX System for Distributed Evaluation of SPARQL Queries

Author: Genevès Pierre
Graux Damien
Jachiet Louis
Layaïda Nabil
Publication venue: HAL CCSD
Publication date: 23/10/2017
Field of study

SPARQL is the W3C standard query language for querying data expressed in the Resource Description Framework (RDF). The increasing amounts of data available in the RDF format raise a major need and research interest in building efficient and scalable distributed SPARQL query evaluators. In this context, we propose SPARQLGX: an implementation of a distributed RDF datastore based on Apache Spark. SPARQLGX is designed to leverage existing Hadoop infrastructures for evaluating SPARQL queries efficiently. SPARQLGX relies on an automated translation of SPARQL queries into optimized executable Spark code. We show that SPARQLGX makes it possible to evaluate SPARQL queries on billions of triples distributed across multiple nodes, while providing attractive performance figures. We report on experiments which show how SPARQLGX compares to state-of-the-art implementations and we show that our approach scales better than other systems in terms of supported dataset size. With its simple design, SPARQLGX represents an interesting alternative in several scenarios

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

SAP HANA Data Volume Management

Author: Kumar Subhadip
Publication venue
Publication date: 28/05/2023
Field of study

Today information technology is a data-driven environment. The role of data is to empower business leaders to make decisions based on facts, trends, and statistical numbers. SAP is no exception. In modern days many companies use business suites like SAP on HANA S/4 or ERP or SAP Business Warehouse and other non-SAP applications and run those on HANA databases for faster processing. While HANA is an extremely powerful in-memory database, growing business data has an impact on the overall performance and budget of the organization. This paper presents best practices to reduce the overall data footprint of HANA databases for three use cases like SAP Business Suite on HANA, SAP Business Warehouse, and Native HANA database

arXiv.org e-Print Archive

SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark

Author: Genevès Pierre
Graux Damien
Jachiet Louis
Layaïda Nabil
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

International audiencesparql is the w3c standard query language for querying data expressed in the Resource Description Framework (rdf). The increasing amounts of rdf data available raise a major need and research interest in building efficient and scalable distributed sparql query eval-uators. In this context, we propose sparqlgx: our implementation of a distributed rdf datastore based on Apache Spark. sparqlgx is designed to leverage existing Hadoop infrastructures for evaluating sparql queries. sparqlgx relies on a translation of sparql queries into exe-cutable Spark code that adopts evaluation strategies according to (1) the storage method used and (2) statistics on data. We show that spar-qlgx makes it possible to evaluate sparql queries on billions of triples distributed across multiple nodes, while providing attractive performance figures. We report on experiments which show how sparqlgx compares to related state-of-the-art implementations and we show that our approach scales better than these systems in terms of supported dataset size. With its simple design, sparqlgx represents an interesting alternative in several scenarios

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1

Efficient processing of large-scale spatio-temporal data

Author: Hagedorn Stefan
Publication venue
Publication date: 01/01/2020
Field of study

Millionen Geräte, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und übertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art räumlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese großen räumlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die Ausführung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten Ausführungen der Analyseprogramme während ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen Ausführungszeiten und hohen Kosten für gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschäftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir präsentieren zunächst das STARK Framework für die Verarbeitung räumlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen für Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. Außerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmäßiger Datenverteilung und der Größe der Datenmenge umgehen können und präsentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frühzeitig zu reduzieren. Um die Ausführungszeit von Programmen zu verkürzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsächlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. Außerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die Ausführungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data. The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results. Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators. In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly

Digitale Bibliothek Thüringen

Une classification expérimentale multi-critère des évaluateurs SPARQL répartis

Author: Genevès Pierre
Graux Damien
Jachiet Louis
Layaïda Nabil
Publication venue: HAL CCSD
Publication date: 14/11/2017
Field of study

International audiencesparql est le langage standard pour interroger des données au format rdf. Il exite une grande variété d'évaluateurs sparql mettant en place différentes architectures tant pour la répartition des données que pour le déroulement des calculs. Ces différences coupléescouplées`coupléesà des optimisations spécifiques pour chaqué evaluateur rendent la comparaison entre ces systèmes impossible d'un point de vue théorique. Nous proposons un nouvel angle de comparaison des évaluateurs sparql répartis basé sur un classement multi-critère. Nous suggérons d'utiliser un ensemble de cinq fonctionnalités afin d'obtenir une description plus fine des comportements des évaluateurs répartis plutôt que de considérer l'analyse plus traditionnelle des performances temporelles. Afin d'illustrer cette méthode, nous avons mené des expérimentations mettant en compétition dix systèmes existants que nous avons ensuite classés en utilisant une grille de lecture aidantàaidant`aidantà la visualisation des avantages et des limitations des techniques dans le domaine de l'évaluation répartie de requêtes sparql

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server