52 research outputs found

    Preparing for Data-driven Infrastructure

    Get PDF

    PaaS Cloud Service for Cost-Effective Harvesting, Processing and Linking of Unstructured Open Government Data

    Get PDF
    Selle projekti eesmärk on luua pilveteenus, mis võimaldaks struktueerimata avalike andmete töötlemist, selleks, et luua semantiline andmete (veebis olevatest dokumentidest leitud organisatsioonide, kohanimede ja isikunimede) ressursikirjeldusraamistiku - Resource Description Framework (RDF) - graaf, mis on ka masinloetav. Pilveteenus saab sisendiks veebiroomaja toodetud logifaili üle 3 miljoni reaga. Igal real on veebiaadress avalikule dokumendile, mis avatakse, loetakse ning kasutades - tööriista eestikeelsest tekstist nimeolemite leidmiseks- Estnltk-d, eraldatakse organisatsiooonide ja kohtade nimetused ja inimeste nimed. Seejärel lisatakse leitud nimed/nimetused RDF graafi, kasutades olemasolevat Pythoni teeki RDFlib. RDF graafis nimed/nimetused lingitakse nende veebiaadressidega, kus asub seda nime/nimetust sisaldav avalik dokument. Dokumendid arhiveeritakse lugemise hetkel neis olnud sisuga. Lisaks sisaldab teenus igakuist andmete ülekontrollimist, et tuvastada dokumentide muutusi ja vajadusel värskendada RDF graafe. Genereeritud RDF graafe kasutatakse SPARQL päringute tegemiseks, mida saavad teha kasutajad graafilise kasutajaliidese kaudu või masinad veebiteenust kasutades. Projekti oluline väljakutse on luua arhitektuur, mis töötleks andmeid võimalikult kiiresti, sest sisendfail on suur (test-logifailis on üle 3 miljoni rea, kus igal real olev URL võib viidata mahukale dokumendile). Selleks jooksutab teenus seal kus võimalik, protsesse paralleelselt, kasutades Google’i virtuaalmasinaid (Google Compute Engine) ja iga virtuaalmasina kõiki protsessoreid.The aim of this project is to develop a cloud platform service for transforming Open Government Data to Linked Open Government Data. This service receives log file, created by web crawler, with URLs (over 3000000) to some open document as an input. It then opens the document, reads its content and with using "Open source tools for Estonian natural language processing" (Estnltk), finds names of locations, organizations and people. Using Psython library "RDFlib", these names are added to the Resource Description Framework (RDF) graph, so that the names become linked to the URLs that refer to the documents. In order to archive current state of accessed document, this service downloads all processed documents. The service also enables monthly updates system of the already processed documents in order to generate new RDF relations if some of the documents have changed. Generated RDFs are publicly available and the service includes SPARQL endpoint for userss (graphical user interface) and machines (web services) for cost-effective querying of linked entities from the RDF files. An important challenge of this service is to speed up its performance, because the documents behind these 3+ billion URLs may be large. To achieve that, parallel processes are run where possible: using several virtual machines and all CPUs in a virtual machine. This is tested in Google Compute Engin

    Investigating elastic cloud based RDF processing

    Get PDF
    The Semantic Web was proposed as an extension of the traditional Web to give Web data context and meaning by using the Resource Description Framework (RDF) data model. The recent growth in the adoption of RDF in addition to the massive growth of RDF data, have led numerous efforts to focus on the challenges of processing this data. To this extent, many approaches have focused on vertical scalability by utilising powerful hardware, or horizontal scalability utilising always-on physical computer clusters or peer to peer networks. However, these approaches utilise fixed and high specification computer clusters that require considerable upfront and ongoing investments to deal with the data growth. In recent years cloud computing has seen wide adoption due to its unique elasticity and utility billing features. This thesis addresses some of the issues related to the processing of large RDF datasets by utilising cloud computing. Initially, the thesis reviews the background literature of related distributed RDF processing work and issues, in particular distributed rulebased reasoning and dictionary encoding, followed by a review of the cloud computing paradigm and related literature. Then, in order to fully utilise features that are specific to cloud computing such as elasticity, the thesis designs and fully implements a Cloud-based Task Execution framework (CloudEx), a generic framework for efficiently distributing and executing tasks on cloud environments. Subsequently, some of the large-scale RDF processing issues are addressed by using the CloudEx framework to develop algorithms for processing RDF using cloud computing. These algorithms perform efficient dictionary encoding and forward reasoning using cloud-based columnar databases. The algorithms are collectively implemented as an Elastic Cost Aware Reasoning Framework (ECARF), a cloud-based RDF triple store. This thesis presents original results and findings that advance the state of the art of performing distributed cloud-based RDF processing and forward reasoning

    Big data analytics in public sector university libraries in Pakistan

    Get PDF
    This study examines librarians\u27 perceptions, capabilities, and understandings of Big Data analytics in public sector university libraries in Karachi, Pakistan. To acquire the desired results, a survey was conducted and using a quantitative approach. The study\u27s target audience was library administrators at public sector university libraries in Karachi, all of which are recognized by Pakistan\u27s Higher Education Commission and chartered by the Sindh Government. All respondents were sent an e-mail inviting them to participate in the survey on their own time. This study is important because it fills a large vacuum in the literature about the perspectives of Karachi\u27s public sector university librarians on Big Data analytics. The result shows that most of the academic librarians are familiar with the concept of big data and thought that they need to develop their skills for the use of big data analytics tools and the government should also provide a sufficient budget for the professional development of the library staff

    Extensible metadata management framework for personal data lake

    Get PDF
    Common Internet users today are inundated with a deluge of diverse data being generated and siloed in a variety of digital services, applications, and a growing body of personal computing devices as we enter the era of the Internet of Things. Alongside potential privacy compromises, users are facing increasing difficulties in managing their data and are losing control over it. There appears to be a de facto agreement in business and scientific fields that there is critical new value and interesting insight that can be attained by users from analysing their own data, if only it can be freed from its silos and combined with other data in meaningful ways. This thesis takes the point of view that users should have an easy-to-use modern personal data management solution that enables them to centralise and efficiently manage their data by themselves, under their full control, for their best interests, with minimum time and efforts. In that direction, we describe the basic architecture of a management solution that is designed based on solid theoretical foundations and state of the art big data technologies. This solution (called Personal Data Lake - PDL) collects the data of a user from a plurality of heterogeneous personal data sources and stores it into a highly-scalable schema-less storage repository. To simplify the user-experience of PDL, we propose a novel extensible metadata management framework (MMF) that: (i) annotates heterogeneous data with rich lineage and semantic metadata, (ii) exploits the garnered metadata for automating data management workflows in PDL – with extensive focus on data integration, and (iii) facilitates the use and reuse of the stored data for various purposes by querying it on the metadata level either directly by the user or through third party personal analytics services. We first show how the proposed MMF is positioned in PDL architecture, and then describe its principal components. Specifically, we introduce a simple yet effective lineage manager for tracking the provenance of personal data in PDL. We then introduce an ontology-based data integration component called SemLinker which comprises two new algorithms; the first concerns generating graph-based representations to express the native schemas of (semi) structured personal data, and the second algorithm metamodels the extracted representations to a common extensible ontology. SemLinker outputs are utilised by MMF to generate user-tailored unified views that are optimised for querying heterogeneous personal data through low-level SPARQL or high-level SQL-like queries. Next, we introduce an unsupervised automatic keyphrase extraction algorithm called SemCluster that specialises in extracting thematically important keyphrases from unstructured data, and associating each keyphrase with ontological information drawn from an extensible WordNet-based ontology. SemCluster outputs serve as semantic metadata and are utilised by MMF to annotate unstructured contents in PDL, thus enabling various management functionalities such as relationship discovery and semantic search. Finally, we describe how MMF can be utilised to perform holistic integration of personal data and jointly querying it in native representations

    Personalized large scale classification of public tenders on hadoop

    Get PDF
    Ce projet a été réalisé dans le cadre d’un partenariat entre Fujitsu Canada et Université Laval. Les besoins du projets ont été centrés sur une problématique d’affaire définie conjointement avec Fujitsu. Le projet consistait à classifier un corpus d’appels d’offres électroniques avec une approche orienté big data. L’objectif était d’identifier avec un très fort rappel les offres pertinentes au domaine d’affaire de l’entreprise. Après une séries d’expérimentations à petite échelle qui nous ont permise d’illustrer empiriquement (93% de rappel) l’efficacité de notre approche basé sur l’algorithme BNS (Bi-Normal Separation), nous avons implanté un système complet qui exploite l’infrastructure technologique big data Hadoop. Nos expérimentations sur le système complet démontrent qu’il est possible d’obtenir une performance de classification tout aussi efficace à grande échelle (91% de rappel) tout en exploitant les gains de performance rendus possible par l’architecture distribuée de Hadoop.This project was completed as part of an innovation partnership with Fujitsu Canada and Université Laval. The needs and objectives of the project were centered on a business problem defined jointly with Fujitsu. Our project aimed to classify a corpus of electronic public tenders based on state of the art Hadoop big data technology. The objective was to identify with high recall public tenders relevant to the IT services business of Fujitsu Canada. A small scale prototype based on the BNS algorithm (Bi-Normal Separation) was empirically shown to classify with high recall (93%) the public tender corpus. The prototype was then re-implemented on a full scale Hadoop cluster using Apache Pig for the data preparation pipeline and using Apache Mahout for classification. Our experimentation show that the large scale system not only maintains high recall (91%) on the classification task, but can readily take advantage of the massive scalability gains made possible by Hadoop’s distributed architecture

    Arquiteturas federadas para integração de dados biomédicos

    Get PDF
    Doutoramento Ciências da ComputaçãoThe last decades have been characterized by a continuous adoption of IT solutions in the healthcare sector, which resulted in the proliferation of tremendous amounts of data over heterogeneous systems. Distinct data types are currently generated, manipulated, and stored, in the several institutions where patients are treated. The data sharing and an integrated access to this information will allow extracting relevant knowledge that can lead to better diagnostics and treatments. This thesis proposes new integration models for gathering information and extracting knowledge from multiple and heterogeneous biomedical sources. The scenario complexity led us to split the integration problem according to the data type and to the usage specificity. The first contribution is a cloud-based architecture for exchanging medical imaging services. It offers a simplified registration mechanism for providers and services, promotes remote data access, and facilitates the integration of distributed data sources. Moreover, it is compliant with international standards, ensuring the platform interoperability with current medical imaging devices. The second proposal is a sensor-based architecture for integration of electronic health records. It follows a federated integration model and aims to provide a scalable solution to search and retrieve data from multiple information systems. The last contribution is an open architecture for gathering patient-level data from disperse and heterogeneous databases. All the proposed solutions were deployed and validated in real world use cases.A adoção sucessiva das tecnologias de comunicação e de informação na área da saúde tem permitido um aumento na diversidade e na qualidade dos serviços prestados, mas, ao mesmo tempo, tem gerado uma enorme quantidade de dados, cujo valor científico está ainda por explorar. A partilha e o acesso integrado a esta informação poderá permitir a identificação de novas descobertas que possam conduzir a melhores diagnósticos e a melhores tratamentos clínicos. Esta tese propõe novos modelos de integração e de exploração de dados com vista à extração de conhecimento biomédico a partir de múltiplas fontes de dados. A primeira contribuição é uma arquitetura baseada em nuvem para partilha de serviços de imagem médica. Esta solução oferece um mecanismo de registo simplificado para fornecedores e serviços, permitindo o acesso remoto e facilitando a integração de diferentes fontes de dados. A segunda proposta é uma arquitetura baseada em sensores para integração de registos electrónicos de pacientes. Esta estratégia segue um modelo de integração federado e tem como objetivo fornecer uma solução escalável que permita a pesquisa em múltiplos sistemas de informação. Finalmente, o terceiro contributo é um sistema aberto para disponibilizar dados de pacientes num contexto europeu. Todas as soluções foram implementadas e validadas em cenários reais

    Efficient processing of large-scale spatio-temporal data

    Get PDF
    Millionen Geräte, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und übertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art räumlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese großen räumlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die Ausführung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten Ausführungen der Analyseprogramme während ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen Ausführungszeiten und hohen Kosten für gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschäftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir präsentieren zunächst das STARK Framework für die Verarbeitung räumlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen für Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. Außerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmäßiger Datenverteilung und der Größe der Datenmenge umgehen können und präsentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frühzeitig zu reduzieren. Um die Ausführungszeit von Programmen zu verkürzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsächlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. Außerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die Ausführungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data. The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results. Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators. In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly