33 research outputs found

    Large Scale Feature Extraction from Linked Web Data

    Get PDF
    Veebiandmed on ajas muutuvad ning viis, kuidas neid esitatakse muutub samuti. Linkandmed on muutnud veebis leiduva info masinloetavaks. Selles töös esitame kontseptsioonitĂ”enduseks lahenduse, mis vĂ”tab veebisorimise andmetest linkandmed ja teostab nende peal tunnusehĂ”ivet. Esitletud lahenduse eesmĂ€rgiks on luua sisendeid masinĂ”ppe mudelite treenimiseks, mida kasutatakse firmade krediidiskoori hindamiseks. Meie nĂ€itelahendus keskendub toote linkandmetele. Me proovime ĂŒhendadatoodete linkandmed, mis esitavad sama toodet, aga pĂ€rinevad erinevatelt veebilehtedelt.Toodete linkandmed ĂŒhendatakse firmadega, mille lehelt tooted pĂ€rit on. Informatsioon firmadest ja nende toodetest moodustab graafi, millel arvutame graafimeetrikuid.Erinevate ajahetketede veebisorimisandmetel arvutatud graafimeetrikud moodustavad ajaseeria, mis nĂ€itab graafi muutusi lĂ€bi aja. Saadud ajaseeriatel rakendame tunnushĂ”ive arvutamist.Loodud lahendus on planeeritud suurte andmete jaoks ning ehitatud ja disainitud skaleeruvust silmas pidades. Me kasutame Apache Sparki, et töödelda suurt hulka andmeid kiiresti ning olla valmis, kui sisendandmete hulk suureneb 100 korda.Data available on the web is evolving, and the way it is represented is changing as well.Linked data has made information on the web understandable to machines. In this thesis we develop a proof of concept pipeline that extracts linked data from web crawling and performs feature extraction on it. The end goal of this pipeline is to provide input to machine learning models that are used for credit scoring. The use case focuses on extracting product linked data and connecting it with the company that offers it. Built solution attempts to detect if two products from different web sites are the same in order to use one representation for both. Information about companies and products is represented as a graph on which network metrics are calculated. Network metrics from multiple different web crawls are stored in time series that shows changes in graph over time. We then calculate derivatives on the values in time series.The developed pipeline is designed to handle data in terabytes and built with scalability in mind. We use Apache Spark to process huge amounts of data and to be ready if input data increases 100 times

    Improving Academic Natural Language Processing Infrastructures Utilizing Cluster Computation

    Get PDF
    In light of widespread digitization endeavors and ever-growing textual data generation, developing efficient academic Natural Language Processing (NLP) infrastructures, which can deal with large amounts of data, is of particular importance. Novel computation technologies allow tools that support big data and heavy computation while performing timely and cost-effective data processing. This development has led researchers to demand that knowledge be extracted from ever-increasing textual data before it is outdated. Cluster computation is a modern technology for handling big data efficiently. It provides distribution of computing and data over a number of machines in a cluster, as well as efficient use of resources, which are key requirements to process big data in a timely manner. It also assures applications’ high availability and fault tolerance, which are fundamental concerns when dealing with vast amounts of data. In addition, it provides load balancing of data during the execution of tasks, which results in optimal use of resources and enhances efficiency. Data-oriented parallelization is an effective solution to enable the currently available academic NLP infrastructures to process big data. This approach offers a solution to parallelize the NLP tools which comprise identical non-complicated tasks without the expense of changing NLP algorithms. This thesis presents the adaption of cluster computation technology to academic NLP infrastructures to address the notable features that are essential to process vast quantities of text materials efficiently, in terms of both resources and time. Apache Spark on top of Apache Hadoop and its ecosystem have been utilized to develop a set of NLP tools that provide a distributed environment to execute the NLP tasks. Many experiments were conducted to assess the functionality of the designated strategy. This thesis shows that using cluster computation technology and data-oriented parallelization enables academic NLP infrastructures to execute large amounts of textual data in a timely manner while improving the performance of the NLP tools. Moreover, these experiments provide information that brings a more realistic and transparent estimation of workflows’ costs (required hardware resources) and execution time, along with the fastest, optimum, or feasible resource configuration for each individual workflow. This knowledge can be employed by users to trade-off between run-time, size of data, and hardware, and it enables them to design a strategy for data storage, duration of data retention, and delivery time. This has the potential to enhance researchers’ satisfaction when using academic NLP infrastructures. The thesis also shows that a cluster computation approach provides the capacity to adapt NLP services with JIT delivery systems. The proposed strategy assures the reliability and predictability of the services, which are the main characteristics of the services in JIT delivery systems. Defining the relevant parameters, recording the behavior of the services, and analyzing the generated data resulted in the provision of knowledge that can be utilized to create a service catalog—a fundamental requirement for the services in JIT delivery systems—for each service offered. This knowledge also helps to generate the performance profiles for each item mentioned in the service catalog and to update them continuously to cover new experiments and improve service quality

    Microdata Deduplication with Spark

    Get PDF
    Üha rohkem avaldatakse veebis struktureeritud sisu, mis on loetav nii inimeste kui masinate poolt. TĂ€nu otsimootorite loojatele, kes on defineerinud standardid struktureeritud sisu esitamiseks, teevad jĂ€rjest rohkemad veebisaidid osa oma andmetest, nt toodete, isikute, organisatsioonide ja asukohtade kirjeldused, veebis avalikuks. Selleks kasutatakse RDFa, microdata jms vorminguid. Microdata on ĂŒks viimastest vormingutest ning saanud populaarseks suhteliselt lĂŒhikese aja jooksul. Sarnaselt on arenenud tehnoloogiad veebist struktureeritud sisu kĂ€ttesaamiseks. NĂ€iteks on Apache Any23, mis vĂ”imaldab veebilehtedest microdata andmeid eraldada ja linkandmetena kĂ€ttesaadavaks teha. Samas pole struktureeritud andmete veebist kĂ€ttesaamine enam suurim tehniline vĂ€ljakutse. Nimelt on veebist saadud andmeid enne kasutamist vaja puhastada - eemaldada duplikaadid, lahendada ebakĂ”lad ning hakkama tuleb saada ka ebamÀÀraste andmetega.\n\rKĂ€esoleva magistritöö peamiseks fookuseks on efektiivse lahenduse loomine veebis leiduvatest linkandmetest duplikaatide eemaldamine suurte andmekoguste jaoks. Kuigi deduplikeerimise algoritmid on saavutanud suhtelise kĂŒpsuse, tuleb neid konkreetsete andmekomplektide jaoks siiski peenhÀÀlestada. EelkĂ”ige tuleb tuvastada sobivaim vĂ”tme pikkus kirjete sortimiseks. KĂ€esolevas töös tuvastatakse optimaalne vĂ”tme pikkus veebis leiduvate tooteandmete deduplikeerimise kontekstis. Suurte andmemahtude tĂ”ttu kasutatakse Apache Spark'i deduplikeerimist hajusalgoritmide realiseerimiseks.The web is transforming from traditional web to web of data, where information is presented in such a way that it is readable by machines as well as human. As a part of this transformation, every day more and more websites implant structured data, e.g. product, person, organization, place etc., into the HTML pages. To implant the structured data different encoding vocabularies, such as RDFa, microdata, and microformats, are used. Microdata is the most recent addition to these structure data embedding standards, but it has gained more popularity over other formats in less time. Similarly, progress has been made in the extraction of the structured data from web pages, which has resulted in open source tools such as Apache Any23 and non-profit Common Crawl project. Any23 allows extraction of microdata from the web pages with less effort, whereas Common Crawl extracts data from websites and provides it publically for download. In fact, the microdata extraction tools only take care of parsing and data transformation steps of data cleansing. Although with the help of these state-of-the-art extraction tools microdata can be easily extracted, before the extracted data used in potential applications, duplicates should be removed and data unified. Since microdata origins from arbitrary web resources, it has arbitrary quality as well and should be treated correspondingly. \n\rThe main purpose of this thesis is to develop the effective mechanism for deduplication of microdata on the web scale. Although the deduplication algorithms have reached relative maturity, however, these algorithm needs to be executed on specific datasets for fine-tuning. In particular, the need to identify the most suitable length of sorting key in sorted-based deduplication approach. The present work identifies the optimum length of the sorting key in the context of extracted product microdata deduplication. Due to large volumes of data to be processed continuously, Apache Spark will be used for implementing the necessary procedures

    Personalized large scale classification of public tenders on hadoop

    Get PDF
    Ce projet a Ă©tĂ© rĂ©alisĂ© dans le cadre d’un partenariat entre Fujitsu Canada et UniversitĂ© Laval. Les besoins du projets ont Ă©tĂ© centrĂ©s sur une problĂ©matique d’affaire dĂ©finie conjointement avec Fujitsu. Le projet consistait Ă  classifier un corpus d’appels d’offres Ă©lectroniques avec une approche orientĂ© big data. L’objectif Ă©tait d’identifier avec un trĂšs fort rappel les offres pertinentes au domaine d’affaire de l’entreprise. AprĂšs une sĂ©ries d’expĂ©rimentations Ă  petite Ă©chelle qui nous ont permise d’illustrer empiriquement (93% de rappel) l’efficacitĂ© de notre approche basĂ© sur l’algorithme BNS (Bi-Normal Separation), nous avons implantĂ© un systĂšme complet qui exploite l’infrastructure technologique big data Hadoop. Nos expĂ©rimentations sur le systĂšme complet dĂ©montrent qu’il est possible d’obtenir une performance de classification tout aussi efficace Ă  grande Ă©chelle (91% de rappel) tout en exploitant les gains de performance rendus possible par l’architecture distribuĂ©e de Hadoop.This project was completed as part of an innovation partnership with Fujitsu Canada and UniversitĂ© Laval. The needs and objectives of the project were centered on a business problem defined jointly with Fujitsu. Our project aimed to classify a corpus of electronic public tenders based on state of the art Hadoop big data technology. The objective was to identify with high recall public tenders relevant to the IT services business of Fujitsu Canada. A small scale prototype based on the BNS algorithm (Bi-Normal Separation) was empirically shown to classify with high recall (93%) the public tender corpus. The prototype was then re-implemented on a full scale Hadoop cluster using Apache Pig for the data preparation pipeline and using Apache Mahout for classification. Our experimentation show that the large scale system not only maintains high recall (91%) on the classification task, but can readily take advantage of the massive scalability gains made possible by Hadoop’s distributed architecture

    Evaluation of Storage Systems for Big Data Analytics

    Get PDF
    abstract: Recent trends in big data storage systems show a shift from disk centric models to memory centric models. The primary challenges faced by these systems are speed, scalability, and fault tolerance. It is interesting to investigate the performance of these two models with respect to some big data applications. This thesis studies the performance of Ceph (a disk centric model) and Alluxio (a memory centric model) and evaluates whether a hybrid model provides any performance benefits with respect to big data applications. To this end, an application TechTalk is created that uses Ceph to store data and Alluxio to perform data analytics. The functionalities of the application include offline lecture storage, live recording of classes, content analysis and reference generation. The knowledge base of videos is constructed by analyzing the offline data using machine learning techniques. This training dataset provides knowledge to construct the index of an online stream. The indexed metadata enables the students to search, view and access the relevant content. The performance of the application is benchmarked in different use cases to demonstrate the benefits of the hybrid model.Dissertation/ThesisMasters Thesis Computer Science 201

    Predicting parking space availability based on heterogeneous data using Machine Learning techniques

    Get PDF
    Abstract. These days, smart cities are focused on improving their services and bringing quality to everyday life, leveraging modern ICT technologies. For this reason, the data from connected IoT devices, environmental sensors, economic platforms, social networking sites, governance systems, and others can be gathered for achieving such goals. The rapid increase in the number of vehicles in major cities of the world has made mobility in urban areas difficult, due to traffic congestion and parking availability issues. Finding a suitable parking space is often influenced by various factors such as weather conditions, traffic flows, and geographical information (markets, hospitals, parks, and others). In this study, a predictive analysis has been performed to estimate the availability of parking spaces using heterogeneous data from Cork County, Ireland. However, accumulating, processing, and analysing the produced data from heterogeneous sources is itself a challenge, due to their diverse nature and different acquisition frequencies. Therefore, a data lake has been proposed in this study to collect, process, analyse, and visualize data from disparate sources. In addition, the proposed platform is used for predicting the available parking spaces using the collected data from heterogeneous sources. The study includes proposed design and implementation details of data lake as well as the developed parking space availability prediction model using machine learning techniques

    Efficient Resource Management for Cloud Computing Environments

    Get PDF
    Cloud computing has recently gained popularity as a cost-effective model for hosting and delivering services over the Internet. In a cloud computing environment, a cloud provider packages its physical resources in data centers into virtual resources and offers them to service providers using a pay-as-you-go pricing model. Meanwhile, a service provider uses the rented virtual resources to host its services. This large-scale multi-tenant architecture of cloud computing systems raises key challenges regarding how data centers resources should be controlled and managed by both service and cloud providers. This thesis addresses several key challenges pertaining to resource management in cloud environments. From the perspective of service providers, we address the problem of selecting appropriate data centers for service hosting with consideration of resource price, service quality as well as dynamic reconfiguration costs. From the perspective of cloud providers, as it has been reported that workload in real data centers can be typically divided into server-based applications and MapReduce applications with different performance and scheduling criteria, we provide separate resource management solutions for each type of workloads. For server-based applications, we provide a dynamic capacity provisioning scheme that dynamically adjusts the number of active servers to achieve the best trade-off between energy savings and scheduling delay, while considering heterogeneous resource characteristics of both workload and physical machines. For MapReduce applications, we first analyzed task run-time resource consumption of a large variety of MapReduce jobs and discovered it can vary significantly over-time, depending on the phase the task is currently executing. We then present a novel scheduling algorithm that controls task execution at the level of phases with the aim of improving both job running time and resource utilization. Through detailed simulations and experiments using real cloud clusters, we have found our proposed solutions achieve substantial gain compared to current state-of-art resource management solutions, and therefore have strong implications in the design of real cloud resource management systems in practice

    Content sensitivity based access control model for big data

    Get PDF
    Big data technologies have seen tremendous growth in recent years. They are being widely used in both industry and academia. In spite of such exponential growth, these technologies lack adequate measures to protect the data from misuse or abuse. Corporations that collect data from multiple sources are at risk of liabilities due to exposure of sensitive information. In the current implementation of Hadoop, only file level access control is feasible. Providing users, the ability to access data based on attributes in a dataset or based on their role is complicated due to the sheer volume and multiple formats (structured, unstructured and semi-structured) of data. In this dissertation an access control framework, which enforces access control policies dynamically based on the sensitivity of the data is proposed. This framework enforces access control policies by harnessing the data context, usage patterns and information sensitivity. Information sensitivity changes over time with the addition and removal of datasets, which can lead to modifications in the access control decisions and the proposed framework accommodates these changes. The proposed framework is automated to a large extent and requires minimal user intervention. The experimental results show that the proposed framework is capable of enforcing access control policies on non-multimedia datasets with minimal overhea

    Efficient Resource Management for Cloud Computing Environments

    Get PDF
    Cloud computing has recently gained popularity as a cost-effective model for hosting and delivering services over the Internet. In a cloud computing environment, a cloud provider packages its physical resources in data centers into virtual resources and offers them to service providers using a pay-as-you-go pricing model. Meanwhile, a service provider uses the rented virtual resources to host its services. This large-scale multi-tenant architecture of cloud computing systems raises key challenges regarding how data centers resources should be controlled and managed by both service and cloud providers. This thesis addresses several key challenges pertaining to resource management in cloud environments. From the perspective of service providers, we address the problem of selecting appropriate data centers for service hosting with consideration of resource price, service quality as well as dynamic reconfiguration costs. From the perspective of cloud providers, as it has been reported that workload in real data centers can be typically divided into server-based applications and MapReduce applications with different performance and scheduling criteria, we provide separate resource management solutions for each type of workloads. For server-based applications, we provide a dynamic capacity provisioning scheme that dynamically adjusts the number of active servers to achieve the best trade-off between energy savings and scheduling delay, while considering heterogeneous resource characteristics of both workload and physical machines. For MapReduce applications, we first analyzed task run-time resource consumption of a large variety of MapReduce jobs and discovered it can vary significantly over-time, depending on the phase the task is currently executing. We then present a novel scheduling algorithm that controls task execution at the level of phases with the aim of improving both job running time and resource utilization. Through detailed simulations and experiments using real cloud clusters, we have found our proposed solutions achieve substantial gain compared to current state-of-art resource management solutions, and therefore have strong implications in the design of real cloud resource management systems in practice