Search CORE

33 research outputs found

Large Scale Feature Extraction from Linked Web Data

Author: Koppel Madis-Karli
Publication venue
Publication date: 01/01/2018
Field of study

Veebiandmed on ajas muutuvad ning viis, kuidas neid esitatakse muutub samuti. Linkandmed on muutnud veebis leiduva info masinloetavaks. Selles töös esitame kontseptsioonitõenduseks lahenduse, mis võtab veebisorimise andmetest linkandmed ja teostab nende peal tunnusehõivet. Esitletud lahenduse eesmärgiks on luua sisendeid masinõppe mudelite treenimiseks, mida kasutatakse firmade krediidiskoori hindamiseks. Meie näitelahendus keskendub toote linkandmetele. Me proovime ühendadatoodete linkandmed, mis esitavad sama toodet, aga pärinevad erinevatelt veebilehtedelt.Toodete linkandmed ühendatakse firmadega, mille lehelt tooted pärit on. Informatsioon firmadest ja nende toodetest moodustab graafi, millel arvutame graafimeetrikuid.Erinevate ajahetketede veebisorimisandmetel arvutatud graafimeetrikud moodustavad ajaseeria, mis näitab graafi muutusi läbi aja. Saadud ajaseeriatel rakendame tunnushõive arvutamist.Loodud lahendus on planeeritud suurte andmete jaoks ning ehitatud ja disainitud skaleeruvust silmas pidades. Me kasutame Apache Sparki, et töödelda suurt hulka andmeid kiiresti ning olla valmis, kui sisendandmete hulk suureneb 100 korda.Data available on the web is evolving, and the way it is represented is changing as well.Linked data has made information on the web understandable to machines. In this thesis we develop a proof of concept pipeline that extracts linked data from web crawling and performs feature extraction on it. The end goal of this pipeline is to provide input to machine learning models that are used for credit scoring. The use case focuses on extracting product linked data and connecting it with the company that offers it. Built solution attempts to detect if two products from different web sites are the same in order to use one representation for both. Information about companies and products is represented as a graph on which network metrics are calculated. Network metrics from multiple different web crawls are stored in time series that shows changes in graph over time. We then calculate derivatives on the values in time series.The developed pipeline is designed to handle data in terabytes and built with scalability in mind. We use Apache Spark to process huge amounts of data and to be ready if input data increases 100 times

DSpace at Tartu University Library

Improving Academic Natural Language Processing Infrastructures Utilizing Cluster Computation

Author: Sahami Soheila
Publication venue
Publication date: 25/09/2020
Field of study

In light of widespread digitization endeavors and ever-growing textual data generation, developing efficient academic Natural Language Processing (NLP) infrastructures, which can deal with large amounts of data, is of particular importance. Novel computation technologies allow tools that support big data and heavy computation while performing timely and cost-effective data processing. This development has led researchers to demand that knowledge be extracted from ever-increasing textual data before it is outdated. Cluster computation is a modern technology for handling big data efficiently. It provides distribution of computing and data over a number of machines in a cluster, as well as efficient use of resources, which are key requirements to process big data in a timely manner. It also assures applications’ high availability and fault tolerance, which are fundamental concerns when dealing with vast amounts of data. In addition, it provides load balancing of data during the execution of tasks, which results in optimal use of resources and enhances efficiency. Data-oriented parallelization is an effective solution to enable the currently available academic NLP infrastructures to process big data. This approach offers a solution to parallelize the NLP tools which comprise identical non-complicated tasks without the expense of changing NLP algorithms. This thesis presents the adaption of cluster computation technology to academic NLP infrastructures to address the notable features that are essential to process vast quantities of text materials efficiently, in terms of both resources and time. Apache Spark on top of Apache Hadoop and its ecosystem have been utilized to develop a set of NLP tools that provide a distributed environment to execute the NLP tasks. Many experiments were conducted to assess the functionality of the designated strategy. This thesis shows that using cluster computation technology and data-oriented parallelization enables academic NLP infrastructures to execute large amounts of textual data in a timely manner while improving the performance of the NLP tools. Moreover, these experiments provide information that brings a more realistic and transparent estimation of workflows’ costs (required hardware resources) and execution time, along with the fastest, optimum, or feasible resource configuration for each individual workflow. This knowledge can be employed by users to trade-off between run-time, size of data, and hardware, and it enables them to design a strategy for data storage, duration of data retention, and delivery time. This has the potential to enhance researchers’ satisfaction when using academic NLP infrastructures. The thesis also shows that a cluster computation approach provides the capacity to adapt NLP services with JIT delivery systems. The proposed strategy assures the reliability and predictability of the services, which are the main characteristics of the services in JIT delivery systems. Defining the relevant parameters, recording the behavior of the services, and analyzing the generated data resulted in the provision of knowledge that can be utilized to create a service catalog—a fundamental requirement for the services in JIT delivery systems—for each service offered. This knowledge also helps to generate the performance profiles for each item mentioned in the service catalog and to update them continuously to cover new experiments and improve service quality

Qucosa - Publikationsserver der Universität Leipzig

Microdata Deduplication with Spark

Author: Rehman Khalil Ur
Publication venue
Publication date: 01/01/2016
Field of study

Üha rohkem avaldatakse veebis struktureeritud sisu, mis on loetav nii inimeste kui masinate poolt. Tänu otsimootorite loojatele, kes on defineerinud standardid struktureeritud sisu esitamiseks, teevad järjest rohkemad veebisaidid osa oma andmetest, nt toodete, isikute, organisatsioonide ja asukohtade kirjeldused, veebis avalikuks. Selleks kasutatakse RDFa, microdata jms vorminguid. Microdata on üks viimastest vormingutest ning saanud populaarseks suhteliselt lühikese aja jooksul. Sarnaselt on arenenud tehnoloogiad veebist struktureeritud sisu kättesaamiseks. Näiteks on Apache Any23, mis võimaldab veebilehtedest microdata andmeid eraldada ja linkandmetena kättesaadavaks teha. Samas pole struktureeritud andmete veebist kättesaamine enam suurim tehniline väljakutse. Nimelt on veebist saadud andmeid enne kasutamist vaja puhastada - eemaldada duplikaadid, lahendada ebakõlad ning hakkama tuleb saada ka ebamääraste andmetega.\n\rKäesoleva magistritöö peamiseks fookuseks on efektiivse lahenduse loomine veebis leiduvatest linkandmetest duplikaatide eemaldamine suurte andmekoguste jaoks. Kuigi deduplikeerimise algoritmid on saavutanud suhtelise küpsuse, tuleb neid konkreetsete andmekomplektide jaoks siiski peenhäälestada. Eelkõige tuleb tuvastada sobivaim võtme pikkus kirjete sortimiseks. Käesolevas töös tuvastatakse optimaalne võtme pikkus veebis leiduvate tooteandmete deduplikeerimise kontekstis. Suurte andmemahtude tõttu kasutatakse Apache Spark'i deduplikeerimist hajusalgoritmide realiseerimiseks.The web is transforming from traditional web to web of data, where information is presented in such a way that it is readable by machines as well as human. As a part of this transformation, every day more and more websites implant structured data, e.g. product, person, organization, place etc., into the HTML pages. To implant the structured data different encoding vocabularies, such as RDFa, microdata, and microformats, are used. Microdata is the most recent addition to these structure data embedding standards, but it has gained more popularity over other formats in less time. Similarly, progress has been made in the extraction of the structured data from web pages, which has resulted in open source tools such as Apache Any23 and non-profit Common Crawl project. Any23 allows extraction of microdata from the web pages with less effort, whereas Common Crawl extracts data from websites and provides it publically for download. In fact, the microdata extraction tools only take care of parsing and data transformation steps of data cleansing. Although with the help of these state-of-the-art extraction tools microdata can be easily extracted, before the extracted data used in potential applications, duplicates should be removed and data unified. Since microdata origins from arbitrary web resources, it has arbitrary quality as well and should be treated correspondingly. \n\rThe main purpose of this thesis is to develop the effective mechanism for deduplication of microdata on the web scale. Although the deduplication algorithms have reached relative maturity, however, these algorithm needs to be executed on specific datasets for fine-tuning. In particular, the need to identify the most suitable length of sorting key in sorted-based deduplication approach. The present work identifies the optimum length of the sorting key in the context of extracted product microdata deduplication. Due to large volumes of data to be processed continuously, Apache Spark will be used for implementing the necessary procedures

DSpace at Tartu University Library

Personalized large scale classification of public tenders on hadoop

Author: Dumoulin Mathieu
Publication venue: Bibliotheque de l' Universite Laval
Publication date: 01/01/2014
Field of study

Ce projet a été réalisé dans le cadre d’un partenariat entre Fujitsu Canada et Université Laval. Les besoins du projets ont été centrés sur une problématique d’affaire définie conjointement avec Fujitsu. Le projet consistait à classifier un corpus d’appels d’offres électroniques avec une approche orienté big data. L’objectif était d’identifier avec un très fort rappel les offres pertinentes au domaine d’affaire de l’entreprise. Après une séries d’expérimentations à petite échelle qui nous ont permise d’illustrer empiriquement (93% de rappel) l’efficacité de notre approche basé sur l’algorithme BNS (Bi-Normal Separation), nous avons implanté un système complet qui exploite l’infrastructure technologique big data Hadoop. Nos expérimentations sur le système complet démontrent qu’il est possible d’obtenir une performance de classification tout aussi efficace à grande échelle (91% de rappel) tout en exploitant les gains de performance rendus possible par l’architecture distribuée de Hadoop.This project was completed as part of an innovation partnership with Fujitsu Canada and Université Laval. The needs and objectives of the project were centered on a business problem defined jointly with Fujitsu. Our project aimed to classify a corpus of electronic public tenders based on state of the art Hadoop big data technology. The objective was to identify with high recall public tenders relevant to the IT services business of Fujitsu Canada. A small scale prototype based on the BNS algorithm (Bi-Normal Separation) was empirically shown to classify with high recall (93%) the public tender corpus. The prototype was then re-implemented on a full scale Hadoop cluster using Apache Pig for the data preparation pipeline and using Apache Mahout for classification. Our experimentation show that the large scale system not only maintains high recall (91%) on the classification task, but can readily take advantage of the massive scalability gains made possible by Hadoop’s distributed architecture

CorpusUL

Evaluation of Storage Systems for Big Data Analytics

Author
Publication venue
Publication date: 01/01/2017
Field of study

abstract: Recent trends in big data storage systems show a shift from disk centric models to memory centric models. The primary challenges faced by these systems are speed, scalability, and fault tolerance. It is interesting to investigate the performance of these two models with respect to some big data applications. This thesis studies the performance of Ceph (a disk centric model) and Alluxio (a memory centric model) and evaluates whether a hybrid model provides any performance benefits with respect to big data applications. To this end, an application TechTalk is created that uses Ceph to store data and Alluxio to perform data analytics. The functionalities of the application include offline lecture storage, live recording of classes, content analysis and reference generation. The knowledge base of videos is constructed by analyzing the offline data using machine learning techniques. This training dataset provides knowledge to construct the index of an online stream. The indexed metadata enables the students to search, view and access the relevant content. The performance of the application is benchmarked in different use cases to demonstrate the benefits of the hybrid model.Dissertation/ThesisMasters Thesis Computer Science 201

ASU Digital Repository

Predicting parking space availability based on heterogeneous data using Machine Learning techniques

Author: Mehmood H. (Hassan)
Publication venue: University of Oulu
Publication date: 08/05/2019
Field of study

Abstract. These days, smart cities are focused on improving their services and bringing quality to everyday life, leveraging modern ICT technologies. For this reason, the data from connected IoT devices, environmental sensors, economic platforms, social networking sites, governance systems, and others can be gathered for achieving such goals. The rapid increase in the number of vehicles in major cities of the world has made mobility in urban areas difficult, due to traffic congestion and parking availability issues. Finding a suitable parking space is often influenced by various factors such as weather conditions, traffic flows, and geographical information (markets, hospitals, parks, and others). In this study, a predictive analysis has been performed to estimate the availability of parking spaces using heterogeneous data from Cork County, Ireland. However, accumulating, processing, and analysing the produced data from heterogeneous sources is itself a challenge, due to their diverse nature and different acquisition frequencies. Therefore, a data lake has been proposed in this study to collect, process, analyse, and visualize data from disparate sources. In addition, the proposed platform is used for predicting the available parking spaces using the collected data from heterogeneous sources. The study includes proposed design and implementation details of data lake as well as the developed parking space availability prediction model using machine learning techniques

University of Oulu Repository - Jultika

Efficient Resource Management for Cloud Computing Environments

Author: Zhang Qi
Publication venue: 'University of Waterloo'
Publication date: 23/09/2013
Field of study

Cloud computing has recently gained popularity as a cost-effective model for hosting and delivering services over the Internet. In a cloud computing environment, a cloud provider packages its physical resources in data centers into virtual resources and offers them to service providers using a pay-as-you-go pricing model. Meanwhile, a service provider uses the rented virtual resources to host its services. This large-scale multi-tenant architecture of cloud computing systems raises key challenges regarding how data centers resources should be controlled and managed by both service and cloud providers. This thesis addresses several key challenges pertaining to resource management in cloud environments. From the perspective of service providers, we address the problem of selecting appropriate data centers for service hosting with consideration of resource price, service quality as well as dynamic reconfiguration costs. From the perspective of cloud providers, as it has been reported that workload in real data centers can be typically divided into server-based applications and MapReduce applications with different performance and scheduling criteria, we provide separate resource management solutions for each type of workloads. For server-based applications, we provide a dynamic capacity provisioning scheme that dynamically adjusts the number of active servers to achieve the best trade-off between energy savings and scheduling delay, while considering heterogeneous resource characteristics of both workload and physical machines. For MapReduce applications, we first analyzed task run-time resource consumption of a large variety of MapReduce jobs and discovered it can vary significantly over-time, depending on the phase the task is currently executing. We then present a novel scheduling algorithm that controls task execution at the level of phases with the aim of improving both job running time and resource utilization. Through detailed simulations and experiments using real cloud clusters, we have found our proposed solutions achieve substantial gain compared to current state-of-art resource management solutions, and therefore have strong implications in the design of real cloud resource management systems in practice

University of Waterloo's Institutional Repository

Content sensitivity based access control model for big data

Author: Thandapani Kumarasamy Ashwin Kumar
Publication venue
Publication date: 01/07/2017
Field of study

Big data technologies have seen tremendous growth in recent years. They are being widely used in both industry and academia. In spite of such exponential growth, these technologies lack adequate measures to protect the data from misuse or abuse. Corporations that collect data from multiple sources are at risk of liabilities due to exposure of sensitive information. In the current implementation of Hadoop, only file level access control is feasible. Providing users, the ability to access data based on attributes in a dataset or based on their role is complicated due to the sheer volume and multiple formats (structured, unstructured and semi-structured) of data. In this dissertation an access control framework, which enforces access control policies dynamically based on the sensitivity of the data is proposed. This framework enforces access control policies by harnessing the data context, usage patterns and information sensitivity. Information sensitivity changes over time with the addition and removal of datasets, which can lead to modifications in the access control decisions and the proposed framework accommodates these changes. The proposed framework is automated to a large extent and requires minimal user intervention. The experimental results show that the proposed framework is capable of enforcing access control policies on non-multimedia datasets with minimal overhea

SHAREOK repository

Efficient Resource Management for Cloud Computing Environments

Author: Zhang Qi
Publication venue: 'University of Waterloo'
Publication date: 23/09/2013
Field of study

University of Waterloo's Institutional Repository

Publikationsserver des Instituts für Deutsche Sprache