Search CORE

3,866 research outputs found

Scalable data analytics using spark

Author: Bakırov Aslan
Publication venue
Publication date: 12/05/2015
Field of study

Tezin basılısı İstanbul Şehir Üniversitesi Kütüphanesi'ndedir.This thesis presents our experience in designing a scalable data analytics platform on top of Apache Spark (major) and Apache Hadoop (minor). We worked on three repre- sentative applications: (1) Sentiment Analysis, (2) Collaborative Filtering and (3) Topic Modeling. We demonstrated how to scale these applications on a cluster of 8 workers. Each worker contributes 4 cores, 8 GB RAM, and 100 GB of disk space to the com- pute pool. Our conclusion is that Apache Spark has enough maturity to be deployed in production comfortably.Abstract ii Öz iii Acknowledgments v List of Figures viii List of Tables ix 1 Introduction 1 2 Sentiment Analytics on Spark 2 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2.1 Preprocessing on the data . . . . . . . . . . . . . . . . . . . . . . . 3 2.2.2 Naive Bayes Classiﬁer . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3.1 Resilient Distributed Datasets(RDD) . . . . . . . . . . . . . . . . . 5 2.3.2 Broadcast Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.3 The Movie Reviews Dataset . . . . . . . . . . . . . . . . . . . . . . 6 2.3.4 Cluster Conﬁguration . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.5 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.2 Apache Mahout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.3 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.3.1 Broadcasting vs. Not-broadcasting . . . . . . . . . . . . . 10 2.4.3.2 Time required for training . . . . . . . . . . . . . . . . . . 10 2.4.3.3 Time required for testing . . . . . . . . . . . . . . . . . . 11 3 Collaborative Filtering on Spark 13 3.1 MLBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Online Recommendation System . . . . . . . . . . . . . . . . . . . . . . . 14 4 Topic Modeling on Hadoop 17 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 4.3 LDA in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.1 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.2 Cluster Conﬁguration . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.3 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Conclusions 22 Bibliography 2

Istanbul Sehir University Repository

A Context Centric Model for building a Knowledge advantage Machine Based on Personal Ontology Patterns

Author: Wang Luyi
Publication venue: The Research Repository @ WVU
Publication date: 01/05/2012
Field of study

Throughout the industrial era societal advancement could be attributed in large part to introduction a plethora of electromechanical machines all of which exploited a key concept known as Mechanical Advantage. In the post-industrial era exploitation of knowledge is emerging as the key enabler for societal advancement. With the advent of the Internet and the Web, while there is no dearth of knowledge, what is lacking is an efficient and practical mechanism for organizing knowledge and presenting it in a comprehensible form appropriate for every context. This is the fundamental problem addressed by my dissertation.;We begin by proposing a novel architecture for creating a Knowledge Advantage Machine (KaM), one which enables a knowledge worker to bring to bear a larger amount of knowledge to solve a problem in a shorter time. This is analogous to an electromechanical machine that enables an industrial worker to bring to bear a large amount of power to perform a task thus improving worker productivity. This work is based on the premise that while a universal KaM is beyond the realm of possibility, a KaM specific to a particular type of knowledge worker is realizable because of the limited scope of his/her personal ontology used to organize all relevant knowledge objects.;The proposed architecture is based on a society of intelligent agents which collaboratively discover, markup, and organize relevant knowledge objects into a semantic knowledge network on a continuing basis. This in-turn is exploited by another agent known as the Context Agent which determines the current context of the knowledge worker and makes available in a suitable form the relevant portion of the semantic network. In this dissertation we demonstrate the viability and extensibility of this architecture by building a prototype KaM for one type of knowledge worker such as a professor

The Research Repository @ WVU (West Virginia University)

BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology

Author: Banos Vangelis
Kasioumis Nikolaos
Kim Yunhyong
Kopidaki Stella
Ross Seamus
Rynning Morten
Stepanyan Karen
Publication venue: BlogForever
Publication date: 25/10/2013
Field of study

This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software

ZENODO

Enlighten

Sentiment analysis in context: Investigating the use of BERT and other techniques for ChatBot improvement

Author: INNOCENTE SIMONE
Publication venue
Publication date: 25/07/2023
Field of study

openIn an increasingly digitized world, where large amounts of data are generated daily, its efficient analysis has become more and more stringent. Natural Language Processing (NLP) offers a solution by exploiting the power of artificial intelligence to process texts, to understand their content and to perform specific tasks. The thesis is based on an internship at Pat Srl, a company devoted to create solutions to support digital innovation, process automation, and service quality with the ultimate goal of improving leadership and customer satisfaction. The primary objective of this thesis is to develop a sentiment analysis model in order to improve the customer experience for clients using the ChatBot system created by the company itself. This task has gained significant attention in recent years as it can be applied to different fields, including social media monitoring, market research, brand monitoring or customer experience and feedback analysis. Following a careful analysis of the available data, a comprehensive evaluation of various models was conducted. Notably, BERT, a large language model that has provided promising results in several NLP tasks, emerged among all. Different approaches utilizing the BERT models were explored, such as the fine-tuning modality or the architectural structure. Moreover, some preprocessing steps of the data were emphasized and studied, due to the particular nature of the sentiment analysis task. During the course of the internship, the dataset underwent revisions aimed to mitigate the problem of inaccurate predictions. Additionally, techniques for data balancing were tested and evaluated, enhancing the overall quality of the analysis. Another important aspect of this project involved the deployment of the model. In a business environment, it is essential to carefully consider and balance resources before transitioning to production. The model distribution was carried out using specific tools, such as Docker and Kubernetes. These specialized technologies played a pivotal role in ensuring efficient and seamless deployment.In an increasingly digitized world, where large amounts of data are generated daily, its efficient analysis has become more and more stringent. Natural Language Processing (NLP) offers a solution by exploiting the power of artificial intelligence to process texts, to understand their content and to perform specific tasks. The thesis is based on an internship at Pat Srl, a company devoted to create solutions to support digital innovation, process automation, and service quality with the ultimate goal of improving leadership and customer satisfaction. The primary objective of this thesis is to develop a sentiment analysis model in order to improve the customer experience for clients using the ChatBot system created by the company itself. This task has gained significant attention in recent years as it can be applied to different fields, including social media monitoring, market research, brand monitoring or customer experience and feedback analysis. Following a careful analysis of the available data, a comprehensive evaluation of various models was conducted. Notably, BERT, a large language model that has provided promising results in several NLP tasks, emerged among all. Different approaches utilizing the BERT models were explored, such as the fine-tuning modality or the architectural structure. Moreover, some preprocessing steps of the data were emphasized and studied, due to the particular nature of the sentiment analysis task. During the course of the internship, the dataset underwent revisions aimed to mitigate the problem of inaccurate predictions. Additionally, techniques for data balancing were tested and evaluated, enhancing the overall quality of the analysis. Another important aspect of this project involved the deployment of the model. In a business environment, it is essential to carefully consider and balance resources before transitioning to production. The model distribution was carried out using specific tools, such as Docker and Kubernetes. These specialized technologies played a pivotal role in ensuring efficient and seamless deployment

Padua Thesis and Dissertation Archive

A topic community-based method for friend recommendation in large-scale online social networks

Author: Akcora
Armentano
Carullo
Cheng
Cummins
Guo
He
Hu
Kannan
Lee
Long
McPherson
Psorakis
Samanthula
Wang
Wang
Wu
Zhang
Zhang
Zhu
Publication venue: 'Wiley'
Publication date: 25/03/2017
Field of study

Crossref

Coventry University Pure Portal

Automatic text filtering using limited supervision learning for epidemic intelligence

Author: Stewart Avaré Bonaparte
Publication venue: Hannover : Gottfried Wilhelm Leibniz Universität Hannover
Publication date: 01/01/2014
Field of study

[no abstract

Institutionelles Repositorium der Leibniz Universität Hannover