3,866 research outputs found

    Scalable data analytics using spark

    Get PDF
    Tezin basılısı Ä°stanbul ƞehir Üniversitesi KĂŒtĂŒphanesi'ndedir.This thesis presents our experience in designing a scalable data analytics platform on top of Apache Spark (major) and Apache Hadoop (minor). We worked on three repre- sentative applications: (1) Sentiment Analysis, (2) Collaborative Filtering and (3) Topic Modeling. We demonstrated how to scale these applications on a cluster of 8 workers. Each worker contributes 4 cores, 8 GB RAM, and 100 GB of disk space to the com- pute pool. Our conclusion is that Apache Spark has enough maturity to be deployed in production comfortably.Abstract ii Öz iii Acknowledgments v List of Figures viii List of Tables ix 1 Introduction 1 2 Sentiment Analytics on Spark 2 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2.1 Preprocessing on the data . . . . . . . . . . . . . . . . . . . . . . . 3 2.2.2 Naive Bayes ClassiïŹer . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3.1 Resilient Distributed Datasets(RDD) . . . . . . . . . . . . . . . . . 5 2.3.2 Broadcast Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.3 The Movie Reviews Dataset . . . . . . . . . . . . . . . . . . . . . . 6 2.3.4 Cluster ConïŹguration . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.5 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.2 Apache Mahout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.3 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.3.1 Broadcasting vs. Not-broadcasting . . . . . . . . . . . . . 10 2.4.3.2 Time required for training . . . . . . . . . . . . . . . . . . 10 2.4.3.3 Time required for testing . . . . . . . . . . . . . . . . . . 11 3 Collaborative Filtering on Spark 13 3.1 MLBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Online Recommendation System . . . . . . . . . . . . . . . . . . . . . . . 14 4 Topic Modeling on Hadoop 17 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 4.3 LDA in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.1 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.2 Cluster ConïŹguration . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.3 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Conclusions 22 Bibliography 2

    A Context Centric Model for building a Knowledge advantage Machine Based on Personal Ontology Patterns

    Get PDF
    Throughout the industrial era societal advancement could be attributed in large part to introduction a plethora of electromechanical machines all of which exploited a key concept known as Mechanical Advantage. In the post-industrial era exploitation of knowledge is emerging as the key enabler for societal advancement. With the advent of the Internet and the Web, while there is no dearth of knowledge, what is lacking is an efficient and practical mechanism for organizing knowledge and presenting it in a comprehensible form appropriate for every context. This is the fundamental problem addressed by my dissertation.;We begin by proposing a novel architecture for creating a Knowledge Advantage Machine (KaM), one which enables a knowledge worker to bring to bear a larger amount of knowledge to solve a problem in a shorter time. This is analogous to an electromechanical machine that enables an industrial worker to bring to bear a large amount of power to perform a task thus improving worker productivity. This work is based on the premise that while a universal KaM is beyond the realm of possibility, a KaM specific to a particular type of knowledge worker is realizable because of the limited scope of his/her personal ontology used to organize all relevant knowledge objects.;The proposed architecture is based on a society of intelligent agents which collaboratively discover, markup, and organize relevant knowledge objects into a semantic knowledge network on a continuing basis. This in-turn is exploited by another agent known as the Context Agent which determines the current context of the knowledge worker and makes available in a suitable form the relevant portion of the semantic network. In this dissertation we demonstrate the viability and extensibility of this architecture by building a prototype KaM for one type of knowledge worker such as a professor

    BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology

    Get PDF
    This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software

    Sentiment analysis in context: Investigating the use of BERT and other techniques for ChatBot improvement

    Get PDF
    openIn an increasingly digitized world, where large amounts of data are generated daily, its efficient analysis has become more and more stringent. Natural Language Processing (NLP) offers a solution by exploiting the power of artificial intelligence to process texts, to understand their content and to perform specific tasks. The thesis is based on an internship at Pat Srl, a company devoted to create solutions to support digital innovation, process automation, and service quality with the ultimate goal of improving leadership and customer satisfaction. The primary objective of this thesis is to develop a sentiment analysis model in order to improve the customer experience for clients using the ChatBot system created by the company itself. This task has gained significant attention in recent years as it can be applied to different fields, including social media monitoring, market research, brand monitoring or customer experience and feedback analysis. Following a careful analysis of the available data, a comprehensive evaluation of various models was conducted. Notably, BERT, a large language model that has provided promising results in several NLP tasks, emerged among all. Different approaches utilizing the BERT models were explored, such as the fine-tuning modality or the architectural structure. Moreover, some preprocessing steps of the data were emphasized and studied, due to the particular nature of the sentiment analysis task. During the course of the internship, the dataset underwent revisions aimed to mitigate the problem of inaccurate predictions. Additionally, techniques for data balancing were tested and evaluated, enhancing the overall quality of the analysis. Another important aspect of this project involved the deployment of the model. In a business environment, it is essential to carefully consider and balance resources before transitioning to production. The model distribution was carried out using specific tools, such as Docker and Kubernetes. These specialized technologies played a pivotal role in ensuring efficient and seamless deployment.In an increasingly digitized world, where large amounts of data are generated daily, its efficient analysis has become more and more stringent. Natural Language Processing (NLP) offers a solution by exploiting the power of artificial intelligence to process texts, to understand their content and to perform specific tasks. The thesis is based on an internship at Pat Srl, a company devoted to create solutions to support digital innovation, process automation, and service quality with the ultimate goal of improving leadership and customer satisfaction. The primary objective of this thesis is to develop a sentiment analysis model in order to improve the customer experience for clients using the ChatBot system created by the company itself. This task has gained significant attention in recent years as it can be applied to different fields, including social media monitoring, market research, brand monitoring or customer experience and feedback analysis. Following a careful analysis of the available data, a comprehensive evaluation of various models was conducted. Notably, BERT, a large language model that has provided promising results in several NLP tasks, emerged among all. Different approaches utilizing the BERT models were explored, such as the fine-tuning modality or the architectural structure. Moreover, some preprocessing steps of the data were emphasized and studied, due to the particular nature of the sentiment analysis task. During the course of the internship, the dataset underwent revisions aimed to mitigate the problem of inaccurate predictions. Additionally, techniques for data balancing were tested and evaluated, enhancing the overall quality of the analysis. Another important aspect of this project involved the deployment of the model. In a business environment, it is essential to carefully consider and balance resources before transitioning to production. The model distribution was carried out using specific tools, such as Docker and Kubernetes. These specialized technologies played a pivotal role in ensuring efficient and seamless deployment

    Automatic text filtering using limited supervision learning for epidemic intelligence

    Get PDF
    [no abstract
    • 

    corecore