388 research outputs found

    A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

    Full text link
    Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure

    A Scalable Machine Learning Online Service for Big Data Real-Time Analysis

    Get PDF
    Proceedings of: IEEE Symposium Series on Computational Intelligence (SSCI 2014). Orlando, FL, USA, December 09-12, 2014.This work describes a proposal for developing and testing a scalable machine learning architecture able to provide real-time predictions or analytics as a service over domain-independent big data, working on top of the Hadoop ecosystem and providing real-time analytics as a service through a RESTful API. Systems implementing this architecture could provide companies with on-demand tools facilitating the tasks of storing, analyzing, understanding and reacting to their data, either in batch or stream fashion; and could turn into a valuable asset for improving the business performance and be a key market differentiator in this fast pace environment. In order to validate the proposed architecture, two systems are developed, each one providing classical machine-learning services in different domains: the first one involves a recommender system for web advertising, while the second consists in a prediction system which learns from gamers' behavior and tries to predict future events such as purchases or churning. An evaluation is carried out on these systems, and results show how both services are able to provide fast responses even when a number of concurrent requests are made, and in the particular case of the second system, results clearly prove that computed predictions significantly outperform those obtained if random guess was used.This research work is part of Memento Data Analysis project, co-funded by the Spanish Ministry of Industry, Energy and Tourism with identifier TSI-020601-2012-99.Publicad

    Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

    Get PDF
    Collaborative filtering based recommender systems use information about a user\u27s preferences to make personalized predictions about content, such as topics, people, or products, that they might find relevant. As the volume of accessible information and active users on the Internet continues to grow, it becomes increasingly difficult to compute recommendations quickly and accurately over a large dataset. In this study, we will introduce an algorithmic framework built on top of Apache Spark for parallel computation of the neighborhood-based collaborative filtering problem, which allows the algorithm to scale linearly with a growing number of users. We also investigate several different variants of this technique including user and item-based recommendation approaches, correlation and vector-based similarity calculations, and selective down-sampling of user interactions. Finally, we provide an experimental comparison of these techniques on the MovieLens dataset consisting of 10 million movie ratings

    In-memory, distributed content-based recommender system

    Get PDF
    Burdened by their popularity, recommender systems increasingly take on larger datasets while they are expected to deliver high quality results within reasonable time. To meet these ever growing requirements, industrial recommender systems often turn to parallel hardware and distributed computing. While the MapReduce paradigm is generally accepted for massive parallel data processing, it often entails complex algorithm reorganization and suboptimal efficiency because mid-computation values are typically read from and written to hard disk. This work implements an in-memory, content-based recommendation algorithm and shows how it can be parallelized and efficiently distributed across many homogeneous machines in a distributed-memory environment. By focusing on data parallelism and carefully constructing the definition of work in the context of recommender systems, we are able to partition the complete calculation process into any number of independent and equally sized jobs. An empirically validated performance model is developed to predict parallel speedup and promises high efficiencies for realistic hardware configurations. For the MovieLens 10 M dataset we note efficiency values up to 71 % for a configuration of 200 computing nodes (eight cores per node)

    Personalized large scale classification of public tenders on hadoop

    Get PDF
    Ce projet a été réalisé dans le cadre d’un partenariat entre Fujitsu Canada et Université Laval. Les besoins du projets ont été centrés sur une problématique d’affaire définie conjointement avec Fujitsu. Le projet consistait à classifier un corpus d’appels d’offres électroniques avec une approche orienté big data. L’objectif était d’identifier avec un très fort rappel les offres pertinentes au domaine d’affaire de l’entreprise. Après une séries d’expérimentations à petite échelle qui nous ont permise d’illustrer empiriquement (93% de rappel) l’efficacité de notre approche basé sur l’algorithme BNS (Bi-Normal Separation), nous avons implanté un système complet qui exploite l’infrastructure technologique big data Hadoop. Nos expérimentations sur le système complet démontrent qu’il est possible d’obtenir une performance de classification tout aussi efficace à grande échelle (91% de rappel) tout en exploitant les gains de performance rendus possible par l’architecture distribuée de Hadoop.This project was completed as part of an innovation partnership with Fujitsu Canada and Université Laval. The needs and objectives of the project were centered on a business problem defined jointly with Fujitsu. Our project aimed to classify a corpus of electronic public tenders based on state of the art Hadoop big data technology. The objective was to identify with high recall public tenders relevant to the IT services business of Fujitsu Canada. A small scale prototype based on the BNS algorithm (Bi-Normal Separation) was empirically shown to classify with high recall (93%) the public tender corpus. The prototype was then re-implemented on a full scale Hadoop cluster using Apache Pig for the data preparation pipeline and using Apache Mahout for classification. Our experimentation show that the large scale system not only maintains high recall (91%) on the classification task, but can readily take advantage of the massive scalability gains made possible by Hadoop’s distributed architecture

    A Recommendation Engine Using Apache Spark

    Get PDF
    The volume of structured and unstructured data has grown at exponential scale in recent days. As a result of this rapid data growth, we are always inundated with plethora of choices in any product or service. It is very natural to get lost in the amazon of such choices and finding hard to make decisions. The project aims at addressing this problem by using entity recommendation. The two main aspects that the project concentrates on are implementing and presenting more accurate entity recommendations to the user and another is dealing with vast amount of data. The project aims at presenting recommendation results according to user’s query with efficiency and accuracy. Project makes use of ListNet ranking algorithm to rank the recommendation results. Query independent features and query dependent features are used to come up with ranking scores. Ranking scores decide the order in which the recommendation results are presented to the user. Project makes use of Apache Spark, a distributed bigdata processing framework. Spark gives the advantage of handling iterative and interactive algorithms with efficiency and minimal processing time as compared to traditional mapreduce paradigm. We performed the experiments for recommendation engine using DBPedia as the dataset and tested the results for movie domain. We used both queryindependent (pagerank) and querydependent (clicklogs) features for ranking purposes. We observed that ListNet algorithm performs really well by making use of Apache Spark as the RDDs provide faster way for iterative algorithms to execute. We also observed that the results of recommendation engine are accurate and the entities are well ranked

    mARC: Memory by Association and Reinforcement of Contexts

    Full text link
    This paper introduces the memory by Association and Reinforcement of Contexts (mARC). mARC is a novel data modeling technology rooted in the second quantization formulation of quantum mechanics. It is an all-purpose incremental and unsupervised data storage and retrieval system which can be applied to all types of signal or data, structured or unstructured, textual or not. mARC can be applied to a wide range of information clas-sification and retrieval problems like e-Discovery or contextual navigation. It can also for-mulated in the artificial life framework a.k.a Conway "Game Of Life" Theory. In contrast to Conway approach, the objects evolve in a massively multidimensional space. In order to start evaluating the potential of mARC we have built a mARC-based Internet search en-gine demonstrator with contextual functionality. We compare the behavior of the mARC demonstrator with Google search both in terms of performance and relevance. In the study we find that the mARC search engine demonstrator outperforms Google search by an order of magnitude in response time while providing more relevant results for some classes of queries

    Improving Online Education Using Big Data Technologies

    Get PDF
    In a world in full digital transformation, where new information and communication technologies are constantly evolving, the current challenge of Computing Environments for Human Learning (CEHL) is to search the right way to integrate and harness the power of these technologies. In fact, these environments face many challenges, especially the increased demand for learning, the huge growth in the number of learners, the heterogeneity of available resources as well as the problems related to the complexity of intensive processing and real-time analysis of data produced by e-learning systems, which goes beyond the limits of traditional infrastructures and relational database management systems. This chapter presents a number of solutions dedicated to CEHL around the two big paradigms, namely cloud computing and Big Data. The first part of this work is dedicated to the presentation of an approach to integrate both emerging technologies of the big data ecosystem and on-demand services of the cloud in the e-learning field. It aims to enrich and enhance the quality of e-learning platforms relying on the services provided by the cloud accessible via the internet. It introduces distributed storage and parallel computing of Big Data in order to provide robust solutions to the requirements of intensive processing, predictive analysis, and massive storage of learning data. To do this, a methodology is presented and applied which describes the integration process. In addition, this chapter also addresses the deployment of a distributed e-learning architecture combining several recent tools of the Big Data and based on a strategy of data decentralization and the parallelization of the treatments on a cluster of nodes. Finally, this article aims to develop a Big Data solution for online learning platforms based on LMS Moodle. A course recommendation system has been designed and implemented relying on machine learning techniques, to help the learner select the most relevant learning resources according to their interests through the analysis of learning traces. The realization of this system is done using the learning data collected from the ESTenLigne platform and Spark Framework deployed on Hadoop infrastructure
    • …
    corecore