149 research outputs found

    Increasing Efficiency of Recommendation System using Big Data Analysis

    Get PDF
    In the present Digital Space a lot of Internet users try to come up with solutions to a particular problem by suggesting solutions that are pre-existing on the Internet. This brings down the originality of posts and we are able to overcome this problem by applying prediction models on data sets. It is important for a user to come up with original ideas to gain up votes which in turn represent the quality of a post. Due to the huge influx of data at every moment, the need for big data analytics becomes essential and hence the use of an open source framework like Hadoop is imperative so as to increase effectiveness of recommender system built on these prediction models

    Elementary Concepts of Big Data and Hadoop

    Get PDF
    This paper is an effort to present the basic importance of Big Data and also its importance in an organization from its performance point of view. The term Big data, refers the data sets, whose volume, complexity and also rate of growth make them more difficult to capture, manage, process and also analyzed. For such type of data –intensive applications, the Apache Hadoop Framework has newly concerned a lot of attention. Hadoop is the core platform for structuring Big data, and solves the problem of making it helpful for analytics idea. Hadoop is an open source software project that enables the distributed processing of enormous data and framework for the analysis and transformation of very large data sets using the MapReduce paradigm. This paper deals with the architecture of Hadoop with its various components

    Hadoop-BAM: directly manipulating next generation sequencing data in the cloud

    Get PDF
    Summary: Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps

    A High-Performance Data Accessing and Processing System for Campus Real-time Power Usage

    Get PDF
    With the flourishing of Internet of Things (IoT) technology, ubiquitous power data can be linked to the Internet and be analyzed for real-time monitoring requirements. Numerous power data would be accumulated to even Tera-byte level as the time goes. To approach a real-time power monitoring platform on them, an efficient and novel implementation techniques has been developed and formed to be the kernel material of this thesis. Based on the integration of multiple software subsystems in a layered manner, the proposed power-monitoring platform has been established and is composed of Ubuntu (as operating system), Hadoop (as storage subsystem), Hive (as data warehouse), and the Spark MLlib (as data analytics) from bottom to top. The generic power-data source is provided by the so-called smart meters equipped inside factories located in an enterprise practically. The data collection and storage are handled by the Hadoop subsystem and the data ingestion to Hive data warehouse is conducted by the Spark unit. On the aspect of system verification, under single-record query, these software modules: HiveQL and Impala SQL had been tested in terms of query-response efficiency. And for the performance exploration on the full-table query function. The relevant experiments have been conducted on the same software modules as well. The kernel contributions of this research work can be highlighted by two parts: the details of building an efficient real-time power-monitoring platform, and the relevant query-response efficiency for reference

    Performansi Response Time Query Pada Hadoop-Hive Menggunakan Metode Partition

    Get PDF
    Hive menggantikan teknik pemrosesan tradisional RDBMS yang tidak dapat digunakan pada big data. Tetapi, Hive dengan kondisi default akan mencari data secara menyeluruh  saat mengeksekusi query. Metode partition mampu mengelompokkan data, sehingga dilakukan pengujian untuk mengetahui apakah dengan mengelompokkan data akan memberikan peningkatan performansi response time query atau sebaliknya. Pada penelitian ini, dibangun infrastruktur Hadoop cluster dengan sistem multi node  menggunakan virtual machine. Dataset yang digunakan adalah dataset Movielens dengan kardinalitas atribut yaitu 5, 50 dan 100. Tiap dataset terdiri dari 15 juta records data. Berdasarkan hasil penelitian, metode partition selain mampu mengelompokkan data juga memberikan performansi response time query yang lebih cepat sebesar 30.8% dibandingkan kondisi default. Selain itu, Metode partition saat kardinalitas 100 lebih baik dibandingkan dua kardinalitas yang lebih kecil yaitu kardinalitas 5 dan kardinalitas 50

    Query Recommender System Using Hierarchical Classification

    Get PDF
    In data warehouses, lots of data are gathered which are navigated and explored for analytical purposes. Even for expert people, to handle such a large data is a tough task. Handling such a voluminous data is more difficult task for non-expert users or for users who are not familiar with the database schema. The aim of this paper is to help this class of users by recommending them SQL queries that they might use. These SQL recommendations are selected by tracking the users past behavior and comparing them with other users. At first time, users may not know where to start their exploration. Secondly, users may overlook queries which help to retrieve important information. The queries are recorded and compared using hierarchical classification which is then re-ranked according to relevance. The relevant queries are retrieved using users querying behavior. Users use a query interface to issue a series of SQL queries that aim to analyze the data and mine it for interesting information. DOI: 10.17762/ijritcc2321-8169.15067
    corecore