Search CORE

2 research outputs found

Applying data mining techniques over big data

Author: Al-Hashemi Idrees Yousef
Publication venue: Boston University
Publication date: 01/01/2013
Field of study

Thesis (M.S.C.S.) PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at [email protected]. Thank you.The rapid development of information technology in recent decades means that data appear in a wide variety of formats — sensor data, tweets, photographs, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today’s internet has about 0.1 Zettabytes of data (ZB is about 1021 bytes), and this number will reach 35 ZB by 2020. With such an overwhelming flood of information, present data management systems are not able to scale to this huge amount of raw, unstructured data—in today’s parlance, Big Data. In the present study, we show the basic concepts and design of Big Data tools, algorithms, and techniques. We compare the classical data mining algorithms to the Big Data algorithms by using Hadoop/MapReduce as a core implementation of Big Data for scalable algorithms. We implemented the K-means algorithm and A-priori algorithm with Hadoop/MapReduce on a 5 nodes Hadoop cluster. We explore NoSQL databases for semi-structured, massively large-scaling of data by using MongoDB as an example. Finally, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data storage for these two algorithms

Boston University Institutional Repository (OpenBU)

Applying Data Mining Techniques Over Big Data

Author: Al-Hashemi Idrees
Kalathur Suresh
Publication venue
Publication date: 09/08/2013
Field of study

With rapid development of information technology, data flows in different variety of formats - sensors data, tweets, photos, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today Internet is about 1.8 Zettabytes (Zettabytes is 10^21), and this number will reach 35 Zettabytes by 2020. With that, data management systems are not able to scale to this huge amount of raw, unstructured data, which what is called today big data. In this present study, we show the basic concept and design of big data tools, algorthims [sic] and techniques. We compare the classical data mining algorithms with big data algorthims [sic] by using hadoop/MapReuce [sic] as the core implemention [sic] of big-data for scalable algorthims. [sic] We implemented K-means and A-priori algorthim [sic] by using Hadoop/MapReduce on 5 nodes cluster of hadoop. We also show their performance for Gigabytes of data. Finally, we explore NoSQL (Not Only SQL) databases for semi-structured, massively large-scale of data using MongoDB as an example. Then, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data stores for these two algorithms.This research work is part of a full scholarship fund of a Master degree through Minisrty of Higher Education and Scientific Research (MOHESR), Republic of Iraq (Fund 17004)

Boston University Institutional Repository (OpenBU)