    Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework

    While many existing formal concept analysis algorithms are efficient, they are typically unsuitable for distributed implementation. Taking the MapReduce (MR) framework as our inspiration we introduce a distributed approach for performing formal concept mining. Our method has its novelty in that we use a light-weight MapReduce runtime called Twister which is better suited to iterative algorithms than recent distributed approaches. First, we describe the theoretical foundations underpinning our distributed formal concept analysis approach. Second, we provide a representative exemplar of how a classic centralized algorithm can be implemented in a distributed fashion using our methodology: we modify Ganter's classic algorithm by introducing a family of MR* algorithms, namely MRGanter and MRGanter+ where the prefix denotes the algorithm's lineage. To evaluate the factors that impact distributed algorithm performance, we compare our MR* algorithms with the state-of-the-art. Experiments conducted on real datasets demonstrate that MRGanter+ is efficient, scalable and an appealing algorithm for distributed problems.Comment: 17 pages, ICFCA 201, Formal Concept Analysis 201

    A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

    Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure

    An ontology enhanced parallel SVM for scalable spam filter training

    This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart

    Enumerating Maximal Bicliques from a Large Graph using MapReduce

    We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce platform, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller sized subgraphs, followed by processing different subgraphs in parallel. Our algorithm uses two ideas that enable it to scale to large graphs: (1) the redundancy in work between different subgraph explorations is minimized through a careful pruning of the search space, and (2) the load on different reducers is balanced through the use of an appropriate total order among the vertices. Our evaluation shows that the algorithm scales to large graphs with millions of edges and tens of mil- lions of maximal bicliques. To our knowledge, this is the first work on maximal biclique enumeration for graphs of this scale.Comment: A preliminary version of the paper was accepted at the Proceedings of the 3rd IEEE International Congress on Big Data 201

    Large-Scale Image Processing Using MapReduce

    JĂ€lgides tĂ€napĂ€eva tehnoloogia arengut ning odavate fotokaamerate ĂŒha laialdasemat levikut, on ĂŒha selgem, et ĂŒhe osa ĂŒha kasvavast inimeste tekitatud andmete hulgast moodustavad pildid. Teades, et tĂ”enĂ€oliselt tuleb neid andmeid ka töödelda, ning et ĂŒksikute arvutite vĂ”imsus ei luba kohati juba praegu neid mahukamate ĂŒlesannete jaoks kasutada, on inimesed hakanud uurima mitmete hajusarvutuse mudelite pakutavaid vĂ”imalusi. Üks selline on MapReduce, mille pĂ”hiliseks aluseks on arvutuste ĂŒldisele kujule viimine, seades programmeerija ĂŒlesandeks defineerida vaid selle, mis toimub andmetega nelja arvutuse faasi - Input, Map, Reduce, Output - jooksul. Kuna sellest mudelist on olemas kvaliteetseid vabavara realisatsioone, ning mahukamateks arvutusteks on kerge vaeva ja vĂ€hese kuluga vĂ”imalik rentida vajalik infrastruktuur, siis on selline lĂ€henemine pilditöötlusele muutunud peaaegu igaĂŒhele kĂ€ttesaadavaks. Antud magistritöö eesmĂ€rgiks on uurida MapReduce mudeli kasutatavust suuremahulise pilditöötluse vallas. Selleks vaatlen eraldi juhte, kus tegemist on tavalistest piltidest koosneva suure andmestikuga, ning kus tuleb töödelda ĂŒhte suuremahulist pilti. Samuti jagan nelja klassi vahel kĂ”ik pilditöötlusalgoritmid, nimetades need vastavalt lokaalseteks, iteratiivseteks lokaalseteks, mittelokaalseteks ja iteratiivseteks mittelokaalseteks algoritmideks. Kasutades neid jaotusi, kirjeldan ĂŒldiselt pĂ”hilisi probleeme ja takistusi, mis vĂ”ivad segada mingit tĂŒĂŒpi algoritmide hajusat rakendamist mingit tĂŒĂŒpi piltandmetel, ning pakun vĂ€lja vĂ”imalikke lahendusi. Töö praktilises osas kirjeldan MapReduce mudeli kasutamist Apache Hadoop raamistikuga kahel erineval andmestikul, millest esimene on 265GiB-suurune pildikogu, ning teine 6.99 gigapiksli suurune mikroskoobifoto. Esimese nĂ€ite puhul on ĂŒlesandeks pildikogust meta-andmete eraldamine, kasutades selleks objekti- ning tekstituvastust. Teise andmestiku puhul on ĂŒlesandeks töödelda pilti ĂŒhe kindla mitteiteratiivse lokaalse algoritmiga. Kuigi mĂ”lemal juhul on tegemist vaid katsetamise eesmĂ€rgil loodud rakendustega, on mĂ”lemal puhul nĂ€ha, et olemasolevate pilditöötluse algoritmide MapReduce programmideks teisendamine on kĂŒllaltki lihtne, ning ei too endaga kaasa suuri kadusid jĂ”udluses. KokkuvĂ”tteks vĂ€idan, et tavapĂ€rases mÔÔdus piltidest koosnevate andmestike puhul on MapReduce mudel lihtne viis arvutusi hajusale kujule viies kiirendada, kuid suuremahuliste piltide puhul kehtib see enamasti ainult mitteiteratiivsete lokaalsete algoritmidega.Due to the increasing popularity of cheap digital photography equipment, personal computing devices with easy to use cameras, and an overall im- provement of image capture technology with regard to quality, the amount of data generated by people each day shows trends of growing faster than the processing capabilities of single devices. For other tasks related to large-scale data, humans have already turned towards distributed computing as a way to side-step impending physical limitations to processing hardware by com- bining the resources of many computers and providing programmers various different interfaces to the resulting construct, relieving them from having to account for the intricacies stemming from it’s physical structure. An example of this is the MapReduce model, which - by way of placing all calculations to a string of Input-Map-Reduce-Output operations capable of working in- dependently - allows for easy application of distributed computing for many trivially parallelised processes. With the aid of freely available implemen- tations of this model and cheap computing infrastructure offered by cloud providers, having access to expensive purpose-built hardware or in-depth un- derstanding of parallel programming are no longer required of anyone who wishes to work with large-scale image data. In this thesis, I look at the issues of processing two kinds of such data - large data-sets of regular images and single large images - using MapReduce. By further classifying image pro- cessing algorithms to iterative/non-iterative and local/non-local, I present a general analysis on why different combinations of algorithms and data might be easier or harder to adapt for distributed processing with MapReduce. Finally, I describe the application of distributed image processing on two ex- ample cases: a 265GiB data-set of photographs and a 6.99 gigapixel image. Both preliminary analysis and practical results indicate that the MapReduce model is well suited for distributed image processing in the first case, whereas in the second case, this is true for only local non-iterative algorithms, and further work is necessary in order to provide a conclusive decision
