Search CORE

2,109 research outputs found

Parallel and Distributed Collaborative Filtering: A Survey

Author: Karydi Efthalia
Margaritis Konstantinos G.
Publication venue
Publication date: 09/09/2014
Field of study

Collaborative filtering is amongst the most preferred techniques when implementing recommender systems. Recently, great interest has turned towards parallel and distributed implementations of collaborative filtering algorithms. This work is a survey of the parallel and distributed collaborative filtering implementations, aiming not only to provide a comprehensive presentation of the field's development, but also to offer future research orientation by highlighting the issues that need to be further developed.Comment: 46 page

arXiv.org e-Print Archive

Big Data and Cross-Document Coreference Resolution: Current State and Future Opportunities

Author: Beheshti Seyed-Mehdi-Reza
Benatallah Boualem
Ryu Seung Hwan
Venugopal Srikumar
Wang Wei
Publication venue
Publication date: 14/11/2013
Field of study

Information Extraction (IE) is the task of automatically extracting structured information from unstructured/semi-structured machine-readable documents. Among various IE tasks, extracting actionable intelligence from ever-increasing amount of data depends critically upon Cross-Document Coreference Resolution (CDCR) - the task of identifying entity mentions across multiple documents that refer to the same underlying entity. Recently, document datasets of the order of peta-/tera-bytes has raised many challenges for performing effective CDCR such as scaling to large numbers of mentions and limited representational power. The problem of analysing such datasets is called "big data". The aim of this paper is to provide readers with an understanding of the central concepts, subtasks, and the current state-of-the-art in CDCR process. We provide assessment of existing tools/techniques for CDCR subtasks and highlight big data challenges in each of them to help readers identify important and outstanding issues for further investigation. Finally, we provide concluding remarks and discuss possible directions for future work

arXiv.org e-Print Archive

Using Hadoop for Large Scale Analysis on Twitter: A Technical Report

Author: Nodarakis Nikolaos
Sioutas Spyros
Tsakalidis Athanasios
Tzimas Giannis
Publication venue
Publication date: 03/02/2016
Field of study

Sentiment analysis (or opinion mining) on Twitter data has attracted much attention recently. One of the system's key features, is the immediacy in communication with other users in an easy, user-friendly and fast way. Consequently, people tend to express their feelings freely, which makes Twitter an ideal source for accumulating a vast amount of opinions towards a wide diversity of topics. This amount of information offers huge potential and can be harnessed to receive the sentiment tendency towards these topics. However, since none can invest an infinite amount of time to read through these tweets, an automated decision making approach is necessary. Nevertheless, most existing solutions are limited in centralized environments only. Thus, they can only process at most a few thousand tweets. Such a sample, is not representative to define the sentiment polarity towards a topic due to the massive number of tweets published daily. In this paper, we go one step further and develop a novel method for sentiment learning in the MapReduce framework. Our algorithm exploits the hashtags and emoticons inside a tweet, as sentiment labels, and proceeds to a classification procedure of diverse sentiment types in a parallel and distributed manner. Moreover, we utilize Bloom filters to compact the storage size of intermediate data and boost the performance of our algorithm. Through an extensive experimental evaluation, we prove that our solution is efficient, robust and scalable and confirm the quality of our sentiment identification.Comment: 8 pages, 3 tables, 3 figure

arXiv.org e-Print Archive

BAMCloud: A Cloud Based Mobile Biometric Authentication Framework

Author: Alam Mansaf
Jabin Suraiya
Shakil Kashish Ara
Zareen Farhana Javed
Publication venue
Publication date: 19/05/2017
Field of study

With an exponential increase in number of users switching to mobile banking, various countries are adopting biometric solutions as security measures. The main reason for biometric technologies becoming more common in the everyday lives of consumers is because of the facility to easily capture biometric data in real time, using their mobile phones. Biometric technologies are providing the potential security framework to make banking more convenient and secure than it has ever been. At the same time, the exponential growth of enrollment in the biometric system produces massive amount of high dimensionality data that leads to degradation in the performance of the mobile banking systems. Therefore, in order to overcome the performance issues arising due to this data deluge, this paper aims to propose a distributed mobile biometric system based on a high performance cluster Cloud. High availability, better time efficiency and scalability are some of the added advantages of using the proposed system. In this paper a Cloud based mobile biometric authentication framework (BAMCloud) is proposed that uses dynamic signatures and performs authentication. It includes the steps involving data capture using any handheld mobile device, then storage, preprocessing and training the system in a distributed manner over Cloud. For this purpose we have implemented it using MapReduce on Hadoop platform and for training Levenberg-Marquardt backpropagation neural network has been used. Moreover, the methodology adopted is very novel as it achieves a speedup of 8.5x and a performance of 96.23%. Furthermore, the cost benefit analysis of the implemented system shows that the cost of implementation and execution of the system is lesser than the existing ones. The experiments demonstrate that the better performance is achieved by proposed framework as compared to the other methods used in the recent literature

arXiv.org e-Print Archive

Security and Privacy Aspects in MapReduce on Clouds: A Survey

Author: Derbeko Philip
Dolev Shlomi
Gudes Ehud
Sharma Shantanu
Publication venue
Publication date: 02/05/2016
Field of study

MapReduce is a programming system for distributed processing large-scale data in an efficient and fault tolerant manner on a private, public, or hybrid cloud. MapReduce is extensively used daily around the world as an efficient distributed computation tool for a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and analysis of social networks. Security and privacy of data and MapReduce computations are essential concerns when a MapReduce computation is executed in public or hybrid clouds. In order to execute a MapReduce job in public and hybrid clouds, authentication of mappers-reducers, confidentiality of data-computations, integrity of data-computations, and correctness-freshness of the outputs are required. Satisfying these requirements shield the operation from several types of attacks on data and MapReduce computations. In this paper, we investigate and discuss security and privacy challenges and requirements, considering a variety of adversarial capabilities, and characteristics in the scope of MapReduce. We also provide a review of existing security and privacy protocols for MapReduce and discuss their overhead issues.Comment: Accepted in Elsevier Computer Science Revie

arXiv.org e-Print Archive

Empirical Big Data Research: A Systematic Literature Mapping

Author: Mathisen Bjørn Magnus
Roman Dumitru
Wienhofen Leendert
Publication venue
Publication date: 12/10/2016
Field of study

Background: Big Data is a relatively new field of research and technology, and literature reports a wide variety of concepts labeled with Big Data. The maturity of a research field can be measured in the number of publications containing empirical results. In this paper we present the current status of empirical research in Big Data. Method: We employed a systematic mapping method with which we mapped the collected research according to the labels Variety, Volume and Velocity. In addition, we addressed the application areas of Big Data. Results: We found that 151 of the assessed 1778 contributions contain a form of empirical result and can be mapped to one or more of the 3 V's and 59 address an application area. Conclusions: The share of publications containing empirical results is well below the average compared to computer science research as a whole. In order to mature the research on Big Data, we recommend applying empirical methods to strengthen the confidence in the reported results. Based on our trend analysis we consider Volume and Variety to be the most promising uncharted area in Big Data.Comment: Submitted to Springer journal Data Science and Engineerin

arXiv.org e-Print Archive

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing

Author: Wan Shixiang
Zou Quan
Publication venue
Publication date: 04/04/2017
Field of study

Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. After comparing with most available state-of-the-art methods, our experimental results indicate the following: 1) HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large biological sequences; 2) HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource; 3) HAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign

arXiv.org e-Print Archive

Big Data Computing Using Cloud-Based Technologies, Challenges and Future Perspectives

Author: Alam Mansaf
Khan Samiya
Shakil Kashish Ara
Publication venue
Publication date: 24/11/2017
Field of study

The excessive amounts of data generated by devices and Internet-based sources at a regular basis constitute, big data. This data can be processed and analyzed to develop useful applications for specific domains. Several mathematical and data analytics techniques have found use in this sphere. This has given rise to the development of computing models and tools for big data computing. However, the storage and processing requirements are overwhelming for traditional systems and technologies. Therefore, there is a need for infrastructures that can adjust the storage and processing capability in accordance with the changing data dimensions. Cloud Computing serves as a potential solution to this problem. However, big data computing in the cloud has its own set of challenges and research issues. This chapter surveys the big data concept, discusses the mathematical and data analytics techniques that can be used for big data and gives taxonomy of the existing tools, frameworks and platforms available for different big data computing models. Besides this, it also evaluates the viability of cloud-based big data computing, examines existing challenges and opportunities, and provides future research directions in this field

arXiv.org e-Print Archive

Distributed rank-1 dictionary learning: Towards fast and scalable solutions for fMRI big data analytics

Author: Fazli Mojtaba Sedigh
Li Xiang
Lin Binbin
Liu Tianming
Makkie Milad
Quinn Shannon
Ye Jieping
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/08/2017
Field of study

The use of functional brain imaging for research and diagnosis has benefitted greatly from the recent advancements in neuroimaging technologies, as well as the explosive growth in size and availability of fMRI data. While it has been shown in literature that using multiple and large scale fMRI datasets can improve reproducibility and lead to new discoveries, the computational and informatics systems supporting the analysis and visualization of such fMRI big data are extremely limited and largely under-discussed. We propose to address these shortcomings in this work, based on previous success in using dictionary learning method for functional network decomposition studies on fMRI data. We presented a distributed dictionary learning framework based on rank-1 matrix decomposition with sparseness constraint (D-r1DL framework). The framework was implemented using the Spark distributed computing engine and deployed on three different processing units: an in-house server, in-house high performance clusters, and the Amazon Elastic Compute Cloud (EC2) service. The whole analysis pipeline was integrated with our neuroinformatics system for data management, user input/output, and real-time visualization. Performance and accuracy of D-r1DL on both individual and group-wise fMRI Human Connectome Project (HCP) dataset shows that the proposed framework is highly scalable. The resulting group-wise functional network decompositions are highly accurate, and the fast processing time confirm this claim. In addition, D-r1DL can provide real-time user feedback and results visualization which are vital for large-scale data analysis.Comment: One of the authors name, Mojtaba Sedigh Fazli, has been mistakenly missed from this paper presented at the IEEE Big Data confrence. In result we are submitting this verison to correct the authors' name

arXiv.org e-Print Archive

Semi-Automatic Terminology Ontology Learning Based on Topic Modeling

Author: Dhar Amit Kumar
Rani Monika
Vyas O. P.
Publication venue
Publication date: 05/08/2017
Field of study

Ontologies provide features like a common vocabulary, reusability, machine-readable content, and also allows for semantic search, facilitate agent interaction and ordering & structuring of knowledge for the Semantic Web (Web 3.0) application. However, the challenge in ontology engineering is automatic learning, i.e., the there is still a lack of fully automatic approach from a text corpus or dataset of various topics to form ontology using machine learning techniques. In this paper, two topic modeling algorithms are explored, namely LSI & SVD and Mr.LDA for learning topic ontology. The objective is to determine the statistical relationship between document and terms to build a topic ontology and ontology graph with minimum human intervention. Experimental analysis on building a topic ontology and semantic retrieving corresponding topic ontology for the user's query demonstrating the effectiveness of the proposed approach

arXiv.org e-Print Archive