2,108 research outputs found
Parallel and Distributed Collaborative Filtering: A Survey
Collaborative filtering is amongst the most preferred techniques when
implementing recommender systems. Recently, great interest has turned towards
parallel and distributed implementations of collaborative filtering algorithms.
This work is a survey of the parallel and distributed collaborative filtering
implementations, aiming not only to provide a comprehensive presentation of the
field's development, but also to offer future research orientation by
highlighting the issues that need to be further developed.Comment: 46 page
Big Data and Cross-Document Coreference Resolution: Current State and Future Opportunities
Information Extraction (IE) is the task of automatically extracting
structured information from unstructured/semi-structured machine-readable
documents. Among various IE tasks, extracting actionable intelligence from
ever-increasing amount of data depends critically upon Cross-Document
Coreference Resolution (CDCR) - the task of identifying entity mentions across
multiple documents that refer to the same underlying entity. Recently, document
datasets of the order of peta-/tera-bytes has raised many challenges for
performing effective CDCR such as scaling to large numbers of mentions and
limited representational power. The problem of analysing such datasets is
called "big data". The aim of this paper is to provide readers with an
understanding of the central concepts, subtasks, and the current
state-of-the-art in CDCR process. We provide assessment of existing
tools/techniques for CDCR subtasks and highlight big data challenges in each of
them to help readers identify important and outstanding issues for further
investigation. Finally, we provide concluding remarks and discuss possible
directions for future work
Using Hadoop for Large Scale Analysis on Twitter: A Technical Report
Sentiment analysis (or opinion mining) on Twitter data has attracted much
attention recently. One of the system's key features, is the immediacy in
communication with other users in an easy, user-friendly and fast way.
Consequently, people tend to express their feelings freely, which makes Twitter
an ideal source for accumulating a vast amount of opinions towards a wide
diversity of topics. This amount of information offers huge potential and can
be harnessed to receive the sentiment tendency towards these topics. However,
since none can invest an infinite amount of time to read through these tweets,
an automated decision making approach is necessary. Nevertheless, most existing
solutions are limited in centralized environments only. Thus, they can only
process at most a few thousand tweets. Such a sample, is not representative to
define the sentiment polarity towards a topic due to the massive number of
tweets published daily. In this paper, we go one step further and develop a
novel method for sentiment learning in the MapReduce framework. Our algorithm
exploits the hashtags and emoticons inside a tweet, as sentiment labels, and
proceeds to a classification procedure of diverse sentiment types in a parallel
and distributed manner. Moreover, we utilize Bloom filters to compact the
storage size of intermediate data and boost the performance of our algorithm.
Through an extensive experimental evaluation, we prove that our solution is
efficient, robust and scalable and confirm the quality of our sentiment
identification.Comment: 8 pages, 3 tables, 3 figure
BAMCloud: A Cloud Based Mobile Biometric Authentication Framework
With an exponential increase in number of users switching to mobile banking,
various countries are adopting biometric solutions as security measures. The
main reason for biometric technologies becoming more common in the everyday
lives of consumers is because of the facility to easily capture biometric data
in real time, using their mobile phones. Biometric technologies are providing
the potential security framework to make banking more convenient and secure
than it has ever been. At the same time, the exponential growth of enrollment
in the biometric system produces massive amount of high dimensionality data
that leads to degradation in the performance of the mobile banking systems.
Therefore, in order to overcome the performance issues arising due to this data
deluge, this paper aims to propose a distributed mobile biometric system based
on a high performance cluster Cloud. High availability, better time efficiency
and scalability are some of the added advantages of using the proposed system.
In this paper a Cloud based mobile biometric authentication framework
(BAMCloud) is proposed that uses dynamic signatures and performs
authentication. It includes the steps involving data capture using any handheld
mobile device, then storage, preprocessing and training the system in a
distributed manner over Cloud. For this purpose we have implemented it using
MapReduce on Hadoop platform and for training Levenberg-Marquardt
backpropagation neural network has been used. Moreover, the methodology adopted
is very novel as it achieves a speedup of 8.5x and a performance of 96.23%.
Furthermore, the cost benefit analysis of the implemented system shows that the
cost of implementation and execution of the system is lesser than the existing
ones. The experiments demonstrate that the better performance is achieved by
proposed framework as compared to the other methods used in the recent
literature
Security and Privacy Aspects in MapReduce on Clouds: A Survey
MapReduce is a programming system for distributed processing large-scale data
in an efficient and fault tolerant manner on a private, public, or hybrid
cloud. MapReduce is extensively used daily around the world as an efficient
distributed computation tool for a large class of problems, e.g., search,
clustering, log analysis, different types of join operations, matrix
multiplication, pattern matching, and analysis of social networks. Security and
privacy of data and MapReduce computations are essential concerns when a
MapReduce computation is executed in public or hybrid clouds. In order to
execute a MapReduce job in public and hybrid clouds, authentication of
mappers-reducers, confidentiality of data-computations, integrity of
data-computations, and correctness-freshness of the outputs are required.
Satisfying these requirements shield the operation from several types of
attacks on data and MapReduce computations. In this paper, we investigate and
discuss security and privacy challenges and requirements, considering a variety
of adversarial capabilities, and characteristics in the scope of MapReduce. We
also provide a review of existing security and privacy protocols for MapReduce
and discuss their overhead issues.Comment: Accepted in Elsevier Computer Science Revie
Empirical Big Data Research: A Systematic Literature Mapping
Background: Big Data is a relatively new field of research and technology,
and literature reports a wide variety of concepts labeled with Big Data. The
maturity of a research field can be measured in the number of publications
containing empirical results. In this paper we present the current status of
empirical research in Big Data. Method: We employed a systematic mapping method
with which we mapped the collected research according to the labels Variety,
Volume and Velocity. In addition, we addressed the application areas of Big
Data. Results: We found that 151 of the assessed 1778 contributions contain a
form of empirical result and can be mapped to one or more of the 3 V's and 59
address an application area. Conclusions: The share of publications containing
empirical results is well below the average compared to computer science
research as a whole. In order to mature the research on Big Data, we recommend
applying empirical methods to strengthen the confidence in the reported
results. Based on our trend analysis we consider Volume and Variety to be the
most promising uncharted area in Big Data.Comment: Submitted to Springer journal Data Science and Engineerin
HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing
Multiple sequence alignment (MSA) plays a key role in biological sequence
analyses, especially in phylogenetic tree construction. Extreme increase in
next-generation sequencing results in shortage of efficient ultra-large
biological sequence alignment approaches for coping with different sequence
types. Distributed and parallel computing represents a crucial technique for
accelerating ultra-large sequence analyses. Based on HAlign and Spark
distributed computing system, we implement a highly cost-efficient and
time-efficient HAlign-II tool to address ultra-large multiple biological
sequence alignment and phylogenetic tree construction. After comparing with
most available state-of-the-art methods, our experimental results indicate the
following: 1) HAlign-II can efficiently carry out MSA and construct
phylogenetic trees with ultra-large biological sequences; 2) HAlign-II shows
extremely high memory efficiency and scales well with increases in computing
resource; 3) HAlign-II provides a user-friendly web server based on our
distributed computing infrastructure. HAlign-II with open-source codes and
datasets was established at http://lab.malab.cn/soft/halign
Big Data Computing Using Cloud-Based Technologies, Challenges and Future Perspectives
The excessive amounts of data generated by devices and Internet-based sources
at a regular basis constitute, big data. This data can be processed and
analyzed to develop useful applications for specific domains. Several
mathematical and data analytics techniques have found use in this sphere. This
has given rise to the development of computing models and tools for big data
computing. However, the storage and processing requirements are overwhelming
for traditional systems and technologies. Therefore, there is a need for
infrastructures that can adjust the storage and processing capability in
accordance with the changing data dimensions. Cloud Computing serves as a
potential solution to this problem. However, big data computing in the cloud
has its own set of challenges and research issues. This chapter surveys the big
data concept, discusses the mathematical and data analytics techniques that can
be used for big data and gives taxonomy of the existing tools, frameworks and
platforms available for different big data computing models. Besides this, it
also evaluates the viability of cloud-based big data computing, examines
existing challenges and opportunities, and provides future research directions
in this field
Distributed rank-1 dictionary learning: Towards fast and scalable solutions for fMRI big data analytics
The use of functional brain imaging for research and diagnosis has benefitted
greatly from the recent advancements in neuroimaging technologies, as well as
the explosive growth in size and availability of fMRI data. While it has been
shown in literature that using multiple and large scale fMRI datasets can
improve reproducibility and lead to new discoveries, the computational and
informatics systems supporting the analysis and visualization of such fMRI big
data are extremely limited and largely under-discussed. We propose to address
these shortcomings in this work, based on previous success in using dictionary
learning method for functional network decomposition studies on fMRI data. We
presented a distributed dictionary learning framework based on rank-1 matrix
decomposition with sparseness constraint (D-r1DL framework). The framework was
implemented using the Spark distributed computing engine and deployed on three
different processing units: an in-house server, in-house high performance
clusters, and the Amazon Elastic Compute Cloud (EC2) service. The whole
analysis pipeline was integrated with our neuroinformatics system for data
management, user input/output, and real-time visualization. Performance and
accuracy of D-r1DL on both individual and group-wise fMRI Human Connectome
Project (HCP) dataset shows that the proposed framework is highly scalable. The
resulting group-wise functional network decompositions are highly accurate, and
the fast processing time confirm this claim. In addition, D-r1DL can provide
real-time user feedback and results visualization which are vital for
large-scale data analysis.Comment: One of the authors name, Mojtaba Sedigh Fazli, has been mistakenly
missed from this paper presented at the IEEE Big Data confrence. In result we
are submitting this verison to correct the authors' name
Semi-Automatic Terminology Ontology Learning Based on Topic Modeling
Ontologies provide features like a common vocabulary, reusability,
machine-readable content, and also allows for semantic search, facilitate agent
interaction and ordering & structuring of knowledge for the Semantic Web (Web
3.0) application. However, the challenge in ontology engineering is automatic
learning, i.e., the there is still a lack of fully automatic approach from a
text corpus or dataset of various topics to form ontology using machine
learning techniques. In this paper, two topic modeling algorithms are explored,
namely LSI & SVD and Mr.LDA for learning topic ontology. The objective is to
determine the statistical relationship between document and terms to build a
topic ontology and ontology graph with minimum human intervention. Experimental
analysis on building a topic ontology and semantic retrieving corresponding
topic ontology for the user's query demonstrating the effectiveness of the
proposed approach
- …