18,052 research outputs found
A Survey of Parallel Sequential Pattern Mining
With the growing popularity of shared resources, large volumes of complex
data of different types are collected automatically. Traditional data mining
algorithms generally have problems and challenges including huge memory cost,
low processing speed, and inadequate hard disk space. As a fundamental task of
data mining, sequential pattern mining (SPM) is used in a wide variety of
real-life applications. However, it is more complex and challenging than other
pattern mining tasks, i.e., frequent itemset mining and association rule
mining, and also suffers from the above challenges when handling the
large-scale data. To solve these problems, mining sequential patterns in a
parallel or distributed computing environment has emerged as an important issue
with many applications. In this paper, an in-depth survey of the current status
of parallel sequential pattern mining (PSPM) is investigated and provided,
including detailed categorization of traditional serial SPM approaches, and
state of the art parallel SPM. We review the related work of parallel
sequential pattern mining in detail, including partition-based algorithms for
PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for
PSPM, and provide deep description (i.e., characteristics, advantages,
disadvantages and summarization) of these parallel approaches of PSPM. Some
advanced topics for PSPM, including parallel quantitative / weighted / utility
sequential pattern mining, PSPM from uncertain data and stream data, hardware
acceleration for PSPM, are further reviewed in details. Besides, we review and
provide some well-known open-source software of PSPM. Finally, we summarize
some challenges and opportunities of PSPM in the big data era.Comment: Accepted by ACM Trans. on Knowl. Discov. Data, 33 page
Evaluation of Frequent Itemset Mining Platforms using Apriori and FP-Growth Algorithm
With the overwhelming amount of complex and heterogeneous data pouring from
any-where, any-time, and any-device, there is undeniably an era of Big Data.
The emergence of the Big Data as a disruptive technology for next generation of
intelligent systems, has brought many issues of how to extract and make use of
the knowledge obtained from the data within short times, limited budget and
under high rates of data generation. Companies are recognizing that big data
can be used to make more accurate predictions, and can be used to enhance the
business with the help of appropriate association rule mining algorithm. To
help these organizations, with which software and algorithm is more appropriate
for them depending on their dataset, we compared the most famous three
MapReduce based software Hadoop, Spark, Flink on two widely used algorithms
Apriori and Fp-Growth on different scales of dataset
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
Parallel and Distributed Collaborative Filtering: A Survey
Collaborative filtering is amongst the most preferred techniques when
implementing recommender systems. Recently, great interest has turned towards
parallel and distributed implementations of collaborative filtering algorithms.
This work is a survey of the parallel and distributed collaborative filtering
implementations, aiming not only to provide a comprehensive presentation of the
field's development, but also to offer future research orientation by
highlighting the issues that need to be further developed.Comment: 46 page
When data mining meets optimization: A case study on the quadratic assignment problem
This paper presents a hybrid approach called frequent pattern based search
that combines data mining and optimization. The proposed method uses a data
mining procedure to mine frequent patterns from a set of high-quality solutions
collected from previous search, and the mined frequent patterns are then
employed to build starting solutions that are improved by an optimization
procedure. After presenting the general approach and its composing ingredients,
we illustrate its application to solve the well-known and challenging quadratic
assignment problem. Computational results on the 21 hardest benchmark instances
show that the proposed approach competes favorably with state-of-the-art
algorithms both in terms of solution quality and computing time
Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research
Sentiment analysis as a field has come a long way since it was first
introduced as a task nearly 20 years ago. It has widespread commercial
applications in various domains like marketing, risk management, market
research, and politics, to name a few. Given its saturation in specific
subtasks -- such as sentiment polarity classification -- and datasets, there is
an underlying perception that this field has reached its maturity. In this
article, we discuss this perception by pointing out the shortcomings and
under-explored, yet key aspects of this field that are necessary to attain true
sentiment understanding. We analyze the significant leaps responsible for its
current relevance. Further, we attempt to chart a possible course for this
field that covers many overlooked and unanswered questions.Comment: Published in the IEEE Transactions on Affective Computing (TAFFC
Analytics for the Internet of Things: A Survey
The Internet of Things (IoT) envisions a world-wide, interconnected network
of smart physical entities. These physical entities generate a large amount of
data in operation and as the IoT gains momentum in terms of deployment, the
combined scale of those data seems destined to continue to grow. Increasingly,
applications for the IoT involve analytics. Data analytics is the process of
deriving knowledge from data, generating value like actionable insights from
them. This article reviews work in the IoT and big data analytics from the
perspective of their utility in creating efficient, effective and innovative
applications and services for a wide spectrum of domains. We review the broad
vision for the IoT as it is shaped in various communities, examine the
application of data analytics across IoT domains, provide a categorisation of
analytic approaches and propose a layered taxonomy from IoT data to analytics.
This taxonomy provides us with insights on the appropriateness of analytical
techniques, which in turn shapes a survey of enabling technology and
infrastructure for IoT analytics. Finally, we look at some tradeoffs for
analytics in the IoT that can shape future research
A Framework for Fast Classification Algorithms
Today, due to globalization of the world the size of data set is increasing, it is necessary to discover the
knowledge. The discovery of knowledge can be typically in the form of association rules, classification rules,
clustering, discovery of frequent episodes and deviation detection. Fast and accurate classifiers for large
databases are an important task in data mining. There is growing evidence that integrating classification and
association rules mining, classification approaches based on heuristic, greedy search like decision tree induction.
Emerging associative classification algorithms have shown good promises on producing accurate classifiers. In
this paper we focus on performance of associative classification and present a parallel model for classifier
building. For classifier building some parallel-distributed algorithms have been proposed for decision tree
induction but so far no such work has been reported for associative classification
A Disease Diagnosis and Treatment Recommendation System Based on Big Data Mining and Cloud Computing
It is crucial to provide compatible treatment schemes for a disease according
to various symptoms at different stages. However, most classification methods
might be ineffective in accurately classifying a disease that holds the
characteristics of multiple treatment stages, various symptoms, and
multi-pathogenesis. Moreover, there are limited exchanges and cooperative
actions in disease diagnoses and treatments between different departments and
hospitals. Thus, when new diseases occur with atypical symptoms, inexperienced
doctors might have difficulty in identifying them promptly and accurately.
Therefore, to maximize the utilization of the advanced medical technology of
developed hospitals and the rich medical knowledge of experienced doctors, a
Disease Diagnosis and Treatment Recommendation System (DDTRS) is proposed in
this paper. First, to effectively identify disease symptoms more accurately, a
Density-Peaked Clustering Analysis (DPCA) algorithm is introduced for
disease-symptom clustering. In addition, association analyses on
Disease-Diagnosis (D-D) rules and Disease-Treatment (D-T) rules are conducted
by the Apriori algorithm separately. The appropriate diagnosis and treatment
schemes are recommended for patients and inexperienced doctors, even if they
are in a limited therapeutic environment. Moreover, to reach the goals of high
performance and low latency response, we implement a parallel solution for
DDTRS using the Apache Spark cloud platform. Extensive experimental results
demonstrate that the proposed DDTRS realizes disease-symptom clustering
effectively and derives disease treatment recommendations intelligently and
accurately
cSELENE: Privacy Preserving Query Retrieval System on Heterogeneous Cloud Data
While working in collaborative team elsewhere sometimes the federated (huge)
data are from heterogeneous cloud vendors. It is not only about the data
privacy concern but also about how can those federated data can be querying
from cloud directly in fast and securely way. Previous solution offered hybrid
cloud between public and trusted private cloud. Another previous solution used
encryption on MapReduce framework. But the challenge is we are working on
heterogeneous clouds. In this paper, we present a novel technique for querying
with privacy concern.
Since we take execution time into account, our basic idea is to use the data
mining model by partitioning the federated databases in order to reduce the
search and query time. By using model of the database it means we use only the
summary or the very characteristic patterns of the database. Modeling is the
Preserving Privacy Stage I, since by modeling the data is being symbolized. We
implement encryption on the database as preserving privacy Stage II. Our
system, called "cSELENE" (stands for "cloud SELENE"), is designed to handle
federated data on heterogeneous clouds: AWS, Microsoft Azure, and Google Cloud
Platform with MapReduce technique.
In this paper we discuss preserving-privacy system and threat model, the
format of federated data, the parallel programming (GPU programming and
shared/memory systems), the parallel and secure algorithm for data mining model
in distributed cloud, the cloud infrastructure/architecture, and the UIX design
of the cSELENE system. Other issues such as incremental method and the secure
design of cloud architecture system (Virtual Machines across platform design)
are still open to discuss. Our experiments should demonstrate the validity and
practicality of the proposed high performance computing scheme.Comment: The First International Workshop on Learning From Limited or Noisy
Data for Information Retrieval (LND4IR), Ann Arbor, Michigan, USA, July 2018
(SIGIR 2018), 6 page
- …