322,899 research outputs found
Distributed mining of molecular fragments
In real world applications sequential algorithms of
data mining and data exploration are often unsuitable for
datasets with enormous size, high-dimensionality and complex
data structure. Grid computing promises unprecedented
opportunities for unlimited computing and storage resources. In this context there is the necessity to develop
high performance distributed data mining algorithms.
However, the computational complexity of the problem and
the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment
Distributed Online Big Data Classification Using Context Information
Distributed, online data mining systems have emerged as a result of
applications requiring analysis of large amounts of correlated and
high-dimensional data produced by multiple distributed data sources. We propose
a distributed online data classification framework where data is gathered by
distributed data sources and processed by a heterogeneous set of distributed
learners which learn online, at run-time, how to classify the different data
streams either by using their locally available classification functions or by
helping each other by classifying each other's data. Importantly, since the
data is gathered at different locations, sending the data to another learner to
process incurs additional costs such as delays, and hence this will be only
beneficial if the benefits obtained from a better classification will exceed
the costs. We model the problem of joint classification by the distributed and
heterogeneous learners from multiple data sources as a distributed contextual
bandit problem where each data is characterized by a specific context. We
develop a distributed online learning algorithm for which we can prove
sublinear regret. Compared to prior work in distributed online data mining, our
work is the first to provide analytic regret results characterizing the
performance of the proposed algorithm
Distributed Private Online Learning for Social Big Data Computing over Data Center Networks
With the rapid growth of Internet technologies, cloud computing and social
networks have become ubiquitous. An increasing number of people participate in
social networks and massive online social data are obtained. In order to
exploit knowledge from copious amounts of data obtained and predict social
behavior of users, we urge to realize data mining in social networks. Almost
all online websites use cloud services to effectively process the large scale
of social data, which are gathered from distributed data centers. These data
are so large-scale, high-dimension and widely distributed that we propose a
distributed sparse online algorithm to handle them. Additionally,
privacy-protection is an important point in social networks. We should not
compromise the privacy of individuals in networks, while these social data are
being learned for data mining. Thus we also consider the privacy problem in
this article. Our simulations shows that the appropriate sparsity of data would
enhance the performance of our algorithm and the privacy-preserving method does
not significantly hurt the performance of the proposed algorithm.Comment: ICC201
Algorithms for Extracting Frequent Episodes in the Process of Temporal Data Mining
An important aspect in the data mining process is the discovery of patterns having a great influence on the studied problem. The purpose of this paper is to study the frequent episodes data mining through the use of parallel pattern discovery algorithms. Parallel pattern discovery algorithms offer better performance and scalability, so they are of a great interest for the data mining research community. In the following, there will be highlighted some parallel and distributed frequent pattern mining algorithms on various platforms and it will also be presented a comparative study of their main features. The study takes into account the new possibilities that arise along with the emerging novel Compute Unified Device Architecture from the latest generation of graphics processing units. Based on their high performance, low cost and the increasing number of features offered, GPU processors are viable solutions for an optimal implementation of frequent pattern mining algorithmsFrequent Pattern Mining, Parallel Computing, Dynamic Load Balancing, Temporal Data Mining, CUDA, GPU, Fermi, Thread
Mining Large Data Sets on Grids: Issues and Prospects
When data mining and knowledge discovery techniques must be used to analyze large amounts of data, high-performance parallel and distributed computers can help to provide better computational performance and, as a consequence, deeper and more meaningful results. Recently grids, composed of large-scale, geographically distributed platforms working together, have emerged as effective architectures for high-performance decentralized computation. It is natural to consider grids as tools for distributed data-intensive applications such as data mining, but the underlying patterns of computation and data movement in such applications are different from those of more conventional high-performance computation. These differences require a different kind of grid, or at least a grid with significantly different emphases. This paper discusses the main issues, requirements, and design approaches for the implementation of grid-based knowledge discovery systems. Furthermore, some prospects and promising research directions in datacentric and knowledge-discovery oriented grids are outlined
Distributed data mining using web services.
With the increasing computational power and the decreasing cost of high bandwidth networks resulted in Distibuted Systems. Distributed Data Mining is being used to analyze and monitor data in distributed systems. In the past, distributed technologies like Java RMI, CORBA were used for data mining but the result was a more tightly coupled system. Using web services a loosely coupled, interoperable distributed computing framework can be built. The topic of this thesis is to investigate the use of web service in distributed data mining. This thesis involves the design, development and implementation of distributed data mining using web services as well as an in-depth look at technical aspects and future implication of such framework. A working framework will be created allowing a user to dynamically locate and run mining algorithms on data services or vice versa. The algorithm and data will be deployed as web services. The created web services will be registered at public registry servers. Two distributed data mining architectures will be presented, Data to Algorithm and Algorithm to Data. Finally, performance of the both the architectures will be compared with varying data using different public registry servers
Grid data mining by means of learning classifier systems and distributed model induction
This paper introduces a distributed data mining approach suited to
grid computing environments based on a supervised learning
classifier system. Different methods of merging data mining
models generated at different distributed sites are explored.
Centralized Data Mining (CDM) is a conventional method of data
mining in distributed data. In CDM, data that is stored in
distributed locations have to be collected and stored in a central
repository before executing the data mining algorithm. CDM
method is reliable; however it is expensive (computational,
communicational and implementation costs are high).
Alternatively, Distributed Data Mining (DDM) approach is
economical but it has limitations in combining local models. In
DDM, the data mining algorithm has to be executed at each one of
the sites to induce a local model. Those induced local models are
collected and combined to form a global data mining model. In
this work six different tactics are used for constructing the global
model in DDM: Generalized Classifier Method (GCM); Specific
Classifier Method (SCM); Weighed Classifier Method (WCM);
Majority Voting Method (MVM); Model Sampling Method
(MSM); and Centralized Training Method (CTM). Preliminary
experimental tests were conducted with two synthetic data sets
(eleven multiplexer and monks3) and a real world data set
(intensive care medicine). The initial results demonstrate that the
performance of DDM methods is competitive when compared
with the CDM methods.Fundação para a Ciência e a Tecnologia (FCT
- …