11 research outputs found

    Decomposable Naive Bayes Classifier for Partitioned Data

    Get PDF
    Most learning algorithms are designed to work on a single dataset. However, with the growth of networks, data is increasingly distributed over many databases in many different geographical sites. These databases cannot be moved to other network sites due to security, size, privacy, or data ownership consideration. In this paper, we propose two decomposable versions of Naive Bayes Classifier for horizontally and vertically partitioned data. The goal of our algorithms is to achieve the learning objectives for any data distribution encountered across the network by exchanging minimum local summaries among the participating sites

    Learning k-Nearest Neighbors Classifier from Distributed Data

    Get PDF
    Most learning algorithms assume that all the relevant data are available on a single computer site. In the emerging networked environments learning tasks are encountering situations in which the relevant data exists in a number of geographically distributed databases that are connected by communication networks. These databases cannot be moved to other network sites due to security, size, privacy, or data-ownership considerations. In this paper we show how a k-nearest classifier algorithm can be adapted for distributed data situations. The objective of our algorithms is to achieve the learning objectives for any data distribution encountered across the network by exchanging local summaries among the participating nodes

    New Algorithm for Clustering Distributed Data Using k-Means

    Get PDF
    The internet era and high speed networks have ushered in the capabilities to have ready access to large amounts of geographically distributed data. Individuals, businesses, and governments recognize the value of this available resource to those who can transform the data into information. These databases, though valuable as individual entities, become significantly more valuable when they function as parts of a federated database and their data can be aggregated for collective mining or computations. This requires new algorithms to shift their focus from working with single databases to efficiently working with federated databases. In this paper, we propose a new decomposable version of the popular k-means clustering algorithm that works in this desired manner with a set of networked databases. We show that it is possible to perform global computation in a reasonably secure manner for either horizontally or vertically distributed databases. The computation is completed by only exchanging a few local summaries among the databases. An empirical and analytical validation of our results is also presented

    Nearest Neighbor Clustering over Partitioned Data

    Get PDF
    Most clustering algorithms assume that all the relevant data are available on a single node of a computer network. In the emerging distributed and networked knowledge environments, databases relevant for computations may reside on a number of nodes connected by a communication network. These data resources cannot be moved to other network sites due to privacy, security, and size considerations. The desired global computation must be decomposed into local computations to match the distribution of data across the network. The capability to decompose computations must be general enough to handle different distributions of data and different participating nodes in each instance of the global computation. In this paper, we present a methodology and algorithm for clustering distributed data in d-dimensional space, using nearest neighbor clustering, wherein each distributed data source is represented by an agent. Each such agent has the capability to decompose global computations into local parts, for itself and for agents at other sites. The global computation is then performed by the agent either exchanging some minimal summaries with other agents or traveling to all the sites and performing local tasks that can be done at each local site. The objective is to perform global tasks with a minimum of communication or travel by participating agents across the network

    Agents for Integrating Distributed Data for Complex Computations

    Get PDF
    Algorithms for many complex computations assume that all the relevant data is available on a single node of a computer network. In the emerging distributed and networked knowledge environments, databases relevant for computations may reside on a number of nodes connected by a communication network. These data resources cannot be moved to other network sites due to privacy, security, and size considerations. The desired global computation must be decomposed into local computations to match the distribution of data across the network. The capability to decompose computations must be general enough to handle different distributions of data and different participating nodes in each instance of the global computation. In this paper, we present a methodology wherein each distributed data source is represented by an agent. Each such agent has the capability to decompose global computations into local parts, for itself and for agents at other sites. The global computation is then performed by the agent either exchanging some minimal summaries with other agents or travelling to all the sites and performing local tasks that can be done at each local site. The objective is to perform global tasks with a minimum of communication or travel by participating agents across the network

    An evaluation between Bloom Filter join and PERF join in Distributed Query Processing

    Get PDF
    Nowadays, with the explosion of information and the telecommunication era\u27s coming, more and more huge applications encourage decentralization of data while accessing data from different sites [HFB00]. The process of retrieving data from different sites called Distributed Query Processing. The objective of distributed query optimization is to find the most cost-effective of executing query across the network [OV99]. Semijoin [BC81] [BG+81] is known as an effective operator to eliminate the tuples of a relation which are not contributive to a query. 2-way semijoin [KR87] is an extended version of semijoin which not only performs forward reduction like traditional semijoin does, but also provides backward reduction always in cost-effective way. Bloom Filter[B70] and PERF [LR95] are 2 filter based techniques which use a bit vector to represent of the original join attributes projection during the data transmission. Compare with generating a bit array with hash function in bloom filter, Perf join is based on the tuples scan order to avoid losing information caused by hash collision. In the thesis, we will apply both bloom filter and pert on 2-way semijoin algorithms to reduce transmission cost of distributed queries. Performance of propose algorithms will compare against each others and IFS (Initial Feasible Solution) through amount of experiments. \u27Keywords:\u27 Distributed Query Processing, Semijoin, Bloom Filter, Perf Join

    A bloom-filter strategy for response time reduction in distributed query processing.

    Get PDF
    In distributed database systems, query optimization is to find strategies attempt to minimize the amount of data transmitted over the network. Optimization algorithms have an important impact on the performance of distributed query processing. Since optimal query processing in distributed database systems has been shown to be NP-Hard [WC96], heuristics are applied to find a cost-effective and efficient (but suboptimal) processing strategy. Many query optimization strategies have been proposed to minimize either the total cost or the response time. The approaches in distributed query processing have mainly focused on the use of joins, semijoins, and filters. In this thesis, we propose a new reduction strategy based on bloom-filters to significantly reduce the response time of a distributed query. This algorithm can process general queries consisting of an arbitrary number of relations and join attributes. The performance of the algorithm with respect to response time is compared against the Initial Feasible Solution (IFS). An amount of experimental results has been used to evaluate the performance of our algorithm. Compared to the IFS, our algorithm provides a significantly improved query solution. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2003 .G36. Source: Masters Abstracts International, Volume: 43-05, page: 1749. Thesis (M.Sc.)--University of Windsor (Canada), 2003

    Reduction of collisions in Bloom filters during distributed query optimization.

    Get PDF
    The goal of distributed query optimization is to find the optimal strategy for the execution of a given query. The approaches in distributed query processing have mainly focused on the use of joins, semijoins, and filters. Semijoins have the advantage over joins in that there are no increases in data sizes. However, a semijoin needs more local processing such as projection and higher data transmission. To improve the distributed query processing, the filter-based approach is utilized. One of the limitations of this approach is collisions. We investigate how collisions affect the performance of the algorithm and how performance can be improved given those collisions. Our proposed algorithm utilizes two sets of filters to reduce the collisions, so the performance has been improved when collisions exist. Our proposed algorithm is evaluated objectively by comparison to a full reducer which is the algorithm that fully reduces all relations involved in a query by eliminating all non-participating tuples from the relations. The results of the evaluation show that: (1) With a perfect hash function, on average, our algorithm eliminates 97.41% of the unneeded data and fully reduces the relations of over 70% of the queries. (2) Using a single set of filters with specific percentages of collisions, on average, less than half of a queries are fully reduced by the algorithm. Therefore, the collisions substantially affects the performance. (3) Using two sets of filters, On average, our algorithm eliminates 95% of noncontributive tuples and achieves over 60% full reduction. In conclusion, our improved algorithm utilizes the two sets of filters to reduce the effects of collisions substantially. Therefore, we improve the performance of our algorithm under the assumption of collisions which is the major problem in using Bloom filters during distributed query optimization.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis1999 .L53. Source: Masters Abstracts International, Volume: 39-02, page: 0528. Adviser: Joan Morrissey. Thesis (M.Sc.)--University of Windsor (Canada), 1999

    Compressed positionally encoded record filters in distributed query processing.

    Get PDF
    Different from a centralized database system, distributed query processing involves data transmission among distributed sites, which makes reducing transmission cost a major goal for distributed query optimization. A Positionally Encoded Record Filter (PERF) has attracted research attention as a cost-effective operator to reduce transmission cost. A PERF is a bit array generated by relation tuple scan order instead of hashing, so that it inherits the same compact size benefit as a Bloom filter while suffering no loss of join information caused by hash collisions. Our proposed algorithm PERF_C (Compressed PERF) further reduces the transmission cost in algorithm PERF by compressing both the join attributes and the corresponding PERF filters using arithmetic coding. We prove by time complexity analysis that compression is more efficient than sorting, which was proposed by earlier research to remove duplicates in algorithm PERF. Through the experiments on our synthetic testbed with 36 types of distributed queries, algorithm PERF_C effectively reduces the transmission cost with a cost reduction ratio of 62%--77% over IFS. And PERF_C outperforms PERF with a gain of 16%--36% in cost reduction ratio. A new metric to measure the compression speed in bits per second, compression bps , is defined as a guideline to decide when compression is beneficial. When compression overhead is considered, compression is beneficial only if compression bps is faster than data transfer speed. Tested on both randomly generated and specially designed distributed queries, number of join attributes, size of join attributes and relations, level of duplications are identified to be critical database factors affecting compression. Tested under three typical real computing platforms, compression bps is measured over a wide range of data size and falls in the range from 4M b/s to 9M b/s. Compared to the present relatively slow data transfer rate over Internet, compression is found to be an effective means of reducing transmission cost in distributed query processing. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .Z565. Source: Masters Abstracts International, Volume: 43-01, page: 0249. Adviser: J. Morrissey. Thesis (M.Sc.)--University of Windsor (Canada), 2004

    Implementation of composite semijoins using a variation of Bloom filters.

    Get PDF
    Different from a centralized database system, distributed query processing involves data transmission among different sites and this communication cost is a dominant factor compared to local processing cost. So, the objective of distributed query optimization is to find strategies to minimize the amount of data transmitted over the network. Since optimal query processing in distributed database systems has been shown to be an NP-hard problem, heuristics are applied to find a near-optimal processing strategy. Previous research has mainly focused on the use of joins, semijoins, and hash semijoins (Bloom filters). The semijoin is a commonly recognized operator, which provides efficient query results. As a variation of semijoin, the composite semijoin is beneficial to do semijoins as one composite rather than as multiple single column semijoins. The Hash semijoin (which uses a Bloom filter) is used to minimize the cost of a semijoin operation. This thesis report provides a summary of each category of query processing techniques and optimization algorithms. Also in this thesis, we propose a new algorithm called Composite Semijoin Filter by combining the idea of composite semijoins, Bloom filters and PERF joins. One of the advantages of this algorithm is to avoid collisions. The algorithm is evaluated and compared with initial feasible solution (IFS) and another filter-based algorithm. It has been shown that the algorithm gives substantial reduction on relations and the total cost.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .Z58. Source: Masters Abstracts International, Volume: 43-01, page: 0249. Adviser: Joan Morrissey. Thesis (M.Sc.)--University of Windsor (Canada), 2004
    corecore