1,392 research outputs found
Recommended from our members
A resource aware distributed LSI algorithm for scalable information retrieval
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Latent Semantic Indexing (LSI) is one of the popular techniques in the information retrieval fields. Different from the traditional information retrieval techniques, LSI is not based on the keyword matching simply. It uses statistics and algebraic computations. Based on Singular Value Decomposition (SVD), the higher dimensional matrix is converted to a lower dimensional approximate matrix, of which the noises could be filtered. And also the issues of synonymy and polysemy in the traditional techniques can be overcome based on the investigations of the terms related with the documents. However, it is notable that LSI suffers a scalability issue due to the computing complexity of SVD.
This thesis presents a resource aware distributed LSI algorithm MR-LSI which can solve the scalability issue using Hadoop framework based on the distributed computing model MapReduce. It also solves the overhead issue caused by the involved clustering algorithm. The evaluations indicate that MR-LSI can gain significant enhancement compared to the other strategies on processing large scale of documents. One remarkable advantage of Hadoop is that it supports heterogeneous computing environments so that the issue of unbalanced load among nodes is highlighted. Therefore, a load balancing algorithm based on genetic algorithm for balancing load in static environment is proposed. The results show that it can improve the performance of a cluster according to heterogeneity levels.
Considering dynamic Hadoop environments, a dynamic load balancing strategy with varying window size has been proposed. The algorithm works depending on data selecting decision and modeling Hadoop parameters and working mechanisms. Employing improved genetic algorithm for achieving optimized scheduler, the algorithm enhances the performance of a cluster with certain heterogeneity levels
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework
While many existing formal concept analysis algorithms are efficient, they
are typically unsuitable for distributed implementation. Taking the MapReduce
(MR) framework as our inspiration we introduce a distributed approach for
performing formal concept mining. Our method has its novelty in that we use a
light-weight MapReduce runtime called Twister which is better suited to
iterative algorithms than recent distributed approaches. First, we describe the
theoretical foundations underpinning our distributed formal concept analysis
approach. Second, we provide a representative exemplar of how a classic
centralized algorithm can be implemented in a distributed fashion using our
methodology: we modify Ganter's classic algorithm by introducing a family of
MR* algorithms, namely MRGanter and MRGanter+ where the prefix denotes the
algorithm's lineage. To evaluate the factors that impact distributed algorithm
performance, we compare our MR* algorithms with the state-of-the-art.
Experiments conducted on real datasets demonstrate that MRGanter+ is efficient,
scalable and an appealing algorithm for distributed problems.Comment: 17 pages, ICFCA 201, Formal Concept Analysis 201
Real-time detection of moving crowds using spatio-temporal data streams
Over the last decade we have seen a tremendous change in Location Based Services. From primitive reactive applications, explicitly invoked by users, they have evolved into modern complex proactive systems, that are able to automatically provide information based on context and user location. This was caused by the rapid development of outdoor and indoor positioning technologies. GPS modules, which are now included almost into every device, together with indoor technologies, based on WiFi fingerprinting or Bluetooth beacons, allow to determine the user location almost everywhere and at any time. This also led to an enormous growth of spatio-temporal data.
Being very efficient using user-centric approach for a single target current Location Based Services remain quite primitive in the area of a multitarget knowledge extraction. This is rather surprising, taking into consideration the data availability and current processing technologies. Discovering useful information from the location of multiple objects is from one side limited by legal issues related to privacy and data ownership. From the other side, mining group location data over time is not a trivial task and require special algorithms and technologies in order to be effective.
Recent development in data processing area has led to a huge shift from batch processing offline engines, like MapReduce, to real-time distributed streaming frameworks, like Apache Flink or Apache Spark, which are able to process huge amounts of data, including spatio-temporal datastreams.
This thesis presents a system for detecting and analyzing crowds in a continuous spatio-temporal data stream. The aim of the system is to provide relevant knowledge in terms of proactive LBS. The motivation comes from the fact of constant spatio-temporal data growth and recent rapid technological development to process such data
Efficient Processing of k Nearest Neighbor Joins using MapReduce
k nearest neighbor join (kNN join), designed to find k nearest neighbors from
a dataset S for every object in another dataset R, is a primitive operation
widely adopted by many data mining applications. As a combination of the k
nearest neighbor query and the join operation, kNN join is an expensive
operation. Given the increasing volume of data, it is difficult to perform a
kNN join on a centralized machine efficiently. In this paper, we investigate
how to perform kNN join using MapReduce which is a well-accepted framework for
data-intensive applications over clusters of computers. In brief, the mappers
cluster objects into groups; the reducers perform the kNN join on each group of
objects separately. We design an effective mapping mechanism that exploits
pruning rules for distance filtering, and hence reduces both the shuffling and
computational costs. To reduce the shuffling cost, we propose two approximate
algorithms to minimize the number of replicas. Extensive experiments on our
in-house cluster demonstrate that our proposed methods are efficient, robust
and scalable.Comment: VLDB201
- …