151 research outputs found
DENCAST: distributed density-based clustering for multi-target regression
Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value ) state-of-the-art distributed regression methods, in both single and multi-target settings
Doctor of Philosophy
dissertationData-driven analytics has been successfully utilized in many experience-oriented areas, such as education, business, and medicine. With the profusion of traffic-related data from Internet of Things and development of data mining techniques, data-driven analytics is becoming increasingly popular in the transportation industry. The objective of this research is to explore the application of data-driven analytics in transportation research to improve the traffic management and operations. Three problems in the respective areas of transportation planning, traffic operation, and maintenance management have been addressed in this research, including exploring the impact of dynamic ridesharing system in a multimodal network, quantifying non-recurrent congestion impact on freeway corridors, and developing infrastructure sampling method for efficient maintenance activities. First, the impact of dynamic ridesharing in a multimodal network is studied with agent-based modeling. The competing mechanism between dynamic ridesharing system and public transit is analyzed. The model simulates the interaction between travelers and the environment and emulates travelers' decision making process with the presence of competing modes. The model is applicable to networks with varying demographics. Second, a systematic approach is proposed to quantify Incident-Induced Delay on freeway corridors. There are two particular highlights in the study of non-recurrent congestion quantification: secondary incident identification and K-Nearest Neighbor pattern matching. The proposed methodology is easily transferable to any traffic operation system that has access to sensor data at a corridor level. Lastly, a high-dimensional clustering-based stratified sampling method is developed for infrastructure sampling. The stratification process consists of two components: current condition estimation and high-dimensional cluster analysis. High-dimensional cluster analysis employs Locality-Sensitive Hashing algorithm and spectral sampling. The proposed method is a potentially useful tool for agencies to effectively conduct infrastructure inspection and can be easily adopted for choosing samples containing multiple features. These three examples showcase the application of data-driven analytics in transportation research, which can potentially transform the traffic management mindset into a model of data-driven, sensing, and smart urban systems. The analytic
Service Abstractions for Scalable Deep Learning Inference at the Edge
Deep learning driven intelligent edge has already become a reality, where millions of mobile, wearable, and IoT devices analyze real-time data and transform those into actionable insights on-device. Typical approaches for optimizing deep learning inference mostly focus on accelerating the execution of individual inference tasks, without considering the contextual correlation unique to edge environments and the statistical nature of learning-based computation. Specifically, they treat inference workloads as individual black boxes and apply canonical system optimization techniques, developed over the last few decades, to handle them as yet another type of computation-intensive applications. As a result, deep learning inference on edge devices still face the ever increasing challenges of customization to edge device heterogeneity, fuzzy computation redundancy between inference tasks, and end-to-end deployment at scale. In this thesis, we propose the first framework that automates and scales the end-to-end process of deploying efficient deep learning inference from the cloud to heterogeneous edge devices. The framework consists of a series of service abstractions that handle DNN model tailoring, model indexing and query, and computation reuse for runtime inference respectively. Together, these services bridge the gap between deep learning training and inference, eliminate computation redundancy during inference execution, and further lower the barrier for deep learning algorithm and system co-optimization. To build efficient and scalable services, we take a unique algorithmic approach of harnessing the semantic correlation between the learning-based computation. Rather than viewing individual tasks as isolated black boxes, we optimize them collectively in a white box approach, proposing primitives to formulate the semantics of the deep learning workloads, algorithms to assess their hidden correlation (in terms of the input data, the neural network models, and the deployment trials) and merge common processing steps to minimize redundancy
Distributed multi-label learning on Apache Spark
This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art
Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications
Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS
Recommended from our members
MapReduce network enabled algorithms for classification based on association rules
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.There is growing evidence that integrating classification and association rule mining can produce more efficient and accurate classifiers than traditional techniques. This thesis introduces a new MapReduce based association rule miner for extracting strong rules from large datasets. This miner is used later to develop a new large scale classifier. Also new MapReduce simulator was developed to evaluate the scalability of proposed algorithms on MapReduce clusters.
The developed associative rule miner inherits the MapReduce scalability to huge datasets and to thousands of processing nodes. For finding frequent itemsets, it uses hybrid approach between miners that uses counting methods on horizontal datasets, and miners that use set intersections on datasets of vertical formats. The new miner generates same rules that usually generated using apriori-like algorithms because it uses the same confidence and support thresholds definitions.
In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. This thesis also introduces a new MapReduce classifier that based MapReduce associative rule mining. This algorithm employs different approaches in rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. The new classifier works on multi-class datasets and is able to produce multi-label predications with probabilities for each predicted label. To evaluate the classifier 20 different datasets from the UCI data collection were used. Results show that the proposed approach is an accurate and effective classification technique, highly competitive and scalable if compared with other traditional and associative classification approaches.
Also a MapReduce simulator was developed to measure the scalability of MapReduce based applications easily and quickly, and to captures the behaviour of algorithms on cluster environments. This also allows optimizing the configurations of MapReduce clusters to get better execution times and hardware utilization
- …