3,184 research outputs found

    Exploratory Cluster Analysis from Ubiquitous Data Streams using Self-Organizing Maps

    Get PDF
    This thesis addresses the use of Self-Organizing Maps (SOM) for exploratory cluster analysis over ubiquitous data streams, where two complementary problems arise: first, to generate (local) SOM models over potentially unbounded multi-dimensional non-stationary data streams; second, to extrapolate these capabilities to ubiquitous environments. Towards this problematic, original contributions are made in terms of algorithms and methodologies. Two different methods are proposed regarding the first problem. By focusing on visual knowledge discovery, these methods fill an existing gap in the panorama of current methods for cluster analysis over data streams. Moreover, the original SOM capabilities in performing both clustering of observations and features are transposed to data streams, characterizing these contributions as versatile compared to existing methods, which target an individual clustering problem. Also, additional methodologies that tackle the ubiquitous aspect of data streams are proposed in respect to the second problem, allowing distributed and collaborative learning strategies. Experimental evaluations attest the effectiveness of the proposed methods and realworld applications are exemplified, namely regarding electric consumption data, air quality monitoring networks and financial data, motivating their practical use. This research study is the first to clearly address the use of the SOM towards ubiquitous data streams and opens several other research opportunities in the future

    Doctor of Philosophy

    Get PDF
    dissertationInteractive editing and manipulation of digital media is a fundamental component in digital content creation. One media in particular, digital imagery, has seen a recent increase in popularity of its large or even massive image formats. Unfortunately, current systems and techniques are rarely concerned with scalability or usability with these large images. Moreover, processing massive (or even large) imagery is assumed to be an off-line, automatic process, although many problems associated with these datasets require human intervention for high quality results. This dissertation details how to design interactive image techniques that scale. In particular, massive imagery is typically constructed as a seamless mosaic of many smaller images. The focus of this work is the creation of new technologies to enable user interaction in the formation of these large mosaics. While an interactive system for all stages of the mosaic creation pipeline is a long-term research goal, this dissertation concentrates on the last phase of the mosaic creation pipeline - the composition of registered images into a seamless composite. The work detailed in this dissertation provides the technologies to fully realize interactive editing in mosaic composition on image collections ranging from the very small to massive in scale

    Data Management for Dynamic Multimedia Analytics and Retrieval

    Get PDF
    Multimedia data in its various manifestations poses a unique challenge from a data storage and data management perspective, especially if search, analysis and analytics in large data corpora is considered. The inherently unstructured nature of the data itself and the curse of dimensionality that afflicts the representations we typically work with in its stead are cause for a broad range of issues that require sophisticated solutions at different levels. This has given rise to a huge corpus of research that puts focus on techniques that allow for effective and efficient multimedia search and exploration. Many of these contributions have led to an array of purpose-built, multimedia search systems. However, recent progress in multimedia analytics and interactive multimedia retrieval, has demonstrated that several of the assumptions usually made for such multimedia search workloads do not hold once a session has a human user in the loop. Firstly, many of the required query operations cannot be expressed by mere similarity search and since the concrete requirement cannot always be anticipated, one needs a flexible and adaptable data management and query framework. Secondly, the widespread notion of staticity of data collections does not hold if one considers analytics workloads, whose purpose is to produce and store new insights and information. And finally, it is impossible even for an expert user to specify exactly how a data management system should produce and arrive at the desired outcomes of the potentially many different queries. Guided by these shortcomings and motivated by the fact that similar questions have once been answered for structured data in classical database research, this Thesis presents three contributions that seek to mitigate the aforementioned issues. We present a query model that generalises the notion of proximity-based query operations and formalises the connection between those queries and high-dimensional indexing. We complement this by a cost-model that makes the often implicit trade-off between query execution speed and results quality transparent to the system and the user. And we describe a model for the transactional and durable maintenance of high-dimensional index structures. All contributions are implemented in the open-source multimedia database system Cottontail DB, on top of which we present an evaluation that demonstrates the effectiveness of the proposed models. We conclude by discussing avenues for future research in the quest for converging the fields of databases on the one hand and (interactive) multimedia retrieval and analytics on the other

    Parallel implementation of maximum parsimony search algorithm on multicore CPUs

    Get PDF
    Phylogenetics is the study of the evolutionary relationships among species. It is derived from the ancient greek words, phylon meaning race , and genetikos, meaning relative to birth . An important methodology in phylogenetics is a cladistics methodology (parsimony) applied to the study of taxonomic classification. Modern study includes as source data aspects of molecular biology, such as the DNA sequence of homologous (orthologous) genes. The algorithms used attempt to reconstruct evolutionary relationships in the form of phylogenetic trees, based on the available morphological data, behavioral data, and usually DNA sequence data (Fitch W. M., 1971). The topic of this thesis is the parallel implementation of an existing algorithm called Maximum Parsimony, a search for a guaranteed optimal tree(s) based the fewest number of mutations required for tree construction. The algorithm grows linearly with the increase in DNA sequence length and combinatorially with the number of organisms studied (Felsenstein J. , The number of evolutionary trees., 1978). The algorithm may take hours to complete. The limitations of the current implementations such as PAUP are that they are limited to just one core on the CPU, even if 8 are available. This parallel implementation may use as many cores as are available. The method of research is to replicate the accuracy of existing serial software, parallelize the algorithm to many cores without losing accuracy, optimize by various methods, then attempt to port to other hardware architectures. Some time is spent on the implementation of the algorithms onto GPUs and Clusters. The results are that, while this implementation matches the accuracy of the current standard, and speeds up in parallel, it does not presently match the speed of PAUP for reasons yet to be determined

    Solar rotation speed by detecting and tracking of Coronal Bright Points

    Get PDF
    Coronal Bright Points are one of many Solar manifestations that provide scientists evi-dences of its activity and are usually recognized by being small light dots, like scattered jewels. For many years these Bright Points have been overlooked due to another element of solar activity, sunspots, which drawn scientists full attention mainly because they were easier to detect. Never-theless, CBPs as a result of a clear distribution across all latitudes, provide better tracers to study Solar corona rotation. A literature review on CBPs detection and tracking unveiled limitations both in detection accuracy and lacking an automated image processing feature. The purpose of this dissertation was to present an alternative method for detecting CBPs using advanced image processing techniques and provide an automatic recognition software. The proposed methodology is divided into pre-processing methods, a segmentation section, post processing and a data evaluation approach to increase the CBP detection efficiency. As iden-tified by the study of the available data, pre-processing transformations were needed to ensure each image met certain specifications for future detection. The detection process includes a gra-dient based segmentation algorithm, previously developed for retinal image analysis, which is now successfully applied to this CBP case study. The outcome is the CBP list obtained by the detection algorithm which is then filtered and evaluated to remove false positives. To validate the proposed methodology, CBPs need to be tracked along time, to obtain the rotation of the Solar corona. Therefore, the images used in this study were taken from 19.3nm wavelength by the AIA 193 instrument on board of the Solar Dynamics Observatory (SDO) sat-ellite over 3 days during august 2010. These images allowed the perception of how CBPs angular rotation velocity not only depends on heliographic latitude, but also on other factors such as time. From the results obtained it was clear that the proposed methodology is an effective method to detect and track CBPs providing a consistent method for its detection

    Fast Parallel Machine Learning Algorithms for Large Datasets Using Graphic Processing Unit

    Get PDF
    This dissertation deals with developing parallel processing algorithms for Graphic Processing Unit (GPU) in order to solve machine learning problems for large datasets. In particular, it contributes to the development of fast GPU based algorithms for calculating distance (i.e. similarity, affinity, closeness) matrix. It also presents the algorithm and implementation of a fast parallel Support Vector Machine (SVM) using GPU. These application tools are developed using Compute Unified Device Architecture (CUDA), which is a popular software framework for General Purpose Computing using GPU (GPGPU). Distance calculation is the core part of all machine learning algorithms because the closer the query is to some samples (i.e. observations, records, entries), the more likely the query belongs to the class of those samples. K-Nearest Neighbors Search (k-NNS) is a popular and powerful distance based tool for solving classification problem. It is the prerequisite for training local model based classifiers. Fast distance calculation can significantly improve the speed performance of these classifiers and GPUs can be very handy for their accelerations. Meanwhile, several GPU based sorting algorithms are also included to sort the distance matrix and seek for the k-nearest neighbors. The speed performances of the sorting algorithms vary depending upon the input sequences. The GPUKNN proposed in this dissertation utilizes the GPU based distance computation algorithm and automatically picks up the most suitable sorting algorithm according to the characteristics of the input datasets. Every machine learning tool has its own pros and cons. The advantage of SVM is the high classification accuracy. This makes SVM possibly the best classification tool. However, as in many other machine learning algorithms, SVM\u27s slow training phase slows down when the size of the input datasets increase. The GPU version of parallel SVM based on parallel Sequential Minimal Optimization (SMO) implemented in this dissertation is proposed to reduce the time cost in both training and predicting phases. This implementation of GPUSVM is original. It utilizes many parallel processing techniques to accelerate and minimize the computations of kernel evaluation, which are considered as the most time consuming operations in SVM. Although the many-core architecture of GPU performs the best in data level parallelism, multi-task (aka. task level parallelism) processing is also integrated into the application to improve the speed performance of tasks such as multiclass classification and cross-validation. Furthermore, the procedure of finding worst violators is distributed to multiple blocks on the CUDA model. This reduces the time cost for each iteration of SMO during the training phase. All of these violators are shared among different tasks in multiclass classification and cross-validation to reduce the duplicate kernel computations. The speed performance results have shown that the achieved speedup of both the training phase and predicting phase are ranging from one order of magnitude to three orders of magnitude times faster compared to the state of the art LIBSVM software on some well known benchmarking datasets

    Time Series Mining: Shapelet Discovery, Ensembling, and Applications

    Get PDF
    Time series is a prominent class of temporal data sequences that has the properties of being equally spaced in time, chronologically ordered, and highly dimensional. Time series classification is an important branch of time series mining. Existing time series classifiers operate either on row data in the time domain or into an alternate data space in the shapelets or frequency domains. Combining time series classifiers, is another powerful technique used to improve the classification accuracy. It was demonstrated that different classifiers can be expert in predicting different subset of classes over others. The challenge lies in learning the expertise of different base learners. In addition, the high dimensionality characteristic of time series data makes it difficult to visualize their distribution. In this thesis we developed a new time series ensembling methods in order to improve the predictive performance, investigated the interpretability of classifiers by leveraging the power of deep learning models and adjusting them to provide visual shapelets as a by-product of the classification task. Finally, we show application through problems of solar energetic particle events prediction

    A Hybrid Vision-Map Method for Urban Road Detection

    Get PDF

    K-means for massive data

    Get PDF
    145 p.The K-means algorithm is undoubtedly one of the most popular clustering analysis techniques, due to its easiness in the implementation, straightforward parallelizability and competitive computational complexity, when compared to more sophisticated clustering alternatives. Unfortunately, the progressive growth of the amount of data that needs to be analyzed, in a wide variety of scientific fields, represents a significant challenge for the K-means algorithm, since its time complexity is dominated by the number of distance computations, which is linear with respect to both the number of instances and dimensionality of the problem. This fact difficults its scalability on such massive data sets. Another major drawback of the K-means algorithm corresponds to its high dependency on the initial conditions, which not only may affect the quality of the obtained solution, but that may also have major impact on its computational load, as for instance, a poor initialization could lead to an exponential running time in the worst case scenario.In this dissertation we tackle all these difficulties. Initially, we propose an approximation to the K-means problem, the Recursive Partition-based K-means algorithm (RPKM). This approach consists of recursively applying a weighted version of K-means algorithm over a sequence of spatial-based partitions of the data set. From one iteration to the next, a more refined partition is constructed and the process is repeated using the optimal set of centroids, obtained at the previous iteration, as initialization. From practical stand point, such a process reduces the computational load of K-means algorithm as the number of representatives, at each iteration, is meant to be much smaller than the number of instances of the data set. On the other hand, both phases of the algorithm are embarrasingly parallel. From the theoretical standpoint, and in spite of the selected partition strategy, one can guarantee the non-repetition of the clusterings generated at each RPKM iteration, which ultimately implies the reduction of the total amount of K-means algorithm iterations, as well as leading, in most of the cases, to a monotone decrease of the overall error function. Afterwards, we report on a RPKM-type approach, the Boundary Weighted K-means algorithm (BWKM). For this technique the data set partition is based on an adaptative mesh, that adjusts the size of each grid cell to maximize the chances of each cell to have only instances of the same cluster. The goal is to focus most of the computational resources on those regions where it is harder to determine the correct cluster assignment of the original instances (which is the main source of error for our approximation). For such a construction, it can be proved that if all the cells of a spatial partition are well assigned (have instances of the same cluster) at the end of a BWKM step, then the obtained clustering is actually a fixed point of the K-means algorithm over the entire data set, which is generated after using only a small number of representatives in comparison to the actual size of the data set. Furthermore, if, for a certain step of BWKM, this property can be verified at consecutive weighted Lloyds iterations, then the error of our approximation also decreases monotonically. From the practical stand point, BWKM was compared to the state-of-the-art: K-means++, Forgy K-means, Markov Chain Monte Carlo K-means and Minibatch K-means. The obtained results show that BWKM commonly converged to solutions, with a relative error of under 1% with respect to the considered methods, while using a much smaller amount of distance computations (up to 7 orders of magnitude lower). Even when the computational cost of BWKM is linear with respect to the dimensionality, its error quality guarantees are mainly related to the diagonal length of the grid cells, meaning that, as we increase the dimensionality of the problem, it will be harder for BWKM to have such a competitive performance. Taking this into consideration, we developed a fully-parellelizable feature selection technique intended for the K-means algorithm, the Bounded Dimensional Distributed K-means algorithm (BDDKM). This approach consists of applying any heuristic for the K-means problem over multiple subsets of dimensions (each of which is bounded by a predefined constant, m<<d) and using the obtained clusterings to upper-bound the increase in the K-means error when deleting a given feature. We then select the features with the m largest error increase. Not only can each step of BDDKM be simply parallelized, but its computational cost is dominated by that of the selected heuristic (on m dimensions), which makes it a suitable dimensionality reduction alternative for BWKM on large data sets. Besides providing a theoretical bound for the obtained solution, via BDDKM, with respect the optimal K-means clustering, we analyze its performance in comparison to well-known feature selection and feature extraction techniques. Such an analysis shows BDDKM to consistently obtain results with lower K-means error than all the considered feature selection techniques: Laplacian scores, maximum variance and random selection, while also requiring similar or lower computational times than these approaches. Even more interesting, BDDKM, when compared to feature extraction techniques, such as Random Projections, also shows a noticeable improvement in both error and computational time. As a response to the high dependency of K-means algorithm to its initialization, we finally introduce a cheap Split-Merge step that can be used to re-start the K-means algorithm after reaching a fixed point, Split-Merge K-means (SMKM). Under some settings, one can show that this approach reduces the error of the given fixed point without requiring any further iteration of the K-means algorithm. Moreover, experimental results show that this strategy is able to generate approximations with an associated error that is hard to reach for different multi-start methods, such as multi-start Forgy K-means, K-means++ and Hartigan K-means. In particular,SMKM consistently generated the local minima with the lowest K-means error, reducing, on average, over 1 and 2 orders of magnitude of relative error with respect to K-means++ and Hartigan K-means and Forgy K-means, respectively. Not only does the quality of the solution obtained by SMKM tend to be much lower than the previously commented methods, but, in terms of computational resources, SMKM also required a much lower number of distance computations (about an order of magnitude less) to reach the lowest error that they achieved.bcam:basque center for applied mathematics Excelencia Severo Ocho
    corecore