363 research outputs found

    FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

    Full text link
    We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2Dn^2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

    Convolution Acceleration: Query Based Filter Pruning with ALSH

    Get PDF
    The rising ubiquity of Convolutional Neural Networks for learning tasks has led to their use on a variety of devices. CNNs can be used on small devices, such as phones or embedded systems; however, compute time is a critical enabling factor. On these devices, trading high accuracy for improved performance may be worthwhile. This has led to active research in high-level convolution optimizations. One successful class of optimizations is filter pruning, in which filters that are determined to have a small effect on the network\u27s output are deleted. In this work, we present a self-pruning convolution that is intended to accelerate convolutions for use on small devices. We call it an ALSH Convolution because it uses Asymmetric Locality Sensitive Hashing to generate a subset of the convolution\u27s filters that are likely to produce large outputs for a given input. Our methodology is accessible: it generalizes well to many architectures and is easy to use, essentially functioning as a regular layer. Experiments show that a network modified to use ALSH Convolutions can stay within 5% accuracy on CIFAR-10 and 10% on CIFAR-100. Further, on small devices, a network built with our implementation can be 2x faster than the same network composed of PyTorch\u27s convolution

    High Performance Computing using Infiniband-based clusters

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Doctor of Philosophy

    Get PDF
    dissertationData-driven analytics has been successfully utilized in many experience-oriented areas, such as education, business, and medicine. With the profusion of traffic-related data from Internet of Things and development of data mining techniques, data-driven analytics is becoming increasingly popular in the transportation industry. The objective of this research is to explore the application of data-driven analytics in transportation research to improve the traffic management and operations. Three problems in the respective areas of transportation planning, traffic operation, and maintenance management have been addressed in this research, including exploring the impact of dynamic ridesharing system in a multimodal network, quantifying non-recurrent congestion impact on freeway corridors, and developing infrastructure sampling method for efficient maintenance activities. First, the impact of dynamic ridesharing in a multimodal network is studied with agent-based modeling. The competing mechanism between dynamic ridesharing system and public transit is analyzed. The model simulates the interaction between travelers and the environment and emulates travelers' decision making process with the presence of competing modes. The model is applicable to networks with varying demographics. Second, a systematic approach is proposed to quantify Incident-Induced Delay on freeway corridors. There are two particular highlights in the study of non-recurrent congestion quantification: secondary incident identification and K-Nearest Neighbor pattern matching. The proposed methodology is easily transferable to any traffic operation system that has access to sensor data at a corridor level. Lastly, a high-dimensional clustering-based stratified sampling method is developed for infrastructure sampling. The stratification process consists of two components: current condition estimation and high-dimensional cluster analysis. High-dimensional cluster analysis employs Locality-Sensitive Hashing algorithm and spectral sampling. The proposed method is a potentially useful tool for agencies to effectively conduct infrastructure inspection and can be easily adopted for choosing samples containing multiple features. These three examples showcase the application of data-driven analytics in transportation research, which can potentially transform the traffic management mindset into a model of data-driven, sensing, and smart urban systems. The analytic

    SLIM : Scalable Linkage of Mobility Data

    Get PDF
    We present a scalable solution to link entities across mobility datasets using their spatio-temporal information. This is a fundamental problem in many applications such as linking user identities for security, understanding privacy limitations of location based services, or producing a unified dataset from multiple sources for urban planning. Such integrated datasets are also essential for service providers to optimise their services and improve business intelligence. In this paper, we first propose a mobility based representation and similarity computation for entities. An efficient matching process is then developed to identify the final linked pairs, with an automated mechanism to decide when to stop the linkage. We scale the process with a locality-sensitive hashing (LSH) based approach that significantly reduces candidate pairs for matching. To realize the effectiveness and efficiency of our techniques in practice, we introduce an algorithm called SLIM. In the experimental evaluation, SLIM outperforms the two existing state-of-the-art approaches in terms of precision and recall. Moreover, the LSH-based approach brings two to four orders of magnitude speedup

    SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS

    Get PDF
    In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) sux-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (Kmer-Estimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use sux tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs ( 100 bp, mostly found in animals). In our new method (MinCNE), the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a comprehensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods. Adviser: Jitender S. Deogu
    corecore