36 research outputs found

    Faster Linear Algebra for Distance Matrices

    Full text link
    The distance matrix of a dataset XX of nn points with respect to a distance function ff represents all pairwise distances between points in XX induced by ff. Due to their wide applicability, distance matrices and related families of matrices have been the focus of many recent algorithmic works. We continue this line of research and take a broad view of algorithm design for distance matrices with the goal of designing fast algorithms, which are specifically tailored for distance matrices, for fundamental linear algebraic primitives. Our results include efficient algorithms for computing matrix-vector products for a wide class of distance matrices, such as the 1\ell_1 metric for which we get a linear runtime, as well as an Ω(n2)\Omega(n^2) lower bound for any algorithm which computes a matrix-vector product for the \ell_{\infty} case, showing a separation between the 1\ell_1 and the \ell_{\infty} metrics. Our upper bound results, in conjunction with recent works on the matrix-vector query model, have many further downstream applications, including the fastest algorithm for computing a relative error low-rank approximation for the distance matrix induced by 1\ell_1 and 22\ell_2^2 functions and the fastest algorithm for computing an additive error low-rank approximation for the 2\ell_2 metric, in addition to applications for fast matrix multiplication among others. We also give algorithms for constructing distance matrices and show that one can construct an approximate 2\ell_2 distance matrix in time faster than the bound implied by the Johnson-Lindenstrauss lemma.Comment: Selected as Oral for NeurIPS 202

    Unsupervised Structural Embedding Methods for Efficient Collective Network Mining

    Full text link
    How can we align accounts of the same user across social networks? Can we identify the professional role of an email user from their patterns of communication? Can we predict the medical effects of chemical compounds from their atomic network structure? Many problems in graph data mining, including all of the above, are defined on multiple networks. The central element to all of these problems is cross-network comparison, whether at the level of individual nodes or entities in the network or at the level of entire networks themselves. To perform this comparison meaningfully, we must describe the entities in each network expressively in terms of patterns that generalize across the networks. Moreover, because the networks in question are often very large, our techniques must be computationally efficient. In this thesis, we propose scalable unsupervised methods that embed nodes in vector space by mapping nodes with similar structural roles in their respective networks, even if they come from different networks, to similar parts of the embedding space. We perform network alignment by matching nodes across two or more networks based on the similarity of their embeddings, and refine this process by reinforcing the consistency of each node’s alignment with those of its neighbors. By characterizing the distribution of node embeddings in a graph, we develop graph-level feature vectors that are highly effective for graph classification. With principled sparsification and randomized approximation techniques, we make all our methods computationally efficient and able to scale to graphs with millions of nodes or edges. We demonstrate the effectiveness of structural node embeddings on industry-scale applications, and propose an extensive set of embedding evaluation techniques that lay the groundwork for further methodological development and application.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162895/1/mheimann_1.pd

    Fine-Grained Complexity: Exploring Reductions and their Properties

    Get PDF
    Η σχεδίαση αλγορίθμων αποτελεί ένα απο τα κύρια θέματα ενδιαφέροντος για τον τομέα της Πληροφορικής. Παρά τα πολλά αποτελέσματα σε ορισμένους τομείς, η προσέγγιση αυτή έχει πετύχει κάποια πρακτικά αδιέξοδα που έχουν αποδειχτεί προβληματικά στην πρόοδο του τομέα. Επίσης, οι κλασικές πρακτικές Υπολογιστικής Πολυπλοκότητας δεν ήταν σε θέση να παρακάμψουν αυτά τα εμπόδια. Η κατανόηση της δυσκολίας του κάθε προβλήματος δεν είναι τετριμμένη. Η Ραφιναρισμένη Πολυπλοκότητα παρέχει νέες προ-οπτικές για τα κλασικά προβλήματα, με αποτέλεσμα σταθερούς δεσμούς μεταξύ γνωστών εικασιών στην πολυπλοκότητα και την σχεδίαση αλγορίθμων. Χρησιμεύει επίσης ως εργα-λείο για να αποδείξει τα υπο όρους κατώτατα όρια για προβλήματα πολυωνυμικής χρονικής πολυπλοκότητας, ένα πεδίο που έχει σημειώσει πολύ λίγη πρόοδο μέχρι τώρα. Οι δημοφι-λείς υποθέσεις/παραδοχές όπως το SETH, το OVH, το 3SUM, και το APSP, δίνουν πολλά φράγματα που δεν έχουν ακόμα αποδειχθεί με κλασικές τεχνικές και παρέχουν μια νέα κατανόηση της δομής και της εντροπίας των προβλημάτων γενικά. Σκοπός αυτής της εργασίας είναι να συμβάλει στην εδραίωση του πλαισίου για αναγωγές από κάθε εικασία και να διερευνήσει την διαρθρωτική διαφορά μεταξύ των προβλημάτων σε κάθε περίπτωση.Algorithmic design has been one of the main subjects of interest for Computer science. While very effective in some areas, this approach has been met with some practical dead ends that have been very problematic in the progress of the field. Classical Computational Complexity practices have also not been able to bypass these blocks. Understanding the hardness of each problem is not trivial. Fine-Grained Complexity provides new perspectives on classic problems, resulting to solid links between famous conjectures in Complexity, and Algorithmic design. It serves as a tool to prove conditional lower bounds for problems with polynomial time complexity, a field that had seen very little progress until now. Popular conjectures such as SETH, k-OV, 3SUM, and APSP, imply many bounds that have yet to be proven using classic techniques, and provide a new understanding of the structure and entropy of problems in general. The aim of this thesis is to contribute towards solidifying the framework for reductions from each conjecture, and to explore the structural difference between the problems in each cas

    RUPEE: A Big Data Approach to Indexing and Searching Protein Structures

    Get PDF
    Title from PDF of title page viewed July 7, 2021Yugyung LeeVitaIncludes bibliographical references (pages 149-158)Thesis (Ph.D.)--School of Computing and Engineering and Department of Mathematics and Statistics. University of Missouri--Kansas City, 2021Given the close relationship between protein structure and function, protein structure searches have long played an established role in bioinformatics. Despite their maturity, existing protein structure searches either compromise the quality of results to obtain faster response times or suffer from longer response times to provide better quality results. Existing protein structure searches that focus on faster response times often use sequence clustering or depend on other simplifying assumptions not based on structure alone. In the case of sequence clustering, strong structure similarities are often hidden behind cluster representatives. Existing protein structure searches that focus on better quality results often perform full pairwise protein structure alignments with the query structure against every available structure in the searched database, which can take as long as a full day to complete. The poor response times of these protein structure searches prevent the easy and efficient exploration of relationships between protein structures, which is the norm in other areas of inquiry. To address these trade-offs between faster response times and quality results, we have developed RUPEE, a fast and accurate purely geometric protein structure search combining a novel approach to encoding sequences of torsion angles with established techniques from information retrieval and big data. RUPEE can compare the query structure to every available structure in the searched database with fast response times. To accomplish this, first, we introduce a new polar plot of torsion angles to help identify separable regions of torsion angles and derive a simple encoding of torsion angles based on the identified regions. Then, we introduce a heuristic to encode sequences of torsion angles called Run Position Encoding to increase the specificity of our encoding within regular secondary structures, alpha-helices and beta-strands. Once we have a linear encoding of protein structures based on their torsion angles, we use min-hashing and locality sensitive hashing, established techniques from information retrieval and big data, to compare the query structure to every available structure in the searched database with fast response times. Moreover, because RUPEE is a purely geometric protein structure search, it does not depend on protein sequences. RUPEE also does not depend on other simplifying assumptions not based on structure alone. As such, RUPEE can be used effectively to search on protein structures with low sequence and structure similarity to known structures, such as predicted structures that results from protein structure prediction algorithms. Comparing our results to the mTM-align, SSM, CATHEDRAL, and VAST protein structure searches, RUPEE has set a new bar for protein structure searches. RUPEE produces better quality results than the best available protein structure searches and does so with the fastest response times.Introduction -- Encoding Torsion Angles -- Indexing Protein Structures -- Searching Protein Structures -- Results and Evaluation -- Using RUPEE -- Conclusion -- Appendix A. Benchmarks of Known Protein Structures -- Appendix B. Benchmarks of Protein Structure Prediction

    Similarity measures and algorithms for cartographic schematization

    Get PDF

    Efficient homology search for genomic sequence databases

    Get PDF
    Genomic search tools can provide valuable insights into the chemical structure, evolutionary origin and biochemical function of genetic material. A homology search algorithm compares a protein or nucleotide query sequence to each entry in a large sequence database and reports alignments with highly similar sequences. The exponential growth of public data banks such as GenBank has necessitated the development of fast, heuristic approaches to homology search. The versatile and popular blast algorithm, developed by researchers at the US National Center for Biotechnology Information (NCBI), uses a four-stage heuristic approach to efficiently search large collections for analogous sequences while retaining a high degree of accuracy. Despite an abundance of alternative approaches to homology search, blast remains the only method to offer fast, sensitive search of large genomic collections on modern desktop hardware. As a result, the tool has found widespread use with millions of queries posed each day. A significant investment of computing resources is required to process this large volume of genomic searches and a cluster of over 200 workstations is employed by the NCBI to handle queries posed through the organisation's website. As the growth of sequence databases continues to outpace improvements in modern hardware, blast searches are becoming slower each year and novel, faster methods for sequence comparison are required. In this thesis we propose new techniques for fast yet accurate homology search that result in significantly faster blast searches. First, we describe improvements to the final, gapped alignment stages where the query and sequences from the collection are aligned to provide a fine-grain measure of similarity. We describe three new methods for aligning sequences that roughly halve the time required to perform this computationally expensive stage. Next, we investigate improvements to the first stage of search, where short regions of similarity between a pair of sequences are identified. We propose a novel deterministic finite automaton data structure that is significantly smaller than the codeword lookup table employed by ncbi-blast, resulting in improved cache performance and faster search times. We also discuss fast methods for nucleotide sequence comparison. We describe novel approaches for processing sequences that are compressed using the byte packed format already utilised by blast, where four nucleotide bases from a strand of DNA are stored in a single byte. Rather than decompress sequences to perform pairwise comparisons, our innovations permit sequences to be processed in their compressed form, four bases at a time. Our techniques roughly halve average query evaluation times for nucleotide searches with no effect on the sensitivity of blast. Finally, we present a new scheme for managing the high degree of redundancy that is prevalent in genomic collections. Near-duplicate entries in sequence data banks are highly detrimental to retrieval performance, however existing methods for managing redundancy are both slow, requiring almost ten hours to process the GenBank database, and crude, because they simply purge highly-similar sequences to reduce the level of internal redundancy. We describe a new approach for identifying near-duplicate entries that is roughly six times faster than the most successful existing approaches, and a novel approach to managing redundancy that reduces collection size and search times but still provides accurate and comprehensive search results. Our improvements to blast have been integrated into our own version of the tool. We find that our innovations more than halve average search times for nucleotide and protein searches, and have no signifcant effect on search accuracy. Given the enormous popularity of blast, this represents a very significant advance in computational methods to aid life science research
    corecore