9 research outputs found

    When Hashing Met Matching: Efficient Spatio-Temporal Search for Ridesharing

    Full text link
    Carpooling, or sharing a ride with other passengers, holds immense potential for urban transportation. Ridesharing platforms enable such sharing of rides using real-time data. Finding ride matches in real-time at urban scale is a difficult combinatorial optimization task and mostly heuristic approaches are applied. In this work, we mathematically model the problem as that of finding near-neighbors and devise a novel efficient spatio-temporal search algorithm based on the theory of locality sensitive hashing for Maximum Inner Product Search (MIPS). The proposed algorithm can find kk near-optimal potential matches for every ride from a pool of nn rides in time O(n1+ρ(k+log⁥n)log⁥k)O(n^{1 + \rho} (k + \log n) \log k) and space O(n1+ρlog⁥k)O(n^{1 + \rho} \log k) for a small ρ<1\rho < 1. Our algorithm can be extended in several useful and interesting ways increasing its practical appeal. Experiments with large NY yellow taxi trip datasets show that our algorithm consistently outperforms state-of-the-art heuristic methods thereby proving its practical applicability

    Efficient algorithms for substring near neighbor problem

    Full text link

    Approximate nearest neighbor problem in high dimensions

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 47-49).We investigate the problem of finding the approximate nearest neighbor when the data set points are the substrings of a given text T. The exact version of this problem is defined as follows. Given a text T of length n, we want to build a data structure that supports the following operation: given a pattern P, find the substring of T that is the closest to P. Since the exact version of this problem is surprisingly difficult, we address the approximate version, in which we are allowed to return a substring of T that is at most c times further than the actual closest substring of T. This problem occurs, for example, in computational biology [4, 5]. In particular, we study the case where the length of the pattern P, denoted by m, is not known in advance, which is the most natural scenario. We present a data structure that uses O(n1+1/c) space and has 0 (nl/c + mn⁰(l)) query time' when the distance between two strings is the Hamming distance. These bounds essentially match the earlier bounds of [12], which assumed that the pattern length m is fixed in advance. Furthermore, our data structure can be constructed in O (n1+1/c + n1+⁰(1)M1/3) time, where M is an upper bound for m. This time essentially matches the preprocessing time of [12] as long as the term O(n1+1/c) dominates the running time, which is the case when, for example, c < 3. We also extend our results to the case where the distances are measured according to the lI distance. The query time and the space bound are essentially the same, while the preprocessing time becomes 6 (n'+/c + nl+o(l)M2/3) (We use notation f(n) = O(g(n)) to denote f(n) = g(n) logO(1) g(n)).by Alexandr Andoni.M.Eng

    Structural RNA Homology Search and Alignment Using Covariance Models

    Get PDF
    Functional RNA elements do not encode proteins, but rather function directly as RNAs. Many different types of RNAs play important roles in a wide range of cellular processes, including protein synthesis, gene regulation, protein transport, splicing, and more. Because important sequence and structural features tend to be evolutionarily conserved, one way to learn about functional RNAs is through comparative sequence analysis - by collecting and aligning examples of homologous RNAs and comparing them. Covariance models: CMs) are powerful computational tools for homology search and alignment that score both the conserved sequence and secondary structure of an RNA family. However, due to the high computational complexity of their search and alignment algorithms, searches against large databases and alignment of large RNAs like small subunit ribosomal RNA: SSU rRNA) are prohibitively slow. Large-scale alignment of SSU rRNA is of particular utility for environmental survey studies of microbial diversity which often use the rRNA as a phylogenetic marker of microorganisms. In this work, we improve CM methods by making them faster and more sensitive to remote homology. To accelerate searches, we introduce a query-dependent banding: QDB) technique that makes scoring sequences more efficient by restricting the possible lengths of structural elements based on their probability given the model. We combine QDB with a complementary filtering method that quickly prunes away database subsequences deemed unlikely to receive high CM scores based on sequence conservation alone. To increase search sensitivity, we apply two model parameterization strategies from protein homology search tools to CMs. As judged by our benchmark, these combined approaches yield about a 250-fold speedup and significant increase in search sensitivity compared with previous implementations. To accelerate alignment, we apply a method that uses a fast sequence-based alignment of a target sequence to determine constraints for the more expensive CM sequence- and structure-based alignment. This technique reduces the time required to align one SSU rRNA sequence from about 15 minutes to 1 second with a negligible effect on alignment accuracy. Collectively, these improvements make CMs more powerful and practical tools for RNA homology search and alignment

    Recherche d'éléments répétés par analyse des distributions de fréquences d'oligonucléotides

    Full text link
    Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal

    Provably sensitive Indexing strategies for biosequence similarity search

    No full text

    Discriminative, generative, and imitative learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2002.Includes bibliographical references (leaves 201-212).I propose a common framework that combines three different paradigms in machine learning: generative, discriminative and imitative learning. A generative probabilistic distribution is a principled way to model many machine learning and machine perception problems. Therein, one provides domain specific knowledge in terms of structure and parameter priors over the joint space of variables. Bayesian networks and Bayesian statistics provide a rich and flexible language for specifying this knowledge and subsequently refining it with data and observations. The final result is a distribution that is a good generator of novel exemplars. Conversely, discriminative algorithms adjust a possibly non-distributional model to data optimizing for a specific task, such as classification or prediction. This typically leads to superior performance yet compromises the flexibility of generative modeling. I present Maximum Entropy Discrimination (MED) as a framework to combine both discriminative estimation and generative probability densities. Calculations involve distributions over parameters, margins, and priors and are provably and uniquely solvable for the exponential family. Extensions include regression, feature selection, and transduction. SVMs are also naturally subsumed and can be augmented with, for example, feature selection, to obtain substantial improvements. To extend to mixtures of exponential families, I derive a discriminative variant of the Expectation-Maximization (EM) algorithm for latent discriminative learning (or latent MED).(cont.) While EM and Jensen lower bound log-likelihood, a dual upper bound is made possible via a novel reverse-Jensen inequality. The variational upper bound on latent log-likelihood has the same form as EM bounds, is computable efficiently and is globally guaranteed. It permits powerful discriminative learning with the wide range of contemporary probabilistic mixture models (mixtures of Gaussians, mixtures of multinomials and hidden Markov models). We provide empirical results on standardized data sets that demonstrate the viability of the hybrid discriminative-generative approaches of MED and reverse-Jensen bounds over state of the art discriminative techniques or generative approaches. Subsequently, imitative learning is presented as another variation on generative modeling which also learns from exemplars from an observed data source. However, the distinction is that the generative model is an agent that is interacting in a much more complex surrounding external world. It is not efficient to model the aggregate space in a generative setting. I demonstrate that imitative learning (under appropriate conditions) can be adequately addressed as a discriminative prediction task which outperforms the usual generative approach. This discriminative-imitative learning approach is applied with a generative perceptual system to synthesize a real-time agent that learns to engage in social interactive behavior.by Tony Jebara.Ph.D
    corecore