16 research outputs found
A Succinct Four Russians Speedup for Edit Distance Computation and One-against-many Banded Alignment
The classical Four Russians speedup for computing edit distance (a.k.a. Levenshtein distance), due to Masek and Paterson [Masek and Paterson, 1980], involves partitioning the dynamic programming table into k-by-k square blocks and generating a lookup table in O(psi^{2k} k^2 |Sigma|^{2k}) time and O(psi^{2k} k |Sigma|^{2k}) space for block size k, where psi depends on the cost function (for unit costs psi = 3) and |Sigma| is the size of the alphabet. We show that the O(psi^{2k} k^2) and O(psi^{2k} k) factors can be improved to O(k^2 lg{k}) time and O(k^2) space. Thus, we improve the time and space complexity of that aspect compared to Masek and Paterson [Masek and Paterson, 1980] and remove the dependence on psi.
We further show that for certain problems the O(|Sigma|^{2k}) factor can also be reduced. Using this technique, we show a new algorithm for the fundamental problem of one-against-many banded alignment. In particular, comparing one string of length m to n other strings of length m with maximum distance d can be performed in O(n m + m d^2 lg{d} + n d^3) time. When d is reasonably small, this approaches or meets the current best theoretic result of O(nm + n d^2) achieved by using the best known pairwise algorithm running in O(m + d^2) time [Myers, 1986][Ukkonen, 1985] while potentially being more practical. It also improves on the standard practical approach which requires O(n m d) time to iteratively run an O(md) time pairwise banded alignment algorithm.
Regarding pairwise comparison, we extend the classic result of Masek and Paterson [Masek and Paterson, 1980] which computes the edit distance between two strings in O(m^2/log{m}) time to remove the dependence on psi even when edits have arbitrary costs from a penalty matrix. Crochemore, Landau, and Ziv-Ukelson [Crochemore, 2003] achieved a similar result, also allowing for unrestricted scoring matrices, but with variable-sized blocks. In practical applications of the Four Russians speedup wherein space efficiency is important and smaller block sizes k are used (notably k < |Sigma|), Kim, Na, Park, and Sim [Kim et al., 2016] showed how to remove the dependence on the alphabet size for the unit cost version, generating a lookup table in O(3^{2k} (2k)! k^2) time and O(3^{2k} (2k)! k) space. Combining their work with our result yields an improvement to O((2k)! k^2 lg{k}) time and O((2k)! k^2) space
Algorithms to Approximate Column-Sparse Packing Problems
Column-sparse packing problems arise in several contexts in both
deterministic and stochastic discrete optimization. We present two unifying
ideas, (non-uniform) attenuation and multiple-chance algorithms, to obtain
improved approximation algorithms for some well-known families of such
problems. As three main examples, we attain the integrality gap, up to
lower-order terms, for known LP relaxations for k-column sparse packing integer
programs (Bansal et al., Theory of Computing, 2012) and stochastic k-set
packing (Bansal et al., Algorithmica, 2012), and go "half the remaining
distance" to optimal for a major integrality-gap conjecture of Furedi, Kahn and
Seymour on hypergraph matching (Combinatorica, 1993).Comment: Extended abstract appeared in SODA 2018. Full version in ACM
Transactions of Algorithm
Markets, Elections, and Microbes: Data-driven Algorithms from Theory to Practice
Many modern problems in algorithms and optimization are driven by data which often carries with it an element of uncertainty. In this work, we conduct an investigation into algorithmic foundations and applications across three main areas.
The first area is online matching algorithms for e-commerce applications such as online sales and advertising. The importance of e-commerce in modern business cannot be overstated and even minor algorithmic improvements can have huge impacts. In online matching problems, we generally have a known offline set of goods or advertisements while users arrive online and allocations must be made immediately and irrevocably when a user arrives. However, in the real world, there is also uncertainty about a user's true interests and this can be modeled by considering matching problems in a graph with stochastic edges that only have a probability of existing. These edges can represent the probability of a user purchasing a product or clicking on an ad. Thus, we optimize over data which only provides an estimate of what types of users will arrive and what they will prefer. We survey a broad landscape of problems in this area, gain a deeper understanding of the algorithmic challenges, and present algorithms with improved worst case performance
The second area is constrained clustering where we explore classical clustering problems with additional constraints on which data points should be clustered together. Utilizing these constraints is important for many clustering problems because they can be used to ensure fairness, exploit expert advice, or capture natural properties of the data. In simplest case, this can mean some pairs of points have ``must-link'' constraints requiring that that they must be clustered together. Moving into stochastic settings, we can describe more general pairwise constraints such as bounding the probability that two points are separated into different clusters. This lets us introduce a new notion of fairness for clustering and address stochastic problems such as semi-supervised learning with advice from imperfect experts. Here, we introduce new models of constrained clustering including new notions of fairness for clustering applications. Since these problems are NP-hard, we give approximation algorithms and in some cases conduct experiments to explore how the algorithms perform in practice. Finally, we look closely at the particular clustering problem of drawing election districts and show how constraining the clusters based on past voting data can interact with voter incentives.
The third area is string algorithms for bioinformatics and metagenomics specifically where the data deluge from next generation sequencing drives the necessity for new algorithms that are both fast and accurate. For metagenomic analysis, we present a tool for clustering a microbial marker gene, the 16S ribosomal RNA gene. On the more theoretical side, we present a succinct application of the Method of the Four Russians to edit distance computation as well as new algorithms and bounds for the maximum duo-preservation string mapping (MPSM) problem
Follow Your Star: New Frameworks for Online Stochastic Matching with Known and Unknown Patience
We study several generalizations of the Online Bipartite Matching problem. We
consider settings with stochastic rewards, patience constraints, and weights
(both vertex- and edge-weighted variants). We introduce a stochastic variant of
the patience-constrained problem, where the patience is chosen randomly
according to some known distribution and is not known until the point at which
patience has been exhausted. We also consider stochastic arrival settings
(i.e., online vertex arrival is determined by a known random process), which
are natural settings that are able to beat the hard worst-case bounds of more
pessimistic adversarial arrivals.
Our approach to online matching utilizes black-box algorithms for matching on
star graphs under various models of patience. In support of this, we design
algorithms which solve the star graph problem optimally for patience with a
constant hazard rate and yield a 1/2-approximation for any patience
distribution. This 1/2-approximation also improves existing guarantees for
cascade-click models in the product ranking literature, in which a user must be
shown a sequence of items with various click-through-rates and the user's
patience could run out at any time.
We then build a framework which uses these star graph algorithms as black
boxes to solve the online matching problems under different arrival settings.
We show improved (or first-known) competitive ratios for these problems.
Finally, we present negative results that include formalizing the concept of a
stochasticity gap for LP upper bounds on these problems, bounding the
worst-case performance of some popular greedy approaches, and showing the
impossibility of having an adversarial patience in the product ranking setting.Comment: 43 page
Better Greedy Sequence Clustering with Fast Banded Alignment
Comparing a string to a large set of sequences is a key subroutine in greedy heuristics for clustering genomic data. Clustering 16S rRNA gene sequences into operational taxonomic units (OTUs) is a common method used in studying microbial communities. We present a new approach to greedy clustering using a trie-like data structure and Four Russians speedup. We evaluate the running time of our method in terms of the number of comparisons it makes during clustering and show in experimental results that the number of comparisons grows linearly with the size of the dataset as opposed to the quadratic running time of other methods. We compare the clusters output by our method to the popular greedy clustering tool UCLUST. We show that the clusters we generate can be both tighter and larger
New Algorithms, Better Bounds, and a Novel Model for Online Stochastic Matching
Online matching has received significant attention over the last 15 years due to its close connection to Internet advertising. As the seminal work of Karp, Vazirani, and Vazirani has an optimal (1 - 1/epsilon) competitive ratio in the standard adversarial online model, much effort has gone into developing useful online models that incorporate some stochasticity in the arrival process. One such popular model is the "known I.I.D. model" where different customer-types arrive online from a known distribution. We develop algorithms with improved competitive ratios for some basic variants of this model with integral arrival rates, including: (a) the case of general weighted edges, where we improve the best-known ratio of 0.667 due to [Haeupler, Mirrokni and Zadimoghaddam WINE 2011] to 0.705; and (b) the vertex-weighted case, where we improve the 0.7250 ratio of [Jaillet and Lu Math. Oper. Res 2013] to 0.7299. We also consider two extensions, one is "known I.I.D." with non-integral arrival rate and stochastic rewards; the other is "known I.I.D." b-matching with non-integral arrival rate and stochastic rewards. We present a simple non-adaptive algorithm which works well simultaneously on the two extensions.
One of the key ingredients of our improvement is the following (offline) approach to bipartite-matching polytopes with additional constraints. We first add several valid constraints in order to get a good fractional solution f; however, these give us less control over the structure of f. We next remove all these additional constraints and randomly move from f to a feasible point on the matching polytope with all coordinates being from the set {0, 1/k, 2/k,..., 1} for a chosen integer k. The structure of this solution is inspired by [Jaillet and Lu Math. Oper. Res 2013] and is a tractable structure for algorithm design and analysis. The appropriate random move preserves many of the removed constraints (approximately [exactly] with high probability [in expectation]). This underlies some of our improvements, and, we hope, could be of independent interest
Probabilistic Fair Clustering
In clustering problems, a central decision-maker is given a complete metric
graph over vertices and must provide a clustering of vertices that minimizes
some objective function. In fair clustering problems, vertices are endowed with
a color (e.g., membership in a group), and the features of a valid clustering
might also include the representation of colors in that clustering. Prior work
in fair clustering assumes complete knowledge of group membership. In this
paper, we generalize prior work by assuming imperfect knowledge of group
membership through probabilistic assignments. We present clustering algorithms
in this more general setting with approximation ratio guarantees. We also
address the problem of "metric membership", where different groups have a
notion of order and distance. Experiments are conducted using our proposed
algorithms as well as baselines to validate our approach and also surface
nuanced concerns when group membership is not known deterministically
Fairness, Semi-Supervised Learning, and More: A General Framework for Clustering with Stochastic Pairwise Constraints
Metric clustering is fundamental in areas ranging from Combinatorial
Optimization and Data Mining, to Machine Learning and Operations Research.
However, in a variety of situations we may have additional requirements or
knowledge, distinct from the underlying metric, regarding which pairs of points
should be clustered together. To capture and analyze such scenarios, we
introduce a novel family of \emph{stochastic pairwise constraints}, which we
incorporate into several essential clustering objectives (radius/median/means).
Moreover, we demonstrate that these constraints can succinctly model an
intriguing collection of applications, including among others \emph{Individual
Fairness} in clustering and \emph{Must-link} constraints in semi-supervised
learning. Our main result consists of a general framework that yields
approximation algorithms with provable guarantees for important clustering
objectives, while at the same time producing solutions that respect the
stochastic pairwise constraints. Furthermore, for certain objectives we devise
improved results in the case of Must-link constraints, which are also the best
possible from a theoretical perspective. Finally, we present experimental
evidence that validates the effectiveness of our algorithms.Comment: This paper appeared in AAAI 202