9,200 research outputs found
Large neighborhood search for the most strings with few bad columns problem
In this work, we consider the following NP-hard combinatorial optimization problem from computational biology. Given a set of input strings of equal length, the goal is to identify a maximum cardinality subset of strings that differ maximally in a pre-defined number of positions. First of all, we introduce an integer linear programming model for this problem. Second, two variants of a rather simple greedy strategy are proposed. Finally, a large neighborhood search algorithm is presented. A comprehensive experimental comparison among the proposed techniques shows, first, that larger neighborhood search generally outperforms both greedy strategies. Second, while large neighborhood search shows to be competitive with the stand-alone application of CPLEX for small- and medium-sized problem instances, it outperforms CPLEX in the context of larger instances.Peer ReviewedPostprint (author's final draft
On the role of metaheuristic optimization in bioinformatics
Metaheuristic algorithms are employed to solve complex and large-scale optimization problems in many different fields, from transportation and smart cities to finance. This paper discusses how metaheuristic algorithms are being applied to solve different optimization problems in the area of bioinformatics. While the text provides references to many optimization problems in the area, it focuses on those that have attracted more interest from the optimization community. Among the problems analyzed, the paper discusses in more detail the molecular docking problem, the protein structure prediction, phylogenetic inference, and different string problems. In addition, references to other relevant optimization problems are also given, including those related to medical imaging or gene selection for classification. From the previous analysis, the paper generates insights on research opportunities for the Operations Research and Computer Science communities in the field of bioinformatics
Text Line Segmentation of Historical Documents: a Survey
There is a huge amount of historical documents in libraries and in various
National Archives that have not been exploited electronically. Although
automatic reading of complete pages remains, in most cases, a long-term
objective, tasks such as word spotting, text/image alignment, authentication
and extraction of specific fields are in use today. For all these tasks, a
major step is document segmentation into text lines. Because of the low quality
and the complexity of these documents (background noise, artifacts due to
aging, interfering lines),automatic text line segmentation remains an open
research field. The objective of this paper is to present a survey of existing
methods, developed during the last decade, and dedicated to documents of
historical interest.Comment: 25 pages, submitted version, To appear in International Journal on
Document Analysis and Recognition, On line version available at
http://www.springerlink.com/content/k2813176280456k3
Cross-Sender Bit-Mixing Coding
Scheduling to avoid packet collisions is a long-standing challenge in
networking, and has become even trickier in wireless networks with multiple
senders and multiple receivers. In fact, researchers have proved that even {\em
perfect} scheduling can only achieve . Here
is the number of nodes in the network, and is the {\em medium
utilization rate}. Ideally, one would hope to achieve ,
while avoiding all the complexities in scheduling. To this end, this paper
proposes {\em cross-sender bit-mixing coding} ({\em BMC}), which does not rely
on scheduling. Instead, users transmit simultaneously on suitably-chosen slots,
and the amount of overlap in different user's slots is controlled via coding.
We prove that in all possible network topologies, using BMC enables us to
achieve . We also prove that the space and time
complexities of BMC encoding/decoding are all low-order polynomials.Comment: Published in the International Conference on Information Processing
in Sensor Networks (IPSN), 201
Statistical data mining for symbol associations in genomic databases
A methodology is proposed to automatically detect significant symbol
associations in genomic databases. A new statistical test is proposed to assess
the significance of a group of symbols when found in several genesets of a
given database. Applied to symbol pairs, the thresholded p-values of the test
define a graph structure on the set of symbols. The cliques of that graph are
significant symbol associations, linked to a set of genesets where they can be
found. The method can be applied to any database, and is illustrated MSigDB C2
database. Many of the symbol associations detected in C2 or in non-specific
selections did correspond to already known interactions. On more specific
selections of C2, many previously unkown symbol associations have been
detected. These associations unveal new candidates for gene or protein
interactions, needing further investigation for biological evidence
Clustered Integer 3SUM via Additive Combinatorics
We present a collection of new results on problems related to 3SUM,
including:
1. The first truly subquadratic algorithm for
1a. computing the (min,+) convolution for monotone increasing
sequences with integer values bounded by ,
1b. solving 3SUM for monotone sets in 2D with integer coordinates
bounded by , and
1c. preprocessing a binary string for histogram indexing (also
called jumbled indexing).
The running time is:
with
randomization, or deterministically. This greatly improves the
previous time bound obtained from Williams'
recent result on all-pairs shortest paths [STOC'14], and answers an open
question raised by several researchers studying the histogram indexing problem.
2. The first algorithm for histogram indexing for any constant alphabet size
that achieves truly subquadratic preprocessing time and truly sublinear query
time.
3. A truly subquadratic algorithm for integer 3SUM in the case when the given
set can be partitioned into clusters each covered by an interval
of length , for any constant .
4. An algorithm to preprocess any set of integers so that subsequently
3SUM on any given subset can be solved in
time.
All these results are obtained by a surprising new technique, based on the
Balog--Szemer\'edi--Gowers Theorem from additive combinatorics
The Case for Learned Index Structures
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the
position of a record within a sorted array, a Hash-Index as a model to map a
key to a position of a record within an unsorted array, and a BitMap-Index as a
model to indicate if a data record exists or not. In this exploratory research
paper, we start from this premise and posit that all existing index structures
can be replaced with other types of models, including deep-learning models,
which we term learned indexes. The key idea is that a model can learn the sort
order or structure of lookup keys and use this signal to effectively predict
the position or existence of records. We theoretically analyze under which
conditions learned indexes outperform traditional index structures and describe
the main challenges in designing learned index structures. Our initial results
show, that by using neural nets we are able to outperform cache-optimized
B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over
several real-world data sets. More importantly though, we believe that the idea
of replacing core components of a data management system through learned models
has far reaching implications for future systems designs and that this work
just provides a glimpse of what might be possible
- …