1,091 research outputs found
A Performance Model For Gpu Architectures: Analysis And Design Of Fundamental Algorithms
Ph.D. Thesis. University of Hawaiʻi at Mānoa 2018
Improving the Compact Bit-Sliced Signature Index COBS for Large Scale Genomic Data
In this thesis we investigate the potential for improving the Compact Bit-Sliced Signature Index (COBS) [BBGI19] for large scale genomic data. COBS was developed by Bingmann et al. and is an inverted text index based on Bloom filters. It can be used to index k-mers of DNA samples or q-grams of plain text data and is queried using approximate pattern matching based on the k-mer (or q-gram) profile of a query. In their work Bingmann et al. demonstrated a couple of advantages COBS has over other state of the art approximate k-mer-based indices, some of which are extraordinary fast query and construction times, but as well as the fact that COBS can be constructed and queried even if the index does not fit into main memory. This is one of the reasons we decided to look more closely at some areas we could improve COBS. Our main goal is to make COBS more scalable. Scalability is a very important factor when it comes to handling DNA related data. This is because the amount of sequenced data stored in publicly available archives nearly doubles every year, making it difficult to handle even from the perspective of resources alone. We focus on two main areas in which we try to improve COBS. Those are index compression through clustering and distribution. The thesis presents our findings and improvements achieved in respect to those areas
Trajectory Similarity Measurement: An Efficiency Perspective
Trajectories that capture object movement have numerous applications, in
which similarity computation between trajectories often plays a key role.
Traditionally, the similarity between two trajectories is quantified by means
of heuristic measures, e.g., Hausdorff or ERP, that operate directly on the
trajectories. In contrast, recent studies exploit deep learning to map
trajectories to d-dimensional vectors, called embeddings. Then, some distance
measure, e.g., Manhattan or Euclidean, is applied to the embeddings to quantify
trajectory similarity. The resulting similarities are inaccurate: they only
approximate the similarities obtained using the heuristic measures. As distance
computation on embeddings is efficient, focus has been on achieving embeddings
yielding high accuracy.
Adopting an efficiency perspective, we analyze the time complexities of both
the heuristic and the learning-based approaches, finding that the time
complexities of the former approaches are not necessarily higher. Through
extensive experiments on open datasets, we find that, on both CPUs and GPUs,
only a few learning-based approaches can deliver the promised higher
efficiency, when the embeddings can be pre-computed, while heuristic approaches
are more efficient for one-off computations. Among the learning-based
approaches, the self-attention-based ones are the fastest to learn embeddings
that also yield the highest accuracy for similarity queries. These results have
implications for the use of trajectory similarity approaches given different
application requirements
- …