1,274 research outputs found
Bit-parallel and SIMD alignment algorithms for biological sequence analysis
High-throughput next-generation sequencing techniques have hugely decreased the cost and increased the speed of sequencing, resulting in an explosion of sequencing data. This motivates the development of high-efficiency sequence alignment algorithms. In this thesis, I present multiple bit-parallel and Single Instruction Multiple Data (SIMD) algorithms that greatly accelerate the processing of biological sequences.
The first chapter describes the BitPAl bit-parallel algorithms for global alignment with general integer scoring, which assigns integer weights for match, mismatch, and insertion/deletion. The bit-parallel approach represents individual cells in an alignment scoring matrix as bits in computer words and emulates the calculation of scores by a series of logic operations. Bit-parallelism has previously been applied to other pattern matching problems, producing fast algorithms. In timed tests, we show that BitPAl runs 7 - 25 times faster than a standard iterative algorithm.
The second part involves two approaches to alignment with substitution scoring, which assigns a potentially different substitution weight to every pair of alphabet characters, better representing the relative rates of different mutations. The first approach extends the existing BitPAl method. The second approach is a new SIMD algorithm that uses partial sums of adjacent score differences. I present a simple partial sum method as well as one that uses parallel scan for additional acceleration. Results demonstrate that these algorithms are significantly faster than existing SIMD dynamic programming algorithms.
Finally, I describe two extensions to the partial sums algorithm. The first adds support for affine gap penalty scoring. Affine gap scoring represents the biological likelihood that it is more likely for gaps to be continuous than to be distributed throughout a region by introducing a gap opening penalty and a gap extension penalty. The second extension is an algorithm that uses the partial sums method to calculate the tandem alignment of a pattern against a text sequence using a single pattern copy.
Next generation sequencing data provides a wealth of information to researchers. Extracting that information in a timely manner increases the utility and practicality of sequence analysis algorithms. This thesis presents a family of algorithms which provide alignment scores in less time than previous algorithms
Quantum-Inspired Machine Learning: a Survey
Quantum-inspired Machine Learning (QiML) is a burgeoning field, receiving
global attention from researchers for its potential to leverage principles of
quantum mechanics within classical computational frameworks. However, current
review literature often presents a superficial exploration of QiML, focusing
instead on the broader Quantum Machine Learning (QML) field. In response to
this gap, this survey provides an integrated and comprehensive examination of
QiML, exploring QiML's diverse research domains including tensor network
simulations, dequantized algorithms, and others, showcasing recent
advancements, practical applications, and illuminating potential future
research avenues. Further, a concrete definition of QiML is established by
analyzing various prior interpretations of the term and their inherent
ambiguities. As QiML continues to evolve, we anticipate a wealth of future
developments drawing from quantum mechanics, quantum computing, and classical
machine learning, enriching the field further. This survey serves as a guide
for researchers and practitioners alike, providing a holistic understanding of
QiML's current landscape and future directions.Comment: 56 pages, 13 figures, 8 table
Tensor Networks for Dimensionality Reduction and Large-Scale Optimizations. Part 2 Applications and Future Perspectives
Part 2 of this monograph builds on the introduction to tensor networks and
their operations presented in Part 1. It focuses on tensor network models for
super-compressed higher-order representation of data/parameters and related
cost functions, while providing an outline of their applications in machine
learning and data analytics. A particular emphasis is on the tensor train (TT)
and Hierarchical Tucker (HT) decompositions, and their physically meaningful
interpretations which reflect the scalability of the tensor network approach.
Through a graphical approach, we also elucidate how, by virtue of the
underlying low-rank tensor approximations and sophisticated contractions of
core tensors, tensor networks have the ability to perform distributed
computations on otherwise prohibitively large volumes of data/parameters,
thereby alleviating or even eliminating the curse of dimensionality. The
usefulness of this concept is illustrated over a number of applied areas,
including generalized regression and classification (support tensor machines,
canonical correlation analysis, higher order partial least squares),
generalized eigenvalue decomposition, Riemannian optimization, and in the
optimization of deep neural networks. Part 1 and Part 2 of this work can be
used either as stand-alone separate texts, or indeed as a conjoint
comprehensive review of the exciting field of low-rank tensor networks and
tensor decompositions.Comment: 232 page
Tensor Networks for Dimensionality Reduction and Large-Scale Optimizations. Part 2 Applications and Future Perspectives
Part 2 of this monograph builds on the introduction to tensor networks and
their operations presented in Part 1. It focuses on tensor network models for
super-compressed higher-order representation of data/parameters and related
cost functions, while providing an outline of their applications in machine
learning and data analytics. A particular emphasis is on the tensor train (TT)
and Hierarchical Tucker (HT) decompositions, and their physically meaningful
interpretations which reflect the scalability of the tensor network approach.
Through a graphical approach, we also elucidate how, by virtue of the
underlying low-rank tensor approximations and sophisticated contractions of
core tensors, tensor networks have the ability to perform distributed
computations on otherwise prohibitively large volumes of data/parameters,
thereby alleviating or even eliminating the curse of dimensionality. The
usefulness of this concept is illustrated over a number of applied areas,
including generalized regression and classification (support tensor machines,
canonical correlation analysis, higher order partial least squares),
generalized eigenvalue decomposition, Riemannian optimization, and in the
optimization of deep neural networks. Part 1 and Part 2 of this work can be
used either as stand-alone separate texts, or indeed as a conjoint
comprehensive review of the exciting field of low-rank tensor networks and
tensor decompositions.Comment: 232 page
Scaling Attributed Network Embedding to Massive Graphs
Given a graph G where each node is associated with a set of attributes,
attributed network embedding (ANE) maps each node vin G to a compact vector Xv,
which can be used in downstream machine learning tasks. Ideally, Xv should
capture node v's affinity to each attribute, which considers not only v's own
attribute associations, but also those of its connected nodes along edges in G.
It is challenging to obtain high-utility embeddings that enable accurate
predictions; scaling effective ANE computation to massive graphs with millions
of nodes pushes the difficulty of the problem to a whole new level. Existing
solutions largely fail on such graphs, leading to prohibitive costs,
low-quality embeddings, or both. This paper proposes PANE, an effective and
scalable approach to ANE computation for massive graphs that achieves
state-of-the-art result quality on multiple benchmark datasets, measured by the
accuracy of three common prediction tasks: attribute inference, link
prediction, and node classification. PANE obtains high scalability and
effectiveness through three main algorithmic designs. First, it formulates the
learning objective based on a novel random walk model for attributed networks.
The resulting optimization task is still challenging on large graphs. Second,
PANE includes a highly efficient solver for the above optimization problem,
whose key module is a carefully designed initialization of the embeddings,
which drastically reduces the number of iterations required to converge.
Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of
the above solver, which achieves scalability while retaining the high quality
of the resulting embeddings. Extensive experiments, comparing 10 existing
approaches on 8 real datasets, demonstrate that PANE consistently outperforms
all existing methods in terms of result quality, while being orders of
magnitude faster.Comment: 16 pages. PVLDB 2021. Volume 14, Issue
Study of the Quantum Advantage in Quantum Machine Learning Applications for drug discovery
Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.) “Μαθηματική Προτυποποίηση σε Σύγχρονες Τεχνολογίες και στα Χρηματοοικονομικά
- …