976 research outputs found
Computational Analyses of Metagenomic Data
Metagenomics studies the collective microbial genomes extracted from a particular environment without requiring the culturing or isolation of individual genomes, addressing questions revolving around the composition, functionality, and dynamics of microbial communities. The intrinsic complexity of metagenomic data and the diversity of applications call for efficient and accurate computational methods in data handling. In this thesis, I present three primary projects that collectively focus on the computational analysis of metagenomic data, each addressing a distinct topic.
In the first project, I designed and implemented an algorithm named Mapbin for reference-free genomic binning of metagenomic assemblies. Binning aims to group a mixture of genomic fragments based on their genome origin. Mapbin enhances binning results by building a multilayer network that combines the initial binning, assembly graph, and read-pairing information from paired-end sequencing data. The network is further partitioned by the community-detection algorithm, Infomap, to yield a new binning result. Mapbin was tested on multiple simulated and real datasets. The results indicated an overall improvement in the common binning quality metrics.
The second and third projects are both derived from ImMiGeNe, a collaborative and multidisciplinary study investigating the interplay between gut microbiota, host genetics, and immunity in stem-cell transplantation (SCT) patients. In the second project, I conducted microbiome analyses for the metagenomic data. The workflow included the removal of contaminant reads and multiple taxonomic and functional profiling. The results revealed that the SCT recipients' samples yielded significantly fewer reads with heavy contamination of the host DNA, and their microbiomes displayed evident signs of dysbiosis. Finally, I discussed several inherent challenges posed by extremely low levels of target DNA and high levels of contamination in the recipient samples, which cannot be rectified solely through bioinformatics approaches.
The primary goal of the third project is to design a set of primers that can be used to cover bacterial flagellin genes present in the human gut microbiota. Considering the notable diversity of flagellins, I incorporated a method to select representative bacterial flagellin gene sequences, a heuristic approach based on established primer design methods to generate a degenerate primer set, and a selection method to filter genes unlikely to occur in the human gut microbiome. As a result, I successfully curated a reduced yet representative set of primers that would be practical for experimental implementation
Algorithms and complexity for approximately counting hypergraph colourings and related problems
The past decade has witnessed advancements in designing efficient algorithms for approximating the number of solutions to constraint satisfaction problems (CSPs), especially in the local lemma regime. However, the phase transition for the computational tractability is not known. This thesis is dedicated to the prototypical problem of this kind of CSPs, the hypergraph colouring. Parameterised by the number of colours q, the arity of each hyperedge k, and the vertex maximum degree Î, this problem falls into the regime of LovĂĄsz local lemma when Î âČ qá”. In prior, however, fast approximate counting algorithms exist when Î âČ qá”/Âł, and there is no known inapproximability result. In pursuit of this, our contribution is two-folded, stated as follows.
âą When q, k â„ 4 are evens and Î â„ 5·qá”/ÂČ, approximating the number of hypergraph colourings is NP-hard.
âą When the input hypergraph is linear and Î âČ qá”/ÂČ, a fast approximate counting algorithm does exist
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
The success probability in Levine's hat problem, and independent sets in graphs
Lionel Levine's hat challenge has players, each with a (very large, or
infinite) stack of hats on their head, each hat independently colored at random
black or white. The players are allowed to coordinate before the random colors
are chosen, but not after. Each player sees all hats except for those on her
own head. They then proceed to simultaneously try and each pick a black hat
from their respective stacks. They are proclaimed successful only if they are
all correct. Levine's conjecture is that the success probability tends to zero
when the number of players grows. We prove that this success probability is
strictly decreasing in the number of players, and present some connections to
problems in graph theory: relating the size of the largest independent set in a
graph and in a random induced subgraph of it, and bounding the size of a set of
vertices intersecting every maximum-size independent set in a graph.Comment: arXiv admin note: substantial text overlap with arXiv:2103.01541,
arXiv:2103.0599
Nonlocal games and their device-independent quantum applications
Device-independence is a property of certain protocols that allows one to ensure their proper execution given only classical interaction with devices and assuming the correctness of the laws of physics. This scenario describes the most general form of cryptographic security, in which no trust is placed in the hardware involved; indeed, one may even take it to have been prepared by an adversary.
Many quantum tasks have been shown to admit device-independent protocols by augmentation with "nonlocal games". These are games in which noncommunicating parties jointly attempt to fulfil some conditions imposed by a referee. We introduce examples of such games and examine the optimal strategies of players who are allowed access to different possible shared resources, such as entangled quantum states. We then study their role in self-testing, private random number generation, and secure delegated quantum computation. Hardware imperfections are naturally incorporated in the device-independent scenario as adversarial, and we thus also perform noise robustness analysis where feasible.
We first study a generalization of the MerminâPeres magic square game to arbitrary rectangular dimensions. After exhibiting some general properties, these "magic rectangle" games are fully characterized in terms of their optimal win probabilities for quantum strategies. We find that for mĂn magic rectangle games with dimensions m,nâ„3, there are quantum strategies that win with certainty, while for dimensions 1Ăn quantum strategies do not outperform classical strategies. The final case of dimensions 2Ăn is richer, and we give upper and lower bounds that both outperform the classical strategies. As an initial usage scenario, we apply our findings to quantum certified randomness expansion to find noise tolerances and rates for all magic rectangle games. To do this, we use our previous results to obtain the winning probabilities of games with a distinguished input for which the devices give a deterministic outcome and follow the analysis of C. A. Miller and Y. Shi [SIAM J. Comput. 46, 1304 (2017)].
Self-testing is a method to verify that one has a particular quantum state from purely classical statistics. For practical applications, such as device-independent delegated verifiable quantum computation, it is crucial that one self-tests multiple Bell states in parallel while keeping the quantum capabilities required of one side to a minimum. We use our 3Ăn magic rectangle games to obtain a self-test for n Bell states where one side needs only to measure single-qubit Pauli observables. The protocol requires small input sizes [constant for Alice and O(log n) bits for Bob] and is robust with robustness O(nâ”/ÂČâΔ), where Δ is the closeness of the ideal (perfect) correlations to those observed. To achieve the desired self-test, we introduce a one-side-local quantum strategy for the magic square game that wins with certainty, we generalize this strategy to the family of 3Ăn magic rectangle games, and we supplement these nonlocal games with extra check rounds (of single and pairs of observables).
Finally, we introduce a device-independent two-prover scheme in which a classical verifier can use a simple untrusted quantum measurement device (the client device) to securely delegate a quantum computation to an untrusted quantum server. To do this, we construct a parallel self-testing protocol to perform device-independent remote state preparation of n qubits and compose this with the unconditionally secure universal verifiable blind quantum computation (VBQC) scheme of J. F. Fitzsimons and E. Kashefi [Phys. Rev. A 96, 012303 (2017)]. Our self-test achieves a multitude of desirable properties for the application we consider, giving rise to practical and fully device-independent VBQC. It certifies parallel measurements of all cardinal and intercardinal directions in the XY-plane as well as the computational basis, uses few input questions (of size logarithmic in n for the client and a constant number communicated to the server), and requires only single-qubit measurements to be performed by the client device
Faster Deterministic Distributed MIS and Approximate Matching
We present an
round deterministic distributed algorithm for the maximal independent set
problem. By known reductions, this round complexity extends also to maximal
matching, vertex coloring, and edge coloring. These four
problems are among the most central problems in distributed graph algorithms
and have been studied extensively for the past four decades. This improved
round complexity comes closer to the lower bound of
maximal independent set and maximal matching [Balliu et al. FOCS '19]. The
previous best known deterministic complexity for all of these problems was
. Via the shattering technique, the improvement permeates
also to the corresponding randomized complexities, e.g., the new randomized
complexity of vertex coloring is now
rounds.
Our approach is a novel combination of the previously known two methods for
developing deterministic algorithms for these problems, namely global
derandomization via network decomposition (see e.g., [Rozhon, Ghaffari STOC'20;
Ghaffari, Grunau, Rozhon SODA'21; Ghaffari et al. SODA'23]) and local rounding
of fractional solutions (see e.g., [Fischer DISC'17; Harris FOCS'19; Fischer,
Ghaffari, Kuhn FOCS'17; Ghaffari, Kuhn FOCS'21; Faour et al. SODA'23]). We
consider a relaxation of the classic network decomposition concept, where
instead of requiring the clusters in the same block to be non-adjacent, we
allow each node to have a small number of neighboring clusters. We also show a
deterministic algorithm that computes this relaxed decomposition faster than
standard decompositions. We then use this relaxed decomposition to
significantly improve the integrality of certain fractional solutions, before
handing them to the local rounding procedure that now has to do fewer rounding
steps
A Theory of Link Prediction via Relational Weisfeiler-Leman on Knowledge Graphs
Graph neural networks are prominent models for representation learning over
graph-structured data. While the capabilities and limitations of these models
are well-understood for simple graphs, our understanding remains incomplete in
the context of knowledge graphs. Our goal is to provide a systematic
understanding of the landscape of graph neural networks for knowledge graphs
pertaining to the prominent task of link prediction. Our analysis entails a
unifying perspective on seemingly unrelated models and unlocks a series of
other models. The expressive power of various models is characterized via a
corresponding relational Weisfeiler-Leman algorithm. This analysis is extended
to provide a precise logical characterization of the class of functions
captured by a class of graph neural networks. The theoretical findings
presented in this paper explain the benefits of some widely employed practical
design choices, which are validated empirically.Comment: Proceedings of the Thirty-Seventh Annual Conference on Advances in
Neural Information Processing Systems (NeurIPS 2023). Code available at:
https://github.com/HxyScotthuang/CMPN
Parallel and Flow-Based High Quality Hypergraph Partitioning
Balanced hypergraph partitioning is a classic NP-hard optimization problem that is a fundamental tool in such diverse disciplines as VLSI circuit design, route planning, sharding distributed databases, optimizing communication volume in parallel computing, and accelerating the simulation of quantum circuits.
Given a hypergraph and an integer , the task is to divide the vertices into disjoint blocks with bounded size, while minimizing an objective function on the hyperedges that span multiple blocks.
In this dissertation we consider the most commonly used objective, the connectivity metric, where we aim to minimize the number of different blocks connected by each hyperedge.
The most successful heuristic for balanced partitioning is the multilevel approach, which consists of three phases.
In the coarsening phase, vertex clusters are contracted to obtain a sequence of structurally similar but successively smaller hypergraphs.
Once sufficiently small, an initial partition is computed.
Lastly, the contractions are successively undone in reverse order, and an iterative improvement algorithm is employed to refine the projected partition on each level.
An important aspect in designing practical heuristics for optimization problems is the trade-off between solution quality and running time.
The appropriate trade-off depends on the specific application, the size of the data sets, and the computational resources available to solve the problem.
Existing algorithms are either slow, sequential and offer high solution quality, or are simple, fast, easy to parallelize, and offer low quality.
While this trade-off cannot be avoided entirely, our goal is to close the gaps as much as possible.
We achieve this by improving the state of the art in all non-trivial areas of the trade-off landscape with only a few techniques, but employed in two different ways.
Furthermore, most research on parallelization has focused on distributed memory, which neglects the greater flexibility of shared-memory algorithms and the wide availability of commodity multi-core machines.
In this thesis, we therefore design and revisit fundamental techniques for each phase of the multilevel approach, and develop highly efficient shared-memory parallel implementations thereof.
We consider two iterative improvement algorithms, one based on the Fiduccia-Mattheyses (FM) heuristic, and one based on label propagation.
For these, we propose a variety of techniques to improve the accuracy of gains when moving vertices in parallel, as well as low-level algorithmic improvements.
For coarsening, we present a parallel variant of greedy agglomerative clustering with a novel method to resolve cluster join conflicts on-the-fly.
Combined with a preprocessing phase for coarsening based on community detection, a portfolio of from-scratch partitioning algorithms, as well as recursive partitioning with work-stealing, we obtain our first parallel multilevel framework.
It is the fastest partitioner known, and achieves medium-high quality, beating all parallel partitioners, and is close to the highest quality sequential partitioner.
Our second contribution is a parallelization of an n-level approach, where only one vertex is contracted and uncontracted on each level.
This extreme approach aims at high solution quality via very fine-grained, localized refinement, but seems inherently sequential.
We devise an asynchronous n-level coarsening scheme based on a hierarchical decomposition of the contractions, as well as a batch-synchronous uncoarsening, and later fully asynchronous uncoarsening.
In addition, we adapt our refinement algorithms, and also use the preprocessing and portfolio.
This scheme is highly scalable, and achieves the same quality as the highest quality sequential partitioner (which is based on the same components), but is of course slower than our first framework due to fine-grained uncoarsening.
The last ingredient for high quality is an iterative improvement algorithm based on maximum flows.
In the sequential setting, we first improve an existing idea by solving incremental maximum flow problems, which leads to smaller cuts and is faster due to engineering efforts.
Subsequently, we parallelize the maximum flow algorithm and schedule refinements in parallel.
Beyond the strive for highest quality, we present a deterministically parallel partitioning framework.
We develop deterministic versions of the preprocessing, coarsening, and label propagation refinement.
Experimentally, we demonstrate that the penalties for determinism in terms of partition quality and running time are very small.
All of our claims are validated through extensive experiments, comparing our algorithms with state-of-the-art solvers on large and diverse benchmark sets.
To foster further research, we make our contributions available in our open-source framework Mt-KaHyPar.
While it seems inevitable, that with ever increasing problem sizes, we must transition to distributed memory algorithms, the study of shared-memory techniques is not in vain.
With the multilevel approach, even the inherently slow techniques have a role to play in fast systems, as they can be employed to boost quality on coarse levels at little expense.
Similarly, techniques for shared-memory parallelism are important, both as soon as a coarse graph fits into memory, and as local building blocks in the distributed algorithm
Vertex-critical graphs far from edge-criticality
Let be any positive integer. We prove that for every sufficiently large
there exists a -chromatic vertex-critical graph such that
for every set with . This partially
solves a problem posed by Erd\H{o}s in 1985, who asked whether the above
statement holds for .Comment: 6 page
Dedekind's problem in the hypergrid
Consider the partially ordered set on equipped
with the natural coordinate-wise ordering. Let denote the number of
antichains of this poset. The quantity has a number of combinatorial
interpretations: it is precisely the number of -dimensional partitions
with entries from , and by a result of Moshkovitz and Shapira,
is equal to the -color Ramsey number of monotone paths of length
in 3-uniform hypergraphs. This has led to significant interest in the
growth rate of .
A number of results in the literature show that , where is the width of , and the term
goes to for fixed and tending to infinity. In the present paper, we
prove the first bound that is close to optimal in the case where is
arbitrarily large compared to , as well as improve all previous results for
sufficiently large . In particular, we prove that there is an absolute
constant such that for every , This resolves a
conjecture of Moshkovitz and Shapira. A key ingredient in our proof is the
construction of a normalized matching flow on the cover graph of the poset
in which the distribution of weights is close to uniform, a result that
may be of independent interest.Comment: 28 pages + 4 page Appendix, 3 figure
- âŠ