100 research outputs found
Packing Cars into Narrow Roads: PTASs for Limited Supply Highway
In the Highway problem, we are given a path with n edges (the highway), and a set of m drivers, each one characterized by a subpath and a budget. For a given assignment of edge prices (the tolls), the highway owner collects from each driver the total price of the associated path when it does not exceed drivers\u27s budget, and zero otherwise. The goal is to choose the prices to maximize the total profit. A PTAS is known for this (strongly NP-hard) problem [Grandoni,Rothvoss-SODA\u2711, SICOMP\u2716].
In this paper we study the limited supply generalization of Highway, that incorporates capacity constraints. Here the input also includes a capacity u_e >= 0 for each edge e; we need to select, among drivers that can afford the required price, a subset such that the number of drivers that use each edge e is at most u_e (and we get profit only from selected drivers). To the best of our knowledge, the only approximation algorithm known for this problem is a folklore O(log m) approximation based on a reduction to the related Unsplittable Flow on a Path problem (UFP). The main result of this paper is a PTAS for limited supply highway.
As a second contribution, we study a natural generalization of the problem where each driver i demands a different amount d_i of capacity. Using known techniques, it is not hard to derive a QPTAS for this problem. Here we present a PTAS for the case that drivers have uniform budgets. Finding a PTAS for non-uniform-demand limited supply highway is left as a challenging open problem
Prizing on Paths: A PTAS for the Highway Problem
In the highway problem, we are given an n-edge line graph (the highway), and
a set of paths (the drivers), each one with its own budget. For a given
assignment of edge weights (the tolls), the highway owner collects from each
driver the weight of the associated path, when it does not exceed the budget of
the driver, and zero otherwise. The goal is choosing weights so as to maximize
the profit.
A lot of research has been devoted to this apparently simple problem. The
highway problem was shown to be strongly NP-hard only recently
[Elbassioni,Raman,Ray-'09]. The best-known approximation is O(\log n/\log\log
n) [Gamzu,Segev-'10], which improves on the previous-best O(\log n)
approximation [Balcan,Blum-'06].
In this paper we present a PTAS for the highway problem, hence closing the
complexity status of the problem. Our result is based on a novel randomized
dissection approach, which has some points in common with Arora's quadtree
dissection for Euclidean network design [Arora-'98]. The basic idea is
enclosing the highway in a bounding path, such that both the size of the
bounding path and the position of the highway in it are random variables. Then
we consider a recursive O(1)-ary dissection of the bounding path, in subpaths
of uniform optimal weight. Since the optimal weights are unknown, we construct
the dissection in a bottom-up fashion via dynamic programming, while computing
the approximate solution at the same time. Our algorithm can be easily
derandomized. We demonstrate the versatility of our technique by presenting
PTASs for two variants of the highway problem: the tollbooth problem with a
constant number of leaves and the maximum-feasibility subsystem problem on
interval matrices. In both cases the previous best approximation factors are
polylogarithmic [Gamzu,Segev-'10,Elbassioni,Raman,Ray,Sitters-'09]
Developing graph-based co-scheduling algorithms on multicore computers
It is common that multiple cores reside on the same chip and share the on-chip cache. As a result, resource sharing can cause performance degradation of co-running jobs.Job co-scheduling is a technique that can effectively alleviate this contention and many co-schedulers have been reported in related literature. Most solutions however do not aim to find the optimal co-scheduling solution. Being able to determine the optimal solution is critical for evaluating co-scheduling systems. Moreover, most co-schedulers only consider serial jobs, and there often exist both parallel and serial jobs in real-world systems. In this paper a graph-based method is developed to find the optimal co-scheduling solution for serial jobs; the method is then extended to incorporate parallel jobs, including multi-process, and multithreaded parallel jobs. A number of optimization measures are also developed to accelerate the solving process. Moreover, a flexible approximation technique is proposed to strike a balance between the solving speed and the solution quality. Extensive experiments are conducted to evaluate the effectiveness of the proposed co-scheduling algorithms. The results show that the proposed algorithms can find the optimal co-scheduling solution for both serial and parallel jobs. The proposed approximation technique is also shown to be flexible in the sense that we can control the solving speed by setting the requirement for the solution quality
Assembly of long error-prone reads using de Bruijn graphs
The recent breakthroughs in assembling long error-prone reads were based on the overlap-layout-consensus (OLC) approach and did not utilize the strengths of the alternative de Bruijn graph approach to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to short and accurate reads and that the OLC approach is the only practical paradigm for assembling long error-prone reads. We show how to generalize de Bruijn graphs for assembling long error-prone reads and describe the ABruijn assembler, which combines the de Bruijn graph and the OLC approaches and results in accurate genome reconstructions
RNA μνΈμμ© λ° DNA μμ΄μ μ 보ν΄λ μ μν κΈ°κ³νμ΅ κΈ°λ²
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :곡과λν μ»΄ν¨ν°κ³΅νλΆ,2020. 2. κΉμ .μ물체 κ° νννμ μ°¨μ΄λ κ° κ°μ²΄μ μ μ μ μ 보 μ°¨μ΄λ‘λΆν° κΈ°μΈνλ€. μ μ μ μ 보μ λ³νμ λ°λΌμ, κ° μ물체λ μλ‘ λ€λ₯Έ μ’
μΌλ‘ μ§ννκΈ°λ νκ³ , κ°μ λ³μ κ±Έλ¦° νμλΌλ μλ‘ λ€λ₯Έ μνλ₯Ό 보μ΄κΈ°λ νλ€. μ΄μ²λΌ μ€μν μλ¬Όνμ μ 보λ λμ©λ μνμ± λΆμ κΈ°λ² λ±μ ν΅ν΄ λ€μν μ€λ―Ήμ€ λ°μ΄ν°λ‘ μΈ‘μ λλ€. κ·Έλ¬λ, μ€λ―Ήμ€ λ°μ΄ν°λ κ³ μ°¨μ νΉμ§ λ° μκ·λͺ¨ νλ³Έ λ°μ΄ν°μ΄κΈ° λλ¬Έμ, μ€λ―Ήμ€ λ°μ΄ν°λ‘λΆν° μλ¬Όνμ μ 보λ₯Ό ν΄μνλ κ²μ λ§€μ° μ΄λ €μ΄ λ¬Έμ μ΄λ€. μΌλ°μ μΌλ‘, λ°μ΄ν° νΉμ§μ κ°μκ° μνμ κ°μλ³΄λ€ λ§μ λ, μ€λ―Ήμ€ λ°μ΄ν°μ ν΄μμ κ°μ₯ λν΄ν κΈ°κ³νμ΅ λ¬Έμ λ€ μ€ νλλ‘ λ§λλλ€.
λ³Έ λ°μ¬νμ λ
Όλ¬Έμ κΈ°κ³νμ΅ κΈ°λ²μ νμ©νμ¬ κ³ μ°¨μμ μΈ μλ¬Όνμ λ°μ΄ν°λ‘λΆν° μλ¬Όνμ μ 보λ₯Ό μΆμΆνκΈ° μν μλ‘μ΄ μλ¬Όμ 보ν λ°©λ²λ€μ κ³ μνλ κ²μ λͺ©νλ‘ νλ€.
첫 λ²μ§Έ μ°κ΅¬λ DNA μμ΄μ νμ©νμ¬ μ’
κ° λΉκ΅μ λμμ DNA μμ΄μμ μλ λ€μν μ§μμ λ΄κΈ΄ μλ¬Όνμ μ 보λ₯Ό μ μ μ κ΄μ μμ ν΄μν΄λ³΄κ³ μ νμλ€. μ΄λ₯Ό μν΄, μμ κΈ°λ° k λ¨μ΄ λ¬Έμμ΄ λΉκ΅λ°©λ², RKSS 컀λμ κ°λ°νμ¬ λ€μν κ²λ μμ μ§μμμ μ¬λ¬ μ’
κ° λΉκ΅ μ€νμ μννμλ€. RKSS 컀λμ κΈ°μ‘΄μ k λ¨μ΄ λ¬Έμμ΄ μ»€λμ νμ₯ν κ²μΌλ‘, k κΈΈμ΄ λ¨μ΄μ μμ μ 보μ μ’
κ° κ³΅ν΅μ μ νννλ λΉκ΅κΈ°μ€μ κ°λ
μ νμ©νμλ€. k λ¨μ΄ λ¬Έμμ΄ μ»€λμ kμ κΈΈμ΄μ λ°λΌ λ¨μ΄ μκ° κΈμ¦νμ§λ§, λΉκ΅κΈ°μ€μ μ κ·Ήμμμ λ¨μ΄λ‘ μ΄λ£¨μ΄μ Έ μμΌλ―λ‘ μμ΄ κ° μ μ¬λλ₯Ό κ³μ°νλ λ° νμν κ³μ°λμ ν¨μ¨μ μΌλ‘ μ€μΌ μ μλ€. κ²λ μμ μΈ μ§μμ λν΄μ μ€νμ μ§νν κ²°κ³Ό, RKSS 컀λμ κΈ°μ‘΄μ 컀λμ λΉν΄ μ’
κ° μ μ¬λ λ° μ°¨μ΄λ₯Ό ν¨μ¨μ μΌλ‘ κ³μ°ν μ μμλ€. λν, RKSS 컀λμ μ€νμ μ¬μ©λ μλ¬Όνμ μ§μμ ν¬ν¨λ μλ¬Όνμ μ 보λ μ°¨μ΄λ₯Ό μλ¬Όνμ μ§μκ³Ό λΆν©λλ μμλ‘ λΉκ΅ν μ μμλ€.
λ λ²μ§Έ μ°κ΅¬λ μλ¬Όνμ λ€νΈμν¬λ₯Ό ν΅ν΄ 볡μ‘νκ² μ½ν μ μ μ μνΈμμ© κ° μ 보λ₯Ό ν΄μνμ¬, λ λμκ° μλ¬Όνμ κΈ°λ₯ ν΄μμ ν΅ν΄ μμ μνμ λΆλ₯νκ³ μ νμλ€. μ΄λ₯Ό μν΄, κ·Έλν 컨볼루μ
λ€νΈμν¬μ μ΄ν
μ
λ©μ»€λμ¦μ νμ©νμ¬ ν¨μ€μ¨μ΄ κΈ°λ° ν΄μ κ°λ₯ν μ μν λΆλ₯ λͺ¨λΈ(GCN+MAE)μ κ³ μνμλ€. κ·Έλν 컨볼루μ
λ€νΈμν¬λ₯Ό ν΅ν΄μ μλ¬Όνμ μ¬μ μ§μμΈ ν¨μ€μ¨μ΄ μ 보λ₯Ό νμ΅νμ¬ λ³΅μ‘ν μ μ μ μνΈμμ© μ 보λ₯Ό ν¨μ¨μ μΌλ‘ λ€λ£¨μλ€. λν, μ¬λ¬ ν¨μ€μ¨μ΄ μ 보λ₯Ό μ΄ν
μ
λ©μ»€λμ¦μ ν΅ν΄ ν΄μ κ°λ₯ν μμ€μΌλ‘ λ³ν©νμλ€. λ§μ§λ§μΌλ‘, νμ΅ν ν¨μ€μ¨μ΄ λ 벨 μ 보λ₯Ό λ³΄λ€ λ³΅μ‘νκ³ λ€μν μ μ μ λ λ²¨λ‘ ν¨μ¨μ μΌλ‘ μ λ¬νκΈ° μν΄μ λ€νΈμν¬ μ ν μκ³ λ¦¬μ¦μ νμ©νμλ€. λ€μ― κ°μ μ λ°μ΄ν°μ λν΄ GCN+MAE λͺ¨λΈμ μ μ©ν κ²°κ³Ό, κΈ°μ‘΄μ μ μν λΆλ₯ λͺ¨λΈλ€λ³΄λ€ λμ μ±λ₯μ 보μμΌλ©° μ μν νΉμ΄μ μΈ ν¨μ€μ¨μ΄ λ° μλ¬Όνμ κΈ°λ₯μ λ°κ΅΄ν μ μμλ€.
μΈ λ²μ§Έ μ°κ΅¬λ ν¨μ€μ¨μ΄λ‘λΆν° μλΈ ν¨μ€μ¨μ΄/λ€νΈμν¬λ₯Ό μ°ΎκΈ° μν μ°κ΅¬λ€. ν¨μ€μ¨μ΄λ μλ¬Όνμ λ€νΈμν¬μ λ¨μΌ μλ¬Όνμ κΈ°λ₯μ΄ μλλΌ λ€μν μλ¬Όνμ κΈ°λ₯μ΄ ν¬ν¨λμ΄ μμμ μ£Όλͺ©νμλ€. λ¨μΌ κΈ°λ₯μ μ§λ μ μ μ μ‘°ν©μ μ°ΎκΈ° μν΄μ μλ¬Όνμ λ€νΈμν¬μμμ 쑰건 νΉμ΄μ μΈ μ μ μ λͺ¨λμ μ°Ύκ³ μ νμμΌλ©° MIDASλΌλ λꡬλ₯Ό κ°λ°νμλ€. ν¨μ€μ¨μ΄λ‘λΆν° μ μ μ μνΈμμ© κ° νμ±λλ₯Ό μ μ μ λ°νλκ³Ό λ€νΈμν¬ κ΅¬μ‘°λ₯Ό ν΅ν΄ κ³μ°νμλ€. κ³μ°λ νμ±λλ€μ νμ©νμ¬ λ€μ€ ν΄λμ€μμ μλ‘ λ€λ₯΄κ² νμ±νλ μλΈ ν¨μ€λ€μ ν΅κ³μ κΈ°λ²μ κΈ°λ°νμ¬ λ°κ΅΄νμλ€. λν, μ΄ν
μ
λ©μ»€λμ¦κ³Ό κ·Έλν 컨볼루μ
λ€νΈμν¬λ₯Ό ν΅ν΄μ ν΄λΉ μ°κ΅¬λ₯Ό ν¨μ€μ¨μ΄λ³΄λ€ λ ν° μλ¬Όνμ λ€νΈμν¬μ νμ₯νλ €κ³ μλνμλ€. μ λ°©μ λ°μ΄ν°μ λν΄ μ€νμ μ§νν κ²°κ³Ό, MIDASμ λ₯λ¬λ λͺ¨λΈμ λ€μ€ ν΄λμ€μμ μ°¨μ΄κ° λλ μ μ μ λͺ¨λμ ν¨κ³Όμ μΌλ‘ μΆμΆν μ μμλ€.
κ²°λ‘ μ μΌλ‘, λ³Έ λ°μ¬νμ λ
Όλ¬Έμ DNA μμ΄μ λ΄κΈ΄ μ§νμ μ 보λ λΉκ΅, ν¨μ€μ¨μ΄ κΈ°λ° μ μν λΆλ₯, 쑰건 νΉμ΄μ μΈ μ μ μ λͺ¨λ λ°κ΅΄μ μν μλ‘μ΄ κΈ°κ³νμ΅ κΈ°λ²μ μ μνμλ€.Phenotypic differences among organisms are mainly due to the difference in genetic information. As a result of genetic information modification, an organism may evolve into a different species and patients with the same disease may have different prognosis. This important biological information can be observed in the form of various omics data using high throughput instrument technologies such as sequencing instruments. However, interpretation of such omics data is challenging since omics data is with very high dimensions but with relatively small number of samples. Typically, the number of dimensions is higher than the number of samples, which makes the interpretation of omics data one of the most challenging machine learning problems.
My doctoral study aims to develop new bioinformatics methods for decoding information in these high dimensional data by utilizing machine learning algorithms.
The first study is to analyze the difference in the amount of information between different regions of the DNA sequence. To achieve the goal, a ranked-based k-spectrum string kernel, RKSS kernel, is developed for comparative and evolutionary comparison of various genomic region sequences among multiple species. RKSS kernel extends the existing k-spectrum string kernel by utilizing rank information of k-mers and landmarks of k-mers that represents a species. By using a landmark as a reference point for comparison, the number of k-mers needed to calculating sequence similarities is dramatically reduced. In the experiments on three different genomic regions, RKSS kernel captured more reliable distances between species according to genetic information contents of the target region. Also, RKSS kernel was able to rearrange each region to match a biological common insight.
The second study aims to efficiently decode complex genetic interactions using biological networks and, then, to classify cancer subtypes by interpreting biological functions. To achieve the goal, a pathway-based deep learning model using graph convolutional network and multi-attention based ensemble (GCN+MAE) for cancer subtype classification is developed. In order to efficiently reduce the relationships between genes using pathway information, GCN+MAE is designed as an explainable deep learning structure using graph convolutional network and attention mechanism. Extracted pathway-level information of cancer subtypes is transported into gene-level again by network propagation. In the experiments of five cancer data sets, GCN+MAE showed better cancer subtype classification performances and captured subtype-specific pathways and their biological functions.
The third study is to identify sub-networks of a biological pathway. The goal is to dissect a biological pathway into multiple sub-networks, each of which is to be of a single functional unit. To achieve the goal, a condition-specific sub-module detection method in a biological network, MIDAS (MIning Differentially Activated Subpaths) is developed. From the pathway, edge activities are measured by explicit gene expression and network topology. Using the activities, differentially activated subpaths are explored by a statistical approach. Also, by extending this idea on graph convolutional network, different sub-networks are highlighted by attention mechanisms. In the experiment with breast cancer data, MIDAS and the deep learning model successfully decomposed gene-level features into sub-modules of single functions.
In summary, my doctoral study proposes new computational methods to compare genomic DNA sequences as information contents, to model pathway-based cancer subtype classifications and regulations, and to identify condition-specific sub-modules among multiple cancer subtypes.Chapter 1 Introduction 1
1.1 Biological questions with genetic information 2
1.1.1 Biological Sequences 2
1.1.2 Gene expression 2
1.2 Formulating computational problems for the biological questions 3
1.2.1 Decoding biological sequences by k-mer vectors 3
1.2.2 Interpretation of complex relationships between genes 7
1.3 Three computational problems for the biological questions 9
1.4 Outline of the thesis 14
Chapter 2 Ranked k-spectrum kernel for comparative and evolutionary comparison of DNA sequences 15
2.1 Motivation 16
2.1.1 String kernel for sequence comparison 17
2.1.2 Approach: RKSS kernel 19
2.2 Methods 21
2.2.1 Mapping biological sequences to k-mer space: the k-spectrum string kernel 23
2.2.2 The ranked k-spectrum string kernel with a landmark 24
2.2.3 Single landmark-based reconstruction of phylogenetic tree 27
2.2.4 Multiple landmark-based distance comparison of exons, introns, CpG islands 29
2.2.5 Sequence Data for analysis 30
2.3 Results 31
2.3.1 Reconstruction of phylogenetic tree on the exons, introns, and CpG islands 31
2.3.2 Landmark space captures the characteristics of three genomic regions 38
2.3.3 Cross-evaluation of the landmark-based feature space 45
Chapter 3 Pathway-based cancer subtype classification and interpretation by attention mechanism and network propagation 46
3.1 Motivation 47
3.2 Methods 52
3.2.1 Encoding biological prior knowledge using Graph Convolutional Network 52
3.2.2 Re-producing comprehensive biological process by Multi-Attention based Ensemble 53
3.2.3 Linking pathways and transcription factors by network propagation with permutation-based normalization 55
3.3 Results 58
3.3.1 Pathway database and cancer data set 58
3.3.2 Evaluation of individual GCN pathway models 60
3.3.3 Performance of ensemble of GCN pathway models with multi-attention 60
3.3.4 Identification of TFs as regulator of pathways and GO term analysis of TF target genes 67
Chapter 4 Detecting sub-modules in biological networks with gene expression by statistical approach and graph convolutional network 70
4.1 Motivation 70
4.1.1 Pathway based analysis of transcriptome data 71
4.1.2 Challenges and Summary of Approach 74
4.2 Methods 78
4.2.1 Convert single KEGG pathway to directed graph 79
4.2.2 Calculate edge activity for each sample 79
4.2.3 Mining differentially activated subpath among classes 80
4.2.4 Prioritizing subpaths by the permutation test 82
4.2.5 Extension: graph convolutional network and class activation map 83
4.3 Results 84
4.3.1 Identifying 36 subtype specific subpaths in breast cancer 86
4.3.2 Subpath activities have a good discrimination power for cancer subtype classification 88
4.3.3 Subpath activities have a good prognostic power for survival outcomes 90
4.3.4 Comparison with an existing tool, PATHOME 91
4.3.5 Extension: detection of subnetwork on PPI network 98
Chapter 5 Conclusions 101
κ΅λ¬Έμ΄λ‘ 127Docto
- β¦