2,551 research outputs found
Alignment-free Genomic Analysis via a Big Data Spark Platform
Motivation: Alignment-free distance and similarity functions (AF functions,
for short) are a well established alternative to two and multiple sequence
alignments for many genomic, metagenomic and epigenomic tasks. Due to
data-intensive applications, the computation of AF functions is a Big Data
problem, with the recent Literature indicating that the development of fast and
scalable algorithms computing AF functions is a high-priority task. Somewhat
surprisingly, despite the increasing popularity of Big Data technologies in
Computational Biology, the development of a Big Data platform for those tasks
has not been pursued, possibly due to its complexity. Results: We fill this
important gap by introducing FADE, the first extensible, efficient and scalable
Spark platform for Alignment-free genomic analysis. It supports natively
eighteen of the best performing AF functions coming out of a recent hallmark
benchmarking study. FADE development and potential impact comprises novel
aspects of interest. Namely, (a) a considerable effort of distributed
algorithms, the most tangible result being a much faster execution time of
reference methods like MASH and FSWM; (b) a software design that makes FADE
user-friendly and easily extendable by Spark non-specialists; (c) its ability
to support data- and compute-intensive tasks. About this, we provide a novel
and much needed analysis of how informative and robust AF functions are, in
terms of the statistical significance of their output. Our findings naturally
extend the ones of the highly regarded benchmarking study, since the functions
that can really be used are reduced to a handful of the eighteen included in
FADE
RNA ์ํธ์์ฉ ๋ฐ DNA ์์ด์ ์ ๋ณดํด๋ ์ ์ํ ๊ธฐ๊ณํ์ต ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ,2020. 2. ๊น์ .์๋ฌผ์ฒด ๊ฐ ํํํ์ ์ฐจ์ด๋ ๊ฐ ๊ฐ์ฒด์ ์ ์ ์ ์ ๋ณด ์ฐจ์ด๋ก๋ถํฐ ๊ธฐ์ธํ๋ค. ์ ์ ์ ์ ๋ณด์ ๋ณํ์ ๋ฐ๋ผ์, ๊ฐ ์๋ฌผ์ฒด๋ ์๋ก ๋ค๋ฅธ ์ข
์ผ๋ก ์งํํ๊ธฐ๋ ํ๊ณ , ๊ฐ์ ๋ณ์ ๊ฑธ๋ฆฐ ํ์๋ผ๋ ์๋ก ๋ค๋ฅธ ์ํ๋ฅผ ๋ณด์ด๊ธฐ๋ ํ๋ค. ์ด์ฒ๋ผ ์ค์ํ ์๋ฌผํ์ ์ ๋ณด๋ ๋์ฉ๋ ์ํ์ฑ ๋ถ์ ๊ธฐ๋ฒ ๋ฑ์ ํตํด ๋ค์ํ ์ค๋ฏน์ค ๋ฐ์ดํฐ๋ก ์ธก์ ๋๋ค. ๊ทธ๋ฌ๋, ์ค๋ฏน์ค ๋ฐ์ดํฐ๋ ๊ณ ์ฐจ์ ํน์ง ๋ฐ ์๊ท๋ชจ ํ๋ณธ ๋ฐ์ดํฐ์ด๊ธฐ ๋๋ฌธ์, ์ค๋ฏน์ค ๋ฐ์ดํฐ๋ก๋ถํฐ ์๋ฌผํ์ ์ ๋ณด๋ฅผ ํด์ํ๋ ๊ฒ์ ๋งค์ฐ ์ด๋ ค์ด ๋ฌธ์ ์ด๋ค. ์ผ๋ฐ์ ์ผ๋ก, ๋ฐ์ดํฐ ํน์ง์ ๊ฐ์๊ฐ ์ํ์ ๊ฐ์๋ณด๋ค ๋ง์ ๋, ์ค๋ฏน์ค ๋ฐ์ดํฐ์ ํด์์ ๊ฐ์ฅ ๋ํดํ ๊ธฐ๊ณํ์ต ๋ฌธ์ ๋ค ์ค ํ๋๋ก ๋ง๋ญ๋๋ค.
๋ณธ ๋ฐ์ฌํ์ ๋
ผ๋ฌธ์ ๊ธฐ๊ณํ์ต ๊ธฐ๋ฒ์ ํ์ฉํ์ฌ ๊ณ ์ฐจ์์ ์ธ ์๋ฌผํ์ ๋ฐ์ดํฐ๋ก๋ถํฐ ์๋ฌผํ์ ์ ๋ณด๋ฅผ ์ถ์ถํ๊ธฐ ์ํ ์๋ก์ด ์๋ฌผ์ ๋ณดํ ๋ฐฉ๋ฒ๋ค์ ๊ณ ์ํ๋ ๊ฒ์ ๋ชฉํ๋ก ํ๋ค.
์ฒซ ๋ฒ์งธ ์ฐ๊ตฌ๋ DNA ์์ด์ ํ์ฉํ์ฌ ์ข
๊ฐ ๋น๊ต์ ๋์์ DNA ์์ด์์ ์๋ ๋ค์ํ ์ง์ญ์ ๋ด๊ธด ์๋ฌผํ์ ์ ๋ณด๋ฅผ ์ ์ ์ ๊ด์ ์์ ํด์ํด๋ณด๊ณ ์ ํ์๋ค. ์ด๋ฅผ ์ํด, ์์ ๊ธฐ๋ฐ k ๋จ์ด ๋ฌธ์์ด ๋น๊ต๋ฐฉ๋ฒ, RKSS ์ปค๋์ ๊ฐ๋ฐํ์ฌ ๋ค์ํ ๊ฒ๋ ์์ ์ง์ญ์์ ์ฌ๋ฌ ์ข
๊ฐ ๋น๊ต ์คํ์ ์ํํ์๋ค. RKSS ์ปค๋์ ๊ธฐ์กด์ k ๋จ์ด ๋ฌธ์์ด ์ปค๋์ ํ์ฅํ ๊ฒ์ผ๋ก, k ๊ธธ์ด ๋จ์ด์ ์์ ์ ๋ณด์ ์ข
๊ฐ ๊ณตํต์ ์ ํํํ๋ ๋น๊ต๊ธฐ์ค์ ๊ฐ๋
์ ํ์ฉํ์๋ค. k ๋จ์ด ๋ฌธ์์ด ์ปค๋์ k์ ๊ธธ์ด์ ๋ฐ๋ผ ๋จ์ด ์๊ฐ ๊ธ์ฆํ์ง๋ง, ๋น๊ต๊ธฐ์ค์ ์ ๊ทน์์์ ๋จ์ด๋ก ์ด๋ฃจ์ด์ ธ ์์ผ๋ฏ๋ก ์์ด ๊ฐ ์ ์ฌ๋๋ฅผ ๊ณ์ฐํ๋ ๋ฐ ํ์ํ ๊ณ์ฐ๋์ ํจ์จ์ ์ผ๋ก ์ค์ผ ์ ์๋ค. ๊ฒ๋ ์์ ์ธ ์ง์ญ์ ๋ํด์ ์คํ์ ์งํํ ๊ฒฐ๊ณผ, RKSS ์ปค๋์ ๊ธฐ์กด์ ์ปค๋์ ๋นํด ์ข
๊ฐ ์ ์ฌ๋ ๋ฐ ์ฐจ์ด๋ฅผ ํจ์จ์ ์ผ๋ก ๊ณ์ฐํ ์ ์์๋ค. ๋ํ, RKSS ์ปค๋์ ์คํ์ ์ฌ์ฉ๋ ์๋ฌผํ์ ์ง์ญ์ ํฌํจ๋ ์๋ฌผํ์ ์ ๋ณด๋ ์ฐจ์ด๋ฅผ ์๋ฌผํ์ ์ง์๊ณผ ๋ถํฉ๋๋ ์์๋ก ๋น๊ตํ ์ ์์๋ค.
๋ ๋ฒ์งธ ์ฐ๊ตฌ๋ ์๋ฌผํ์ ๋คํธ์ํฌ๋ฅผ ํตํด ๋ณต์กํ๊ฒ ์ฝํ ์ ์ ์ ์ํธ์์ฉ ๊ฐ ์ ๋ณด๋ฅผ ํด์ํ์ฌ, ๋ ๋์๊ฐ ์๋ฌผํ์ ๊ธฐ๋ฅ ํด์์ ํตํด ์์ ์ํ์ ๋ถ๋ฅํ๊ณ ์ ํ์๋ค. ์ด๋ฅผ ์ํด, ๊ทธ๋ํ ์ปจ๋ณผ๋ฃจ์
๋คํธ์ํฌ์ ์ดํ
์
๋ฉ์ปค๋์ฆ์ ํ์ฉํ์ฌ ํจ์ค์จ์ด ๊ธฐ๋ฐ ํด์ ๊ฐ๋ฅํ ์ ์ํ ๋ถ๋ฅ ๋ชจ๋ธ(GCN+MAE)์ ๊ณ ์ํ์๋ค. ๊ทธ๋ํ ์ปจ๋ณผ๋ฃจ์
๋คํธ์ํฌ๋ฅผ ํตํด์ ์๋ฌผํ์ ์ฌ์ ์ง์์ธ ํจ์ค์จ์ด ์ ๋ณด๋ฅผ ํ์ตํ์ฌ ๋ณต์กํ ์ ์ ์ ์ํธ์์ฉ ์ ๋ณด๋ฅผ ํจ์จ์ ์ผ๋ก ๋ค๋ฃจ์๋ค. ๋ํ, ์ฌ๋ฌ ํจ์ค์จ์ด ์ ๋ณด๋ฅผ ์ดํ
์
๋ฉ์ปค๋์ฆ์ ํตํด ํด์ ๊ฐ๋ฅํ ์์ค์ผ๋ก ๋ณํฉํ์๋ค. ๋ง์ง๋ง์ผ๋ก, ํ์ตํ ํจ์ค์จ์ด ๋ ๋ฒจ ์ ๋ณด๋ฅผ ๋ณด๋ค ๋ณต์กํ๊ณ ๋ค์ํ ์ ์ ์ ๋ ๋ฒจ๋ก ํจ์จ์ ์ผ๋ก ์ ๋ฌํ๊ธฐ ์ํด์ ๋คํธ์ํฌ ์ ํ ์๊ณ ๋ฆฌ์ฆ์ ํ์ฉํ์๋ค. ๋ค์ฏ ๊ฐ์ ์ ๋ฐ์ดํฐ์ ๋ํด GCN+MAE ๋ชจ๋ธ์ ์ ์ฉํ ๊ฒฐ๊ณผ, ๊ธฐ์กด์ ์ ์ํ ๋ถ๋ฅ ๋ชจ๋ธ๋ค๋ณด๋ค ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ ์ ์ํ ํน์ด์ ์ธ ํจ์ค์จ์ด ๋ฐ ์๋ฌผํ์ ๊ธฐ๋ฅ์ ๋ฐ๊ตดํ ์ ์์๋ค.
์ธ ๋ฒ์งธ ์ฐ๊ตฌ๋ ํจ์ค์จ์ด๋ก๋ถํฐ ์๋ธ ํจ์ค์จ์ด/๋คํธ์ํฌ๋ฅผ ์ฐพ๊ธฐ ์ํ ์ฐ๊ตฌ๋ค. ํจ์ค์จ์ด๋ ์๋ฌผํ์ ๋คํธ์ํฌ์ ๋จ์ผ ์๋ฌผํ์ ๊ธฐ๋ฅ์ด ์๋๋ผ ๋ค์ํ ์๋ฌผํ์ ๊ธฐ๋ฅ์ด ํฌํจ๋์ด ์์์ ์ฃผ๋ชฉํ์๋ค. ๋จ์ผ ๊ธฐ๋ฅ์ ์ง๋ ์ ์ ์ ์กฐํฉ์ ์ฐพ๊ธฐ ์ํด์ ์๋ฌผํ์ ๋คํธ์ํฌ์์์ ์กฐ๊ฑด ํน์ด์ ์ธ ์ ์ ์ ๋ชจ๋์ ์ฐพ๊ณ ์ ํ์์ผ๋ฉฐ MIDAS๋ผ๋ ๋๊ตฌ๋ฅผ ๊ฐ๋ฐํ์๋ค. ํจ์ค์จ์ด๋ก๋ถํฐ ์ ์ ์ ์ํธ์์ฉ ๊ฐ ํ์ฑ๋๋ฅผ ์ ์ ์ ๋ฐํ๋๊ณผ ๋คํธ์ํฌ ๊ตฌ์กฐ๋ฅผ ํตํด ๊ณ์ฐํ์๋ค. ๊ณ์ฐ๋ ํ์ฑ๋๋ค์ ํ์ฉํ์ฌ ๋ค์ค ํด๋์ค์์ ์๋ก ๋ค๋ฅด๊ฒ ํ์ฑํ๋ ์๋ธ ํจ์ค๋ค์ ํต๊ณ์ ๊ธฐ๋ฒ์ ๊ธฐ๋ฐํ์ฌ ๋ฐ๊ตดํ์๋ค. ๋ํ, ์ดํ
์
๋ฉ์ปค๋์ฆ๊ณผ ๊ทธ๋ํ ์ปจ๋ณผ๋ฃจ์
๋คํธ์ํฌ๋ฅผ ํตํด์ ํด๋น ์ฐ๊ตฌ๋ฅผ ํจ์ค์จ์ด๋ณด๋ค ๋ ํฐ ์๋ฌผํ์ ๋คํธ์ํฌ์ ํ์ฅํ๋ ค๊ณ ์๋ํ์๋ค. ์ ๋ฐฉ์ ๋ฐ์ดํฐ์ ๋ํด ์คํ์ ์งํํ ๊ฒฐ๊ณผ, MIDAS์ ๋ฅ๋ฌ๋ ๋ชจ๋ธ์ ๋ค์ค ํด๋์ค์์ ์ฐจ์ด๊ฐ ๋๋ ์ ์ ์ ๋ชจ๋์ ํจ๊ณผ์ ์ผ๋ก ์ถ์ถํ ์ ์์๋ค.
๊ฒฐ๋ก ์ ์ผ๋ก, ๋ณธ ๋ฐ์ฌํ์ ๋
ผ๋ฌธ์ DNA ์์ด์ ๋ด๊ธด ์งํ์ ์ ๋ณด๋ ๋น๊ต, ํจ์ค์จ์ด ๊ธฐ๋ฐ ์ ์ํ ๋ถ๋ฅ, ์กฐ๊ฑด ํน์ด์ ์ธ ์ ์ ์ ๋ชจ๋ ๋ฐ๊ตด์ ์ํ ์๋ก์ด ๊ธฐ๊ณํ์ต ๊ธฐ๋ฒ์ ์ ์ํ์๋ค.Phenotypic differences among organisms are mainly due to the difference in genetic information. As a result of genetic information modification, an organism may evolve into a different species and patients with the same disease may have different prognosis. This important biological information can be observed in the form of various omics data using high throughput instrument technologies such as sequencing instruments. However, interpretation of such omics data is challenging since omics data is with very high dimensions but with relatively small number of samples. Typically, the number of dimensions is higher than the number of samples, which makes the interpretation of omics data one of the most challenging machine learning problems.
My doctoral study aims to develop new bioinformatics methods for decoding information in these high dimensional data by utilizing machine learning algorithms.
The first study is to analyze the difference in the amount of information between different regions of the DNA sequence. To achieve the goal, a ranked-based k-spectrum string kernel, RKSS kernel, is developed for comparative and evolutionary comparison of various genomic region sequences among multiple species. RKSS kernel extends the existing k-spectrum string kernel by utilizing rank information of k-mers and landmarks of k-mers that represents a species. By using a landmark as a reference point for comparison, the number of k-mers needed to calculating sequence similarities is dramatically reduced. In the experiments on three different genomic regions, RKSS kernel captured more reliable distances between species according to genetic information contents of the target region. Also, RKSS kernel was able to rearrange each region to match a biological common insight.
The second study aims to efficiently decode complex genetic interactions using biological networks and, then, to classify cancer subtypes by interpreting biological functions. To achieve the goal, a pathway-based deep learning model using graph convolutional network and multi-attention based ensemble (GCN+MAE) for cancer subtype classification is developed. In order to efficiently reduce the relationships between genes using pathway information, GCN+MAE is designed as an explainable deep learning structure using graph convolutional network and attention mechanism. Extracted pathway-level information of cancer subtypes is transported into gene-level again by network propagation. In the experiments of five cancer data sets, GCN+MAE showed better cancer subtype classification performances and captured subtype-specific pathways and their biological functions.
The third study is to identify sub-networks of a biological pathway. The goal is to dissect a biological pathway into multiple sub-networks, each of which is to be of a single functional unit. To achieve the goal, a condition-specific sub-module detection method in a biological network, MIDAS (MIning Differentially Activated Subpaths) is developed. From the pathway, edge activities are measured by explicit gene expression and network topology. Using the activities, differentially activated subpaths are explored by a statistical approach. Also, by extending this idea on graph convolutional network, different sub-networks are highlighted by attention mechanisms. In the experiment with breast cancer data, MIDAS and the deep learning model successfully decomposed gene-level features into sub-modules of single functions.
In summary, my doctoral study proposes new computational methods to compare genomic DNA sequences as information contents, to model pathway-based cancer subtype classifications and regulations, and to identify condition-specific sub-modules among multiple cancer subtypes.Chapter 1 Introduction 1
1.1 Biological questions with genetic information 2
1.1.1 Biological Sequences 2
1.1.2 Gene expression 2
1.2 Formulating computational problems for the biological questions 3
1.2.1 Decoding biological sequences by k-mer vectors 3
1.2.2 Interpretation of complex relationships between genes 7
1.3 Three computational problems for the biological questions 9
1.4 Outline of the thesis 14
Chapter 2 Ranked k-spectrum kernel for comparative and evolutionary comparison of DNA sequences 15
2.1 Motivation 16
2.1.1 String kernel for sequence comparison 17
2.1.2 Approach: RKSS kernel 19
2.2 Methods 21
2.2.1 Mapping biological sequences to k-mer space: the k-spectrum string kernel 23
2.2.2 The ranked k-spectrum string kernel with a landmark 24
2.2.3 Single landmark-based reconstruction of phylogenetic tree 27
2.2.4 Multiple landmark-based distance comparison of exons, introns, CpG islands 29
2.2.5 Sequence Data for analysis 30
2.3 Results 31
2.3.1 Reconstruction of phylogenetic tree on the exons, introns, and CpG islands 31
2.3.2 Landmark space captures the characteristics of three genomic regions 38
2.3.3 Cross-evaluation of the landmark-based feature space 45
Chapter 3 Pathway-based cancer subtype classification and interpretation by attention mechanism and network propagation 46
3.1 Motivation 47
3.2 Methods 52
3.2.1 Encoding biological prior knowledge using Graph Convolutional Network 52
3.2.2 Re-producing comprehensive biological process by Multi-Attention based Ensemble 53
3.2.3 Linking pathways and transcription factors by network propagation with permutation-based normalization 55
3.3 Results 58
3.3.1 Pathway database and cancer data set 58
3.3.2 Evaluation of individual GCN pathway models 60
3.3.3 Performance of ensemble of GCN pathway models with multi-attention 60
3.3.4 Identification of TFs as regulator of pathways and GO term analysis of TF target genes 67
Chapter 4 Detecting sub-modules in biological networks with gene expression by statistical approach and graph convolutional network 70
4.1 Motivation 70
4.1.1 Pathway based analysis of transcriptome data 71
4.1.2 Challenges and Summary of Approach 74
4.2 Methods 78
4.2.1 Convert single KEGG pathway to directed graph 79
4.2.2 Calculate edge activity for each sample 79
4.2.3 Mining differentially activated subpath among classes 80
4.2.4 Prioritizing subpaths by the permutation test 82
4.2.5 Extension: graph convolutional network and class activation map 83
4.3 Results 84
4.3.1 Identifying 36 subtype specific subpaths in breast cancer 86
4.3.2 Subpath activities have a good discrimination power for cancer subtype classification 88
4.3.3 Subpath activities have a good prognostic power for survival outcomes 90
4.3.4 Comparison with an existing tool, PATHOME 91
4.3.5 Extension: detection of subnetwork on PPI network 98
Chapter 5 Conclusions 101
๊ตญ๋ฌธ์ด๋ก 127Docto
RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison
Many algorithms for sequence analysis rely on word matching or word
statistics. Often, these approaches can be improved if binary patterns
representing match and don't-care positions are used as a filter, such that
only those positions of words are considered that correspond to the match
positions of the patterns. The performance of these approaches, however,
depends on the underlying patterns. Herein, we show that the overlap complexity
of a pattern set that was introduced by Ilie and Ilie is closely related to the
variance of the number of matches between two evolutionarily related sequences
with respect to this pattern set. We propose a modified hill-climbing algorithm
to optimize pattern sets for database searching, read mapping and
alignment-free sequence comparison of nucleic-acid sequences; our
implementation of this algorithm is called rasbhari. Depending on the
application at hand, rasbhari can either minimize the overlap complexity of
pattern sets, maximize their sensitivity in database searching or minimize the
variance of the number of pattern-based matches in alignment-free sequence
comparison. We show that, for database searching, rasbhari generates pattern
sets with slightly higher sensitivity than existing approaches. In our Spaced
Words approach to alignment-free sequence comparison, pattern sets calculated
with rasbhari led to more accurate estimates of phylogenetic distances than the
randomly generated pattern sets that we previously used. Finally, we used
rasbhari to generate patterns for short read classification with CLARK-S. Here
too, the sensitivity of the results could be improved, compared to the default
patterns of the program. We integrated rasbhari into Spaced Words; the source
code of rasbhari is freely available at http://rasbhari.gobics.de
Evolution of Strigamia centipedes (Chilopoda): a first molecular assessment of phylogeny and divergence times
We present a first phylogenetic and temporal framework, with biogeographical insights, for the centipedes of the genus Strigamia, which are widespread predators in the forest soils of the Northern Hemisphere and comprise the evo-devo model species Strigamia maritima. The phylogeny was estimated by different methods of maximum likelihood and Bayesian inference from sequences of two mitochondrial (16S, COI) and two nuclear (18S, 28S) genes, obtained from 16 species from all major areas of the global range of the genus and encompassing most of the overall morphological and ecological diversity. Divergence times were estimated after calibration upon the fossil record of centipedes. We found that major lineages of extant species of Strigamia separated most probably around 60 million years (Ma) ago. The two most diverse lineages diversified during the last 30 Ma and are today segregated geographically, one in Europe and another in Eastern Asia. This latter region hosts a hitherto underestimated richness and anatomical diversity of species, including three still unknown, yet morphologically well distinct species, which are here described as new: Strigamia inthanoni sp. n. from Thailand, Strigamia korsosi sp. n. from the Ryukyu Islands and Strigamia nana sp. n. from Taiwan. The northern European model species S. maritima is more strictly related to the Eastern Asian lineage, from which it most probably separated around 35 Ma ago before the major diversification of the latter
HYMENOPTERAN MOLECULAR PHYLOGENETICS: FROM APOCRITA TO BRACONIDAE (ICHNEUMONOIDEA)
Two separate phylogenetic studies were performed for two different taxonomic levels within Hymenoptera. The first study examined the utility of expressed sequence tags for resolving relationships among hymenopteran superfamilies. Transcripts were assembled from 14,000 sequenced clones for 6 disparate Hymenopteran taxa, averaging over 660 unique contigs per species. Orthology and gene determination were performed using modifications to a previously developed computerized pipeline and compared against annotated insect genomes. Sequences from additional taxa were added from public databases with a final dataset of 24 genes for 16 taxa.
The concatenated dataset recovered a robust and well-supported topology; however, there was extreme incongruity among individual gene trees. Analyses of sequences indicated strong compositional and transition biases, particularly in the third codon positions. The use of filtered supernetworks aided visualization of the existing congruent phylogenetic signal that existed across the individual gene trees. Additionally, treeness triangle plots indicated a strong residual signal in several gene trees and across codon positions in the concatenated dataset. However, most analyses of the concatenated dataset recovered expected relationships, known from other independent analyses. Thus, ESTs provide a powerful source of information for phylogenetic analysis, but results are sensitive to low taxonomic sampling and missing data.
The second study examined subfamilial relationships within the parasitoid family Braconidae, using over 4kb of sequence data for 139 taxa. Bayesian inference of the concatenated dataset recovered a robust phylogeny, particularly for early divergences within the family. There was strong evidence supporting two independent lineages within the family: one leading to the noncyclostomes and one leading to the cyclostomes. Ancestral state reconstructions were performed to test the theory of ectoparasitism as the ancestral condition for all taxa within the family. Results indicated an endoparasitic ancestor for the family and for the non-cyclostome lineage, with an early transition to ectoparasitism for the cyclostome lineage. However, reconstructions of some nodes were sensitive to outgroup coding and will also be impacted with increased biological knowledge
Computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires
The adaptive immune system recognizes antigens via an immense array of
antigen-binding antibodies and T-cell receptors, the immune repertoire. The
interrogation of immune repertoires is of high relevance for understanding the
adaptive immune response in disease and infection (e.g., autoimmunity, cancer,
HIV). Adaptive immune receptor repertoire sequencing (AIRR-seq) has driven the
quantitative and molecular-level profiling of immune repertoires thereby
revealing the high-dimensional complexity of the immune receptor sequence
landscape. Several methods for the computational and statistical analysis of
large-scale AIRR-seq data have been developed to resolve immune repertoire
complexity in order to understand the dynamics of adaptive immunity. Here, we
review the current research on (i) diversity, (ii) clustering and network,
(iii) phylogenetic and (iv) machine learning methods applied to dissect,
quantify and compare the architecture, evolution, and specificity of immune
repertoires. We summarize outstanding questions in computational immunology and
propose future directions for systems immunology towards coupling AIRR-seq with
the computational discovery of immunotherapeutics, vaccines, and
immunodiagnostics.Comment: 27 pages, 2 figure
Recommended from our members
ViFi: accurate detection of viral integration and mRNA fusion reveals indiscriminate and unregulated transcription in proximal genomic regions in cervical cancer.
The integration of viral sequences into the host genome is an important driver of tumorigenesis in many viral mediated cancers, notably cervical cancer and hepatocellular carcinoma. We present ViFi, a computational method that combines phylogenetic methods with reference-based read mapping to detect viral integrations. In contrast with read-based reference mapping approaches, ViFi is faster, and shows high precision and sensitivity on both simulated and biological data, even when the integrated virus is a novel strain or highly mutated. We applied ViFi to matched genomic and mRNA data from 68 cervical cancer samples from TCGA and found high concordance between the two. Surprisingly, viral integration resulted in a dramatic transcriptional upregulation in all proximal elements, including LINEs and LTRs that are not normally transcribed. This upregulation is highly correlated with the presence of a viral gene fused with a downstream human element. Moreover, genomic rearrangements suggest the formation of apparent circular extrachromosomal (ecDNA) human-viral structures. Our results suggest the presence of apparent small circular fusion viral/human ecDNA, which correlates with indiscriminate and unregulated expression of proximal genomic elements, potentially contributing to the pathogenesis of HPV-associated cervical cancers. ViFi is available at https://github.com/namphuon/ViFi
- โฆ