692 research outputs found

    Identification of new genes in human chromosome 3 contig 7 by graphical representation technique

    Get PDF
    The rapidly growing library of genomic length sequences and the working draft of the human genome sequence imply a concomitant need to determine new methods to analyse the sequences for rapid identification of new genes and their functions. We have developed a graphical technique for quick determination of probable coding regions in DNA sequences. In this article we apply this technique to the new sequence data from human chromosome 3 contig 7 to test the efficacy of the proposed system in a live case, and also compare with results from other genomic sequences. We report here a sampling of sequence segments that pass theoretical tests of likelihood of being genes and list several that have close homology with sequences from the expressed sequence tag (EST) databases. We also comment on the possible use of the graphical representation technique in the shotgun method of gene sequencing

    Human Promoter Prediction Using DNA Numerical Representation

    Get PDF
    With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

    Predicting Proteome-Early Drug Induced Cardiac Toxicity Relationships (Pro-EDICToRs) with Node Overlapping Parameters (NOPs) of a new class of Blood Mass-Spectra graphs

    Get PDF
    The 11th International Electronic Conference on Synthetic Organic Chemistry session Computational ChemistryBlood Serum Proteome-Mass Spectra (SP-MS) may allow detecting Proteome-Early Drug Induced Cardiac Toxicity Relationships (called here Pro-EDICToRs). However, due to the thousands of proteins in the SP identifying general Pro-EDICToRs patterns instead of a single protein marker may represents a more realistic alternative. In this sense, first we introduced a novel Cartesian 2D spectrum graph for SP-MS. Next, we introduced the graph node-overlapping parameters (nopk) to numerically characterize SP-MS using them as inputs to seek a Quantitative Proteome-Toxicity Relationship (QPTR) classifier for Pro-EDICToRs with accuracy higher than 80%. Principal Component Analysis (PCA) on the nopk values present in the QPTR model explains with one factor (F1) the 82.7% of variance. Next, these nopk values were used to construct by the first time a Pro-EDICToRs Complex Network having nodes (samples) linked by edges (similarity between two samples). We compared the topology of two sub-networks (cardiac toxicity and control samples); finding extreme relative differences for the re-linking (P) and Zagreb (M2) indices (9.5 and 54.2 % respectively) out of 11 parameters. We also compared subnetworks with well known ideal random networks including Barabasi-Albert, Kleinberg Small World, Erdos-Renyi, and Epsstein Power Law models. Finally, we proposed Partial Order (PO) schemes of the 115 samples based on LDA-probabilities, F1-scores and/or network node degrees. PCA-CN and LDA-PCA based POs with Tanimoto’s coefficients equal or higher than 0.75 are promising for the study of Pro-EDICToRs. These results shows that simple QPTRs models based on MS graph numerical parameters are an interesting tool for proteome researchThe authors thank projects funded by the Xunta de Galicia (PXIB20304PR and BTF20302PR) and the Ministerio de Sanidad y Consumo (PI061457). González-Díaz H. acknowledges tenure track research position funded by the Program Isidro Parga Pondal, Xunta de Galici

    RNA 상호작용 및 DNA 서열의 정보해독을 위한 기계학습 기법

    Get PDF
    학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 김선.생물체 간 표현형의 차이는 각 개체의 유전적 정보 차이로부터 기인한다. 유전적 정보의 변화에 따라서, 각 생물체는 서로 다른 종으로 진화하기도 하고, 같은 병에 걸린 환자라도 서로 다른 예후를 보이기도 한다. 이처럼 중요한 생물학적 정보는 대용량 시퀀싱 분석 기법 등을 통해 다양한 오믹스 데이터로 측정된다. 그러나, 오믹스 데이터는 고차원 특징 및 소규모 표본 데이터이기 때문에, 오믹스 데이터로부터 생물학적 정보를 해석하는 것은 매우 어려운 문제이다. 일반적으로, 데이터 특징의 개수가 샘플의 개수보다 많을 때, 오믹스 데이터의 해석을 가장 난해한 기계학습 문제들 중 하나로 만듭니다. 본 박사학위 논문은 기계학습 기법을 활용하여 고차원적인 생물학적 데이터로부터 생물학적 정보를 추출하기 위한 새로운 생물정보학 방법들을 고안하는 것을 목표로 한다. 첫 번째 연구는 DNA 서열을 활용하여 종 간 비교와 동시에 DNA 서열상에 있는 다양한 지역에 담긴 생물학적 정보를 유전적 관점에서 해석해보고자 하였다. 이를 위해, 순위 기반 k 단어 문자열 비교방법, RKSS 커널을 개발하여 다양한 게놈 상의 지역에서 여러 종 간 비교 실험을 수행하였다. RKSS 커널은 기존의 k 단어 문자열 커널을 확장한 것으로, k 길이 단어의 순위 정보와 종 간 공통점을 표현하는 비교기준점 개념을 활용하였다. k 단어 문자열 커널은 k의 길이에 따라 단어 수가 급증하지만, 비교기준점은 극소수의 단어로 이루어져 있으므로 서열 간 유사도를 계산하는 데 필요한 계산량을 효율적으로 줄일 수 있다. 게놈 상의 세 지역에 대해서 실험을 진행한 결과, RKSS 커널은 기존의 커널에 비해 종 간 유사도 및 차이를 효율적으로 계산할 수 있었다. 또한, RKSS 커널은 실험에 사용된 생물학적 지역에 포함된 생물학적 정보량 차이를 생물학적 지식과 부합되는 순서로 비교할 수 있었다. 두 번째 연구는 생물학적 네트워크를 통해 복잡하게 얽힌 유전자 상호작용 간 정보를 해석하여, 더 나아가 생물학적 기능 해석을 통해 암의 아형을 분류하고자 하였다. 이를 위해, 그래프 컨볼루션 네트워크와 어텐션 메커니즘을 활용하여 패스웨이 기반 해석 가능한 암 아형 분류 모델(GCN+MAE)을 고안하였다. 그래프 컨볼루션 네트워크를 통해서 생물학적 사전 지식인 패스웨이 정보를 학습하여 복잡한 유전자 상호작용 정보를 효율적으로 다루었다. 또한, 여러 패스웨이 정보를 어텐션 메커니즘을 통해 해석 가능한 수준으로 병합하였다. 마지막으로, 학습한 패스웨이 레벨 정보를 보다 복잡하고 다양한 유전자 레벨로 효율적으로 전달하기 위해서 네트워크 전파 알고리즘을 활용하였다. 다섯 개의 암 데이터에 대해 GCN+MAE 모델을 적용한 결과, 기존의 암 아형 분류 모델들보다 나은 성능을 보였으며 암 아형 특이적인 패스웨이 및 생물학적 기능을 발굴할 수 있었다. 세 번째 연구는 패스웨이로부터 서브 패스웨이/네트워크를 찾기 위한 연구다. 패스웨이나 생물학적 네트워크에 단일 생물학적 기능이 아니라 다양한 생물학적 기능이 포함되어 있음에 주목하였다. 단일 기능을 지닌 유전자 조합을 찾기 위해서 생물학적 네트워크상에서 조건 특이적인 유전자 모듈을 찾고자 하였으며 MIDAS라는 도구를 개발하였다. 패스웨이로부터 유전자 상호작용 간 활성도를 유전자 발현량과 네트워크 구조를 통해 계산하였다. 계산된 활성도들을 활용하여 다중 클래스에서 서로 다르게 활성화된 서브 패스들을 통계적 기법에 기반하여 발굴하였다. 또한, 어텐션 메커니즘과 그래프 컨볼루션 네트워크를 통해서 해당 연구를 패스웨이보다 더 큰 생물학적 네트워크에 확장하려고 시도하였다. 유방암 데이터에 대해 실험을 진행한 결과, MIDAS와 딥러닝 모델을 다중 클래스에서 차이가 나는 유전자 모듈을 효과적으로 추출할 수 있었다. 결론적으로, 본 박사학위 논문은 DNA 서열에 담긴 진화적 정보량 비교, 패스웨이 기반 암 아형 분류, 조건 특이적인 유전자 모듈 발굴을 위한 새로운 기계학습 기법을 제안하였다.Phenotypic differences among organisms are mainly due to the difference in genetic information. As a result of genetic information modification, an organism may evolve into a different species and patients with the same disease may have different prognosis. This important biological information can be observed in the form of various omics data using high throughput instrument technologies such as sequencing instruments. However, interpretation of such omics data is challenging since omics data is with very high dimensions but with relatively small number of samples. Typically, the number of dimensions is higher than the number of samples, which makes the interpretation of omics data one of the most challenging machine learning problems. My doctoral study aims to develop new bioinformatics methods for decoding information in these high dimensional data by utilizing machine learning algorithms. The first study is to analyze the difference in the amount of information between different regions of the DNA sequence. To achieve the goal, a ranked-based k-spectrum string kernel, RKSS kernel, is developed for comparative and evolutionary comparison of various genomic region sequences among multiple species. RKSS kernel extends the existing k-spectrum string kernel by utilizing rank information of k-mers and landmarks of k-mers that represents a species. By using a landmark as a reference point for comparison, the number of k-mers needed to calculating sequence similarities is dramatically reduced. In the experiments on three different genomic regions, RKSS kernel captured more reliable distances between species according to genetic information contents of the target region. Also, RKSS kernel was able to rearrange each region to match a biological common insight. The second study aims to efficiently decode complex genetic interactions using biological networks and, then, to classify cancer subtypes by interpreting biological functions. To achieve the goal, a pathway-based deep learning model using graph convolutional network and multi-attention based ensemble (GCN+MAE) for cancer subtype classification is developed. In order to efficiently reduce the relationships between genes using pathway information, GCN+MAE is designed as an explainable deep learning structure using graph convolutional network and attention mechanism. Extracted pathway-level information of cancer subtypes is transported into gene-level again by network propagation. In the experiments of five cancer data sets, GCN+MAE showed better cancer subtype classification performances and captured subtype-specific pathways and their biological functions. The third study is to identify sub-networks of a biological pathway. The goal is to dissect a biological pathway into multiple sub-networks, each of which is to be of a single functional unit. To achieve the goal, a condition-specific sub-module detection method in a biological network, MIDAS (MIning Differentially Activated Subpaths) is developed. From the pathway, edge activities are measured by explicit gene expression and network topology. Using the activities, differentially activated subpaths are explored by a statistical approach. Also, by extending this idea on graph convolutional network, different sub-networks are highlighted by attention mechanisms. In the experiment with breast cancer data, MIDAS and the deep learning model successfully decomposed gene-level features into sub-modules of single functions. In summary, my doctoral study proposes new computational methods to compare genomic DNA sequences as information contents, to model pathway-based cancer subtype classifications and regulations, and to identify condition-specific sub-modules among multiple cancer subtypes.Chapter 1 Introduction 1 1.1 Biological questions with genetic information 2 1.1.1 Biological Sequences 2 1.1.2 Gene expression 2 1.2 Formulating computational problems for the biological questions 3 1.2.1 Decoding biological sequences by k-mer vectors 3 1.2.2 Interpretation of complex relationships between genes 7 1.3 Three computational problems for the biological questions 9 1.4 Outline of the thesis 14 Chapter 2 Ranked k-spectrum kernel for comparative and evolutionary comparison of DNA sequences 15 2.1 Motivation 16 2.1.1 String kernel for sequence comparison 17 2.1.2 Approach: RKSS kernel 19 2.2 Methods 21 2.2.1 Mapping biological sequences to k-mer space: the k-spectrum string kernel 23 2.2.2 The ranked k-spectrum string kernel with a landmark 24 2.2.3 Single landmark-based reconstruction of phylogenetic tree 27 2.2.4 Multiple landmark-based distance comparison of exons, introns, CpG islands 29 2.2.5 Sequence Data for analysis 30 2.3 Results 31 2.3.1 Reconstruction of phylogenetic tree on the exons, introns, and CpG islands 31 2.3.2 Landmark space captures the characteristics of three genomic regions 38 2.3.3 Cross-evaluation of the landmark-based feature space 45 Chapter 3 Pathway-based cancer subtype classification and interpretation by attention mechanism and network propagation 46 3.1 Motivation 47 3.2 Methods 52 3.2.1 Encoding biological prior knowledge using Graph Convolutional Network 52 3.2.2 Re-producing comprehensive biological process by Multi-Attention based Ensemble 53 3.2.3 Linking pathways and transcription factors by network propagation with permutation-based normalization 55 3.3 Results 58 3.3.1 Pathway database and cancer data set 58 3.3.2 Evaluation of individual GCN pathway models 60 3.3.3 Performance of ensemble of GCN pathway models with multi-attention 60 3.3.4 Identification of TFs as regulator of pathways and GO term analysis of TF target genes 67 Chapter 4 Detecting sub-modules in biological networks with gene expression by statistical approach and graph convolutional network 70 4.1 Motivation 70 4.1.1 Pathway based analysis of transcriptome data 71 4.1.2 Challenges and Summary of Approach 74 4.2 Methods 78 4.2.1 Convert single KEGG pathway to directed graph 79 4.2.2 Calculate edge activity for each sample 79 4.2.3 Mining differentially activated subpath among classes 80 4.2.4 Prioritizing subpaths by the permutation test 82 4.2.5 Extension: graph convolutional network and class activation map 83 4.3 Results 84 4.3.1 Identifying 36 subtype specific subpaths in breast cancer 86 4.3.2 Subpath activities have a good discrimination power for cancer subtype classification 88 4.3.3 Subpath activities have a good prognostic power for survival outcomes 90 4.3.4 Comparison with an existing tool, PATHOME 91 4.3.5 Extension: detection of subnetwork on PPI network 98 Chapter 5 Conclusions 101 국문초록 127Docto

    MECHANISM AND FUNCTION OF SPLICEOSOMAL CLEAVAGE IN FISSION YEAST

    Get PDF
    Telomerase is the ribonucleoprotein complex that replenishes lost DNA sequences at the ends of chromosomes. At its core, telomerase consists of an RNA subunit (TERC) that provides the template and a catalytic protein component (TERT). Insufficient telomerase activity leads to various disorders like dyskeratosis congenita, aplastic anemia and idiopathic pulmonary fibrosis. How different mutations in the same gene lead to disparate symptoms and disorders is not clear. The overall objective of my project is to understand the biogenesis of telomerase in the genetically tractable eukaryote S. pombe, whose telomere maintenance machinery closely resembles that of humans. Our laboratory has previously shown that the mature 3' end of S. pombe telomerase RNA (TER1) is generated by the first step of spliceosomal splicing. The cis- and trans- acting factors that distinguish the single step spliceosomal cleavage in TER1 from the two-step splicing reaction that removes introns in other genes are being investigated. We now demonstrate that a strong branch site (BS), a long distance to the 3' splice site (SS) and a weak polypyrimidine tract (Py) tract act synergistically to attenuate the transition from the first to the second step of splicing. The observation that a strong BS antagonizes the second step of splicing in the context of TER1 suggests that the BS-U2 snRNA interactions are disrupted after the first step and thus earlier than previously thought. The slow transition from first to second step triggers the Prp22 DExD/H-box helicase- dependent rejection of the cleaved products and Prp43-dependent discard of the splicing intermediates. Related to this work, we have established that the spliceosome generates the 3' ends of telomerase RNA in S. cryophilus and S. octosporus albeit via a different mechanism involving U6 snRNA hyperstabilization at the 5'ss. Our findings explain how the spliceosome can function in 3' end processing and provide new insights into the mechanism of splicing

    Genome wide mining of alternative splicing in metazoan model organisms

    Get PDF
    Tese de doutoramento, Ciências Biomédicas (Ciências Morfológicas), Universidade de Lisboa, Faculdade de Medicina, 2009Background: Mining current mRNA and EST databases for novel alternatively spliced isoforms is of paramount importance for shedding light on the way in which the maturation of RNA is used to regulate gene expression. Preliminary observations revealed a tendency for greater amounts of potentially non protein-coding alternative transcripts in human genes than in orthologous genes from other organisms. However, many of these isoforms did not appear in recently published alternative splicing databases on account of constraints imposed in the selection of transcripts. This prompted us to develop a less constrained database with the aim of contributing to the identification of the full repertoire of splice variants in the transcriptome of different organisms. Given that mechanisms of control of gene expression involving non-protein-coding splice variants have been described in a variety of genes, this information may be crucial to deciphering more intricate layers of gene regulation in complex organisms brought about by alternative splicing. Description: An algorithm was developed to cluster mRNA and EST BLAT alignments to annotated gene regions. Consensus splice sites were the main requirement imposed on the selection of transcripts. The method was applied to thirteen model organisms. The alternative splicing information generated has been incorporated into a database with clear graphical displays representing the splicing patterns and is available from the ExonMine website (http://www.imm.fm.ul.pt/exonmine). It incorporates information on constitutive exons, poly-A signals, open reading frames and translation, expression specificity of any exon or splicing pattern relative to biological source of mRNA/EST, alternative splicing events and respective exon and junction sequences for microarray probe design. The ExonMine interface also provides several tools to support laboratory validation of splicing patterns. Conclusions: ExonMine detects a higher percentage of spliced genes and isoforms than currently available alternative splicing databases. The analysis reveals a marked increase, in complex organisms, of splice variants with either retained introns or incorporating novel exons with no apparent protein-coding potential. About 18% of unannotated exons detected in ExonMine were found expressed in primary human cells using tiling arrays. Validation of some of these results for the U2AF family of splicing factors was successfully performed in collaboration with members of the lab revealing primate specific transcripts and an alternatively spliced transcript carrying a microRNA. The database was also successfully used for genome wide analysis of sequence elements involved in the regulation of alternative splicing and for custom alternative splicing microarray design. Matching of ExonMine data to a commercial exon microarray platform covering the majority of human exons was also performed and will assist in large-scale analysis of alternative splicing data. The algorithm developed also provides for easy updatability, taking only 48 hours to generate data for the whole human genome and far less time for less complex organisms. In conclusion, ExonMine represents a new useful resource for future research on alternative splicing and gene regulation.Muscular Dystrophy Association (MDA3662), European Commission (LSHG-CT-2005-518238, EURASNET) and Fundação para a Ciência e Tecnologia, Portugal (PTDC/SAU-GMG/69739/2006)

    Gene identification using phylogenetic metrics with conditional random fields

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 69-72).While the complete sequence of the human genome contains all the information necessary for encoding a complete human being, its interpretation remains a major challenge of modern biology. The first step to any genomic analysis is a comprehensive and accurate annotation of all genes encoded in the genome, providing the basis for understanding human variation, gene regulation, health and disease. Traditionally, the problem of computational gene prediction has been addressed using graphical probabilistic models of genomic sequence. While such models have been successful for small genomes with relatively simple gene structure, new methods are necessary for scaling these to the complete human genome, and for leveraging information across multiple mammalian species currently being sequenced. While generative models like hidden Markov models (HMMs) face the difficulty of modeling both coding and non-coding regions across a complete genome, discriminative models such as Conditional Random Fields (CRFs) have recently emerged, which focus specifically on the discrimination problem of gene identification, and can therefore be more powerful. One of the most attractive characteristics of these models is that their general framework also allows the incorporation of any number of independently derived feature functions (metrics), which can increase discriminatory power. While most of the work on CRFs for gene finding has been on model construction and training, there has not been much focus on the metrics used in such discriminatory frameworks. This is particularly important with the availability of rich comparative genome data, enabling the development of phylogenetic gene identification metrics which can maximally use alignments of a large number of genomes.(cont.) In this work I address the question of gene identification using multiple related genomes. I first present novel comparative metrics for gene classification that show considerable improvement over existing work, and also scale well with an increase in the number of aligned genomes. Second, I describe a general methodology of extending pair-wise metrics to alignments of multiple genomes that incorporates the evolutionary phylogenetic relationship between informant species. Third, I evaluate various methods of combining metrics that exploit metric independence and result in superior classification. Finally, I incorporate the metrics into a Conditional Random Field gene model, to perform unrestricted de novo gene prediction on 12-species alignments of the D. melanogaster genome, and demonstrate accuracy rivaling that of state-of-the-art gene prediction systems.by Ameya Nitin Deoras.S.M

    BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction

    Get PDF
    A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN

    Features generated for computational splice-site prediction correspond to functional elements

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Accurate selection of splice sites during the splicing of precursors to messenger RNA requires both relatively well-characterized signals at the splice sites and auxiliary signals in the adjacent exons and introns. We previously described a feature generation algorithm (FGA) that is capable of achieving high classification accuracy on human 3' splice sites. In this paper, we extend the splice-site prediction to 5' splice sites and explore the generated features for biologically meaningful splicing signals.</p> <p>Results</p> <p>We present examples from the observed features that correspond to known signals, both core signals (including the branch site and pyrimidine tract) and auxiliary signals (including GGG triplets and exon splicing enhancers). We present evidence that features identified by FGA include splicing signals not found by other methods.</p> <p>Conclusion</p> <p>Our generated features capture known biological signals in the expected sequence interval flanking splice sites. The method can be easily applied to other species and to similar classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, etc.</p
    corecore