724 research outputs found

    Novel algorithms for protein sequence analysis

    Get PDF
    Each protein is characterized by its unique sequential order of amino acids, the so-called protein sequence. Biology__s paradigm is that this order of amino acids determines the protein__s architecture and function. In this thesis, we introduce novel algorithms to analyze protein sequences. Chapter 1 begins with the introduction of amino acids, proteins and protein families. Then fundamental techniques from computer science related to the thesis are briefly described. Making a multiple sequence alignment (MSA) and constructing a phylogenetic tree are traditional means of sequence analysis. Information entropy, feature selection and sequential pattern mining provide alternative ways to analyze protein sequences and they are all from computer science. In Chapter 2, information entropy was used to measure the conservation on a given position of the alignment. From an alignment which is grouped into subfamilies, two types of information entropy values are calculated for each position in the MSA. One is the average entropy for a given position among the subfamilies, the other is the entropy for the same position in the entire multiple sequence alignment. This so-called two-entropies analysis or TEA in short, yields a scatter-plot in which all positions are represented with their two entropy values as x- and y-coordinates. The different locations of the positions (or dots) in the scatter-plot are indicative of various conservation patterns and may suggest different biological functions. The globally conserved positions show up at the lower left corner of the graph, which suggests that these positions may be essential for the folding or for the main functions of the protein superfamily. In contrast the positions neither conserved between subfamilies nor conserved in each individual subfamily appear at the upper right corner. The positions conserved within each subfamily but divergent among subfamilies are in the upper left corner. They may participate in biological functions that divide subfamilies, such as recognition of an endogenous ligand in G protein-coupled receptors. The TEA method requires a definition of protein subfamilies as an input. However such definition is a challenging problem by itself, particularly because this definition is crucial for the following prediction of specificity positions. In Chapter 3, we automated the TEA method described in Chapter 2 by tracing the evolutionary pressure from the root to the branches of the phylogenetic tree. At each level of the tree, a TEA plot is produced to capture the signal of the evolutionary pressure. A consensus TEA-O plot is composed from the whole series of plots to provide a condensed representation. Positions related to functions that evolved early (conserved) or later (specificity) are close to the lower left or upper left corner of the TEA-O plot, respectively. This novel approach allows an unbiased, user-independent, analysis of residue relevance in a protein family. We tested the TEA-O method on a synthetic dataset as well as on __real__ data, i.e., LacI and GPCR datasets. The ROC plots for the real data showed that TEA-O works perfectly well on all datasets and much better than other considered methods such as evolutionary trace, SDPpred and TreeDet. While positions were treated independently from each other in Chapter 2 and 3 in predicting specificity positions, in Chapter 4 multi-RELIEF considers both sequence similarity and distance in 3D structure in the specificity scoring function. The multi-RELIEF method was developed based on RELIEF, a state-of-the-art Machine-Learning technique for feature weighting. It estimates the expected __local__ functional specificity of residues from an alignment divided in multiple classes. Optionally, 3D structure information is exploited by increasing the weight of residues that have high-weight neighbors. Using ROC curves over a large body of experimental reference data, we showed that multi-RELIEF identifies specificity residues for the seven test sets used. In addition, incorporating structural information improved the prediction for specificity of interaction with small molecules. Comparison of multi-RELIEF with four other state-of-the-art algorithms indicates its robustness and best overall performance. In Chapter 2, 3 and 4, we heavily relied on multiple sequence alignment to identify conserved and specificity positions. As mentioned before, the construction of such alignment is not self-evident. Following the principle of sequential pattern mining, in Chapter 5, we proposed a new algorithm that directly identifies frequent biologically meaningful patterns from unaligned sequences. Six algorithms were designed and implemented to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. From Chapter 2 to 5, we aimed to identify functional residues from either aligned or unaligned protein sequences. In Chapter 6, we introduce an alignment-independent procedure to cluster protein sequences, which may be used to predict protein function. Traditionally phylogeny reconstruction is usually based on multiple sequence alignment. The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In cheminformatics, constructing a similarity tree of ligands is usually alignment free. Feature spaces are routine means to convert compounds into binary fingerprints. Then distances among compounds can be obtained and similarity trees are constructed via clustering techniques. We explored building feature spaces for phylogeny reconstruction either using the so-called k-mer method or via sequential pattern mining with additional filtering and combining operations. Satisfying trees were built from both approaches compared with alignment-based methods. We found that when k equals 3, the phylogenetic tree built from the k-mer fingerprints is as good as one of the alignment-based methods, in which PAM and Neighborhood joining are used for computing distance and constructing a tree, respectively (NJ-PAM). As for the sequential pattern mining approach, the quality of the phylogenetic tree is better than one of the alignment-based method (NJ-PAM), if we set the support value to 10% and used maximum patterns only as descriptors. Finally in Chapter 7, general conclusions about the research described in this thesis are drawn. They are supplemented with an outlook on further research lines. We are convinced that the described algorithms can be useful in, e.g., genomic analyses, and provide further ideas for novel algorithms in this respect.Leiden University, NWO (Horizon Breakthrough project 050-71-041) and the Dutch Top Institute Pharma (D1-105)UBL - phd migration 201

    Multi-Harmony: detecting functional specificity from sequence alignment

    Get PDF
    Many protein families contain sub-families with functional specialization, such as binding different ligands or being involved in different proteinā€“protein interactions. A small number of amino acids generally determine functional specificity. The identification of these residues can aid the understanding of protein function and help finding targets for experimental analysis. Here, we present multi-Harmony, an interactive web sever for detecting sub-type-specific sites in proteins starting from a multiple sequence alignment. Combining our Sequence Harmony (SH) and multi-Relief (mR) methods in one web server allows simultaneous analysis and comparison of specificity residues; furthermore, both methods have been significantly improved and extended. SH has been extended to cope with more than two sub-groups. mR has been changed from a sampling implementation to a deterministic one, making it more consistent and user friendly. For both methods Z-scores are reported. The multi-Harmony web server produces a dynamic output page, which includes interactive connections to the Jalview and Jmol applets, thereby allowing interactive analysis of the results. Multi-Harmony is available at http://www.ibi.vu.nl/ programs/shmrwww

    An automated stochastic approach to the identification of the protein specificity determinants and functional subfamilies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent progress in sequencing and 3 D structure determination techniques stimulated development of approaches aimed at more precise annotation of proteins, that is, prediction of exact specificity to a ligand or, more broadly, to a binding partner of any kind.</p> <p>Results</p> <p>We present a method, SDPclust, for identification of protein functional subfamilies coupled with prediction of specificity-determining positions (SDPs). SDPclust predicts specificity in a phylogeny-independent stochastic manner, which allows for the correct identification of the specificity for proteins that are separated on a phylogenetic tree, but still bind the same ligand. SDPclust is implemented as a Web-server <url>http://bioinf.fbb.msu.ru/SDPfoxWeb/</url> and a stand-alone Java application available from the website.</p> <p>Conclusions</p> <p>SDPclust performs a simultaneous identification of specificity determinants and specificity groups in a statistically robust and phylogeny-independent manner.</p

    Ensemble approach to predict specificity determinants: benchmarking and validation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It is extremely important and challenging to identify the sites that are responsible for functional specification or diversification in protein families. In this study, a rigorous comparative benchmarking protocol was employed to provide a reliable evaluation of methods which predict the specificity determining sites. Subsequently, three best performing methods were applied to identify new potential specificity determining sites through ensemble approach and common agreement of their prediction results.</p> <p>Results</p> <p>It was shown that the analysis of structural characteristics of predicted specificity determining sites might provide the means to validate their prediction accuracy. For example, we found that for smaller distances it holds true that the more reliable the prediction method is, the closer predicted specificity determining sites are to each other and to the ligand.</p> <p>Conclusion</p> <p>We observed certain similarities of structural features between predicted and actual subsites which might point to their functional relevance. We speculate that majority of the identified potential specificity determining sites might be indirectly involved in specific interactions and could be ideal target for mutagenesis experiments.</p

    LncRNAs signature defining major subtypes of B-cell acute lymphoblastic leukemia

    Get PDF
    Introduction: B-cell precursor acute lymphoblastic leukemia (BCP-ALL) is the most prevalent heterogeneous cancer in children and adults, with multiple subtypes. Emerging evidence suggests that long non-coding RNAs (lncRNAs) might play a key role in the development and progression of leukemia. Thus, we performed a transcriptional and DNA methylation survey to explore the lncRNA landscape on three BCP-ALL subtypes (82 samples) and demonstrated their functions and epigenetic profile. Methodology: The primary BCP-ALL samples from bone marrow material were collected from diagnosis (ID) and relapse (REL) stages of adult (n = 21) and pediatric (n = 24) BCP-ALL patients, using RNA-seq and DNA methylation array technology. The subtype-specific and relapse-specific lncRNAs were analyzed by differential expression (DE) analysis method using LIMMA Voom. By analyzing the co-expression of the subtype-specific lncRNAs and protein-coding (PC) genes from all subtypes, we inferred potential functions of these lncRNAs by applying ā€œguilt-by-associationā€ approach. Additionally, we validated our subtype-specific lncRNAs on an independent cohort of 47 BCP-ALL samples. The epigenetic regulation of subtype-specific lncRNAs were identified using the Bumphunter package. The correlation analysis was performed between DM and DE lncRNAs from three subtypes to determine the epigenetically facilitated and silenced lncRNAs. Results: We present a comprehensive landscape of lncRNAs signatures which classifies three molecular subtypes of BCP-ALL on DNA methylation and RNA expression levels. The principle component analysis (PCA) on most variable lncRNAs on RNA and DNA methylation level confirmed robust separation of DUX4, Ph-like and NH-HeH BCP-ALL subtypes. Using integrative bioinformatics analysis, subtype-specific and relapse-specific lncRNAs signature together determine 1564 subtype-specific and 941 relapse-specific lncRNAs from three subtypes. The unsupervised hierarchical clustering on these subtype-specific lncRNAs validated their specificity on the independent validation cohort. For the first time, our study demonstrates that BCP-ALL subtype specific as well as relapse-specific lncRNAs may contribute to the activation of key pathways including TGF-Ī², PI3K-Akt, mTOR and activation of JAK-STAT signaling pathways from DUX4 and Ph-like subtypes. Finally, the significantly hyper-methylated and hypo-methylated subtype-specific lncRNAs were profiled. In addition to that, we identified 23 subtypes specific lncRNAs showing hypo and hyper-methylation pattern in their promoter region that significantly correlates with their diminished and increased expression in respective subtypes. Conclusions: Overall, our work provides the most comprehensive analyses for lncRNAs in BCP-ALL subtypes. Our findings suggest a wide range of biological functions associated with lncRNAs and epigenetically facilitated lncRNAs in BCP-ALL and provide a foundation for functional investigations that could lead to novel therapeutic approaches.EinfĆ¼hrung: Die B-VorlƤufer akute lymphatischen LeukƤmie (BCP-ALL) ist eine heterogene Krebserkrankung mit mehreren definierten Subgruppen. Neue Daten deuten darauf hin, dass lange nicht-kodierende RNAs (long noncoding RNAs - lncRNAs) eine SchlĆ¼sselrolle bei der Entwicklung und Progression der BCP-ALL spielen kƶnnten. Daher fĆ¼hrten wir eine Transkriptions- und DNA-Methylierungsstudie durch, um die lncRNA-Landschaft von drei BCP-ALL-Subgruppen (82 Proben) zu charakterisieren und potentielle regulative Konsequenzen zu analysieren. Methodik: Material wurde zum Zeitpunkt der Erstdiagnose (ID) und im Rezidiv (REL) von erwachenen (n = 21) und pƤdiatrischen (n = 24) BCP-ALL-Patienten entnommen und unter Verwendung von RNA-Seq und DNA-Methylierungs-Array-Technologien untersucht. Die Subgruppen-spezifischen und rezidiv-spezifischen lncRNAs wurden durch differentielle Expressions (DE) Analysen mit LIMMA Voom analysiert. Durch die Analyse der Koexpression von lncRNAs mit Protein-kodierenden (PC) Genen aus allen Subgruppen schlossen wir unter Verwendung eines ā€šGuilt-by-associationā€˜ -Ansatzes auf potentielle Funktionen der DE lncRNAs. Zudem haben wir die Subgruppen-spezifischen lncRNAs auf einem unabhƤngigen Datenset von 47 BCP-ALL-Proben validiert. Die epigenetische. Die epigenetische Regulation von Subgruppen-spezifischen lncRNAs wurde durch eine differentielle Methylierungs (DM) analyse identifiziert. Die Korrelation zwischen DM und DE lncRNAs aus drei Subgruppen wurde ermittelt, um den Einfluss der epigenetischen Regulation auf die Expression von lncRNAs zu analysieren. Ergebnisse: Wir prƤsentieren eine umfassende Landschaft von lncRNA-Signaturen, die drei molekulare Subtypen von BCP-ALL auf DNA-Methylierungs- und RNA-Expressionslevel klassifiziert. Die Hauptkomponentenanalyse (PCA) auf den top variablen lncRNAs auf RNA und DNA-Methylierungsniveau bestƤtigte eine robuste Trennung von Ph-like, DUX4 und NH-NeH BCP-ALL Subtypen. Mit integrativer bioinformatischer Analyse, zusammen 1564 subtyp-spezifische und 941 rezidiv-spezifische lncRNAs aus den drei Subtypen. Das unĆ¼berwachte hierarchische Clustering auf diesen Subtyp-spezifischen lncRNAs validierte ihre SpezifitƤt in der unabhƤngigen Validierungskohorte. Unsere Studie zeigt erstmals, dass BCP-ALL-Subtyp-spezifische sowie Rezidiv-spezifische lncRNAs zur Aktivierung von Signalwegen wie TGF-Ī², PI3K-Akt, mTOR und Aktivierung von JAK-STAT-Signalwegen von DUX4 und Ph-like Subtypen. Endlich wurden die signifikant DM subtyp-spezifische lncRNAs profiliert. DarĆ¼ber hinaus identifizierten wir 23 Subtyp-spezifische lncRNAs, die ein Hypo- und Hypermethylierungsmuster in ihrer Promotorregion zeigen, das signifikant mit ihrer verringerten und erhƶhten Expression in den jeweiligen Subtypen korreliert. Schlussfolgerungen: Insgesamt liefert unsere Arbeit die umfassendsten Analysen fĆ¼r lncRNAs in BCP-ALL-Subtypen. Unsere Ergebnisse weisen auf eine Vielzahl von biologischen Funktionen im Zusammenhang mit lncRNAs und epigenetisch erleichterten lncRNAs in BCP-ALL hin und bieten eine Grundlage fĆ¼r funktionelle Untersuchungen, die zu neuen therapeutischen AnsƤtzen fĆ¼hren kƶnnten

    TREMORā€”a tool for retrieving transcriptional modules by incorporating motif covariance

    Get PDF
    A transcriptional module (TM) is a collection of transcription factors (TF) that as a group, co-regulate multiple, functionally related genes. The task of identifying TMs poses an important biological challenge. Since TFs belong to evolutionarily and structurally related families, TF family members often bind to similar DNA motifs and can confound sequence-based approaches to TM identification. A previous approach to TM detection addresses this issue by pre-selecting a single representative from each TF family. One problem with this approach is that closely related transcription factors can still target sufficiently distinct genes in a biologically meaningful way, and thus, pre-selecting a single family representative may in principle miss certain TMs. Here we report a methodā€”TREMOR (Transcriptional Regulatory Module Retriever). This method uses the Mahalanobis distance to assess the validity of a TM and automatically incorporates the inter-TF binding similarity without resorting to pre-selecting family representatives. The application of TREMOR on human muscle-specific, liver-specific and cell-cycle-related genes reveals TFs and TMs that were validated from literature and also reveals additional related genes

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    Following the trail of cellular signatures : computational methods for the analysis of molecular high-throughput profiles

    Get PDF
    Over the last three decades, high-throughput techniques, such as next-generation sequencing, microarrays, or mass spectrometry, have revolutionized biomedical research by enabling scientists to generate detailed molecular profiles of biological samples on a large scale. These profiles are usually complex, high-dimensional, and often prone to technical noise, which makes a manual inspection practically impossible. Hence, powerful computational methods are required that enable the analysis and exploration of these data sets and thereby help researchers to gain novel insights into the underlying biology. In this thesis, we present a comprehensive collection of algorithms, tools, and databases for the integrative analysis of molecular high-throughput profiles. We developed these tools with two primary goals in mind. The detection of deregulated biological processes in complex diseases, like cancer, and the identification of driving factors within those processes. Our first contribution in this context are several major extensions of the GeneTrail web service that make it one of the most comprehensive toolboxes for the analysis of deregulated biological processes and signaling pathways. GeneTrail offers a collection of powerful enrichment and network analysis algorithms that can be used to examine genomic, epigenomic, transcriptomic, miRNomic, and proteomic data sets. In addition to approaches for the analysis of individual -omics types, our framework also provides functionality for the integrative analysis of multi-omics data sets, the investigation of time-resolved expression profiles, and the exploration of single-cell experiments. Besides the analysis of deregulated biological processes, we also focus on the identification of driving factors within those processes, in particular, miRNAs and transcriptional regulators. For miRNAs, we created the miRNA pathway dictionary database miRPathDB, which compiles links between miRNAs, target genes, and target pathways. Furthermore, it provides a variety of tools that help to study associations between them. For the analysis of transcriptional regulators, we developed REGGAE, a novel algorithm for the identification of key regulators that have a significant impact on deregulated genes, e.g., genes that show large expression differences in a comparison between disease and control samples. To analyze the influence of transcriptional regulators on deregulated biological processes,, we also created the RegulatorTrail web service. In addition to REGGAE, this tool suite compiles a range of powerful algorithms that can be used to identify key regulators in transcriptomic, proteomic, and epigenomic data sets. Moreover, we evaluate the capabilities of our tool suite through several case studies that highlight the versatility and potential of our framework. In particular, we used our tools to conducted a detailed analysis of a Wilms' tumor data set. Here, we could identify a circuitry of regulatory mechanisms, including new potential biomarkers, that might contribute to the blastemal subtype's increased malignancy, which could potentially lead to new therapeutic strategies for Wilms' tumors. In summary, we present and evaluate a comprehensive framework of powerful algorithms, tools, and databases to analyze molecular high-throughput profiles. The provided methods are of broad interest to the scientific community and can help to elucidate complex pathogenic mechanisms.Heutzutage werden molekulare Hochdurchsatzmessverfahren, wie Hochdurchsatzsequenzierung, Microarrays, oder Massenspektrometrie, regelmƤƟig angewendet, um Zellen im groƟen Stil und auf verschiedenen molekularen Ebenen zu charakterisieren. Die dabei generierten DatensƤtze sind in der Regel hochdimensional und oft verrauscht. Daher werden leistungsfƤhige computergestĆ¼tzte Anwendungen benƶtigt, um deren Analyse zu ermƶglichen. In dieser Arbeit prƤsentieren wir eine Reihe von effektiven Algorithmen, Programmen, und Datenbaken fĆ¼r die Analyse von molekularen HochdurchsetzdatensƤtzen. Diese AnsƤtze wurden entwickelt, um deregulierte biologische Prozesse zu untersuchen und in diesen wichtige SchlĆ¼sselmolekĆ¼le zu identifizieren. ZusƤtzlich wurden eine Reihe von Analysen durchgefĆ¼hrt um die verschiedenen Methoden zu evaluieren. Zu diesem Zweck haben wir insbesondere eine Wilmstumor Studie durchgefĆ¼hrt, in der wir verschiedene regulatorische Mechanismen und dazugehƶrige Biomarker identifizieren konnten, die fĆ¼r die erhƶhte MalignitƤt von Wilmstumoren mit blastemreichen Subtyp verantwortlich sein kƶnnten. Diese Erkenntnisse kƶnnten in der Zukunft zu einer verbesserten Behandlung dieser Tumore fĆ¼hren. Diese Ergebnisse zeigen eindrucksvoll, dass unsere AnsƤtze in der Lage sind, verschiedene molekulare Hochdurchsatzmessungen auszuwerten und dabei helfen kƶnnen pathogene Mechanismen im Zusammenhang mit Krebs oder anderen komplexen Krankheiten aufzuklƤren
    • ā€¦
    corecore