4 research outputs found

    Discovering Protein Functional Regions and Protein-Protein Interaction using Co-occurring Aligned Pattern Clusters

    Get PDF
    Bioinformatics is a rapidly expanding field of research due to multiple recent advancements: 1) the advent of machine intelligence, 2) the increase of computing power, 3) our better understanding of the underlying biomolecular mechanisms, and 4) the drastic reduction of biosequencing cost and time. Since wet laboratory approaches to analysing the protein sequencing is still labour intensive and time consuming, more cost-effective computational approaches for analyzing protein sequences and their biochemical interactions are crucial. This is especially true when we encounter a large collection of protein sequences. Aligned Pattern CLustering (APCL), an algorithm which combines machine intelligence methodologies such as pattern recognition, pattern discovery, pattern clustering and alignment, formulated by my research group and myself, is one such technique. APCL discovers, prunes, and clusters aligned statistically significant patterns to assemble a related, or specifically, a homologous group of patterns in the form of an Aligned Pattern Cluster (APC). The APC obtained is found to correspond to statistically and functionally significant association patterns, which corresponds as conserved regions, such as binding segments within and between protein sequences as well as between Protein Transcription Factor (TF) and DNA Transcription Factor Binding Sites (TFBS) in many of our empirical experiments. While several known algorithms also exist to find functionally conserved segments in biosequences, they are less flexible and require more parameters than what APCL requires. Hence, APCL is a powerful tool to analyze biosequences. Because of its effectiveness, the usefulness of APCL is further expanded from the assist of discovering and analyzing functional regions of protein sequences to the exploration of co-occurrence of patterns on the same sequences or on interacting patterns between sequences from the discovered APCs. Two new algorithms are introduced and reported in this thesis in the exploration of 1) APCs containing patterns residing within the same biosequences and 2) APCs containing patterns residing between interacting biosequences. The first algorithm attempts to cluster APCs from APCs that share patterns on the same biosequences. It uses a co-occurrence score between APCs in a co-occurrence APC pair (two APCs containing co-occurrence patterns) to account for the proportion of biosequences of co-occurrence patterns they share against the total number of sequences containing them. Using this score as a similarity measure (or more precisely, as a co-occurring measure), we devise a Co-occurrence APC Clustering Algorithm to cluster APCs obtained from a collection of related biosequences into a Co-Occurrence Cluster of APCs abbreviated by cAPC. It is then analyzed and verified to see whether or not there are essential biological functions associating with the APCs within that cluster. Cytochrome c and ubiquitin families were analyzed in depth, and it was validated that members in the same cAPC do cover the functional regions that have essential cooperative biological functions. The second algorithm takes advantage of the effectiveness of APCL to create a protein-protein interaction (PPI) identification and prediction algorithm. PPI prediction is a hot research problem in bioinformatics and proteomic. A good number of algorithms exist. The state of the art algorithm is one which could achieve high success rate in prediction performance, but provides results that are difficult to interpret. The research in this thesis tries to overcome this hurdle. This second algorithm uses an APC-PPI score between two APCs to account for the proportion of patterns residing on two different protein sequences. This score measures how often patterns in both APCs co-occur in the sequence data of two known interacting proteins. The scores are then used to construct feature vectors to first train a learning model from the known PPI data and later used to predict the possible PPI between a protein pair. The algorithm performance was comparable to the state of the art algorithms, but provided results that are interpretable. The results from both algorithms built upon the extension of APCL in finding co-occurring patterns via co-occurrence of APCs are proved to be effective and useful since its performance in finding APCs is fast and effective. The first algorithm discovered biological insights, supported by biological literature, which are typically unable to be discovered solely through the analysis of biosequences. The second algorithm succeeded in providing accurate and descriptive PPI predictions. Hence, these two algorithms are useful in the analysis and prediction of proteins. In addition, through continued research and development to the second algorithm, it will be a powerful tool for the drug industry, as it can help find new PPI, an important step in developing new drugs for different drug targets

    Discovery and Analysis of Aligned Pattern Clusters from Protein Family Sequences

    Get PDF
    Protein sequences are essential for encoding molecular structures and functions. Consequently, biologists invest substantial resources and time discovering functional patterns in proteins. Using high-throughput technologies, biologists are generating an increasing amount of data. Thus, the major challenge in biosequencing today is the ability to conduct data analysis in an effi cient and productive manner. Conserved amino acids in proteins reveal important functional domains within protein families. Conversely, less conserved amino acid variations within these protein sequence patterns reveal areas of evolutionary and functional divergence. Exploring protein families using existing methods such as multiple sequence alignment is computationally expensive, thus pattern search is used. However, at present, combinatorial methods of pattern search generate a large set of solutions, and probabilistic methods require richer representations. They require biological ground truth of the input sequences, such as gene name or taxonomic species, as class labels based on traditional classi fication practice to train a model for predicting unknown sequences. However, these algorithms are inherently biased by mislabelling and may not be able to reveal class characteristics in a detailed and succinct manner. A novel pattern representation called an Aligned Pattern Cluster (AP Cluster) as developed in this dissertation is compact yet rich. It captures conservations and variations of amino acids and covers more sequences with lower entropy and greatly reduces the number of patterns. AP Clusters contain statistically signi cant patterns with variations; their importance has been confi rmed by the following biological evidences: 1) Most of the discovered AP Clusters correspond to binding segments while their aligned columns correspond to binding sites as verifi ed by pFam, PROSITE, and the three-dimensional structure. 2) By compacting strong correlated functional information together, AP Clusters are able to reveal class characteristics for taxonomical classes, gene classes and other functional classes, or incorrect class labelling. 3) Co-occurrence of AP Clusters on the same homologous protein sequences are spatially close in the protein's three-dimensional structure. These results demonstrate the power and usefulness of AP Clusters. They bring in similar statistically signifi cance patterns with variation together and align them to reveal protein regional functionality, class characteristics, binding and interacting sites for the study of protein-protein and protein-drug interactions, for diff erentiation of cancer tumour types, targeted gene therapy as well as for drug target discovery.1 yea

    Discovering Patterns from Sequences with Applications to Protein-Protein and Protein-DNA Interaction

    Get PDF
    Understanding Protein-Protein and Protein-DNA interaction is of fundamental importance in deciphering gene regulation and other biological processes in living cells. Traditionally, new interaction knowledge is discovered through biochemical experiments that are often labor intensive, expensive and time-consuming. Thus, computational approaches are preferred. Due to the abundance of sequence data available today, sequence-based interaction analysis becomes one of the most readily applicable and cost-effective methods. One important problem in sequence-based analysis is to identify the functional regions from a set of sequences within the same family or demonstrating similar biological functions in experiments. The rationale is that throughout evolution the functional regions normally remain conserved (intact), allowing them to be identified as patterns from a set of sequences. However, there are also mutations such as substitution, insertion, deletion in these functional regions. Existing methods, such as those based on position weight matrices, assume that the functional regions have a fixed width and thus cannot not identify functional regions with mutations, particularly those with insertion or deletion mutations. Recently, Aligned Pattern Clustering (APCn) was introduced to identify functional regions as Aligned Pattern Clusters (APCs) by grouping and aligning patterns with variable width. Nevertheless, APCn cannot discover functional regions with substitution, insertion and/or deletion mutations, since their frequencies of occurrences are too low to be considered as patterns. To overcome such an impasse, this thesis proposes a new APC discovery algorithm known as Pattern-Directed Aligned Pattern Clustering (PD-APCn). By first discovering seed patterns from the input sequence data, with their sequence positions located and recorded on an address table, PD-APCn can use the seed patterns to direct the incremental extension of functional regions with minor mutations. By grouping the aligned extended patterns, PD-APCn can recruit patterns adaptively and efficiently with variable width without relying on exhaustive optimal search. Experiments on synthetic datasets with different sizes and noise levels showed that PD-APCn can identify the implanted pattern with mutations, outperforming the popular existing motif-finding software MEME with much higher recall and Fmeasure over a computational speed-up of up to 665 times. When applying PD-APCn on datasets from Cytochrome C and Ubiquitin protein families, all key binding sites conserved in the families were captured in the APC outputs. In sequence-based interaction analysis, there is also a lack of a model for co-occurring functional regions with mutations, where co-occurring functional regions between interaction sequences are indicative of binding sites. This thesis proposes a new representation model Co-Occurrence APCs to capture co-occurring functional regions with mutations from interaction sequences in database transaction format. Applications on Protein-DNA and Protein-Protein interaction validated the capability of Co-Occurrence APCs. In Protein-DNA interaction, a new representation model, Protein-DNA Co-Occurrence APC, was developed for modeling Protein-DNA binding cores. The new model is more compact than the traditional one-to-one pattern associations, as it packs many-to-many associations in one model, yet it is detailed enough to allow site-specific variants. An algorithm, based on Co-Support Score, was also developed to discover Protein-DNA Co-Occurrence APCs from Protein-DNA interaction sequences. This algorithm is 1600x faster in run-time than its contemporaries. New Protein-DNA binding cores indicated by Protein-DNA Co-Occurrence APCs were also discovered via homology modeling as a proof-of-concept. In Protein-Protein interaction, a new representation model, Protein-Protein Co-Occurrence APC, was developed for modeling the co-occurring sequence patterns in Protein-Protein Interaction between two protein sequences. A new algorithm, WeMine-P2P, was developed for sequence-based Protein-Protein Interaction machine learning prediction by constructing feature vectors leveraging Protein-Protein Co-Occurrence APCs, based on novel scores such as Match Score, MaxMatch Score and APC-PPI score. Through 40 independent experiments, it outperformed the well-known algorithm, PIPE2, which also uses co-occurring functional regions while not allowing variable widths and mutations. Both applications on Protein-Protein and Protein-DNA interaction have indicated the potential use of Co-Occurrence APC for exploring other types of biosequence interaction in the future

    Co-occurrence Clusters of Aligned Pattern Clusters

    No full text
    corecore