3 research outputs found

    A knowledgebase of stress reponsive gene regulatory elements in arabidopsis Thaliana

    Get PDF
    Magister Scientiae - MScStress responsive genes play a key role in shaping the manner in which plants process and respond to environmental stress. Their gene products are linked to DNA transcription and its consequent translation into a response product. However, whilst these genes play a significant role in manufacturing responses to stressful stimuli, transcription factors coordinate access to these genes, specifically by accessing a gene's promoter region which houses transcription factor binding sites. Here transcriptional elements play a key role in mediating responses to environmental stress where each transcription factor binding site may constitute a potential response to a stress signal. Arabidopsis thaliana, a model organism, can be used to identify the mechanism of how transcription factors shape a plant's survival in a stressful environment. Whilst there are numerous plant stress research groups, globally there is a shortage of publicly available stress responsive gene databases. In addition a number of previous databases such as the Generation Challenge Programme's comparative plant stressresponsive gene catalogue, Stresslink and DRASTIC have become defunct whilst others have stagnated. There is currently a single Arabidopsis thaliana stress response database called STIFDB which was launched in 2008 and only covers abiotic stresses as handled by major abiotic stress responsive transcription factor families. Its data was sourced from microarray expression databases, contains numerous omissions as well as numerous erroneous entries and has not been updated since its inception.The Dragon Arabidopsis Stress Transcription Factor database (DASTF) was developed in response to the current lack of stress response gene resources. A total of 2333 entries were downloaded from SWISSPROT, manually curated and imported into DASTF. The entries represent 424 transcription factor families. Each entry has a corresponding SWISSPROT, ENTREZ GENBANK and TAIR accession number. The 5' untranslated regions (UTR) of 417 families were scanned against TRANSFAC's binding site catalogue to identify binding sites. The relational database consists of two tables, namely a transcription factor table and a transcription factor family table called DASTF_TF and TF_Family respectively. Using a two-tier client-server architecture, a webserver was built with PHP, APACHE and MYSQL and the data was loaded into these tables with a PYTHON script. The DASTF database contains 60 entries which correspond to biotic stress and 167 correspond to abiotic stress while 2106 respond to biotic and/or abiotic stress. Users can search the database using text, family, chromosome and stress type search options. Online tools have been integrated into the DASTF, database, such as HMMER, CLUSTALW, BLAST and HYDROCALCULATOR. User's can upload sequences to identify which transcription factor family their sequences belong to by using HMMER. The website can be accessed at http://apps.sanbi.ac.za/dastf/ and two updates per year are envisaged.South Afric

    Discovering Patterns from Sequences with Applications to Protein-Protein and Protein-DNA Interaction

    Get PDF
    Understanding Protein-Protein and Protein-DNA interaction is of fundamental importance in deciphering gene regulation and other biological processes in living cells. Traditionally, new interaction knowledge is discovered through biochemical experiments that are often labor intensive, expensive and time-consuming. Thus, computational approaches are preferred. Due to the abundance of sequence data available today, sequence-based interaction analysis becomes one of the most readily applicable and cost-effective methods. One important problem in sequence-based analysis is to identify the functional regions from a set of sequences within the same family or demonstrating similar biological functions in experiments. The rationale is that throughout evolution the functional regions normally remain conserved (intact), allowing them to be identified as patterns from a set of sequences. However, there are also mutations such as substitution, insertion, deletion in these functional regions. Existing methods, such as those based on position weight matrices, assume that the functional regions have a fixed width and thus cannot not identify functional regions with mutations, particularly those with insertion or deletion mutations. Recently, Aligned Pattern Clustering (APCn) was introduced to identify functional regions as Aligned Pattern Clusters (APCs) by grouping and aligning patterns with variable width. Nevertheless, APCn cannot discover functional regions with substitution, insertion and/or deletion mutations, since their frequencies of occurrences are too low to be considered as patterns. To overcome such an impasse, this thesis proposes a new APC discovery algorithm known as Pattern-Directed Aligned Pattern Clustering (PD-APCn). By first discovering seed patterns from the input sequence data, with their sequence positions located and recorded on an address table, PD-APCn can use the seed patterns to direct the incremental extension of functional regions with minor mutations. By grouping the aligned extended patterns, PD-APCn can recruit patterns adaptively and efficiently with variable width without relying on exhaustive optimal search. Experiments on synthetic datasets with different sizes and noise levels showed that PD-APCn can identify the implanted pattern with mutations, outperforming the popular existing motif-finding software MEME with much higher recall and Fmeasure over a computational speed-up of up to 665 times. When applying PD-APCn on datasets from Cytochrome C and Ubiquitin protein families, all key binding sites conserved in the families were captured in the APC outputs. In sequence-based interaction analysis, there is also a lack of a model for co-occurring functional regions with mutations, where co-occurring functional regions between interaction sequences are indicative of binding sites. This thesis proposes a new representation model Co-Occurrence APCs to capture co-occurring functional regions with mutations from interaction sequences in database transaction format. Applications on Protein-DNA and Protein-Protein interaction validated the capability of Co-Occurrence APCs. In Protein-DNA interaction, a new representation model, Protein-DNA Co-Occurrence APC, was developed for modeling Protein-DNA binding cores. The new model is more compact than the traditional one-to-one pattern associations, as it packs many-to-many associations in one model, yet it is detailed enough to allow site-specific variants. An algorithm, based on Co-Support Score, was also developed to discover Protein-DNA Co-Occurrence APCs from Protein-DNA interaction sequences. This algorithm is 1600x faster in run-time than its contemporaries. New Protein-DNA binding cores indicated by Protein-DNA Co-Occurrence APCs were also discovered via homology modeling as a proof-of-concept. In Protein-Protein interaction, a new representation model, Protein-Protein Co-Occurrence APC, was developed for modeling the co-occurring sequence patterns in Protein-Protein Interaction between two protein sequences. A new algorithm, WeMine-P2P, was developed for sequence-based Protein-Protein Interaction machine learning prediction by constructing feature vectors leveraging Protein-Protein Co-Occurrence APCs, based on novel scores such as Match Score, MaxMatch Score and APC-PPI score. Through 40 independent experiments, it outperformed the well-known algorithm, PIPE2, which also uses co-occurring functional regions while not allowing variable widths and mutations. Both applications on Protein-Protein and Protein-DNA interaction have indicated the potential use of Co-Occurrence APC for exploring other types of biosequence interaction in the future
    corecore