2 research outputs found

    Discovering Patterns from Sequences with Applications to Protein-Protein and Protein-DNA Interaction

    Get PDF
    Understanding Protein-Protein and Protein-DNA interaction is of fundamental importance in deciphering gene regulation and other biological processes in living cells. Traditionally, new interaction knowledge is discovered through biochemical experiments that are often labor intensive, expensive and time-consuming. Thus, computational approaches are preferred. Due to the abundance of sequence data available today, sequence-based interaction analysis becomes one of the most readily applicable and cost-effective methods. One important problem in sequence-based analysis is to identify the functional regions from a set of sequences within the same family or demonstrating similar biological functions in experiments. The rationale is that throughout evolution the functional regions normally remain conserved (intact), allowing them to be identified as patterns from a set of sequences. However, there are also mutations such as substitution, insertion, deletion in these functional regions. Existing methods, such as those based on position weight matrices, assume that the functional regions have a fixed width and thus cannot not identify functional regions with mutations, particularly those with insertion or deletion mutations. Recently, Aligned Pattern Clustering (APCn) was introduced to identify functional regions as Aligned Pattern Clusters (APCs) by grouping and aligning patterns with variable width. Nevertheless, APCn cannot discover functional regions with substitution, insertion and/or deletion mutations, since their frequencies of occurrences are too low to be considered as patterns. To overcome such an impasse, this thesis proposes a new APC discovery algorithm known as Pattern-Directed Aligned Pattern Clustering (PD-APCn). By first discovering seed patterns from the input sequence data, with their sequence positions located and recorded on an address table, PD-APCn can use the seed patterns to direct the incremental extension of functional regions with minor mutations. By grouping the aligned extended patterns, PD-APCn can recruit patterns adaptively and efficiently with variable width without relying on exhaustive optimal search. Experiments on synthetic datasets with different sizes and noise levels showed that PD-APCn can identify the implanted pattern with mutations, outperforming the popular existing motif-finding software MEME with much higher recall and Fmeasure over a computational speed-up of up to 665 times. When applying PD-APCn on datasets from Cytochrome C and Ubiquitin protein families, all key binding sites conserved in the families were captured in the APC outputs. In sequence-based interaction analysis, there is also a lack of a model for co-occurring functional regions with mutations, where co-occurring functional regions between interaction sequences are indicative of binding sites. This thesis proposes a new representation model Co-Occurrence APCs to capture co-occurring functional regions with mutations from interaction sequences in database transaction format. Applications on Protein-DNA and Protein-Protein interaction validated the capability of Co-Occurrence APCs. In Protein-DNA interaction, a new representation model, Protein-DNA Co-Occurrence APC, was developed for modeling Protein-DNA binding cores. The new model is more compact than the traditional one-to-one pattern associations, as it packs many-to-many associations in one model, yet it is detailed enough to allow site-specific variants. An algorithm, based on Co-Support Score, was also developed to discover Protein-DNA Co-Occurrence APCs from Protein-DNA interaction sequences. This algorithm is 1600x faster in run-time than its contemporaries. New Protein-DNA binding cores indicated by Protein-DNA Co-Occurrence APCs were also discovered via homology modeling as a proof-of-concept. In Protein-Protein interaction, a new representation model, Protein-Protein Co-Occurrence APC, was developed for modeling the co-occurring sequence patterns in Protein-Protein Interaction between two protein sequences. A new algorithm, WeMine-P2P, was developed for sequence-based Protein-Protein Interaction machine learning prediction by constructing feature vectors leveraging Protein-Protein Co-Occurrence APCs, based on novel scores such as Match Score, MaxMatch Score and APC-PPI score. Through 40 independent experiments, it outperformed the well-known algorithm, PIPE2, which also uses co-occurring functional regions while not allowing variable widths and mutations. Both applications on Protein-Protein and Protein-DNA interaction have indicated the potential use of Co-Occurrence APC for exploring other types of biosequence interaction in the future

    Informal Social Protection and Poverty: A Case Study In Pakistan

    Get PDF
    Many countries in the global south face financial constraints restricting their capacity to provide welfare to their populations. As a result, informal networks such as immediate and extended family and friends, NGOs, and religious organizations meet the welfare needs of a large segment of poor and vulnerable people in such countries through informal social protection. Madrassas are religious schools that have been prevalent in many Muslim countries for centuries. My research in Pakistan shows the importance of these institutions in satisfying the welfare needs of a large segment of the country's poor and vulnerable population. In this research, I compared the informal social protection provided by madrassas with formal social protection offered by the government to households to determine the usefulness of the former. Owing to the lack of data on informal welfare in Pakistan, I adopted an original multi-stage sampling methodology for data collection for the comparison mentioned above. To collect the data for the study, 570 households were surveyed, and 90 were interviewed in-depth across 14 randomly selected cities of Pakistan based on the multidimensional poverty index. I found that most of the surveyed households were impoverished and facing serious insecurities and vulnerabilities. Unemployment, precarious informal sector jobs such as street vendors, coal mining, agricultural tenants, child labour, the prevalence of infectious diseases, absence of adequate insurance, loss of lives and migration because of conflict and natural disasters were common among the surveyed families. The insecurities made them eligible to receive the benefits of formal welfare by the state. However, a sizeable majority were not receiving these benefits but were aware of such programmes. I found that madrassas are a significant source of welfare for the surveyed families apart from education providers. The benefits received from madrassas included cash assistance to the families in times of need, health treatments, helping in marriage and burial services, and most importantly, madrassas education makes their children employable.The big, collected data was also explored by using an efficient unsupervised machine learning K-means clustering algorithm triangulated with semi-structured interviews to form clusters representing multiple welfare regimes in Pakistan at one point in time. Each cluster exhibit the features of a distinct regime coexisting simultaneously at one point in time. A fifth regime was also found by using secondary data. I conclude that in a lower-income country such as Pakistan, a large section of the overlooked marginalized groups relies on informal welfare administered by madrassas because the coverage of formal welfare is low- and the-income benefits are inadequate. Meeting the eligibility criteria to receive welfare by the government appears to limit access to formal welfare programmes. The accurate identification of the poor for remains an unaddressed issue. In contrast, beneficiary families consider informal welfare more valuable because it is timely provided, no bureaucratic eligibility criteria are expected to be achieved by beneficiaries, and sufficient support is received to manage their requirements. By doing so, this study makes the following critical theoretical and empirical contributions: a) the study used existing literature to define and conceptualise informal social protection. The conceptualisation provided a framework that was used to compare informal social protection with formal social protection; b) a unique data collection and analysis methodology is presented for identifying poor and vulnerable households for social policy interventions in a low-income country; c) this methodology also helps identify various welfare regimes present within a country at one point in time; and d) the study highlights the importance of integrating informal with formal actors in social policy-making in low income countries
    corecore