36 research outputs found

    Pattern Discovery and Disentanglement for Clinical Data Analysis

    Get PDF
    In recent years, machine learning approaches have important empirical successes on analysing data such as images, signals, texts and speeches with applications in biomedical and clinical areas. However, from the perspective of modelling, many machine learning methods still encounter crucial problems such as the lack of transparency and interpretability. Frequent Pattern Mining or Association Mining methods intend to solve the problem of interpretability, but they also encounter serious problems such as requiring exhaustive search and producing overwhelming numbers of patterns. From the perspective of data analysis, they do not render high prediction accuracy particularly for data with low volume, rare or imbalanced groups, rare cases or biases due to subtle overlapping or entanglement of the statistical and functional associations at the data source level. Hence, Professor Andrew K.C. Wong and I have developed a novel Pattern Discovery and Disentanglement (PDD) Method to discover explicit patterns and unveil knowledge from relational datasets even encompassing imbalanced groups, biases and anomalies. The statistically significant high-order patterns, pattern clusters and rare patterns are discovered in the disentangled Attribute Value Association (AVA) Spaces. They may be embedded in a relational dataset but overlapping or entangled with each other so that they are masked or obscured at the data level. The patterns discovered from the disentangled association source can be used for explicitly interpreting the original data, predicting the functional groups/classes and detecting anomalies and/or outliers. When class labels are not given, pattern/entity clustering can be more effectively discovered from the disentangled attribute value association (AVA) space than from the original records. The objective of this Master Thesis is to develop and validate the efficacy of PDD for genomic and clinical data analysis using a) protein sequence data, b) public clinical records from UCI dataset and c) a clinical dataset obtained from the School of Public Health and Health Systems at the University of Waterloo. The experimental results with superior performance in unsupervised and supervised learning than existing methods are presented in interpretable knowledge representation frameworks, interlinking the AVA disentangled sources, patterns, pattern/entity clusters and individual entities. In the clinical cases, it reveals the symptomatic patterns of individual patients, disease complexes/groups and subtle etiological sources. Hence it will have impacts in machine learning on genomic and clinical data with broad applications

    Discovering Patterns from Sequences with Applications to Protein-Protein and Protein-DNA Interaction

    Get PDF
    Understanding Protein-Protein and Protein-DNA interaction is of fundamental importance in deciphering gene regulation and other biological processes in living cells. Traditionally, new interaction knowledge is discovered through biochemical experiments that are often labor intensive, expensive and time-consuming. Thus, computational approaches are preferred. Due to the abundance of sequence data available today, sequence-based interaction analysis becomes one of the most readily applicable and cost-effective methods. One important problem in sequence-based analysis is to identify the functional regions from a set of sequences within the same family or demonstrating similar biological functions in experiments. The rationale is that throughout evolution the functional regions normally remain conserved (intact), allowing them to be identified as patterns from a set of sequences. However, there are also mutations such as substitution, insertion, deletion in these functional regions. Existing methods, such as those based on position weight matrices, assume that the functional regions have a fixed width and thus cannot not identify functional regions with mutations, particularly those with insertion or deletion mutations. Recently, Aligned Pattern Clustering (APCn) was introduced to identify functional regions as Aligned Pattern Clusters (APCs) by grouping and aligning patterns with variable width. Nevertheless, APCn cannot discover functional regions with substitution, insertion and/or deletion mutations, since their frequencies of occurrences are too low to be considered as patterns. To overcome such an impasse, this thesis proposes a new APC discovery algorithm known as Pattern-Directed Aligned Pattern Clustering (PD-APCn). By first discovering seed patterns from the input sequence data, with their sequence positions located and recorded on an address table, PD-APCn can use the seed patterns to direct the incremental extension of functional regions with minor mutations. By grouping the aligned extended patterns, PD-APCn can recruit patterns adaptively and efficiently with variable width without relying on exhaustive optimal search. Experiments on synthetic datasets with different sizes and noise levels showed that PD-APCn can identify the implanted pattern with mutations, outperforming the popular existing motif-finding software MEME with much higher recall and Fmeasure over a computational speed-up of up to 665 times. When applying PD-APCn on datasets from Cytochrome C and Ubiquitin protein families, all key binding sites conserved in the families were captured in the APC outputs. In sequence-based interaction analysis, there is also a lack of a model for co-occurring functional regions with mutations, where co-occurring functional regions between interaction sequences are indicative of binding sites. This thesis proposes a new representation model Co-Occurrence APCs to capture co-occurring functional regions with mutations from interaction sequences in database transaction format. Applications on Protein-DNA and Protein-Protein interaction validated the capability of Co-Occurrence APCs. In Protein-DNA interaction, a new representation model, Protein-DNA Co-Occurrence APC, was developed for modeling Protein-DNA binding cores. The new model is more compact than the traditional one-to-one pattern associations, as it packs many-to-many associations in one model, yet it is detailed enough to allow site-specific variants. An algorithm, based on Co-Support Score, was also developed to discover Protein-DNA Co-Occurrence APCs from Protein-DNA interaction sequences. This algorithm is 1600x faster in run-time than its contemporaries. New Protein-DNA binding cores indicated by Protein-DNA Co-Occurrence APCs were also discovered via homology modeling as a proof-of-concept. In Protein-Protein interaction, a new representation model, Protein-Protein Co-Occurrence APC, was developed for modeling the co-occurring sequence patterns in Protein-Protein Interaction between two protein sequences. A new algorithm, WeMine-P2P, was developed for sequence-based Protein-Protein Interaction machine learning prediction by constructing feature vectors leveraging Protein-Protein Co-Occurrence APCs, based on novel scores such as Match Score, MaxMatch Score and APC-PPI score. Through 40 independent experiments, it outperformed the well-known algorithm, PIPE2, which also uses co-occurring functional regions while not allowing variable widths and mutations. Both applications on Protein-Protein and Protein-DNA interaction have indicated the potential use of Co-Occurrence APC for exploring other types of biosequence interaction in the future

    Cell polarisation in geometry

    Get PDF

    Cell polarisation in geometry

    Get PDF

    Novel biomarkers to guide therapy in chronic inflammatory diseases

    Get PDF
    In this thesis, we focused on the role of the epigenetic modifications in Inflammatory Bowel Disease (IBD) and other immune mediated diseases such as rheumatoid arthritis. In particular, we elaborated on aberrant DNA methylation and investigated its potential to predict therapy response to biological treatment. Furthermore, we explored other novel biomarkers to biological response with microbial signatures and single cell transcriptomics in IBD

    Novel biomarkers to guide therapy in chronic inflammatory diseases

    Get PDF
    In this thesis, we focused on the role of the epigenetic modifications in Inflammatory Bowel Disease (IBD) and other immune mediated diseases such as rheumatoid arthritis. In particular, we elaborated on aberrant DNA methylation and investigated its potential to predict therapy response to biological treatment. Furthermore, we explored other novel biomarkers to biological response with microbial signatures and single cell transcriptomics in IBD

    Molecular dissection of Light Perception in Zebrafish

    Full text link

    HOLOBIOMICS - Use of microbiomics for the exploration of microbial communities in holobionts.

    Get PDF
    Introducing more than a decade ago the High-Throughput Sequencing techniques we have exponentially increased our opportunities of shedding light on complex microbial communities. This revolution opened a ‘golden era’ in the new-born field of microbiomics, avoiding the culturing step that always represented a limiting factor in the characterization of particular and fastidious groups of microorganisms. Furthermore, it is clear the advantage of retrieving all the taxonomic and functional information encoded within a microbiome directly by sequencing a sample deriving from an environment of interest. The huge amount of information produced in studies relying on NGS represents a challenging task, constituting the main driver for the creation of the computational microbiologist: a new figure alongside the molecular microbiologist and classic microbiologist. This researcher’s work starts when the laboratory work ends and the sequencing process is completed: the aim of a computational microbiologist work is to deal with the vast amount of data generated by the sequencing process, producing biologically meaningful data. During my PhD I have focused on these latter tasks, dealing with the characterization at different levels of various holobionts, ranging from wild animals to humans, giving attention at the bacterial, fungal and viral fractions in ecosystems. In the present work I report the main achievements of my research work, whose common denominator is the bioinformatic approach to microbiome data. In the cases I studied, I observed a mutualistic microbiome that may follows adaptive strategies aimed at the conservation of the homeostasis of the total ecosystem. This work contributes to enrich the overall knowledge on the holobiont, also exploring some peculiar ecosystems for the first time. The data presented here may form the basis for future developments in the field, in order to obtain a more comprehensive profiling of bacterial, viral and fungal fractions within complex ecosystems

    Neuromodulatory effects on early visual signal processing

    Get PDF
    Understanding how the brain processes information and generates simple to complex behavior constitutes one of the core objectives in systems neuroscience. However, when studying different neural circuits, their dynamics and interactions researchers often assume fixed connectivity, overlooking a crucial factor - the effect of neuromodulators. Neuromodulators can modulate circuit activity depending on several aspects, such as different brain states or sensory contexts. Therefore, considering the modulatory effects of neuromodulators on the functionality of neural circuits is an indispensable step towards a more complete picture of the brain’s ability to process information. Generally, this issue affects all neural systems; hence this thesis tries to address this with an experimental and computational approach to resolve neuromodulatory effects on cell type-level in a well-define system, the mouse retina. In the first study, we established and applied a machine-learning-based classification algorithm to identify individual functional retinal ganglion cell types, which enabled detailed cell type-resolved analyses. We applied the classifier to newly acquired data of light-evoked retinal ganglion cell responses and successfully identified their functional types. Here, the cell type-resolved analysis revealed that a particular principle of efficient coding applies to all types in a similar way. In a second study, we focused on the issue of inter-experimental variability that can occur during the process of pooling datasets. As a result, further downstream analyses may be complicated by the subtle variations between the individual datasets. To tackle this, we proposed a theoretical framework based on an adversarial autoencoder with the objective to remove inter-experimental variability from the pooled dataset, while preserving the underlying biological signal of interest. In the last study of this thesis, we investigated the functional effects of the neuromodulator nitric oxide on the retinal output signal. To this end, we used our previously developed retinal ganglion cell type classifier to unravel type-specific effects and established a paired recording protocol to account for type-specific time-dependent effects. We found that certain retinal ganglion cell types showed adaptational type-specific changes and that nitric oxide had a distinct modulation of a particular group of retinal ganglion cells. In summary, I first present several experimental and computational methods that allow to study functional neuromodulatory effects on the retinal output signal in a cell type-resolved manner and, second, use these tools to demonstrate their feasibility to study the neuromodulator nitric oxide

    Dissecting Mla3–AVR-Rmo1 recognition and specificity

    Get PDF
    The plant immune system heavily relies on immune receptors known as nucleotide binding leucine-rich repeat (NLR) proteins, which recognise pathogen-secreted effectors to trigger a robust immune response. In barley, resistance to powdery mildew caused by Blumeria graminis f. sp. hordei (Bgh) is conferred by the Mildew locus a (Mla), an NLR that exists as a highly expanded allelic series. Each Mla allele governs Bgh isolate-specific resistance by recognising a corresponding AVRa effector. In addition, different alleles can confer resistance against divergent fungal pathogens. This is the case of the Mla3 allele, which not only recognises AVRa3 from Bgh, but also confers resistance to the blast fungus Magnaporthe oryzae. In this thesis, I aimed to molecularly characterise M. oryzae recognition by Mla3 and elucidate the principles governing multiple pathogen recognition by this NLR. I found that PWL2, an effector known to condition pathogenicity of M. oryzae towards weeping lovegrass, is the gene underlying AVR-Rmo1, the blast effector recognised by Mla3. Evidence indicates that barley and weeping lovegrass convergently evolved to recognise PWL2 with conserved specificity. I established that the C-terminus of Mla3 defines specificity of Pwl2 recognition and protein structure predictions suggest that this region binds to Pwl2 by mimicking the binding interface of a Pwl2 host target. By assessing copy number variation and allelic diversity, I defined that Mla3 functions in a dosage-dependent manner and postulate that polymorphisms reduce the sensitivity threshold to trigger an immune response upon effector recognition, abolishing the high dosage requirement for functional resistance. The identity of AVRa3, the Bgh effector recognised by Mla3, remains unknown. However, Pwl2 belongs to the family of MAX effectors, which is absent in Bgh, hence suggesting that Mla3 recognises structurally unrelated effectors. Altogether, these findings lay the foundation for understanding the mechanisms that shaped multiple pathogen recognition by Mla3
    corecore