80 research outputs found

    Iterative Machine Learning of a Cis-Regulatory Grammar

    Get PDF
    Gene regulation allows for the quantitative control of gene expression. Gene regulation is a complex process encoded through cis-regulatory sequences, short DNA sequences containing clusters of transcription factor binding sites. Each binding site can occur millions of times in multicellular genomes, and seemingly similar collections of binding sites can have very different activities. A leading model to explain these degeneracies is that cis-regulatory sequences follow a “grammar” defined by the number, identity, strength, arrangement, and/or context of the underlying binding sites. Understanding cis-regulatory grammar requires high-throughput technology, quantitative measurements, and computational modeling. This thesis describes an iterative machine learning approach to study cis-regulatory grammar using mouse photoreceptors as a model system. First, I characterized sequence features associated with enhancer and silencer activity in sequences bound by the transcription factor CRX. I showed that both enhancers and silencers are highly occupied by CRX compared to inactive sequences, and enhancers are uniquely enriched for a diverse but degenerate collection of eight motifs. I demonstrated that this information captures a majority of the available signal in genomic sequences and developed an information content metric that summarizes the effects of motif number and diversity. Second, I developed an active machine learning framework that iteratively samples informative perturbations to address the limitations of training quantitative models on genomic sequences alone. I showed that this approach, when complemented with human decision-making, effectively guides machine learning models towards a biologically relevant representation of cis-regulatory grammar. I also highlighted how perturbations selected with active learning are more informative than other perturbations generated by the same procedure. The final machine learning model can capture global and local context-dependencies of transcription factor binding motifs. Using this model, I found that the same motifs can produce the same activity in multiple arrangements. Thus, active machine learning is an effective way to sample perturbations that improve quantitative models of cis-regulatory grammar. Collectively, these results provide an iterative framework to design and sample perturbations that reveal the complexities of cis-regulatory grammar underlying gene regulation

    Spatial and temporal background modelling of non-stationary visual scenes

    Get PDF
    PhDThe prevalence of electronic imaging systems in everyday life has become increasingly apparent in recent years. Applications are to be found in medical scanning, automated manufacture, and perhaps most significantly, surveillance. Metropolitan areas, shopping malls, and road traffic management all employ and benefit from an unprecedented quantity of video cameras for monitoring purposes. But the high cost and limited effectiveness of employing humans as the final link in the monitoring chain has driven scientists to seek solutions based on machine vision techniques. Whilst the field of machine vision has enjoyed consistent rapid development in the last 20 years, some of the most fundamental issues still remain to be solved in a satisfactory manner. Central to a great many vision applications is the concept of segmentation, and in particular, most practical systems perform background subtraction as one of the first stages of video processing. This involves separation of ‘interesting foreground’ from the less informative but persistent background. But the definition of what is ‘interesting’ is somewhat subjective, and liable to be application specific. Furthermore, the background may be interpreted as including the visual appearance of normal activity of any agents present in the scene, human or otherwise. Thus a background model might be called upon to absorb lighting changes, moving trees and foliage, or normal traffic flow and pedestrian activity, in order to effect what might be termed in ‘biologically-inspired’ vision as pre-attentive selection. This challenge is one of the Holy Grails of the computer vision field, and consequently the subject has received considerable attention. This thesis sets out to address some of the limitations of contemporary methods of background segmentation by investigating methods of inducing local mutual support amongst pixels in three starkly contrasting paradigms: (1) locality in the spatial domain, (2) locality in the shortterm time domain, and (3) locality in the domain of cyclic repetition frequency. Conventional per pixel models, such as those based on Gaussian Mixture Models, offer no spatial support between adjacent pixels at all. At the other extreme, eigenspace models impose a structure in which every image pixel bears the same relation to every other pixel. But Markov Random Fields permit definition of arbitrary local cliques by construction of a suitable graph, and 3 are used here to facilitate a novel structure capable of exploiting probabilistic local cooccurrence of adjacent Local Binary Patterns. The result is a method exhibiting strong sensitivity to multiple learned local pattern hypotheses, whilst relying solely on monochrome image data. Many background models enforce temporal consistency constraints on a pixel in attempt to confirm background membership before being accepted as part of the model, and typically some control over this process is exercised by a learning rate parameter. But in busy scenes, a true background pixel may be visible for a relatively small fraction of the time and in a temporally fragmented fashion, thus hindering such background acquisition. However, support in terms of temporal locality may still be achieved by using Combinatorial Optimization to derive shortterm background estimates which induce a similar consistency, but are considerably more robust to disturbance. A novel technique is presented here in which the short-term estimates act as ‘pre-filtered’ data from which a far more compact eigen-background may be constructed. Many scenes entail elements exhibiting repetitive periodic behaviour. Some road junctions employing traffic signals are among these, yet little is to be found amongst the literature regarding the explicit modelling of such periodic processes in a scene. Previous work focussing on gait recognition has demonstrated approaches based on recurrence of self-similarity by which local periodicity may be identified. The present work harnesses and extends this method in order to characterize scenes displaying multiple distinct periodicities by building a spatio-temporal model. The model may then be used to highlight abnormality in scene activity. Furthermore, a Phase Locked Loop technique with a novel phase detector is detailed, enabling such a model to maintain correct synchronization with scene activity in spite of noise and drift of periodicity. This thesis contends that these three approaches are all manifestations of the same broad underlying concept: local support in each of the space, time and frequency domains, and furthermore, that the support can be harnessed practically, as will be demonstrated experimentally

    Spectral and deep learning approaches to Hi-C data analysis

    Get PDF
    Hi-C matrices describe the genome-wide contact probability between chromatin loci. The comparison of Hi-C matrices is important both to assess the reproducibility in biological replicates and to find significant differences between non replicates from different cell-types; however this analysis faces two challenges: Hi-C matrices tend to be undersampled, and thus noisy, and they contain a variety of multi-scale interactions patterns that must be taken into account. One solution to tackle these problems is to extract information from the spectral features of Hi-C maps. In this thesis I will show, by comparing Hi-C maps to random matrices, that most of their spectrum is "aspecific", meaning that its features are the same in all Hi-C maps. On the other hand the top eigenspaces present highly non random features: by enucleating them from the full matrix I am able to obtain sharper interaction patterns, effectively enhancing the quality at the single matrix level and improving results in classification tasks. This shows that selecting a small number of degrees of freedom is key to augment the signal present in Hi-C matrices. However spectral methods are not the only way of reducing the dimensionality of Hi-C datasets: in the second part of the thesis I propose a variational autoencoder architecture as a way of compressing Hi-C data and identifying the most relevant degrees of freedom. Local interactions patterns in Hi-C maps repeat in different cell-types and chromosomes. By learning a low dimensional representation of these local patterns, the variational autoencoder can be used to compress and decompress any Hi-C map. I will show that the reconstruction quality is better than what can be obtained by linear methods, and that classification tasks improve when applied to the low dimensional representations of Hi-C maps. Finally, I will show that the action of the autoencoder and the spectral filter described in the first part of the thesis on the spectra of Hi-C maps is similar

    Using molecular dynamics and enhanced sampling techniques to find cryptic druggable pockets in proteins of pharmaceutical interest

    Get PDF
    Cryptic pockets are sites on protein targets that are hidden in the unliganded form and only become apparent when drugs bind. These sites provide a promising alternative to classical substrate binding sites for drug development, especially when the latter are not druggable. In this thesis I investigate the nature and dynamical properties of cryptic sites in a number of pharmacologically relevant targets, while comparing the efficacy of various simulation-based approaches in discovering them. I found that the studied cryptic sites do not correspond to local minima in the computed conformational free-energy landscape of the unliganded proteins. They thus promptly close in all of the molecular dynamics simulations performed, irrespective of the force-field used. Temperature-based enhanced sampling approaches, such as parallel tempering, do not improve the situation, as the entropic term does not help in the opening of the sites. The use of fragment probes helps, as in long simulations occasionally it leads to the opening and binding to the cryptic sites. The observed mechanism of cryptic site formation is suggestive of interplay between two classical mechanisms: induced-fit and conformational selection. Employing this insight, I developed a novel Hamiltonian replica exchange-based method SWISH (sampling water interfaces through scaled Hamiltonians), which combined with probes resulted in a promising general approach for cryptic site discovery. In addition, we revisit the rather ill-defined concept of the cryptic pockets in order to propose an alternative measurable interpretation. I outline how the new practical definition can be applied to the ligandable targets reported in the PDB, in order to provide a consistent data-driven view on crypticity and how it may impact the drug discovery. This thesis presents a comprehensive study of the cryptic pocket phenomenon: from understanding the nature of their formation to novel detection methodology, and towards understanding their global significance in drug discovery

    Doctor of Philosophy

    Get PDF
    dissertationGenotype Phenotype Association (GPA) is a means to identify candidate genes and genetic variants that may contribute to phenotypic variation. Technological advances in DNA sequencing continue to improve the efficiency and accuracy of GPA. Currently, High Throughput Sequencing (HTS) is the preferred method for GPA as it is fast and economical. HTS allows for population-level characterization of genetic variation, required for GPA studies. Despite the potential power of using HTS in GPA studies, there are technical hurdles that must be overcome. For instance, the excessive error rate in HTS data and the sheer size of population-level data can hinder GPA studies. To overcome these challenges, I have written two software programs for the purpose of HTS GPA. The first toolkit, GPAT++, is designed to detect GPA using small genetic variants. Unlike pervious software, GPAT++'s association test models the inherent errors in HTS, preventing many spurious GPA. The second toolkit, Whole Genome Alignment Metrics (WHAM), was designed for GPA using large genetic variants (structural variants). By integrating both structural variant identification and association testing, WHAM can identify shared structural variants associated with a phenotype. Both GPAT++ and WHAM have been successfully applied to real-world GPA studie

    Vibration Monitoring: Gearbox identification and faults detection

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Statistical methods for high-throughput genomic data

    Get PDF

    Detection of loci associated with water-soluble carbohydrate accumulation and environmental adaptation in white clover (Trifolium repens L.) : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Plant Biology at Massey University, Palmerston North, New Zealand

    Get PDF
    White clover (Trifolium repens L.) is an economically important forage legume in New Zealand/Aotearoa (NZ). It provides quality forage and a source of bioavailable nitrogen fixed through symbiosis with soil Rhizobium bacteria. This thesis investigated the genetic basis of two traits of significant agronomic interest in white clover. These were foliar water-soluble carbohydrate (WSC) accumulation and soil moisture deficit (SMD) tolerance. Previously generated divergent WSC lines of white clover were characterised for foliar WSC and leaf size. Significant (p < 0.05) divergence in foliar WSC content was observed between five breeding pools. Little correlation was observed between WSC and leaf size, indicating that breeding for increased WSC content could be achieved in large and small leaf size classes of white clover in as few as 2 – 3 generations. Genotyping by sequencing (GBS) data were obtained for 1,113 white clover individuals (approximately 47 individuals from each of 24 populations). Population structure was assessed using discriminant analysis of principal components (DAPC) and individuals were assigned to 11 genetic clusters. Divergent selection created a structure that differentiated high and low WSC populations. Outlier detection methodologies using PCAdapt, BayeScan and KGD-FST applied to the GBS data identified 33 SNPs in diverse gene families that discriminated high and low WSC populations. One SNP associated with the starch biosynthesis gene, glgC was identified in a genome-wide association study (GWAS) of 605 white clover individuals. Transcriptome and proteome analyses also provided evidence to suggest that high WSC levels in different breeding pools were achieved through sorting of allelic variants of carbohydrate metabolism pathway genes. Transcriptome and proteome analyses suggested 14 gene models from seven carbohydrate gene families (glgC, WAXY, glgA, glgB, BAM, AMY and ISA3) had responded to artificial selection. Patterns of SNP variation in the AMY, glgC and WAXY gene families separated low and high WSC individuals. Allelic variants in these gene families represent potential targets for assisted breeding of high WSC levels. Overall, multiple lines of evidence corroborate the importance of glgC for increasing foliar WSC accumulation in white clover. Soil moisture deficit (SMD) tolerance was investigated in naturalised populations of white clover collected from 17 sites representing contrasting SMD across the South Island/Te Waipounamu of NZ. Weak genetic differentiation of populations was detected in analyses of GBS data, with three genetic clusters identified by ADMIXTURE. Outlier detection and environmental association analyses identified 64 SNPs significantly (p < 0.05) associated with environmental variation. Mapping of these SNPs to the white clover reference genome, together with gene ontology analyses, suggested some SNPs were associated with genes involved in carbohydrate metabolism and root morphology. A common set of allelic variants in a subset of the populations from high SMD environments may also identify targets for selective breeding, but this variation needs further investigation
    corecore