80 research outputs found
Iterative Machine Learning of a Cis-Regulatory Grammar
Gene regulation allows for the quantitative control of gene expression. Gene regulation is a complex process encoded through cis-regulatory sequences, short DNA sequences containing clusters of transcription factor binding sites. Each binding site can occur millions of times in multicellular genomes, and seemingly similar collections of binding sites can have very different activities. A leading model to explain these degeneracies is that cis-regulatory sequences follow a “grammar” defined by the number, identity, strength, arrangement, and/or context of the underlying binding sites. Understanding cis-regulatory grammar requires high-throughput technology, quantitative measurements, and computational modeling. This thesis describes an iterative machine learning approach to study cis-regulatory grammar using mouse photoreceptors as a model system. First, I characterized sequence features associated with enhancer and silencer activity in sequences bound by the transcription factor CRX. I showed that both enhancers and silencers are highly occupied by CRX compared to inactive sequences, and enhancers are uniquely enriched for a diverse but degenerate collection of eight motifs. I demonstrated that this information captures a majority of the available signal in genomic sequences and developed an information content metric that summarizes the effects of motif number and diversity. Second, I developed an active machine learning framework that iteratively samples informative perturbations to address the limitations of training quantitative models on genomic sequences alone. I showed that this approach, when complemented with human decision-making, effectively guides machine learning models towards a biologically relevant representation of cis-regulatory grammar. I also highlighted how perturbations selected with active learning are more informative than other perturbations generated by the same procedure. The final machine learning model can capture global and local context-dependencies of transcription factor binding motifs. Using this model, I found that the same motifs can produce the same activity in multiple arrangements. Thus, active machine learning is an effective way to sample perturbations that improve quantitative models of cis-regulatory grammar. Collectively, these results provide an iterative framework to design and sample perturbations that reveal the complexities of cis-regulatory grammar underlying gene regulation
Spatial and temporal background modelling of non-stationary visual scenes
PhDThe prevalence of electronic imaging systems in everyday life has become increasingly apparent
in recent years. Applications are to be found in medical scanning, automated manufacture, and
perhaps most significantly, surveillance. Metropolitan areas, shopping malls, and road traffic
management all employ and benefit from an unprecedented quantity of video cameras for monitoring
purposes. But the high cost and limited effectiveness of employing humans as the final
link in the monitoring chain has driven scientists to seek solutions based on machine vision techniques.
Whilst the field of machine vision has enjoyed consistent rapid development in the last
20 years, some of the most fundamental issues still remain to be solved in a satisfactory manner.
Central to a great many vision applications is the concept of segmentation, and in particular,
most practical systems perform background subtraction as one of the first stages of video
processing. This involves separation of ‘interesting foreground’ from the less informative but
persistent background. But the definition of what is ‘interesting’ is somewhat subjective, and
liable to be application specific. Furthermore, the background may be interpreted as including
the visual appearance of normal activity of any agents present in the scene, human or otherwise.
Thus a background model might be called upon to absorb lighting changes, moving trees and
foliage, or normal traffic flow and pedestrian activity, in order to effect what might be termed in
‘biologically-inspired’ vision as pre-attentive selection. This challenge is one of the Holy Grails
of the computer vision field, and consequently the subject has received considerable attention.
This thesis sets out to address some of the limitations of contemporary methods of background
segmentation by investigating methods of inducing local mutual support amongst pixels
in three starkly contrasting paradigms: (1) locality in the spatial domain, (2) locality in the shortterm
time domain, and (3) locality in the domain of cyclic repetition frequency.
Conventional per pixel models, such as those based on Gaussian Mixture Models, offer no
spatial support between adjacent pixels at all. At the other extreme, eigenspace models impose
a structure in which every image pixel bears the same relation to every other pixel. But Markov
Random Fields permit definition of arbitrary local cliques by construction of a suitable graph, and
3
are used here to facilitate a novel structure capable of exploiting probabilistic local cooccurrence
of adjacent Local Binary Patterns. The result is a method exhibiting strong sensitivity to multiple
learned local pattern hypotheses, whilst relying solely on monochrome image data.
Many background models enforce temporal consistency constraints on a pixel in attempt to
confirm background membership before being accepted as part of the model, and typically some
control over this process is exercised by a learning rate parameter. But in busy scenes, a true
background pixel may be visible for a relatively small fraction of the time and in a temporally
fragmented fashion, thus hindering such background acquisition. However, support in terms of
temporal locality may still be achieved by using Combinatorial Optimization to derive shortterm
background estimates which induce a similar consistency, but are considerably more robust
to disturbance. A novel technique is presented here in which the short-term estimates act as
‘pre-filtered’ data from which a far more compact eigen-background may be constructed.
Many scenes entail elements exhibiting repetitive periodic behaviour. Some road junctions
employing traffic signals are among these, yet little is to be found amongst the literature regarding
the explicit modelling of such periodic processes in a scene. Previous work focussing on gait
recognition has demonstrated approaches based on recurrence of self-similarity by which local
periodicity may be identified. The present work harnesses and extends this method in order
to characterize scenes displaying multiple distinct periodicities by building a spatio-temporal
model. The model may then be used to highlight abnormality in scene activity. Furthermore, a
Phase Locked Loop technique with a novel phase detector is detailed, enabling such a model to
maintain correct synchronization with scene activity in spite of noise and drift of periodicity.
This thesis contends that these three approaches are all manifestations of the same broad
underlying concept: local support in each of the space, time and frequency domains, and furthermore,
that the support can be harnessed practically, as will be demonstrated experimentally
Spectral and deep learning approaches to Hi-C data analysis
Hi-C matrices describe the genome-wide contact probability between chromatin loci. The comparison of Hi-C matrices is important both to assess the reproducibility in biological replicates and to find significant differences between non replicates from different cell-types; however this analysis faces two challenges: Hi-C matrices tend to be undersampled, and thus noisy, and they contain a variety of multi-scale interactions patterns that must be taken into account.
One solution to tackle these problems is to extract information from the spectral features of Hi-C maps. In this thesis I will show, by comparing Hi-C maps to random matrices, that most of their spectrum is "aspecific", meaning that its features are the same in all Hi-C maps. On the other hand the top eigenspaces present highly non random features: by enucleating them from the full matrix I am able to obtain sharper interaction patterns, effectively enhancing the quality at the single matrix level and improving results in classification tasks.
This shows that selecting a small number of degrees of freedom is key to augment the signal present in Hi-C matrices. However spectral methods are not the only way of reducing the dimensionality of Hi-C datasets: in the second part of the thesis I propose a variational autoencoder architecture as a way of compressing Hi-C data and identifying the most relevant degrees of freedom.
Local interactions patterns in Hi-C maps repeat in different cell-types and chromosomes. By learning a low dimensional representation of these local patterns, the variational autoencoder can be used to compress and decompress any Hi-C map. I will show that the reconstruction quality is better than what can be obtained by linear methods, and that classification tasks improve when applied to the low dimensional representations of Hi-C maps. Finally, I will show that the action of the autoencoder and the spectral filter described in the first part of the thesis on the spectra of Hi-C maps is similar
Using molecular dynamics and enhanced sampling techniques to find cryptic druggable pockets in proteins of pharmaceutical interest
Cryptic pockets are sites on protein targets that are hidden in the unliganded form and only become apparent when drugs bind. These sites provide a promising alternative to classical substrate binding sites for drug development, especially when the latter are not druggable. In this thesis I investigate the nature and dynamical properties of cryptic sites in a number of pharmacologically relevant targets, while comparing the efficacy of various simulation-based approaches in discovering them. I found that the studied cryptic sites do not correspond to local minima in the computed conformational free-energy landscape of the unliganded proteins. They thus promptly close in all of the molecular dynamics simulations performed, irrespective of the force-field used. Temperature-based enhanced sampling approaches, such as parallel tempering, do not improve the situation, as the entropic term does not help in the opening of the sites. The use of fragment probes helps, as in long simulations occasionally it leads to the opening and binding to the cryptic sites. The observed mechanism of cryptic site formation is suggestive of interplay between two classical mechanisms: induced-fit and conformational selection. Employing this insight, I developed a novel Hamiltonian replica exchange-based method SWISH (sampling water interfaces through scaled Hamiltonians), which combined with probes resulted in a promising general approach for cryptic site discovery. In addition, we revisit the rather ill-defined concept of the cryptic pockets in order to propose an alternative measurable interpretation. I outline how the new practical definition can be applied to the ligandable targets reported in the PDB, in order to provide a consistent data-driven view on crypticity and how it may impact the drug discovery. This thesis presents a comprehensive study of the cryptic pocket phenomenon: from understanding the nature of their formation to novel detection methodology, and towards understanding their global significance in drug discovery
Doctor of Philosophy
dissertationGenotype Phenotype Association (GPA) is a means to identify candidate genes and genetic variants that may contribute to phenotypic variation. Technological advances in DNA sequencing continue to improve the efficiency and accuracy of GPA. Currently, High Throughput Sequencing (HTS) is the preferred method for GPA as it is fast and economical. HTS allows for population-level characterization of genetic variation, required for GPA studies. Despite the potential power of using HTS in GPA studies, there are technical hurdles that must be overcome. For instance, the excessive error rate in HTS data and the sheer size of population-level data can hinder GPA studies. To overcome these challenges, I have written two software programs for the purpose of HTS GPA. The first toolkit, GPAT++, is designed to detect GPA using small genetic variants. Unlike pervious software, GPAT++'s association test models the inherent errors in HTS, preventing many spurious GPA. The second toolkit, Whole Genome Alignment Metrics (WHAM), was designed for GPA using large genetic variants (structural variants). By integrating both structural variant identification and association testing, WHAM can identify shared structural variants associated with a phenotype. Both GPAT++ and WHAM have been successfully applied to real-world GPA studie
Vibration Monitoring: Gearbox identification and faults detection
L'abstract è presente nell'allegato / the abstract is in the attachmen
Detection of loci associated with water-soluble carbohydrate accumulation and environmental adaptation in white clover (Trifolium repens L.) : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Plant Biology at Massey University, Palmerston North, New Zealand
White clover (Trifolium repens L.) is an economically important forage legume in New Zealand/Aotearoa (NZ). It provides quality forage and a source of bioavailable nitrogen fixed through symbiosis with soil Rhizobium bacteria. This thesis investigated the genetic basis of two traits of significant agronomic interest in white clover. These were foliar water-soluble carbohydrate (WSC) accumulation and soil moisture deficit (SMD) tolerance. Previously generated divergent WSC lines of white clover were characterised for foliar WSC and leaf size. Significant (p < 0.05) divergence in foliar WSC content was observed between five breeding pools. Little correlation was observed between WSC and leaf size, indicating that breeding for increased WSC content could be achieved in large and small leaf size classes of white clover in as few as 2 – 3 generations. Genotyping by sequencing (GBS) data were obtained for 1,113 white clover individuals (approximately 47 individuals from each of 24 populations). Population structure was assessed using discriminant analysis of principal components (DAPC) and individuals were assigned to 11 genetic clusters. Divergent selection created a structure that differentiated high and low WSC populations. Outlier detection methodologies using PCAdapt, BayeScan and KGD-FST applied to the GBS data identified 33 SNPs in diverse gene families that discriminated high and low WSC populations. One SNP associated with the starch biosynthesis gene, glgC was identified in a genome-wide association study (GWAS) of 605 white clover individuals. Transcriptome and proteome analyses also provided evidence to suggest that high WSC levels in different breeding pools were achieved through sorting of allelic variants of carbohydrate metabolism pathway genes. Transcriptome and proteome analyses suggested 14 gene models from seven
carbohydrate gene families (glgC, WAXY, glgA, glgB, BAM, AMY and ISA3) had responded to artificial selection. Patterns of SNP variation in the AMY, glgC and WAXY gene families separated low and high WSC individuals. Allelic variants in these gene families represent potential targets for assisted breeding of high WSC levels. Overall,
multiple lines of evidence corroborate the importance of glgC for increasing foliar WSC
accumulation in white clover. Soil moisture deficit (SMD) tolerance was investigated in naturalised populations of white clover collected from 17 sites representing contrasting SMD across the South Island/Te Waipounamu of NZ. Weak genetic differentiation of populations was detected in analyses of GBS data, with three genetic clusters identified by ADMIXTURE. Outlier detection and environmental association analyses identified 64
SNPs significantly (p < 0.05) associated with environmental variation. Mapping of these SNPs to the white clover reference genome, together with gene ontology analyses, suggested some SNPs were associated with genes involved in carbohydrate metabolism and root morphology. A common set of allelic variants in a subset of the
populations from high SMD environments may also identify targets for selective breeding, but this variation needs further investigation
- …