625 research outputs found

    Dynamic transcriptome analysis (DTA)

    Get PDF
    So far, much attention has been paid to regulation of transcription. However, it has been realized that controlled mRNA decay is an equally important process. To understand the contributions of mRNA synthesis and mRNA degradation to gene regulation, we developed Dynamic Transcriptome Analysis (DTA). DTA allows to monitor these contributions for both processes and for all mRNAs in the cell without perturbation of the cellular system. DTA works by non-perturbing metabolic RNA labeling that supersedes conventional methods for mRNA turnover analysis. It is accomplished with dynamic kinetic modeling to derive the gene-specific synthesis and decay parameters. DTA reveals that most mRNA synthesis rates result in several transcripts per cell and cell cycle, and most mRNA half-lives range around a median of 11 min. DTA can monitor the cellular response to osmotic stress with higher sensitivity and temporal resolution than standard transcriptomics. In contrast to monotonically increasing total mRNA levels, DTA reveals three phases of the stress response. In the initial shock phase, mRNA synthesis and decay rates decrease globally, resulting in mRNA storage. During the subsequent induction phase, both rates increase for a subset of genes, resulting in production and rapid removal of stress-responsive mRNAs. In the following recovery phase, decay rates are largely restored, whereas synthesis rates remain altered, apparently enabling growth at high salt concentration. Stress-induced changes in mRNA synthesis rates are predicted from gene occupancy with RNA polymerase II. Thus, DTA realistically monitors the dynamics in mRNA metabolism that underlie gene regulatory systems. One of the technical obstacles of standard transcriptomics is the unknown normalization factor between samples, i.e. wild-type and mutant cells. Variations in RNA extraction efficiencies, amplification steps and scanner calibration introduce differences in the global intensity levels. The required normalization limits the precision of DTA. We have extended DTA to comparative DTA (cDTA), to eliminate this obstacle. cDTA provides absolute rates of mRNA synthesis and decay in Saccharomyces cerevisiae (Sc) cells with the use of Schizosaccharomyces pombe (Sp) as an internal standard. It therefore allows for direct comparison of RNA synthesis and decay rates between samples. cDTA reveals that Sc and Sp transcripts that encode orthologous proteins have similar synthesis rates, whereas decay rates are five fold lower in Sp, resulting in similar mRNA concentrations despite the larger Sp cell volume. cDTA of Sc mutants reveals that a eukaryote can buffer mRNA levels. Impairing transcription with a point mutation in RNA polymerase (Pol) II causes decreased mRNA synthesis rates as expected, but also decreased decay rates. Impairing mRNA degradation by deleting deadenylase subunits of the Ccr4–Not complex causes decreased decay rates as expected, but also decreased synthesis rates. In this thesis, we provide a novel tool to estimate RNA synthesis and decay rates: a quantitative dynamic model to describe mRNA metabolism in growing cells to complement the biochemical protocol of DTA/cDTA. It can be applied to reveal rate changes for all kinds of perturbations, e.g. in knock-out or point mutation strains, in responses to stress stimuli or in small molecule interfering assays like treatments with miRNA or siRNA inhibitors. In doing so, we show that DTA is a valuable tool for miRNA target validation. The DTA/cDTA approach is in principle applicable to virtually every organism. The bioinformatic workflow of DTA/cDTA is implemented in the open source R/Bioconductor package DTA

    On the dependent recognition of some long zinc finger proteins

    Get PDF
    The human genome contains about 800 C2H2 zinc finger proteins (ZFPs), and most of them are composed of long arrays of zinc fingers. Standard ZFP recognition model asserts longer finger arrays should recognize longer DNA-binding sites. However, recent experimental efforts to identify in vivo ZFP binding sites contradict this assumption, with many exhibiting short motifs. Here we use ZFY, CTCF, ZIM3, and ZNF343 as examples to address three closely related questions: What are the reasons that impede current motif discovery methods? What are the functions of those seemingly unused fingers and how can we improve the motif discovery algorithms based on long ZFPs\u27 biophysical properties? Using ZFY, we employed a variety of methods and find evidence for \u27dependent recognition\u27 where downstream fingers can recognize some previously undiscovered motifs only in the presence of an intact core site. For CTCF, high-throughput measurements revealed its upstream specificity profile depends on the strength of its core. Moreover, the binding strength of the upstream site modulates CTCF\u27s sensitivity to different epigenetic modifications within the core, providing new insight into how the previously identified intellectual disability-causing and cancer-related mutant R567W disrupts upstream recognition and deregulates the epigenetic control by CTCF. Our results establish that, because of irregular motif structures, variable spacing and dependent recognition between sub-motifs, the specificities of long ZFPs are significantly underestimated, so we developed an algorithm, ModeMap, to infer the motifs and recognition models of ZIM3 and ZNF343, which facilitates high-confidence identification of specific binding sites, including repeats-derived elements. With revised concept, technique, and algorithm, we can discover the overlooked specificities and functions of those \u27extra\u27 fingers, and therefore decipher their broader roles in human biology and diseases

    Anomaly Detection in Noisy Images

    Get PDF
    Finding rare events in multidimensional data is an important detection problem that has applications in many fields, such as risk estimation in insurance industry, finance, flood prediction, medical diagnosis, quality assurance, security, or safety in transportation. The occurrence of such anomalies is so infrequent that there is usually not enough training data to learn an accurate statistical model of the anomaly class. In some cases, such events may have never been observed, so the only information that is available is a set of normal samples and an assumed pairwise similarity function. Such metric may only be known up to a certain number of unspecified parameters, which would either need to be learned from training data, or fixed by a domain expert. Sometimes, the anomalous condition may be formulated algebraically, such as a measure exceeding a predefined threshold, but nuisance variables may complicate the estimation of such a measure. Change detection methods used in time series analysis are not easily extendable to the multidimensional case, where discontinuities are not localized to a single point. On the other hand, in higher dimensions, data exhibits more complex interdependencies, and there is redundancy that could be exploited to adaptively model the normal data. In the first part of this dissertation, we review the theoretical framework for anomaly detection in images and previous anomaly detection work done in the context of crack detection and detection of anomalous components in railway tracks. In the second part, we propose new anomaly detection algorithms. The fact that curvilinear discontinuities in images are sparse with respect to the frame of shearlets, allows us to pose this anomaly detection problem as basis pursuit optimization. Therefore, we pose the problem of detecting curvilinear anomalies in noisy textured images as a blind source separation problem under sparsity constraints, and propose an iterative shrinkage algorithm to solve it. Taking advantage of the parallel nature of this algorithm, we describe how this method can be accelerated using graphical processing units (GPU). Then, we propose a new method for finding defective components on railway tracks using cameras mounted on a train. We describe how to extract features and use a combination of classifiers to solve this problem. Then, we scale anomaly detection to bigger datasets with complex interdependencies. We show that the anomaly detection problem naturally fits in the multitask learning framework. The first task consists of learning a compact representation of the good samples, while the second task consists of learning the anomaly detector. Using deep convolutional neural networks, we show that it is possible to train a deep model with a limited number of anomalous examples. In sequential detection problems, the presence of time-variant nuisance parameters affect the detection performance. In the last part of this dissertation, we present a method for adaptively estimating the threshold of sequential detectors using Extreme Value Theory on a Bayesian framework. Finally, conclusions on the results obtained are provided, followed by a discussion of possible future work

    Dynamics of HIV-­‐infected CD4+ T cells: implications for cure strategies

    Get PDF
    Despite effective antiretroviral therapy (ART), HIV-1 persists in all infected individuals as proviral DNA integrated within long-lived resting memory CD4+ T cells. The population of infected CD4+ T cells carrying replication-competent proviruses is the major barrier to HIV-1 cure. Several lines of evidence have demonstrated that cellular proliferation of infected cells contributes to HIV-1 persistence. This proliferative process is complicated by the fact that most infected cells carry defective proviruses and cells harboring replication-competent HIV die quickly upon viral reactivation. To elucidate mechanisms that drive proliferation of HIV-1-infected CD4+ T cells, we followed proliferation of cells carrying replication-competent HIV-1 induced by antigen stimulation or cytokine treatment and demonstrated that latently infected cells carrying replication-competent HIV can proliferate in response to both stimuli. To study the dynamics of cells carrying replication-competent HIV-1, we sampled infectious virus from p24+ wells of the quantitative viral outgrowth assay at multiple time points spanning 2-3 years. Sequencing of replication-competent HIV-1 at multiple time points revealed that expanded cellular clones containing replication-competent HIV-1 is common. While some clones persist for 2-3 years, other clones wax and wane overtime. A similar pattern is observed with virus clones in the residual viremia. This observation with residual viremia supports our hypothesis that viruses in plasma are produced by activation of latently infected cells carrying replication-competent HIV-1 rather than ongoing cycles of virus replication. In addition, this supports our hypothesis that antigens drive proliferation of cells carrying replication-competent HIV-1 and activate some of the cells, leading to virus production. The observed patterns with proviruses in the latent reservoir and viruses in the residual viremia do not support a continuous proliferative process related to HIV-1 integration into cancer-associated genes. These studies are also being extended to cells carrying defective proviruses. A previous study has demonstrated that defective HIV-1 proviruses can be transcribed, translated or even recognized by HIV-1 specific CD8+ T cells. We hypothesized that defective proviruses may have a proliferative advantage that allows the cells carrying defective proviruses to proliferate more than cells carrying intact proviruses. Given that intact proviruses only account for 2% of total proviruses in patients on long-term suppressive ART, infected cells carrying defective proviruses greatly outnumber the cells harboring intact proviruses in vivo. We hypothesized that in vitro, cells carrying defective proviruses would proliferate upon T cell activation while cells harboring intact proviruses capable of viral gene expression, would die upon T cell activation due to viral cytopathic effects. Therefore, cells carrying either defective proviruses and or intact proviruses would show different proliferation dynamics upon TCR activation in vitro. To determine whether cells infected with intact or defective proviruses would proliferate to the same extent, we subjected single HIV-1 infected cells to 4 rounds of anti-CD3/CD28 stimulation in a microculture system and used a novel droplet digital PCR assay (intact proviral DNA assay; IPDA) to quantitate the number of intact and defective proviral DNA sequences in infected cells. We demonstratethat cells carrying defective proviruses were capable of enormous expansion with in vitro anti-CD3/CD28 stimulation, while cells harboring intact proviruses were rarely detected and showed little proliferative potential. Integration site analysis of clones expanded in vitro demonstrated that HIV-1 provirus integration into cancer-associated gene is not required for proliferation of HIV-1-infected cells. Additionally, we sequenced the cell clones that proliferated the most in vitro and found that proviruses in these clones were highly defective. These microculture experiments revealed a profound proliferation defect for cells carrying intact proviruses upon anti-CD3/CD28 stimulation. To explore whether cells carrying intact and defective proviruses show similar dynamics in vivo, we examined longitudinal samples collected 2-8 years apart using the IPDA. We found that the half-life of infected was ~44 months, consistent with a previous measurements of the latent reservoir as measured with a quantitative viral outgrowth assay (QVOA), while cells carrying defective proviruses showed greater variability among patients. Collectively, my thesis has measured the dynamics of CD4+ T cells carrying different types of proviruses and has provided insight into mechanisms that may contribute to proliferation of HIV-infected cells in vivo

    Deep learning in food category recognition

    Get PDF
    Integrating artificial intelligence with food category recognition has been a field of interest for research for the past few decades. It is potentially one of the next steps in revolutionizing human interaction with food. The modern advent of big data and the development of data-oriented fields like deep learning have provided advancements in food category recognition. With increasing computational power and ever-larger food datasets, the approach’s potential has yet to be realized. This survey provides an overview of methods that can be applied to various food category recognition tasks, including detecting type, ingredients, quality, and quantity. We survey the core components for constructing a machine learning system for food category recognition, including datasets, data augmentation, hand-crafted feature extraction, and machine learning algorithms. We place a particular focus on the field of deep learning, including the utilization of convolutional neural networks, transfer learning, and semi-supervised learning. We provide an overview of relevant studies to promote further developments in food category recognition for research and industrial applicationsMRC (MC_PC_17171)Royal Society (RP202G0230)BHF (AA/18/3/34220)Hope Foundation for Cancer Research (RM60G0680)GCRF (P202PF11)Sino-UK Industrial Fund (RP202G0289)LIAS (P202ED10Data Science Enhancement Fund (P202RE237)Fight for Sight (24NN201);Sino-UK Education Fund (OP202006)BBSRC (RM32G0178B8

    An uncertainty prediction approach for active learning - application to earth observation

    Get PDF
    Mapping land cover and land usage dynamics are crucial in remote sensing since farmers are encouraged to either intensify or extend crop use due to the ongoing rise in the world’s population. A major issue in this area is interpreting and classifying a scene captured in high-resolution satellite imagery. Several methods have been put forth, including neural networks which generate data-dependent models (i.e. model is biased toward data) and static rule-based approaches with thresholds which are limited in terms of diversity(i.e. model lacks diversity in terms of rules). However, the problem of having a machine learning model that, given a large amount of training data, can classify multiple classes over different geographic Sentinel-2 imagery that out scales existing approaches remains open. On the other hand, supervised machine learning has evolved into an essential part of many areas due to the increasing number of labeled datasets. Examples include creating classifiers for applications that recognize images and voices, anticipate traffic, propose products, act as a virtual personal assistant and detect online fraud, among many more. Since these classifiers are highly dependent from the training datasets, without human interaction or accurate labels, the performance of these generated classifiers with unseen observations is uncertain. Thus, researchers attempted to evaluate a number of independent models using a statistical distance. However, the problem of, given a train-test split and classifiers modeled over the train set, identifying a prediction error using the relation between train and test sets remains open. Moreover, while some training data is essential for supervised machine learning, what happens if there is insufficient labeled data? After all, assigning labels to unlabeled datasets is a time-consuming process that may need significant expert human involvement. When there aren’t enough expert manual labels accessible for the vast amount of openly available data, active learning becomes crucial. However, given a large amount of training and unlabeled datasets, having an active learning model that can reduce the training cost of the classifier and at the same time assist in labeling new data points remains an open problem. From the experimental approaches and findings, the main research contributions, which concentrate on the issue of optical satellite image scene classification include: building labeled Sentinel-2 datasets with surface reflectance values; proposal of machine learning models for pixel-based image scene classification; proposal of a statistical distance based Evidence Function Model (EFM) to detect ML models misclassification; and proposal of a generalised sampling approach for active learning that, together with the EFM enables a way of determining the most informative examples. Firstly, using a manually annotated Sentinel-2 dataset, Machine Learning (ML) models for scene classification were developed and their performance was compared to Sen2Cor the reference package from the European Space Agency – a micro-F1 value of 84% was attained by the ML model, which is a significant improvement over the corresponding Sen2Cor performance of 59%. Secondly, to quantify the misclassification of the ML models, the Mahalanobis distance-based EFM was devised. This model achieved, for the labeled Sentinel-2 dataset, a micro-F1 of 67.89% for misclassification detection. Lastly, EFM was engineered as a sampling strategy for active learning leading to an approach that attains the same level of accuracy with only 0.02% of the total training samples when compared to a classifier trained with the full training set. With the help of the above-mentioned research contributions, we were able to provide an open-source Sentinel-2 image scene classification package which consists of ready-touse Python scripts and a ML model that classifies Sentinel-2 L1C images generating a 20m-resolution RGB image with the six studied classes (Cloud, Cirrus, Shadow, Snow, Water, and Other) giving academics a straightforward method for rapidly and effectively classifying Sentinel-2 scene images. Additionally, an active learning approach that uses, as sampling strategy, the observed prediction uncertainty given by EFM, will allow labeling only the most informative points to be used as input to build classifiers; Sumário: Uma Abordagem de Previsão de Incerteza para Aprendizagem Ativa – Aplicação à Observação da Terra O mapeamento da cobertura do solo e a dinâmica da utilização do solo são cruciais na deteção remota uma vez que os agricultores são incentivados a intensificar ou estender as culturas devido ao aumento contínuo da população mundial. Uma questão importante nesta área é interpretar e classificar cenas capturadas em imagens de satélite de alta resolução. Várias aproximações têm sido propostas incluindo a utilização de redes neuronais que produzem modelos dependentes dos dados (ou seja, o modelo é tendencioso em relação aos dados) e aproximações baseadas em regras que apresentam restrições de diversidade (ou seja, o modelo carece de diversidade em termos de regras). No entanto, a criação de um modelo de aprendizagem automática que, dada uma uma grande quantidade de dados de treino, é capaz de classificar, com desempenho superior, as imagens do Sentinel-2 em diferentes áreas geográficas permanece um problema em aberto. Por outro lado, têm sido utilizadas técnicas de aprendizagem supervisionada na resolução de problemas nas mais diversas áreas de devido à proliferação de conjuntos de dados etiquetados. Exemplos disto incluem classificadores para aplicações que reconhecem imagem e voz, antecipam tráfego, propõem produtos, atuam como assistentes pessoais virtuais e detetam fraudes online, entre muitos outros. Uma vez que estes classificadores são fortemente dependente do conjunto de dados de treino, sem interação humana ou etiquetas precisas, o seu desempenho sobre novos dados é incerta. Neste sentido existem propostas para avaliar modelos independentes usando uma distância estatística. No entanto, o problema de, dada uma divisão de treino-teste e um classificador, identificar o erro de previsão usando a relação entre aqueles conjuntos, permanece aberto. Mais ainda, embora alguns dados de treino sejam essenciais para a aprendizagem supervisionada, o que acontece quando a quantidade de dados etiquetados é insuficiente? Afinal, atribuir etiquetas é um processo demorado e que exige perícia, o que se traduz num envolvimento humano significativo. Quando a quantidade de dados etiquetados manualmente por peritos é insuficiente a aprendizagem ativa torna-se crucial. No entanto, dada uma grande quantidade dados de treino não etiquetados, ter um modelo de aprendizagem ativa que reduz o custo de treino do classificador e, ao mesmo tempo, auxilia a etiquetagem de novas observações permanece um problema em aberto. A partir das abordagens e estudos experimentais, as principais contribuições deste trabalho, que se concentra na classificação de cenas de imagens de satélite óptico incluem: criação de conjuntos de dados Sentinel-2 etiquetados, com valores de refletância de superfície; proposta de modelos de aprendizagem automática baseados em pixels para classificação de cenas de imagens de satétite; proposta de um Modelo de Função de Evidência (EFM) baseado numa distância estatística para detetar erros de classificação de modelos de aprendizagem; e proposta de uma abordagem de amostragem generalizada para aprendizagem ativa que, em conjunto com o EFM, possibilita uma forma de determinar os exemplos mais informativos. Em primeiro lugar, usando um conjunto de dados Sentinel-2 etiquetado manualmente, foram desenvolvidos modelos de Aprendizagem Automática (AA) para classificação de cenas e seu desempenho foi comparado com o do Sen2Cor – o produto de referência da Agência Espacial Europeia – tendo sido alcançado um valor de micro-F1 de 84% pelo classificador, o que representa uma melhoria significativa em relação ao desempenho Sen2Cor correspondente, de 59%. Em segundo lugar, para quantificar o erro de classificação dos modelos de AA, foi concebido o Modelo de Função de Evidência baseado na distância de Mahalanobis. Este modelo conseguiu, para o conjunto de dados etiquetado do Sentinel-2 um micro-F1 de 67,89% na deteção de classificação incorreta. Por fim, o EFM foi utilizado como uma estratégia de amostragem para a aprendizagem ativa, uma abordagem que permitiu atingir o mesmo nível de desempenho com apenas 0,02% do total de exemplos de treino quando comparado com um classificador treinado com o conjunto de treino completo. Com a ajuda das contribuições acima mencionadas, foi possível desenvolver um pacote de código aberto para classificação de cenas de imagens Sentinel-2 que, utilizando num conjunto de scripts Python, um modelo de classificação, e uma imagem Sentinel-2 L1C, gera a imagem RGB correspondente (com resolução de 20m) com as seis classes estudadas (Cloud, Cirrus, Shadow, Snow, Water e Other), disponibilizando à academia um método direto para a classificação de cenas de imagens do Sentinel-2 rápida e eficaz. Além disso, a abordagem de aprendizagem ativa que usa, como estratégia de amostragem, a deteção de classificacão incorreta dada pelo EFM, permite etiquetar apenas os pontos mais informativos a serem usados como entrada na construção de classificadores

    CHARACTERIZATION AND MEASUREMENT OF THE HIV-1 LATENT RESERVOIR USING SINGLE GENOME ANALYSIS AND DROPLET DIGITAL PCR

    Get PDF
    Although antiretroviral therapy (ART) suppresses viral replication to clinically undetectable levels, HIV-1 persists in CD4+ T cells in a latent form not targeted by the immune system or ART (Chun et al., 1997b; Finzi et al., 1997; Ruelas and Greene, 2013; Siliciano et al., 2003; Wong et al., 1997a). This latent reservoir is a major barrier to cure. Many individuals initiate ART during chronic infection, and in this setting, most proviruses are defective (Ho et al., 2013a). However, the dynamics of the accumulation and persistence of defective proviruses during acute HIV-1 infection are largely unknown. Here we show that defective proviruses accumulate rapidly within the first few weeks of infection to make up over 93% of all proviruses, regardless of how early ART is initiated. Using an unbiased method to amplify near full-length proviral genomes from HIV-1 infected adults treated at different stages of infection, we demonstrate that early ART initiation limits the size of the reservoir but does not profoundly impact the proviral landscape. This analysis allows us to revise our understanding of the composition of proviral populations and estimate the true reservoir size in individuals treated early vs. late in infection. Additionally, we demonstrate that common assays for measuring the reservoir significantly overestimate or underestimate the size of the latent reservoir and no assay we tested correlates with the number of intact proviruses. Using our analysis of full-genome sequences, we identify regions and features of the HIV-1 genome that, when interrogated simultaneously, specifically distinguish intact HIV from defective genomes. We describe here a novel intact proviral DNA assay (IPDA) using multiplex droplet digital PCR that allows us to accurately quantify the number of intact proviruses, which are likely the closest estimate to the true size of the latent reservoir. In preliminary results from matched patient samples, the IPDA strongly correlates with full-genome sequencing results. Many defective proviruses contain defects that likely preclude elimination by eradication strategies and could obscure the measurement of real changes in the rarer intact proviruses. By eliminating 90-95% of all defective proviruses and measuring primarily intact proviruses, we anticipate the IDPA will better assess the impact of eradication strategies on the true reservoir of virus that must be eliminated to achieve an HIV-1 cure

    Analyses of All Possible Point Mutations within a Protein Reveals Relationships between Function and Experimental Fitness: A Dissertation

    Get PDF
    The primary amino acid sequence of a protein governs its specific cellular functions. Since the cracking of the genetic code in the late 1950’s, it has been possible to predict the amino acid sequence of a given protein from the DNA sequence of a gene. Nevertheless, the ability to predict a protein’s function from its primary sequence remains a great challenge in biology. In order to address this problem, we combined recent advances in next generation sequencing technologies with systematic mutagenesis strategies to assess the function of thousands of protein variants in a single experiment. Using this strategy, my dissertation describes the effects of most possible single point mutants in the multifunctional Ubiquitin protein in yeast. The effects of these mutants on the essential activation of ubiquitin by the ubiquitin activating protein (E1, Uba1p) as well as their effects on overall yeast growth were measured. Ubiquitin mutants defective for E1 activation were found to correlate with growth defects, although in a non-linear fashion. Further examination of select point mutants indicated that E1 activation deficiencies predict downstream defects in Ubiquitin function, resulting in the observed growth phenotypes. These results indicate that there may be selective pressure for the activity of the E1enzyme to selectively activate ubiquitin protein variants that do not result in functional downstream defects. Additionally, I will describe the use of similar techniques to discover drug resistant mutants of the oncogenic protein BRAFV600E in human melanoma cell lines as an example of the widespread applicability of our strategy for addressing the relationship between protein function and biological fitness
    corecore