4,547 research outputs found

    Information profiles for DNA pattern discovery

    Full text link
    Finite-context modeling is a powerful tool for compressing and hence for representing DNA sequences. We describe an algorithm to detect genomic regularities, within a blind discovery strategy. The algorithm uses information profiles built using suitable combinations of finite-context models. We used the genome of the fission yeast Schizosaccharomyces pombe strain 972 h- for illustration, unveilling locations of low information content, which are usually associated with DNA regions of potential biological interest.Comment: Full version of DCC 2014 paper "Information profiles for DNA pattern discovery

    The Microbial Ecology of Bacterial Vaginosis: A Fine Scale Resolution Metagenomic Analysis

    Get PDF
    The vaginal microbiota play an important protective role in maintaining the health of women. Disruption of the mutualistic relationship that exists between bacterial communities in the vagina and their hosts can lead to bacterial vaginosis (BV), a condition in which lactic acid producing bacteria are supplanted by a diverse array of strictly anaerobic bacteria. BV has been shown to be an independent risk factor for adverse outcomes including preterm delivery and low infant birth weight, acquisition of sexually transmitted infections and HIV, and development of pelvic inflammatory disease. National surveys indicate the prevalence of BV among U.S. women is 29.2%, and yet, despite considerable effort, the etiology of BV remains unknown. Moreover, there are no broadly effective therapies for the treatment of BV, and reoccurrence is common. In the proposed research we will test the overarching hypothesis that vaginal microbial community dynamics and activities are indicators of risk to BV. To do this, we propose to conduct a high resolution prospective study in which samples collected daily from 200 reproductive-age women over two menstrual cycles are used to capture molecular events that take place before, during, and after the spontaneous remission of BV episodes. We will use modern genomic technologies to obtain the data needed to correlate shifts in vaginal microbial community composition and function, metabolomes, and epidemiological and behavioral metadata with the occurrence of BV to better define the syndrome itself and identify patterns that are predictive of BV. The three specific aims of the research are: (1) Evaluate the association between the dynamics of vaginal microbial communities and risk to BV by characterizing the community composition of vaginal specimens archived from a vaginal douching cessation study in which 33 women self-collected vaginal swabs twice-weekly for 16 weeks; (2) Enroll 135 women in a prospective study in which self-collected vaginal swab samples and secretions are collected daily along with data on the occurrence of BV, vaginal pH, and information on time varying habits and practices; (3) Apply model-based statistical clustering and classification approaches to associate the microbial community composition and function, with metadata and clinical diagnoses of BV. The large body of information generated will facilitate understanding vaginal microbial community dynamics, the etiology of BV, and drive the development of better diagnostic tools for BV. Furthermore, the information will enable a more personalized and effective treatment of BV and ultimately help prevent adverse sequelae associated with this highly prevalent disruption of the vaginal microbiome

    Gene Expression based Survival Prediction for Cancer Patients: A Topic Modeling Approach

    Full text link
    Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression (GE) data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional GE data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient's cancer, we represent each patient (~document) as a mixture over cancer-topics, where each cancer-topic is a mixture over GE values (~words). This required some extensions to the standard LDA model eg: to accommodate the "real-valued" expression values - leading to our novel "discretized" Latent Dirichlet Allocation (dLDA) procedure. We initially focus on the METABRIC dataset, which describes breast cancer patients using the r=49,576 GE values, from microarrays. Our results show that our approach provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this approach by running it on the Pan-kidney (KIPAN) dataset, over r=15,529 GE values - here using the mRNAseq modality - and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent "D-calibrated" measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach

    Sequence-based Multiscale Model (SeqMM) for High-throughput chromosome conformation capture (Hi-C) data analysis

    Full text link
    In this paper, I introduce a Sequence-based Multiscale Model (SeqMM) for the biomolecular data analysis. With the combination of spectral graph method, I reveal the essential difference between the global scale models and local scale ones in structure clustering, i.e., different optimization on Euclidean (or spatial) distances and sequential (or genomic) distances. More specifically, clusters from global scale models optimize Euclidean distance relations. Local scale models, on the other hand, result in clusters that optimize the genomic distance relations. For a biomolecular data, Euclidean distances and sequential distances are two independent variables, which can never be optimized simultaneously in data clustering. However, sequence scale in my SeqMM can work as a tuning parameter that balances these two variables and deliver different clusterings based on my purposes. Further, my SeqMM is used to explore the hierarchical structures of chromosomes. I find that in global scale, the Fiedler vector from my SeqMM bears a great similarity with the principal vector from principal component analysis, and can be used to study genomic compartments. In TAD analysis, I find that TADs evaluated from different scales are not consistent and vary a lot. Particularly when the sequence scale is small, the calculated TAD boundaries are dramatically different. Even for regions with high contact frequencies, TAD regions show no obvious consistence. However, when the scale value increases further, although TADs are still quite different, TAD boundaries in these high contact frequency regions become more and more consistent. Finally, I find that for a fixed local scale, my method can deliver very robust TAD boundaries in different cluster numbers.Comment: 22 PAGES, 13 FIGURE
    • …
    corecore