69 research outputs found

    Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features

    Get PDF
    BACKGROUND: Ensemble attribute profile clustering is a novel, text-based strategy for analyzing a user-defined list of genes and/or proteins. The strategy exploits annotation data present in gene-centered corpora and utilizes ideas from statistical information retrieval to discover and characterize properties shared by subsets of the list. The practical utility of this method is demonstrated by employing it in a retrospective study of two non-overlapping sets of genes defined by a published investigation as markers for normal human breast luminal epithelial cells and myoepithelial cells. RESULTS: Each genetic locus was characterized using a finite set of biological properties and represented as a vector of features indicating attributes associated with the locus (a gene attribute profile). In this study, the vector space models for a pre-defined list of genes were constructed from the Gene Ontology (GO) terms and the Conserved Domain Database (CDD) protein domain terms assigned to the loci by the gene-centered corpus LocusLink. This data set of GO- and CDD-based gene attribute profiles, vectors of binary random variables, was used to estimate multiple finite mixture models and each ensuing model utilized to partition the profiles into clusters. The resultant partitionings were combined using a unanimous voting scheme to produce consensus clusters, sets of profiles that co-occured consistently in the same cluster. Attributes that were important in defining the genes assigned to a consensus cluster were identified. The clusters and their attributes were inspected to ascertain the GO and CDD terms most associated with subsets of genes and in conjunction with external knowledge such as chromosomal location, used to gain functional insights into human breast biology. The 52 luminal epithelial cell markers and 89 myoepithelial cell markers are disjoint sets of genes. Ensemble attribute profile clustering-based analysis indicated that both lists contained groups of genes with the functional properties of membrane receptor biology/signal transduction and nucleic acid binding/transcription. A subset of the luminal markers was associated with metabolic and oxidoreductase activities, whereas a subset of myoepithelial markers was associated with protein hydrolase activity. CONCLUSION: Given a set of genes and/or proteins associated with a phenomenon, process or system of interest, ensemble attribute profile clustering provides a simple method for collating and sythesizing the annotation data pertaining to them that are present in text-based, gene-centered corpora. The results provide information about properties common and unique to subsets of the list and hence insights into the biology of the problem under investigation

    Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

    Get PDF
    BACKGROUND: The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. RESULTS: An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams) trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. CONCLUSION: Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation

    The human exosome: an autoantigenic complex of exoribonucleases in myositis and scleroderma

    Get PDF
    The anti-PM/Scl autoantibodies are known to characterize a subset of autoimmune patients with myositis, scleroderma (Scl), and the PM/Scl overlap syndrome. The major autoantigens that are recognized by anti-PM/Scl autoantibodies are designated PM/Scl-100 and PM/Scl-75. These autoantigens have been reported to associate into a large complex consisting of 11 to 16 proteins and to play a role in ribosome synthesis. Recently, it was discovered that the PM/Scl complex is the human counterpart of the yeast (Saccharomyces cerevisiae) exosome, which is an RNA-processing complex consisting of 11 3' → 5' exoribonucleases. To date, 10 human exosome components have been identified, although only some of these were studied in more detail. In this review, we discuss some recent advances in the characterization of the PM/Scl complex

    Comparative Expression Profiling of the Chlamydia trachomatis pmp Gene Family for Clinical and Reference Strains

    Get PDF
    Chlamydia trachomatis, an obligate intracellular pathogen, is a leading worldwide cause of ocular and urogenital diseases. Advances have been made in our understanding of the nine-member polymorphic membrane protein (Pmp) gene (pmp) family of C. trachomatis. However, there is only limited information on their biologic role, especially for biological variants (biovar) and clinical strains.We evaluated expression for pmps throughout development for reference strains E/Bour and L2/434, representing different biovars, and for clinical E and L2 strains. Immunoreactivity of patient sera to recombinant (r)Pmps was also determined. All pmps were expressed at two hours. pmpA had the lowest expression but was up-regulated at 12 h for all strains, indicating involvement in reticulate body development. For pmpD, expression peaked at 36 h. Additionally, 57.7% of sera from infected and 0% from uninfected adolescents were reactive to rPmpD (p = 0.001), suggesting a role in immunogenicity. pmpF had the highest expression levels for all clinical strains and L2/434 with differential expression of the pmpFE operon for the same strains. Sera were nonreactive to rPmpF despite immunoreactivity to rMOMP and rPmpD, suggesting that PmpF is not associated with humoral immune responses. pmpFE sequences for clinical strains were identical to those of the respective reference strains. We identified the putative pmpFE promoter, which was, surprisingly, 100% conserved for all strains. Analyses of ribosomal binding sites, RNase E, and hairpin structures suggested complex regulatory mechanism(s) for this >6 Kb operon.The dissimilar expression of the same pmp for different C. trachomatis strains may explain different strain-specific needs and phenotypic distinctions. This is further supported by the differential immunoreactivity to rPmpD and rPmpF of sera from patients infected with different strains. Furthermore, clinical E strains did not correlate with the E reference strain at the gene expression level, reinforcing the need for expansive studies of clinical strains

    An additive exponential noise channel with a transmission deadline

    No full text
    We derive the maximum mutual information for an additive exponential noise (AEN) channel with a peak input constraint. We find that the optimizing input density is mixed (with singularities) similar to previous results for AEN channels with a mean input constraint. Likewise, the maximum mutual information takes a similar form, though obviously the maximum for the peak constraint is smaller than for the corresponding mean-constrained channel. This model is inspired by multiple biological phenomena and processes which can be abstracted as follows: inscribed matter is sent by an emitter, moves through a medium, and arrives eventually at its destination receptor. The inscribed matter can convey information in a variety of ways such as the number of signaling quanta - molecules, macromolecular complexes, organelles, cells and tissues - that are emitted as well as the detailed pattern of their release. However, rather than focus on a general class of emitter-receptor systems or a particular exemplar of biomedical importance, our ultimate goal is to provide bounds on the potential efficacy of timed-release signaling for any system which emits identical signaling quanta. That is, we seek to apply one of the most potent aspects of information theory to biological signaling - mechanism blindness - in the hopes of gaining insights applicable to diverse systems that span a wide range of spatiotemporal scales. © 2011 IEEE

    C. elegans clk-2, a gene that limits life span, encodes a telomere length regulator similar to yeast telomere binding protein Tel2p

    Get PDF
    AbstractAn important quest in modern biology is to identify genes involved in aging. Model organisms such as the nematode Caenorhabditis elegans are particularly useful in this regard. The C. elegans genome has been sequenced [1], and single gene mutations that extend adult life span have been identified [2]. Among these longevity-controlling loci are four apparently unrelated genes that belong to the clk family [3–5]. In mammals, telomere length and structure can influence cellular, and possibly organismal, aging [6]. Here, we show that clk-2 encodes a regulator of telomere length in C. elegans

    Robust Sparse Hyperplane Classifiers: Application to Uncertain Molecular Profiling Data

    No full text
    Molecular profiling studies can generate abundance measurements for thousands of transcripts, proteins, metabolites, or other species in, for example, normal and tumor tissue samples. Treating such measurements as features and the samples as labeled data points, sparse hyperplanes provide a statistical methodology for classifying data points into one of two categories (classification and prediction) and defining a small subset of discriminatory features (relevant feature identification). However, this and other extant classification methods address only implicitly the issue of observed data being a combination of underlying signals and noise. Recently, robust optimization has emerged as a powerful framework for handling uncertain data explicitly. Here, ideas from this field are exploited to develop robust sparse hyperplanes, i.e., classification and relevant feature identification algorithms that are resilient to variation in the data. Specifically, each data point is associated with an explicit data uncertainty model in the form of an ellipsoid parameterized by a center and covariance matrix. The task of learning a robust sparse hyperplane from such data is formulated as a second order cone program (SOCP). Gaussian and distribution-free data uncertainty models are shown to yield SOCPs that are equivalent to the SCOP based on ellipsoidal uncertainty. The real-world utility of robust sparse hyperplanes is demonstrated via retrospective analysis of breast cancer related transcript profiles. Data-dependent heuristics are used to compute the parameters of each ellipsoidal data uncertainty model. The generalization performance of a specific implementation, designated "robust LIKNON," is better than its nominal counterpart. Finally, the strengths and limitations of robust sparse hyperplanes are discussed
    corecore