3 research outputs found

    Rickettsia Phylogenomics: Unwinding the Intricacies of Obligate Intracellular Life

    Get PDF
    BACKGROUND: Completed genome sequences are rapidly increasing for Rickettsia, obligate intracellular alpha-proteobacteria responsible for various human diseases, including epidemic typhus and Rocky Mountain spotted fever. In light of phylogeny, the establishment of orthologous groups (OGs) of open reading frames (ORFs) will distinguish the core rickettsial genes and other group specific genes (class 1 OGs or C1OGs) from those distributed indiscriminately throughout the rickettsial tree (class 2 OG or C2OGs). METHODOLOGY/PRINCIPAL FINDINGS: We present 1823 representative (no gene duplications) and 259 non-representative (at least one gene duplication) rickettsial OGs. While the highly reductive (approximately 1.2 MB) Rickettsia genomes range in predicted ORFs from 872 to 1512, a core of 752 OGs was identified, depicting the essential Rickettsia genes. Unsurprisingly, this core lacks many metabolic genes, reflecting the dependence on host resources for growth and survival. Additionally, we bolster our recent reclassification of Rickettsia by identifying OGs that define the AG (ancestral group), TG (typhus group), TRG (transitional group), and SFG (spotted fever group) rickettsiae. OGs for insect-associated species, tick-associated species and species that harbor plasmids were also predicted. Through superimposition of all OGs over robust phylogeny estimation, we discern between C1OGs and C2OGs, the latter depicting genes either decaying from the conserved C1OGs or acquired laterally. Finally, scrutiny of non-representative OGs revealed high levels of split genes versus gene duplications, with both phenomena confounding gene orthology assignment. Interestingly, non-representative OGs, as well as OGs comprised of several gene families typically involved in microbial pathogenicity and/or the acquisition of virulence factors, fall predominantly within C2OG distributions. CONCLUSION/SIGNIFICANCE: Collectively, we determined the relative conservation and distribution of 14354 predicted ORFs from 10 rickettsial genomes across robust phylogeny estimation. The data, available at PATRIC (PathoSystems Resource Integration Center), provide novel information for unwinding the intricacies associated with Rickettsia pathogenesis, expanding the range of potential diagnostic, vaccine and therapeutic targets

    MeSH-Based Clustering of Biomedical Literature

    No full text
    The amount of online documents has grown tremendously in recent years that poses challenges for information retrieval from this vast collection. Text Mining, an application of machine learning addresses these challenges by providing techniques for information extraction from large text collections. One of the major areas of applications of text mining is biomedicine. The rapid growth of research in biomedical area is giving rise to a large number of literature published every year. It is difficult to keep pace with the current and related research in an area of interest. It is also difficult and time-consuming to read all the literature retrieved by a keyword search on a topic of interest. An efficient approach to address this problem is document clustering that generates meaningful groups of concepts which provide a better description of the data in a document collection. This study investigated document clustering of biomedical literature to identify concepts represented in large document collections. Biomedical literature is indexed by a controlled vocabulary, MeSH (Medical Subject Headings) which represent the major concepts discussed in a document. We compared the use of MeSH in representing the documents with that of full-text representation for document clustering
    corecore