30 research outputs found
Learning to Find Relevant Biological Articles Without Negative Training Examples
Abstract. Classifiers are traditionally learned using sets of positive and negative training examples. However, often a classifier is required, but for training only an incomplete set of positive examples and a set of unlabeled examples are available. This is the situation, for example, with the Transport Classification Database (TCDB, www.tcdb.org), a repository of information about proteins involved in transmembrane transport. This paper presents and evaluates a method for learning to rank the likely relevance to TCDB of newly published scientific articles, using the articles currently referenced in TCDB as positive training examples. The new method has succeeded in identifying 964 new articles relevant to TCDB in fewer than six months, which is a major practical success. From a general data mining perspective, the contributions of this paper are (i) devising and evaluating two novel approaches that solve the positive-only problem effectively, (ii) applying support vector machines in a state-ofthe-art way for recognizing and ranking relevance, and (iii) deploying a system to update a widely-used, real-world biomedical database. Supplementary information including all data sets are publicly available at www.cs.ucsd.edu/users/knoto/pub/ajcai08.
Supplementary Material for: Bioinformatic Analyses of Transmembrane Transport: Novel Software for Deducing Protein Phylogeny, Topology, and Evolution
<p>During the past decade, we have experienced a revolution in the
biological sciences resulting from the flux of information generated by
genome-sequencing efforts. Our understanding of living organisms, the
metabolic processes they catalyze, the genetic systems encoding cellular
protein and stable RNA constituents, and the pathological conditions
caused by some of these organisms has greatly benefited from the
availability of complete genomic sequences and the establishment of
comprehensive databases. Many research institutes around the world are
now devoting their efforts largely to genome sequencing, data collection
and data analysis. In this review, we summarize tools that are in
routine use in our laboratory for characterizing transmembrane transport
systems. Applications of these tools to specific transporter families
are presented. Many of the computational approaches described should be
applicable to virtually all classes of proteins and RNA molecules.</p
Supplementary Material for: Analysis of 58 Families of Holins Using a Novel Program, PhyST
<p>We have designed a freely accessible program, PhyST, which allows the automated characterization of any family of homologous proteins within the Transporter Classification Database. The program performs an NCBI-PSI-BLAST search and reports (1) the average protein sequence length with standard deviations, (2) the average predicted number of transmembrane segments, (3) the total number of homologues retrieved, (4) a quantitative list of all source phyla, and (5) potential fusion proteins of sizes considerably exceeding the average size of the proteins retrieved. We have applied this program to 58 families of holins, and the results are presented. The results show that holins are very rarely fused to other protein domains, suggesting that holins form transmembrane pores as homooligomers without the participation of other proteins or protein domains.</p><br
Supplementary Material for: The Membrane Attack Complex/Perforin Superfamily
<p>The membrane attack complex/perforin (MACPF) superfamily consists of a
diverse group of proteins involved in bacterial pathogenesis and
sporulation as well as eukaryotic immunity, embryonic development,
neural migration and fruiting body formation. The present work shows
that the evolutionary relationships between the members of the
superfamily, previously suggested by comparison of their tertiary
structures, can also be supported by analyses of their primary
structures. The superfamily includes the MACPF family (TC 1.C.39), the
cholesterol-dependent cytolysin (CDC) family (TC 1.C.12.1 and 1.C.12.2)
and the pleurotolysin pore-forming (pleurotolysin B) family (TC
1.C.97.1), as revealed by expansion of each family by comparison against
a large protein database, and by the comparisons of their hidden Markov
models. Clustering analyses demonstrated grouping of the CDC homologues
separately from the 12 MACPF subfamilies, which also grouped separately
from the pleurotolysin B family. Members of the MACPF superfamily
revealed a remarkably diverse range of proteins spanning eukaryotic,
bacterial, and archaeal taxonomic domains, with notable variations in
protein domain architectures. Our strategy should also be helpful in
putting together other highly divergent protein families.</p
Supplementary Material for: Comparative Analyses of Transport Proteins Encoded within the Genomes of Bdellovibrio bacteriovorus HD100 and Bdellovibrio exovorus JSS
<p><i>Bdellovibrio</i>, δ-proteobacteria, including <i>B. bacteriovorus</i> (Bba) and <i>B. exovorus</i>
(Bex), are obligate predators of other Gram-negative bacteria. While
Bba grows in the periplasm of the prey cell, Bex grows externally. We
have analyzed and compared the transport proteins of these 2 organisms
based on the current contents of the Transporter Classification Database
(TCDB; www.tcdb.org). Bba has 103 transporters more than Bex, 50% more
secondary carriers, and 3 times as many MFS carriers. Bba has far more
metabolite transporters than Bex as expected from its larger genome, but
there are 2 times more carbohydrate uptake and drug efflux systems, and
3 times more lipid transporters. Bba also has polyamine and carboxylate
transporters lacking in Bex. Bba has more than twice as many members of
the Mot-Exb family of energizers, but both may have energizers for
gliding motility. They use entirely different types of systems for iron
acquisition. Both contain unexpectedly large numbers of pseudogenes and
incomplete systems, suggesting that they are undergoing genome size
reduction. Interestingly, all 5 outer-membrane receptors in Bba are
lacking in Bex. The 2 organisms have similar numbers and types of
peptide and amino acid uptake systems as well as protein and
carbohydrate secretion systems. The differences observed correlate with
and may account, in part, for the different lifestyles of these 2
bacterial predators.</p
Supplementary Material for: The Amino Acid-Polyamine-Organocation Superfamily
The amino acid-polyamine-organocation (APC) superfamily has been shown to include five recognized families, four of which are specific for amino acids and their derivatives. Recent high-resolution X-ray crystallographic data have shown that four additional transporter families (BCCT, TC No. 2.A.15; SSS, 2.A.21; NSS, 2.A.22; and NCS1, 2.A.39), transporting a wide range of solutes, exhibit sufficiently similar folds to suggest a common evolutionary origin. We have used established statistical methods, based on sequence similarity, to show that these families are, in fact, members of the APC superfamily. We also identify two additional families (NCS2, 2.A.40; SulP, 2.A.53) as being members of this superfamily. Repeat sequences, each having five transmembrane α-helical segments and arising via ancient intragenic duplications, are demonstrated for all of these families, further strengthening the conclusion of homology. The APC superfamily appears to be the second largest superfamily of secondary carriers, the largest being the major facilitator superfamily (MFS). Although the topology of the members of the APC superfamily differs from that of the MFS, both families appear to have arisen from a common ancestral 2 TMS hairpin structure that underwent intragenic triplication followed by loss of a TMS in the APC family, to give the repeat units that are characteristic of these two superfamilies