144 research outputs found

    Word correlation matrices for protein sequence analysis and remote homology detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive.</p> <p>Results</p> <p>In this work we present a novel kernel for protein sequences based on average word similarity between two sequences. We show that this kernel gives rise to a feature space that allows analysis of discriminative features and fast classification of new sequences. We demonstrate the performance of our approach on a widely-used benchmark setup for protein remote homology detection.</p> <p>Conclusion</p> <p>Our word correlation approach provides highly competitive performance as compared with state-of-the-art methods for protein remote homology detection. The learned model is interpretable in terms of biologically meaningful features. In particular, analysis of discriminative words allows the identification of characteristic regions in biological sequences. Because of its high computational efficiency, our method can be applied to ranking of potential homologs in large databases.</p

    Physicochemical property distributions for accurate and rapid pairwise protein homology detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The challenge of remote homology detection is that many evolutionarily related sequences have very little similarity at the amino acid level. Kernel-based discriminative methods, such as support vector machines (SVMs), that use vector representations of sequences derived from sequence properties have been shown to have superior accuracy when compared to traditional approaches for the task of remote homology detection.</p> <p>Results</p> <p>We introduce a new method for feature vector representation based on the physicochemical properties of the primary protein sequence. A distribution of physicochemical property scores are assembled from 4-mers of the sequence and normalized based on the null distribution of the property over all possible 4-mers. With this approach there is little computational cost associated with the transformation of the protein into feature space, and overall performance in terms of remote homology detection is comparable with current state-of-the-art methods. We demonstrate that the features can be used for the task of pairwise remote homology detection with improved accuracy versus sequence-based methods such as BLAST and other feature-based methods of similar computational cost.</p> <p>Conclusions</p> <p>A protein feature method based on physicochemical properties is a viable approach for extracting features in a computationally inexpensive manner while retaining the sensitivity of SVM protein homology detection. Furthermore, identifying features that can be used for generic pairwise homology detection in lieu of family-based homology detection is important for applications such as large database searches and comparative genomics.</p

    Building multiclass classifiers for remote homology detection and fold recognition

    Get PDF
    BACKGROUND: Protein remote homology detection and fold recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for solving these problems. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the more general multiclass remote homology prediction and fold recognition problems. RESULTS: We present a comprehensive evaluation of a number of methods for building SVM-based multiclass classification schemes in the context of the SCOP protein classification. These methods include schemes that directly build an SVM-based multiclass model, schemes that employ a second-level learning approach to combine the predictions generated by a set of binary SVM-based classifiers, and schemes that build and combine binary classifiers for various levels of the SCOP hierarchy beyond those defining the target classes. CONCLUSION: Analyzing the performance achieved by the different approaches on four different datasets we show that most of the proposed multiclass SVM-based classification approaches are quite effective in solving the remote homology prediction and fold recognition problems and that the schemes that use predictions from binary models constructed for ancestral categories within the SCOP hierarchy tend to not only lead to lower error rates but also reduce the number of errors in which a superfamily is assigned to an entirely different fold and a fold is predicted as being from a different SCOP class. Our results also show that the limited size of the training data makes it hard to learn complex second-level models, and that models of moderate complexity lead to consistently better results

    A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences.</p> <p>Results</p> <p>In this paper, a novel building block of proteins called Top-<it>n</it>-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-<it>n</it>-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-<it>n</it>-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-<it>n</it>-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-<it>n</it>-grams and LSA gives significantly better results compared to related methods.</p> <p>Conclusion</p> <p>The method based on Top-<it>n</it>-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-<it>n</it>-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.</p

    Population screening for colorectal cancer: the implications of an ageing population

    Get PDF
    Population screening for colorectal cancer (CRC) has recently commenced in the United Kingdom supported by the evidence of a number of randomised trials and pilot studies. Certain factors are known to influence screening cost-effectiveness (e.g. compliance), but it remains unclear whether an ageing population (i.e. demographic change) might also have an effect. The aim of this study was to simulate a population-based screening setting using a Markov model and assess the effect of increasing life expectancy on CRC screening cost-effectiveness. A Markov model was constructed that aimed, using a cohort simulation, to estimate the cost-effectiveness of CRC screening in an England and Wales population for two timescales: 2003 (early cohort) and 2033 (late cohort). Four model outcomes were calculated; screened and non-screened cohorts in 2003 and 2033. The screened cohort of men and women aged 60 years were offered biennial unhydrated faecal occult blood testing until the age of 69 years. Life expectancy was assumed to increase by 2.5 years per decade. There were 407 552 fewer people entering the model in the 2033 model due to a lower birth cohort, and population screening saw 30 345 fewer CRC-related deaths over the 50 years of the model. Screening the 2033 cohort cost £96 million with cost savings of £43 million in terms of detection and treatment and £28 million in palliative care costs. After 30 years of follow-up, the cost per life year saved was £1544. An identical screening programme in an early cohort (2003) saw a cost per life year saved of £1651. Population screening for CRC is costly but enables cost savings in certain areas and a considerable reduction in mortality from CRC. This Markov simulation suggests that the cost-effectiveness of population screening for CRC in the United Kingdom may actually be improved by rising life expectancies

    Antimalarial Therapy Selection for Quinolone Resistance among Escherichia coli in the Absence of Quinolone Exposure, in Tropical South America

    Get PDF
    BACKGROUND: Bacterial resistance to antibiotics is thought to develop only in the presence of antibiotic pressure. Here we show evidence to suggest that fluoroquinolone resistance in Escherichia coli has developed in the absence of fluoroquinolone use. METHODS: Over 4 years, outreach clinic attendees in one moderately remote and five very remote villages in rural Guyana were surveyed for the presence of rectal carriage of ciprofloxacin-resistant gram-negative bacilli (GNB). Drinking water was tested for the presence of resistant GNB by culture, and the presence of antibacterial agents and chloroquine by HPLC. The development of ciprofloxacin resistance in E. coli was examined after serial exposure to chloroquine. Patient and laboratory isolates of E. coli resistant to ciprofloxacin were assessed by PCR-sequencing for quinolone-resistance-determining-region (QRDR) mutations. RESULTS: In the very remote villages, 4.8% of patients carried ciprofloxacin-resistant E. coli with QRDR mutations despite no local availability of quinolones. However, there had been extensive local use of chloroquine, with higher prevalence of resistance seen in the villages shortly after a Plasmodium vivax epidemic (p<0.01). Antibacterial agents were not found in the drinking water, but chloroquine was demonstrated to be present. Chloroquine was found to inhibit the growth of E. coli in vitro. Replica plating demonstrated that 2-step QRDR mutations could be induced in E. coli in response to chloroquine. CONCLUSIONS: In these remote communities, the heavy use of chloroquine to treat malaria likely selected for ciprofloxacin resistance in E. coli. This may be an important public health problem in malarious areas

    Complexity of the Inoculum Determines the Rate of Reversion of SIV Gag CD8 T Cell Mutant Virus and Outcome of Infection

    Get PDF
    Escape mutant (EM) virus that evades CD8+ T cell recognition is frequently observed following infection with HIV-1 or SIV. This EM virus is often less replicatively “fit” compared to wild-type (WT) virus, as demonstrated by reversion to WT upon transmission of HIV to a naïve host and the association of EM virus with lower viral load in vivo in HIV-1 infection. The rate and timing of reversion is, however, highly variable. We quantified reversion to WT of a series of SIV and SHIV viruses containing minor amounts of WT virus in pigtail macaques using a sensitive PCR assay. Infection with mixes of EM and WT virus containing ≥10% WT virus results in immediate and rapid outgrowth of WT virus at SIV Gag CD8 T cell epitopes within 7 days of infection of pigtail macaques with SHIV or SIV. In contrast, infection with biologically passaged SHIVmn229 viruses with much smaller proportions of WT sequence, or a molecular clone of pure EM SIVmac239, demonstrated a delayed or slow pattern of reversion. WT virus was not detectable until ≥8 days after inoculation and took ≥8 weeks to become the dominant quasispecies. A delayed pattern of reversion was associated with significantly lower viral loads. The diversity of the infecting inoculum determines the timing of reversion to WT virus, which in turn predicts the outcome of infection. The delay in reversion of fitness-reducing CD8 T cell escape mutations in some scenarios suggests opportunities to reduce the pathogenicity of HIV during very early infection

    Nephrin Regulates Lamellipodia Formation by Assembling a Protein Complex That Includes Ship2, Filamin and Lamellipodin

    Get PDF
    Actin dynamics has emerged at the forefront of podocyte biology. Slit diaphragm junctional adhesion protein Nephrin is necessary for development of the podocyte morphology and transduces phosphorylation-dependent signals that regulate cytoskeletal dynamics. The present study extends our understanding of Nephrin function by showing in cultured podocytes that Nephrin activation induced actin dynamics is necessary for lamellipodia formation. Upon activation Nephrin recruits and regulates a protein complex that includes Ship2 (SH2 domain containing 5′ inositol phosphatase), Filamin and Lamellipodin, proteins important in regulation of actin and focal adhesion dynamics, as well as lamellipodia formation. Using the previously described CD16-Nephrin clustering system, Nephrin ligation or activation resulted in phosphorylation of the actin crosslinking protein Filamin in a p21 activated kinase dependent manner. Nephrin activation in cell culture results in formation of lamellipodia, a process that requires specialized actin dynamics at the leading edge of the cell along with focal adhesion turnover. In the CD16-Nephrin clustering model, Nephrin ligation resulted in abnormal morphology of actin tails in human podocytes when Ship2, Filamin or Lamellipodin were individually knocked down. We also observed decreased lamellipodia formation and cell migration in these knock down cells. These data provide evidence that Nephrin not only initiates actin polymerization but also assembles a protein complex that is necessary to regulate the architecture of the generated actin filament network and focal adhesion dynamics

    Crystal Structure of Legionella DotD: Insights into the Relationship between Type IVB and Type II/III Secretion Systems

    Get PDF
    The Dot/Icm type IVB secretion system (T4BSS) is a pivotal determinant of Legionella pneumophila pathogenesis. L. pneumophila translocate more than 100 effector proteins into host cytoplasm using Dot/Icm T4BSS, modulating host cellular functions to establish a replicative niche within host cells. The T4BSS core complex spanning the inner and outer membranes is thought to be made up of at least five proteins: DotC, DotD, DotF, DotG and DotH. DotH is the outer membrane protein; its targeting depends on lipoproteins DotC and DotD. However, the core complex structure and assembly mechanism are still unknown. Here, we report the crystal structure of DotD at 2.0 Å resolution. The structure of DotD is distinct from that of VirB7, the outer membrane lipoprotein of the type IVA secretion system. In contrast, the C-terminal domain of DotD is remarkably similar to the N-terminal subdomain of secretins, the integral outer membrane proteins that form substrate conduits for the type II and the type III secretion systems (T2SS and T3SS). A short β-segment in the otherwise disordered N-terminal region, located on the hydrophobic cleft of the C-terminal domain, is essential for outer membrane targeting of DotH and Dot/Icm T4BSS core complex formation. These findings uncover an intriguing link between T4BSS and T2SS/T3SS
    corecore