717 research outputs found

    Exploiting physico-chemical properties in string kernels

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas.</p> <p>Results</p> <p>We propose new string kernels that combine the benefits of physico-chemical descriptors for amino acids with the ones of string kernels. The benefits of the proposed kernels are assessed on two problems: MHC-peptide binding classification using position specific kernels and protein classification based on the substring spectrum of the sequences. Our experiments demonstrate that the incorporation of amino acid properties in string kernels yields improved performances compared to standard string kernels and to previously proposed non-substring kernels.</p> <p>Conclusions</p> <p>In summary, the proposed modifications, in particular the combination with the RBF substring kernel, consistently yield improvements without affecting the computational complexity. The proposed kernels therefore appear to be the kernels of choice for any protein sequence-based inference.</p> <p>Availability</p> <p>Data sets, code and additional information are available from <url>http://www.fml.tuebingen.mpg.de/raetsch/suppl/aask</url>. Implementations of the developed kernels are available as part of the Shogun toolbox.</p

    Word correlation matrices for protein sequence analysis and remote homology detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive.</p> <p>Results</p> <p>In this work we present a novel kernel for protein sequences based on average word similarity between two sequences. We show that this kernel gives rise to a feature space that allows analysis of discriminative features and fast classification of new sequences. We demonstrate the performance of our approach on a widely-used benchmark setup for protein remote homology detection.</p> <p>Conclusion</p> <p>Our word correlation approach provides highly competitive performance as compared with state-of-the-art methods for protein remote homology detection. The learned model is interpretable in terms of biologically meaningful features. In particular, analysis of discriminative words allows the identification of characteristic regions in biological sequences. Because of its high computational efficiency, our method can be applied to ranking of potential homologs in large databases.</p

    Towards Open and Equitable Access to Research and Knowledge for Development

    Get PDF
    Leslie Chan and colleagues discuss the value of open access not just for access to health information, but also for transforming structural inequity in current academic reward systems and for valuing scholarship from the South

    Prognostic utility of sestamibi lung uptake does not require adjustment for stress-related variables: A retrospective cohort study

    Get PDF
    BACKGROUND: Increased (99m)Tc-sestamibi stress lung-to-heart ratio (sLHR) has been shown to predict cardiac outcomes similar to pulmonary uptake of thallium. Peak heart rate and use of pharmacologic stress affect the interpretation of lung thallium uptake. The current study was performed to determine whether (99m)Tc-sestamibi sLHR measurements are affected by stress-related variables, and whether this in turn affects prognostic utility. METHODS: sLHR was determined in 718 patients undergoing (99m)Tc-sestamibi SPECT stress imaging. sLHR was assessed in relation to demographics, hemodynamic variables and outcomes (mean follow up 5.6 ± 1.1 years). RESULTS: Mean sLHR was slightly greater in males than in females (P < 0.01) and also showed a weak negative correlation with age (P < 0.01) and systolic blood pressure (P < 0.01), but was unrelated to stress method or heart rate at the time of injection. In patients undergoing treadmill exercise, sLHR was also positively correlated with peak workload (P < 0.05) but inversely with double product (P < 0.05). The combined explanatory effect of sex, age and hemodynamic variables on sLHR was less than 10%. The risk of acute myocardial infarction (AMI) or death increased by a factor of 1.7–1.8 for each SD increase in unadjusted sLHR, and was unaffected by adjustment for sex, age and hemodynamic variables (hazard ratios 1.6–1.7). The area under the ROC curve for the unadjusted sLHR was 0.65 (95% CI 0.59–0.71, P < 0.0001) and was unchanged for the adjusted sLHR (0.65, 95% CI 0.61–0.72, P < 0.0001). CONCLUSION: Stress-related variables have only a weak effect on measured sLHR. Unadjusted and adjusted sLHR provide equivalent prognostic information for prediction of AMI or death

    Subcellular location prediction of proteins using support vector machines with alignment of block sequences utilizing amino acid composition

    Get PDF
    Background: Subcellular location prediction of proteins is an important and well-studied problem in bioinformatics. This is a problem of predicting which part in a cell a given protein is transported to, where an amino acid sequence of the protein is given as an input. This problem is becoming more important since information on subcellular location is helpful for annotation of proteins and genes and the number of complete genomes is rapidly increasing. Since existing predictors are based on various heuristics, it is important to develop a simple method with high prediction accuracies. Results: In this paper, we propose a novel and general predicting method by combining techniques for sequence alignment and feature vectors based on amino acid composition. We implemented this method with support vector machines on plant data sets extracted from the TargetP database. Through fivefold cross validation tests, the obtained overall accuracies and average MCC were 0.9096 and 0.8655 respectively. We also applied our method to other datasets including that of WoLF PSORT. Conclusion: Although there is a predictor which uses the information of gene ontology and yields higher accuracy than ours, our accuracies are higher than existing predictors which use only sequence information. Since such information as gene ontology can be obtained only for known proteins, our predictor is considered to be useful for subcellular location prediction of newly-discovered proteins. Furthermore, the idea of combination of alignment and amino acid frequency is novel and general so that it may be applied to other problems in bioinformatics. Our method for plant is also implemented as a web-system and available on http://sunflower.kuicr.kyoto-u.ac.jp/~tamura/slpfa.html webcite

    Osteoporosis-related fracture case definitions for population-based administrative data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Population-based administrative data have been used to study osteoporosis-related fracture risk factors and outcomes, but there has been limited research about the validity of these data for ascertaining fracture cases. The objectives of this study were to: (a) compare fracture incidence estimates from administrative data with estimates from population-based clinically-validated data, and (b) test for differences in incidence estimates from multiple administrative data case definitions.</p> <p>Methods</p> <p>Thirty-five case definitions for incident fractures of the hip, wrist, humerus, and clinical vertebrae were constructed using diagnosis codes in hospital data and diagnosis and service codes in physician billing data from Manitoba, Canada. Clinically-validated fractures were identified from the Canadian Multicentre Osteoporosis Study (CaMos). Generalized linear models were used to test for differences in incidence estimates.</p> <p>Results</p> <p>For hip fracture, sex-specific differences were observed in the magnitude of under- and over-ascertainment of administrative data case definitions when compared with CaMos data. The length of the fracture-free period to ascertain incident cases had a variable effect on over-ascertainment across fracture sites, as did the use of imaging, fixation, or repair service codes. Case definitions based on hospital data resulted in under-ascertainment of incident clinical vertebral fractures. There were no significant differences in trend estimates for wrist, humerus, and clinical vertebral case definitions.</p> <p>Conclusions</p> <p>The validity of administrative data for estimating fracture incidence depends on the site and features of the case definition.</p

    Implications of Training in Incremental Theories of Intelligence for Undergraduate Statistics Students

    Get PDF
    This chapter documents the effects of training in incremental theories of intelligence on students in introductory statistics courses at a liberal arts university in the US. Incremental theories of intelligence examine the beliefs individuals hold of knowledge and how it is attained. An individual with an incremental theory of intelligence believes that intelligence can be developed. The research examined differences by gender in mastery of statistics and attitudes toward statistics for students who received growth mind-set training. A pre-test, post-test design utilised the Students’ Attitudes Toward Statistics© instrument and the Comprehensive Assessment of Outcomes in a first Statistics course. An ANCOVA revealed that females gained more than males on their value of statistics (F(1, 63) 9.40, MSE 3.79, p .003, η2 P 0.134) and decreased less for effort expended to learn statistics (F(1, 63) 4.41, MSE 4.07, p .040, η2 P 0.067). Females also gained mastery of statistical concepts at a greater rate (F(1, 63) 5.30, MSE 0.06, p .025, η2 P 0.080) indicating a possible path to alleviate the under-representation of females in STEM

    The non-linear infrared-radio correlation of low-z galaxies: implications for redshift evolution, a new radio SFR recipe, and how to minimize selection bias

    Get PDF
    The infrared-radio correlation (IRRC) underpins many commonly used radio luminosity-star formation rate (SFR) calibrations. In preparation for the new generation of radio surveys we revisit the IRRC of low-z galaxies by (a) drawing on the best currently available IR and 1.4 GHz radio photometry, plus ancillary data over the widest possible area, and (b) carefully assessing potential systematics. We compile a catalogue of ∼9,500 z < 0.2 galaxies and derive their 1.4 GHz radio (L1.4), total IR, and monochromatic IR luminosities in up to seven bands, allowing us to parameterize the wavelength-dependence of monochromatic IRRCs from 22-500 μm. For the first time for low-z samples, we quantify how poorly matched IR and radio survey depths bias measured median IR/radio ratios, q¯¯TIR , and discuss the level of biasing expected for low-z IRRC studies in ASKAP/MeerKAT fields. For our subset of ∼2,000 high-confidence star-forming galaxies we find a median q¯¯TIR of 2.54 (scatter: 0.17 dex). We show that q¯¯TIR correlates with L1.4, implying a non-linear IRRC with slope 1.11±0.01. Our new L1.4-SFR calibration, which incorporates this non-linearity, reproduces SFRs from panchromatic SED fits substantially better than previous IRRC-based recipes. Finally, we match the evolutionary slope of recently measured q¯¯TIR -redshift trends without having to invoke redshift evolution of the IRRC. In this framework, the redshift evolution of q¯¯TIR reported at GHz frequencies in the literature is the consequence of a partial, redshift-dependent sampling of a non-linear IRRC obeyed by low-z and distant galaxies

    Physicochemical property distributions for accurate and rapid pairwise protein homology detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The challenge of remote homology detection is that many evolutionarily related sequences have very little similarity at the amino acid level. Kernel-based discriminative methods, such as support vector machines (SVMs), that use vector representations of sequences derived from sequence properties have been shown to have superior accuracy when compared to traditional approaches for the task of remote homology detection.</p> <p>Results</p> <p>We introduce a new method for feature vector representation based on the physicochemical properties of the primary protein sequence. A distribution of physicochemical property scores are assembled from 4-mers of the sequence and normalized based on the null distribution of the property over all possible 4-mers. With this approach there is little computational cost associated with the transformation of the protein into feature space, and overall performance in terms of remote homology detection is comparable with current state-of-the-art methods. We demonstrate that the features can be used for the task of pairwise remote homology detection with improved accuracy versus sequence-based methods such as BLAST and other feature-based methods of similar computational cost.</p> <p>Conclusions</p> <p>A protein feature method based on physicochemical properties is a viable approach for extracting features in a computationally inexpensive manner while retaining the sensitivity of SVM protein homology detection. Furthermore, identifying features that can be used for generic pairwise homology detection in lieu of family-based homology detection is important for applications such as large database searches and comparative genomics.</p
    corecore