68 research outputs found

    Multi-label multi-instance transfer learning for simultaneous reconstruction and cross-talk modeling of multiple human signaling pathways

    Get PDF
    Text file contains the predicted cross-talk signaling components between human signaling pathways (homolog instance). (ZIP 36 KB

    Multi-Label Multi-Kernel Transfer Learning for Human Protein Subcellular Localization

    Get PDF
    Recent years have witnessed much progress in computational modelling for protein subcellular localization. However, the existing sequence-based predictive models demonstrate moderate or unsatisfactory performance, and the gene ontology (GO) based models may take the risk of performance overestimation for novel proteins. Furthermore, many human proteins have multiple subcellular locations, which renders the computational modelling more complicated. Up to the present, there are far few researches specialized for predicting the subcellular localization of human proteins that may reside in multiple cellular compartments. In this paper, we propose a multi-label multi-kernel transfer learning model for human protein subcellular localization (MLMK-TLM). MLMK-TLM proposes a multi-label confusion matrix, formally formulates three multi-labelling performance measures and adapts one-against-all multi-class probabilistic outputs to multi-label learning scenario, based on which to further extends our published work GO-TLM (gene ontology based transfer learning model for protein subcellular localization) and MK-TLM (multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization) for multiplex human protein subcellular localization. With the advantages of proper homolog knowledge transfer, comprehensive survey of model performance for novel protein and multi-labelling capability, MLMK-TLM will gain more practical applicability. The experiments on human protein benchmark dataset show that MLMK-TLM significantly outperforms the baseline model and demonstrates good multi-labelling ability for novel human proteins. Some findings (predictions) are validated by the latest Swiss-Prot database. The software can be freely downloaded at http://soft.synu.edu.cn/upload/msy.rar

    Amino acid classification based spectrum kernel fusion for protein subnuclear localization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Prediction of protein localization in subnuclear organelles is more challenging than general protein subcelluar localization. There are only three computational models for protein subnuclear localization thus far, to the best of our knowledge. Two models were based on protein primary sequence only. The first model assumed homogeneous amino acid substitution pattern across all protein sequence residue sites and used BLOSUM62 to encode <it>k</it>-mer of protein sequence. Ensemble of SVM based on different <it>k</it>-mers drew the final conclusion, achieving 50% overall accuracy. The simplified assumption did not exploit protein sequence profile and ignored the fact of heterogeneous amino acid substitution patterns across sites. The second model derived the <it>PsePSSM </it>feature representation from protein sequence by simply averaging the profile PSSM and combined the <it>PseAA </it>feature representation to construct a kNN ensemble classifier <it>Nuc-PLoc</it>, achieving 67.4% overall accuracy. The two models based on protein primary sequence only both achieved relatively poor predictive performance. The third model required that GO annotations be available, thus restricting the model's applicability.</p> <p>Methods</p> <p>In this paper, we only use the amino acid information of protein sequence without any other information to design a widely-applicable model for protein subnuclear localization. We use <it>K</it>-spectrum kernel to exploit the contextual information around an amino acid and the conserved motif information. Besides expanding window size, we adopt various amino acid classification approaches to capture diverse aspects of amino acid physiochemical properties. Each amino acid classification generates a series of spectrum kernels based on different window size. Thus, (I) window expansion can capture more contextual information and cover size-varying motifs; (II) various amino acid classifications can exploit multi-aspect biological information from the protein sequence. Finally, we combine all the spectrum kernels by simple addition into one single kernel called <it>SpectrumKernel+ </it>for protein subnuclear localization.</p> <p>Results</p> <p>We conduct the performance evaluation experiments on two benchmark datasets: <it>Lei </it>and <it>Nuc-PLoc</it>. Experimental results show that <it>SpectrumKernel+ </it>achieves substantial performance improvement against the previous model <it>Nuc-PLoc</it>, with overall accuracy <it>83.47% </it>against <it>67.4%</it>; and <it>71.23% </it>against <it>50% </it>of <it>Lei SVM Ensemble</it>, against 66.50% of <it>Lei GO SVM Ensemble</it>.</p> <p>Conclusion</p> <p>The method <it>SpectrumKernel</it>+ can exploit rich amino acid information of protein sequence by embedding into implicit size-varying motifs the multi-aspect amino acid physiochemical properties captured by amino acid classification approaches. The kernels derived from diverse amino acid classification approaches and different sizes of <it>k</it>-mer are summed together for data integration. Experiments show that the method <it>SpectrumKernel</it>+ significantly outperforms the existing models for protein subnuclear localization.</p

    Leakage current simulations of Low Gain Avalanche Diode with improved Radiation Damage Modeling

    Full text link
    We report precise TCAD simulations of IHEP-IME-v1 Low Gain Avalanche Diode (LGAD) calibrated by secondary ion mass spectroscopy (SIMS). Our setup allows us to evaluate the leakage current, capacitance, and breakdown voltage of LGAD, which agree with measurements' results before irradiation. And we propose an improved LGAD Radiation Damage Model (LRDM) which combines local acceptor removal with global deep energy levels. The LRDM is applied to the IHEP-IME-v1 LGAD and able to predict the leakage current well at -30 ∘^{\circ}C after an irradiation fluence of Φeq=2.5×1015 neq/cm2 \Phi_{eq}=2.5 \times 10^{15} ~n_{eq}/cm^{2}. The charge collection efficiency (CCE) is under development

    Gene ontology based transfer learning for protein subcellular localization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as <it>GO</it>, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the <it>GO </it>terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology.</p> <p>Results</p> <p>In this paper, we propose a Gene Ontology Based Transfer Learning Model (<it>GO-TLM</it>) for large-scale protein subcellular localization. The model transfers the signature-based homologous <it>GO </it>terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false <it>GO </it>terms that are resulted from evolutionary divergence. We derive three <it>GO </it>kernels from the three aspects of gene ontology to measure the <it>GO </it>similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate <it>GO-TLM </it>performance against three baseline models: <it>MultiLoc, MultiLoc-GO </it>and <it>Euk-mPLoc </it>on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that <it>GO-TLM </it>achieves substantial accuracy improvement against the baseline models: 80.38% against model <it>Euk-mPLoc </it>67.40% with <it>12.98% </it>substantial increase; 96.65% and 96.27% against model <it>MultiLoc-GO </it>89.60% and 89.60%, with <it>7.05% </it>and <it>6.67% </it>accuracy increase on dataset <it>MultiLoc plant </it>and dataset <it>MultiLoc animal</it>, respectively; 97.14%, 95.90% and 96.85% against model <it>MultiLoc-GO </it>83.70%, 90.10% and 85.70%, with accuracy increase <it>13.44%</it>, <it>5.8% </it>and <it>11.15% </it>on dataset <it>BaCelLoc plant</it>, dataset <it>BaCelLoc fungi </it>and dataset <it>BaCelLoc animal </it>respectively. For <it>BaCelLoc </it>independent sets, <it>GO-TLM </it>achieves 81.25%, 80.45% and 79.46% on dataset <it>BaCelLoc plant holdout</it>, dataset <it>BaCelLoc plant holdout </it>and dataset <it>BaCelLoc animal holdout</it>, respectively, as compared against baseline model <it>MultiLoc-GO </it>76%, 60.00% and 73.00%, with accuracy increase <it>5.25%</it>, <it>20.45% </it>and <it>6.46%</it>, respectively.</p> <p>Conclusions</p> <p>Since direct homology-based <it>GO </it>term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, <it>GO-TLM</it>) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based <it>GO </it>term transfer and explicitly weighing the <it>GO </it>kernels substantially improve the prediction performance.</p

    Probability Weighted Ensemble Transfer Learning for Predicting Interactions between HIV-1 and Human Proteins

    Get PDF
    <div><p>Reconstruction of host-pathogen protein interaction networks is of great significance to reveal the underlying microbic pathogenesis. However, the current experimentally-derived networks are generally small and should be augmented by computational methods for less-biased biological inference. From the point of view of computational modelling, <i>data scarcity</i>, <i>data unavailability</i> and <i>negative data sampling</i> are the three major problems for host-pathogen protein interaction networks reconstruction. In this work, we are motivated to address the three concerns and propose a <u>p</u>robability <u>w</u>eighted <u>en</u>semble <u>t</u>ransfer <u>l</u>earning <u>m</u>odel for HIV-human protein interaction prediction (<i>PWEN-TLM</i>), where <i>support vector machine</i> (<i>SVM</i>) is adopted as the individual classifier of the ensemble model. In the model, <i>data scarcity</i> and <i>data unavailability</i> are tackled by homolog knowledge transfer. The importance of homolog knowledge is measured by the <i>ROC-AUC</i> metric of the individual classifiers, whose outputs are probability weighted to yield the final decision. In addition, we further validate the assumption that only the homolog knowledge is sufficient to train a satisfactory model for host-pathogen protein interaction prediction. Thus the model is more robust against <i>data unavailability</i> with less demanding data constraint. As regards with <i>negative</i> data construction, experiments show that <i>exclusiveness of subcellular co-localized proteins</i> is unbiased and more reliable than <i>random sampling</i>. Last, we conduct analysis of overlapped predictions between our model and the existing models, and apply the model to novel host-pathogen PPIs recognition for further biological research.</p></div

    Individual <i>SVM</i> weight distribution on <i>S2</i> dataset.

    No full text
    <p>The negative data is constructed by by the negative data sampling method of <i>random sampling</i>. The horizontal axis is the combination of two sets {T, H} and {F,C, P}. T denotes the target protein, H denotes the homolog protein; F denotes <i>molecular function</i>, C denotes <i>cellular component</i> and P denotes <i>biological process</i>.</p

    <i>ROC curve</i> on <i>S2</i> dataset.

    No full text
    <p>The negative data is constructed by by the negative data sampling method of <i>random sampling</i>. The <i>ROC</i> curves in red, blue and green indicate the performance for the <i>Optimistic</i> case, the <i>Moderate</i> case and the <i>Pessimistic</i> case, respectively.</p
    • …
    corecore