1,677 research outputs found

    Prediction of peptide drift time in ion mobility mass spectrometry from sequence-based features

    Get PDF
    BACKGROUND: Ion mobility-mass spectrometry (IMMS), an analytical technique which combines the features of ion mobility spectrometry (IMS) and mass spectrometry (MS), can rapidly separates ions on a millisecond time-scale. IMMS becomes a powerful tool to analyzing complex mixtures, especially for the analysis of peptides in proteomics. The high-throughput nature of this technique provides a challenge for the identification of peptides in complex biological samples. As an important parameter, peptide drift time can be used for enhancing downstream data analysis in IMMS-based proteomics. RESULTS: In this paper, a model is presented based on least square support vectors regression (LS-SVR) method to predict peptide ion drift time in IMMS from the sequence-based features of peptide. Four descriptors were extracted from peptide sequence to represent peptide ions by a 34-component vector. The parameters of LS-SVR were selected by a grid searching strategy, and a 10-fold cross-validation approach was employed for the model training and testing. Our proposed method was tested on three datasets with different charge states. The high prediction performance achieve demonstrate the effectiveness and efficiency of the prediction model. CONCLUSIONS: Our proposed LS-SVR model can predict peptide drift time from sequence information in relative high prediction accuracy by a test on a dataset of 595 peptides. This work can enhance the confidence of protein identification by combining with current protein searching techniques

    POPISK: T-cell reactivity prediction using support vector machines and string kernels

    Get PDF
    BACKGROUND: Accurate prediction of peptide immunogenicity and characterization of relation between peptide sequences and peptide immunogenicity will be greatly helpful for vaccine designs and understanding of the immune system. In contrast to the prediction of antigen processing and presentation pathway, the prediction of subsequent T-cell reactivity is a much harder topic. Previous studies of identifying T-cell receptor (TCR) recognition positions were based on small-scale analyses using only a few peptides and concluded different recognition positions such as positions 4, 6 and 8 of peptides with length 9. Large-scale analyses are necessary to better characterize the effect of peptide sequence variations on T-cell reactivity and design predictors of a peptide's T-cell reactivity (and thus immunogenicity). The identification and characterization of important positions influencing T-cell reactivity will provide insights into the underlying mechanism of immunogenicity. RESULTS: This work establishes a large dataset by collecting immunogenicity data from three major immunology databases. In order to consider the effect of MHC restriction, peptides are classified by their associated MHC alleles. Subsequently, a computational method (named POPISK) using support vector machine with a weighted degree string kernel is proposed to predict T-cell reactivity and identify important recognition positions. POPISK yields a mean 10-fold cross-validation accuracy of 68% in predicting T-cell reactivity of HLA-A2-binding peptides. POPISK is capable of predicting immunogenicity with scores that can also correctly predict the change in T-cell reactivity related to point mutations in epitopes reported in previous studies using crystal structures. Thorough analyses of the prediction results identify the important positions 4, 6, 8 and 9, and yield insights into the molecular basis for TCR recognition. Finally, we relate this finding to physicochemical properties and structural features of the MHC-peptide-TCR interaction. CONCLUSIONS: A computational method POPISK is proposed to predict immunogenicity with scores which are useful for predicting immunogenicity changes made by single-residue modifications. The web server of POPISK is freely available at http://iclab.life.nctu.edu.tw/POPISK

    Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information

    Get PDF
    BACKGROUND: The majority of peptide bonds in proteins are found to occur in the trans conformation. However, for proline residues, a considerable fraction of Prolyl peptide bonds adopt the cis form. Proline cis/trans isomerization is known to play a critical role in protein folding, splicing, cell signaling and transmembrane active transport. Accurate prediction of proline cis/trans isomerization in proteins would have many important applications towards the understanding of protein structure and function. RESULTS: In this paper, we propose a new approach to predict the proline cis/trans isomerization in proteins using support vector machine (SVM). The preliminary results indicated that using Radial Basis Function (RBF) kernels could lead to better prediction performance than that of polynomial and linear kernel functions. We used single sequence information of different local window sizes, amino acid compositions of different local sequences, multiple sequence alignment obtained from PSI-BLAST and the secondary structure information predicted by PSIPRED. We explored these different sequence encoding schemes in order to investigate their effects on the prediction performance. The training and testing of this approach was performed on a newly enlarged dataset of 2424 non-homologous proteins determined by X-Ray diffraction method using 5-fold cross-validation. Selecting the window size 11 provided the best performance for determining the proline cis/trans isomerization based on the single amino acid sequence. It was found that using multiple sequence alignments in the form of PSI-BLAST profiles could significantly improve the prediction performance, the prediction accuracy increased from 62.8% with single sequence to 69.8% and Matthews Correlation Coefficient (MCC) improved from 0.26 with single local sequence to 0.40. Furthermore, if coupled with the predicted secondary structure information by PSIPRED, our method yielded a prediction accuracy of 71.5% and MCC of 0.43, 9% and 0.17 higher than the accuracy achieved based on the singe sequence information, respectively. CONCLUSION: A new method has been developed to predict the proline cis/trans isomerization in proteins based on support vector machine, which used the single amino acid sequence with different local window sizes, the amino acid compositions of local sequence flanking centered proline residues, the position-specific scoring matrices (PSSMs) extracted by PSI-BLAST and the predicted secondary structures generated by PSIPRED. The successful application of SVM approach in this study reinforced that SVM is a powerful tool in predicting proline cis/trans isomerization in proteins and biological sequence analysis

    머신러닝 기법을 활용한 기후변화 영향에 따른 재해 리스크 평가

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 환경대학원 협동과정 조경학, 2022. 8. 이동근.기후 변화는 우리 세대에게 시급한 위협이다. 자연 재해는 기후 변화로 인해 더 잦은 빈도와 강력하게 발생하고 있어 예측불가성이 커져가고 있다. 특히, 한국의 자연재해는 대부분 기상 현상으로 인해 발생하는데, 지난 10년간 재해로 인한 전체 피해는 주로 태풍(49%)과 호우(40%)에 기인하였다. 따라서 장기적으로 대비하기 위해서는 홍수, 산사태 등 호우와 관련된 위험을 분석하고 평가하는 위험관리가 필요하다. 따라서 본 논문의 주요 연구질문은 다음과 같다: 1) 기후변화로 인한 복잡한 상황에서 다양한 요인을 고려하여 미래의 잠재적 위험을 어떻게 예측할 것인가, 2) 이러한 위험을 줄이기 위해 어떤 노력을 하는 것이 지속가능한가?. 먼저 연안 홍수, 산사태 등 복합적 영향의 미래 위험도를 평가하기 위해 첫째, 최근 연구에서 널리 활용되고 있는 다중 머신러닝(ML) 알고리즘을 확률론적 접근 방식으로 활용하여 현재의 위험도를 분석하였다. 다양한 RCP 기후변화 시나리오 및 지역 기후 모델에 따른 예측 강우량을 고려하여 미래 위험을 추정했습니다. 둘째, 기후변화 영향으로 인한 재난위험 대응을 위한 적응전략의 실효성을 평가하기 위하여, 적응전략으로 중요한 역할을 하는 녹지, 방파제 등 구조적 대책의 효과성과 지속가능성을 여러 적응경로로 나눠 연안침수에 대한 위험저감을 평가하였다. 연구의 결과는 미래의 위험 지역을 식별하고 위험 관리를 위한 의사 결정 과정, 그리고 토지 이용 계획 및 의사 결정 프로세스를 포함한 재난 감소 및 관리 조치에 대해 지원 가능할 것이다.Climate change is an urgent threat to our generation. Natural hazards have become more unpredictable, occurring more frequently and with greater force, due to climate change. Natural disasters in Korea are mostly caused by meteorological events. The total damage caused by disasters in the last ten years is attributed mainly to typhoons (49%) and heavy rain (40%). Therefore, risk management, which analyzes and evaluates hazard risk related to heavy rainfall such as flooding and landslides, is needed to prepare for the long term. Also, effective monitoring and detection responses to climate change are critical for predicting and managing threats to hazard risks. Therefore, the main research questions of this thesis are as follows: 1) How to predict future potential risks in a complex situation due to climate change considering various factors, 2) And what kind of efforts are made to reduce such risks? Is it sustainable? First of all, to assess the future risk of multiple hazards such as coastal flooding, landslide, 1) this study analyzed the present risk by using multiple machine learning (ML) algorithms that have been widely used in recent studies as part of probabilistic approaches, and future risks were estimated by considering the forecasted rainfall according to different representative concentration pathway (RCP) climate change scenarios and regional climate models. Secondly, to evaluate the effectiveness of adaptation strategies to respond to disaster risks posed by climate change impacts, 2) this research analyzed the effectiveness and sustainability of structural measures such as green space and seawall, which are widely used and play an important role as countermeasures against coastal flooding, by dividing into several adaptation pathways. The results of this study identify future at-risk areas and can support decision-making for risk management and can guide disaster reduction and management measures, including land use planning and decision-making processes.Abstract i Chapter 1. Introduction 2 1. Background 2 2. Purpose 4 Chapter 2. Prediction of coastal flooding risk under climate change impacts in South Korea using machine learning algorithms 7 1. Introduction 7 2. Materials and Method 9 2.1 Study Area 9 2.2 Machine learning algorithms 10 2.3 Method 11 3. Results 15 3.1 Comparison of ML algorithms 15 3.2 Risk probability map 16 3.3 Future risk under climate change impacts 17 4. Discussion 18 4.1 Regional differences 18 4.2 Significance factor 20 4.3 Methodological implications 21 5. Conclusions 22 Chapter 3. Predicting susceptibility to landslides under climate change impacts in metropolitan areas of South Korea using machine learning 25 1. Introduction 25 2. Materials and Method 28 2.1 Study Area 28 2.2 Data 29 2.3 Landslide factors analysis 30 2.4 Machine learning algorithms and validation 32 2.5 LSA using different algorithms 33 2.6 Predicting landslide susceptibility 34 3. Results 35 3.1 Multi-collinearity and influencing factor analysis 35 3.2 Comparison of machine learning algorithms 37 3.3 Predicting landslide susceptibility 38 4. Discussion 39 4.1 Analysis of results from different ML algorithms 39 4.2 Difference in susceptibilities based on land cover type 40 5. Conclusions 41 Chapter 4. Adaptation strategies to future coastal flooding: performance evaluation of green and grey infrastructure in South Korea 43 1. Introduction 43 2. Materials and Method 46 2.1 Study area 46 2.2 Data 47 2.3 Comparison of machine learning (ML) techniques and coastal flooding risk analysis 49 2.4 Evaluation of coastal flooding risk with ASs 50 2.5 Potential coastal flooding risk depending on different adaptive pathways 51 3. Results 53 3.1 Performances of ML algorithms 53 3.2 Coastal flooding risk with ASs 54 3.3 Potential coastal flooding risk according to different adaptive pathways 56 4. Discussion 59 4.1 Effect of AS according to spatial characteristics 59 4.2 Importance of nature-based solutions as ASs 62 5. Conclusion 63 Chapter 5. Conclusion 66 Bibliography 71 Abstract in Korean 86박

    A data-driven modeling approach for simulating algal blooms in the tidal freshwater of James River in response to riverine nutrient loading

    Get PDF
    Algal blooms often occur in the tidal freshwater (TF) of the James River estuary, a tributary of the Chesapeake Bay. The timing of algal blooms correlates highly to a summer low-flow period when residence time is long and nutrients are available. Because of complex interactions between physical transport and algal dynamics, it is challenging to predict interannual variations of bloom correctly using a complex eutrophication model without having ahigh-resolution model gridto resolve complexgeometryand anaccurate estimate of nutrientloading to drive the model. In this study, an approach using long-term observational data (from 1990 to 2013) and the Support vector machine (LS-SVM) for simulating algal blooms was applied. The Empirical Orthogonal Function was used to reduce the data dimension that enables the algal bloom dynamics for the entire TF to be modeled by one model. The model results indicate that the data-driven model is capable of simulating interannual algal blooms with good predictive skills and is capable of forecasting algal blooms responding to the change of nutrient loadings and environmental conditions. This study provides a link between a conceptual model and a dynamic model, and demonstrates that the data-driven model is a good approach for simulating algal blooms in this complex environment of the James River. The method is very efficient and can be applied to other estuaries as wel

    Reconstructing Daily Discharge in a Megadelta Using Machine Learning Techniques

    Get PDF
    In this study, six machine learning (ML) models, namely, random forest (RF), Gaussian process regression (GPR), support vector regression (SVR), decision tree (DT), least squares support vector machine (LSSVM), and multivariate adaptive regression spline (MARS) models, were employed to reconstruct the missing daily-averaged discharge in a mega-delta from 1980 to 2015 using upstream-downstream multi-station data. The performance and accuracy of each ML model were assessed and compared with the stage-discharge rating curves (RCs) using four statistical indicators, Taylor diagrams, violin plots, scatter plots, time-series plots, and heatmaps. Model input selection was performed using mutual information and correlation coefficient methods after three data pre-processing steps: normalization, Fourier series fitting, and first-order differencing. The results showed that the ML models are superior to their RC counterparts, and MARS and RF are the most reliable algorithms, although MARS achieves marginally better performance than RF. Compared to RC, MARS and RF reduced the root mean square error (RMSE) by 135% and 141% and the mean absolute error by 194% and 179%, respectively, using year-round data. However, the performance of MARS and RF developed for the climbing (wet season) and recession (dry season) limbs separately worsened slightly compared to that developed using the year-round data. Specifically, the RMSE of MARS and RF in the falling limb was 856 and 1, 040 m3/s, respectively, while that obtained using the year-round data was 768 and 789 m3/s, respectively. In this study, the DT model is not recommended, while the GPR and SVR models provide acceptable results

    In silico prediction of the granzyme B degradome

    Get PDF
    10.1186/1471-2164-12-S3-S1110th Int. Conference on Bioinformatics - 1st ISCB Asia Joint Conference 2011, InCoB 2011/ISCB-Asia 2011: Computational Biology - Proceedings from Asia Pacific Bioinformatics Network (APBioNet)12SUPPL. 3S1

    Long-Term Forecasting of Strong Earthquakes in North America, South America, Japan, Southern China and Northern India With Machine Learning

    Get PDF
    Strong earthquakes (magnitude ≥7) occur worldwide affecting different cities and countries while causing great human, ecological and economic losses. The ability to forecast strong earthquakes on the long-term basis is essential to minimize the risks and vulnerabilities of people living in highly active seismic areas. We have studied seismic activities in North America, South America, Japan, Southern China and Northern India in search for patterns in strong earthquakes on each of these active seismic zones between 1900 and 2021 with the powerful mathematical tool of wavelet transform. We found that the primary seismic activity patterns for M ≥ 7 earthquakes are 55, 3.7, 7.7, and 8.6 years, for seismic zones of the southwestern United States and northern Mexico, southwestern Mexico, South American, and Southern China-Northern India, respectively. In the case of Japan, the most important seismic pattern for earthquakes with magnitude 7 ≤ M (Formula presented.) 8 is 4.1 years and for strong earthquakes with M ≥ 8, it is 40 years. Every seismic pattern obtained clusters the earthquakes in historical intervals/episodes with and without strong earthquakes in the individually analyzed seismic zones. We want to clarify that the intervals where no strong earthquakes do not imply the total absence of seismic activity because earthquakes can occur with lesser magnitude within this same interval. From the information and pattern we obtained from the wavelet analyses, we created a probabilistic, long-term earthquake prediction model for each seismic zone using the Bayesian Machine Learning method. We propose that the periods of occurrence of earthquakes in each seismic zone analyzed could be interpreted as the period in which the stress builds up on different planes of a fault, until this energy releases through the rupture along faults and fractures near the plate tectonic boundaries. Then a series of earthquakes can occur along the fault until the stress subsides and a new cycle begins. Our machine learning models predict a new period of strong earthquakes between 2040 ± 5 and 2057 ± 5, 2024 ± 1 and 2026 ± 1, 2026 ± 2 and 2031 ± 2, 2024 ± 2 and 2029 ± 2, and 2022 ± 1 and 2028 ± 2 for the five active seismic zones of United States, Mexico, South America, Japan, and Southern China and Northern India, respectively. In additon, our methodology can be applied in areas where moderate earthquakes occur, as for the case of the Parkfield section of the San Andreas fault (California, United States). Our methodology explains why a moderate earthquake could never occur in 1988 ± 5 as proposed and why the long-awaited Parkfield earthquake event occurred in 2004. Furthermore, our model predicts that possible seismic events may occur between 2019 and 2031, with a high probability of earthquake events at Parkfield around 2025 ± 2 years.Fil: Velasco Herrera, Victor Manuel. Universidad Nacional Autónoma de México; MéxicoFil: Rossello, Eduardo Antonio. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Geociencias Básicas, Aplicadas y Ambientales de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Geociencias Básicas, Aplicadas y Ambientales de Buenos Aires; ArgentinaFil: Orgeira, Maria Julia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Geociencias Básicas, Aplicadas y Ambientales de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Geociencias Básicas, Aplicadas y Ambientales de Buenos Aires; ArgentinaFil: Arioni, Lucas. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Geociencias Básicas, Aplicadas y Ambientales de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Geociencias Básicas, Aplicadas y Ambientales de Buenos Aires; ArgentinaFil: Soon, Willie. Center for Environmental Research and Earth Sciences; Estados UnidosFil: Velasco, Graciela. Universidad Nacional Autónoma de México; MéxicoFil: Rosique de la Cruz, Laura. Universidad Nacional Autónoma de México; MéxicoFil: Zúñiga, Emmanuel. Universidad Nacional Autónoma de México; MéxicoFil: Vera, Carlos. Universidad Nacional Autónoma de México; Méxic

    Freshwater Algal Bloom Prediction by Support Vector Machine in Macau Storage Reservoirs

    Get PDF
    Understanding and predicting dynamic change of algae population in freshwater reservoirs is particularly important, as algae-releasing cyanotoxins are carcinogens that would affect the health of public. However, the high complex nonlinearity of water variables and their interactions makes it difficult to model the growth of algae species. Recently, support vector machine (SVM) was reported to have advantages of only requiring a small amount of samples, high degree of prediction accuracy, and long prediction period to solve the nonlinear problems. In this study, the SVM-based prediction and forecast models for phytoplankton abundance in Macau Storage Reservoir (MSR) are proposed, in which the water parameters of pH, SiO2, alkalinity, bicarbonate (HCO3 -), dissolved oxygen (DO), total nitrogen (TN), UV254, turbidity, conductivity, nitrate, total nitrogen (TN), orthophosphate (PO4 3−), total phosphorus (TP), suspended solid (SS) and total organic carbon (TOC) selected from the correlation analysis of the 23 monthly water variables were included, with 8-year (2001–2008) data for training and the most recent 3 years (2009–2011) for testing. The modeling results showed that the prediction and forecast powers were estimated as approximately 0.76 and 0.86, respectively, showing that the SVM is an effective new way that can be used for monitoring algal bloom in drinking water storage reservoir
    corecore