32 research outputs found

    Performance of Ensemble Classification for Agricultural and Biological Science Journals with Scopus Index

    Get PDF
    The ensemble method is considered an advanced method in both prediction and classification. The application of this method is estimated to have a more optimal output than the previous classification method. This article aims to determine the ensemble's performance to classify journal quartiles. The subject of agriculture was chosen because Indonesia is an agricultural country, and the interest of researchers in this field shows a positive response. The data is downloaded through the Scimago Journal and Country Rank with the accumulation in 2020. Labels have four classes: Q1, Q2, Q3, and Q4. The ensemble applied is Boosting and Bagging with Decision Tree (DT) and Gaussian Naïve Bayes (GNB) algorithms compiled from 2144 instances. The Boosting meta-ensembles used are Adaboost and XGBoost. From this study, the Bagging Decision Tree has the highest accuracy score at 71.36, followed by XGBoost Decision Tree with 69.51. The third is XGBoost Gaussian Naïve Bayes with 68.82, Adaboost Decision Tree with 60.42, Adaboost Gaussian Naïve Bayes with 58.2, and Bagging Gaussian Naïve Bayes with 56.12 results. This paper shows that the Bagging Decision Tree is the ensemble method that works optimally in this subject classification. This result suggests that the ensemble method can still fail to produce an ideal outcome that approaches the SJR system

    IDENTIFYING MOLECULAR FUNCTIONS OF DYNEIN MOTOR PROTEINS USING EXTREME GRADIENT BOOSTING ALGORITHM WITH MACHINE LEARNING

    Get PDF
    The majority of cytoplasmic proteins and vesicles move actively primarily to dynein motor proteins, which are the cause of muscle contraction. Moreover, identifying how dynein are used in cells will rely on structural knowledge. Cytoskeletal motor proteins have different molecular roles and structures, and they belong to three superfamilies of dynamin, actin and myosin. Loss of function of specific molecular motor proteins can be attributed to a number of human diseases, such as Charcot-Charcot-Dystrophy and kidney disease.  It is crucial to create a precise model to identify dynein motor proteins in order to aid scientists in understanding their molecular role and designing therapeutic targets based on their influence on human disease. Therefore, we develop an accurate and efficient computational methodology is highly desired, especially when using cutting-edge machine learning methods. In this article, we proposed a machine learning-based superfamily of cytoskeletal motor protein locations prediction method called extreme gradient boosting (XGBoost). We get the initial feature set All by extraction the protein features from the sequence and evolutionary data of the amino acid residues named BLOUSM62. Through our successful eXtreme gradient boosting (XGBoost), accuracy score 0.8676%, Precision score 0.8768%, Sensitivity score 0.760%, Specificity score 0.9752% and MCC score 0.7536%.  Our method has demonstrated substantial improvements in the performance of many of the evaluation parameters compared to other state-of-the-art methods. This study offers an effective model for the classification of dynein proteins and lays a foundation for further research to improve the efficiency of protein functional classification

    South German Credit Data Classification Using Random Forest Algorithm to Predict Bank Credit Receipts

    Get PDF
    Normally, most of the bank's wealth is obtained from providing credit loans so that a marketing bank must be able to reduce the risk of non-performing credit loans. The risk of providing loans can be minimized by studying patterns from existing lending data. One technique that can be used to solve this problem is to use data mining techniques. Data mining makes it possible to find hidden information from large data sets by way of classification. The Random Forest (RF) algorithm is a classification algorithm that can be used to deal with data imbalancing problems. The purpose of this study is to discuss the use of the RF algorithm for classification of South German Credit data. This research is needed because currently there is no previous research that applies the RF algorithm to classify South German Credit data specifically. Based on the tests that have been done, the optimal performance of the classification algorithm RF on South German Credit data is the comparison of training data of 85% and testing data of 15% with an accuracy of 78.33%

    Leveraging Machine Learning Models for Peptide-Protein Interaction Prediction

    Full text link
    Peptides play a pivotal role in a wide range of biological activities through participating in up to 40% protein-protein interactions in cellular processes. They also demonstrate remarkable specificity and efficacy, making them promising candidates for drug development. However, predicting peptide-protein complexes by traditional computational approaches, such as Docking and Molecular Dynamics simulations, still remains a challenge due to high computational cost, flexible nature of peptides, and limited structural information of peptide-protein complexes. In recent years, the surge of available biological data has given rise to the development of an increasing number of machine learning models for predicting peptide-protein interactions. These models offer efficient solutions to address the challenges associated with traditional computational approaches. Furthermore, they offer enhanced accuracy, robustness, and interpretability in their predictive outcomes. This review presents a comprehensive overview of machine learning and deep learning models that have emerged in recent years for the prediction of peptide-protein interactions.Comment: 46 pages, 10 figure

    Machine Learning Approaches for the Prioritisation of Cardiovascular Disease Genes Following Genome- wide Association Study

    Get PDF
    Genome-wide association studies (GWAS) have revealed thousands of genetic loci, establishing itself as a valuable method for unravelling the complex biology of many diseases. As GWAS has grown in size and improved in study design to detect effects, identifying real causal signals, disentangling from other highly correlated markers associated by linkage disequilibrium (LD) remains challenging. This has severely limited GWAS findings and brought the method’s value into question. Although thousands of disease susceptibility loci have been reported, causal variants and genes at these loci remain elusive. Post-GWAS analysis aims to dissect the heterogeneity of variant and gene signals. In recent years, machine learning (ML) models have been developed for post-GWAS prioritisation. ML models have ranged from using logistic regression to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models (i.e., neural networks). When combined with functional validation, these methods have shown important translational insights, providing a strong evidence-based approach to direct post-GWAS research. However, ML approaches are in their infancy across biological applications, and as they continue to evolve an evaluation of their robustness for GWAS prioritisation is needed. Here, I investigate the landscape of ML across: selected models, input features, bias risk, and output model performance, with a focus on building a prioritisation framework that is applied to blood pressure GWAS results and tested on re-application to blood lipid traits

    T Cell Receptor Protein Sequences and Sparse Coding: A Novel Approach to Cancer Classification

    Full text link
    Cancer is a complex disease characterized by uncontrolled cell growth and proliferation. T cell receptors (TCRs) are essential proteins for the adaptive immune system, and their specific recognition of antigens plays a crucial role in the immune response against diseases, including cancer. The diversity and specificity of TCRs make them ideal for targeting cancer cells, and recent advancements in sequencing technologies have enabled the comprehensive profiling of TCR repertoires. This has led to the discovery of TCRs with potent anti-cancer activity and the development of TCR-based immunotherapies. In this study, we investigate the use of sparse coding for the multi-class classification of TCR protein sequences with cancer categories as target labels. Sparse coding is a popular technique in machine learning that enables the representation of data with a set of informative features and can capture complex relationships between amino acids and identify subtle patterns in the sequence that might be missed by low-dimensional methods. We first compute the k-mers from the TCR sequences and then apply sparse coding to capture the essential features of the data. To improve the predictive performance of the final embeddings, we integrate domain knowledge regarding different types of cancer properties. We then train different machine learning (linear and non-linear) classifiers on the embeddings of TCR sequences for the purpose of supervised analysis. Our proposed embedding method on a benchmark dataset of TCR sequences significantly outperforms the baselines in terms of predictive performance, achieving an accuracy of 99.8\%. Our study highlights the potential of sparse coding for the analysis of TCR protein sequences in cancer research and other related fields

    MULTI-DIMENSIONAL INTERROGATION OF DNA MUTATIONS IN CANCER

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Diagnosing hospital bacteraemia in the framework of predictive, preventive and personalised medicine using electronic health records and machine learning classifiers

    Get PDF
    Background The bacteraemia prediction is relevant because sepsis is one of the most important causes of morbidity and mortality. Bacteraemia prognosis primarily depends on a rapid diagnosis. The bacteraemia prediction would shorten up to 6 days the diagnosis, and, in conjunction with individual patient variables, should be considered to start the early administration of personalised antibiotic treatment and medical services, the election of specific diagnostic techniques and the determination of additional treatments, such as surgery, that would prevent subsequent complications. Machine learning techniques could help physicians make these informed decisions by predicting bacteraemia using the data already available in electronic hospital records. Objective This study presents the application of machine learning techniques to these records to predict the blood culture’s outcome, which would reduce the lag in starting a personalised antibiotic treatment and the medical costs associated with erroneous treatments due to conservative assumptions about blood culture outcomes. Methods Six supervised classifiers were created using three machine learning techniques, Support Vector Machine, Random Forest and K-Nearest Neighbours, on the electronic health records of hospital patients. The best approach to handle missing data was chosen and, for each machine learning technique, two classification models were created: the first uses the features known at the time of blood extraction, whereas the second uses four extra features revealed during the blood culture. Results The six classifiers were trained and tested using a dataset of 4357 patients with 117 features per patient. The models obtain predictions that, for the best case, are up to a state-of-the-art accuracy of 85.9%, a sensitivity of 87.4% and an AUC of 0.93. Conclusions Our results provide cutting-edge metrics of interest in predictive medical models with values that exceed the medical practice threshold and previous results in the literature using classical modelling techniques in specific types of bacteraemia. Additionally, the consistency of results is reasserted because the three classifiers’ importance ranking shows similar features that coincide with those that physicians use in their manual heuristics. Therefore, the efficacy of these machine learning techniques confirms their viability to assist in the aims of predictive and personalised medicine once the disease presents bacteraemia-compatible symptoms and to assist in improving the healthcare economy

    Priorización de genes y búsqueda de dianas terapéuticas por medio de herramientas informáticas y técnicas de aprendizaje automatizado en cáncer de mama

    Get PDF
    Programa Oficial de Doutoramento en Tecnoloxías da Información e as Comunicacións. 5032V01Tese por compendio de publicacións[Resumen] El cáncer de mama (CM) es la principal causa de muerte relacionada a neoplasias en mujeres y es el tipo de cáncer más diagnosticado a nivel mundial. CM es una enfermedad heterogénea en donde están envueltos diversos factores como alteraciones genómicas, desregulación de la expresión de proteínas, alteración de cascadas genéticas, desregulación hormonal, determinantes ambientales y etnicidad. A pesar de los grandes avances tecnológicos y científicos en los últimos años, la comprensión de los procesos moleculares, la identificación de nuevas dianas terapéuticas y la predicción de proteínas envueltas inmunoterapia, metástasis, y unión al ARN es indispensable para el desarrollo de fármacos y la aplicación de la medicina de precisión en la práctica clínica. La tesis aquí propuesta plantea el desarrollo de una estrategia consenso altamente eficiente en el reconocimiento de genes y proteínas asociadas al CM; la validación oncológica de dichos genes y proteínas priorizadas mediante la estrategia OncoOmics que consistió en el análisis de bases de datos experimentales de alta relevancia a nivel mundial; la identificación de mutaciones oncogénicas y fármacos indispensables para el desarrollo y aplicación de la medicina de precisión; y la predicción de proteínas de CM asociadas a inmunoterapia, metástasis y unión al ARN mediante diversas herramientas informáticas y métodos de inteligencia artificial. Todos los resultados se publicaron en revistas internacionales de importante factor de impacto.Abstract] Breast cancer (BC) is the leading cause of cancer-related death among women and the most commonly diagnosed cancer worldwide. BC is a heterogeneous disease where genomic alterations, protein expression deregulation, signaling pathway alterations, hormone disruption, ethnicity and environmental determinants are involved. Despite the technological and scientific advances in recent years, an understanding of molecular processes, the identification of new therapeutic targets and the prediction of proteins involved in immunotherapy, metastasis, and RNA binding is essential for drug development and application of precision medicine in clinical practice. The current thesis proposes the development of a high efficient consensus strategy in the recognition of genes and proteins associated with BC; the oncological validation of these prioritized genes and proteins using the OncoOmics strategy, which consisted of the analysis of outstanding experimental databases; the identification of oncogenic mutations and essential drugs for the development and application of precision medicine; and the prediction of BC proteins associated with immunotherapy, metastasis and RNA-binding using bioinformatics tools and artificial intelligence methods. All results were published in international journals with a significant impact factor.[Resumo] O cancro de mama (CM) é a principal causa de morte relacionada con enfermidades malignas en mulleres e é o tipo de cancro máis diagnosticado a nivel mundial. A CM é unha enfermidade heteroxénea onde interveñen varios factores, como alteracións xenómicas, desregulación da expresión proteica, alteración de cascadas xenéticas, desregulación hormonal, determinantes ambientais e etnia. A pesar dos grandes avances tecnolóxicos e científicos dos últimos anos, a comprensión dos procesos moleculares, a identificación de novas dianas terapéuticas e a predición de proteínas implicadas na inmunoterapia, metástase e unión ao ARN é fundamental para o desenvolvemento de fármacos e aplicación da medicina de precisión na práctica clínica. Esta tese propón o desenvolvemento dunha estratexia de consenso altamente eficiente no recoñecemento de xenes e proteínas asociadas a CM; a validación oncolóxica destes xenes e proteínas prioritarias mediante a estratexia OncoOmics, que consistiu na análise de bases de datos experimentais altamente relevantes en todo o mundo; a identificación de mutacións oncogénicas e fármacos esenciais para o desenvolvemento e aplicación da medicina de precisión; e a predición de proteínas CM asociadas á inmunoterapia, metástase e unión ao ARN usando diversas ferramentas informáticas e métodos de intelixencia artificial. Todos os resultados publicáronse en revistas internacionais cun importante factor de impacto
    corecore