32 research outputs found
Performance of Ensemble Classification for Agricultural and Biological Science Journals with Scopus Index
The ensemble method is considered an advanced method in both prediction and classification. The application of this method is estimated to have a more optimal output than the previous classification method. This article aims to determine the ensemble's performance to classify journal quartiles. The subject of agriculture was chosen because Indonesia is an agricultural country, and the interest of researchers in this field shows a positive response. The data is downloaded through the Scimago Journal and Country Rank with the accumulation in 2020. Labels have four classes: Q1, Q2, Q3, and Q4. The ensemble applied is Boosting and Bagging with Decision Tree (DT) and Gaussian Naïve Bayes (GNB) algorithms compiled from 2144 instances. The Boosting meta-ensembles used are Adaboost and XGBoost. From this study, the Bagging Decision Tree has the highest accuracy score at 71.36, followed by XGBoost Decision Tree with 69.51. The third is XGBoost Gaussian Naïve Bayes with 68.82, Adaboost Decision Tree with 60.42, Adaboost Gaussian Naïve Bayes with 58.2, and Bagging Gaussian Naïve Bayes with 56.12 results. This paper shows that the Bagging Decision Tree is the ensemble method that works optimally in this subject classification. This result suggests that the ensemble method can still fail to produce an ideal outcome that approaches the SJR system
IDENTIFYING MOLECULAR FUNCTIONS OF DYNEIN MOTOR PROTEINS USING EXTREME GRADIENT BOOSTING ALGORITHM WITH MACHINE LEARNING
The majority of cytoplasmic proteins and vesicles move actively primarily to dynein motor proteins, which are the cause of muscle contraction. Moreover, identifying how dynein are used in cells will rely on structural knowledge. Cytoskeletal motor proteins have different molecular roles and structures, and they belong to three superfamilies of dynamin, actin and myosin. Loss of function of specific molecular motor proteins can be attributed to a number of human diseases, such as Charcot-Charcot-Dystrophy and kidney disease. It is crucial to create a precise model to identify dynein motor proteins in order to aid scientists in understanding their molecular role and designing therapeutic targets based on their influence on human disease. Therefore, we develop an accurate and efficient computational methodology is highly desired, especially when using cutting-edge machine learning methods. In this article, we proposed a machine learning-based superfamily of cytoskeletal motor protein locations prediction method called extreme gradient boosting (XGBoost). We get the initial feature set All by extraction the protein features from the sequence and evolutionary data of the amino acid residues named BLOUSM62. Through our successful eXtreme gradient boosting (XGBoost), accuracy score 0.8676%, Precision score 0.8768%, Sensitivity score 0.760%, Specificity score 0.9752% and MCC score 0.7536%. Our method has demonstrated substantial improvements in the performance of many of the evaluation parameters compared to other state-of-the-art methods. This study offers an effective model for the classification of dynein proteins and lays a foundation for further research to improve the efficiency of protein functional classification
South German Credit Data Classification Using Random Forest Algorithm to Predict Bank Credit Receipts
Normally, most of the bank's wealth is obtained from providing credit loans so that a marketing bank must be able to reduce the risk of non-performing credit loans. The risk of providing loans can be minimized by studying patterns from existing lending data. One technique that can be used to solve this problem is to use data mining techniques. Data mining makes it possible to find hidden information from large data sets by way of classification. The Random Forest (RF) algorithm is a classification algorithm that can be used to deal with data imbalancing problems. The purpose of this study is to discuss the use of the RF algorithm for classification of South German Credit data. This research is needed because currently there is no previous research that applies the RF algorithm to classify South German Credit data specifically. Based on the tests that have been done, the optimal performance of the classification algorithm RF on South German Credit data is the comparison of training data of 85% and testing data of 15% with an accuracy of 78.33%
Leveraging Machine Learning Models for Peptide-Protein Interaction Prediction
Peptides play a pivotal role in a wide range of biological activities through
participating in up to 40% protein-protein interactions in cellular processes.
They also demonstrate remarkable specificity and efficacy, making them
promising candidates for drug development. However, predicting peptide-protein
complexes by traditional computational approaches, such as Docking and
Molecular Dynamics simulations, still remains a challenge due to high
computational cost, flexible nature of peptides, and limited structural
information of peptide-protein complexes. In recent years, the surge of
available biological data has given rise to the development of an increasing
number of machine learning models for predicting peptide-protein interactions.
These models offer efficient solutions to address the challenges associated
with traditional computational approaches. Furthermore, they offer enhanced
accuracy, robustness, and interpretability in their predictive outcomes. This
review presents a comprehensive overview of machine learning and deep learning
models that have emerged in recent years for the prediction of peptide-protein
interactions.Comment: 46 pages, 10 figure
Machine Learning Approaches for the Prioritisation of Cardiovascular Disease Genes Following Genome- wide Association Study
Genome-wide association studies (GWAS) have revealed thousands of genetic loci, establishing itself as a valuable method for unravelling the complex biology of many diseases. As GWAS has grown in size and improved in study design to detect effects, identifying real causal signals, disentangling from other highly correlated markers associated by linkage disequilibrium (LD) remains challenging. This has severely limited GWAS findings and brought the method’s value into question. Although thousands of disease susceptibility loci have been reported, causal variants and genes at these loci remain elusive. Post-GWAS analysis aims to dissect the heterogeneity of variant and gene signals. In recent years, machine learning (ML) models have been developed for post-GWAS prioritisation. ML models have ranged from using logistic regression to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models (i.e., neural networks). When combined with functional validation, these methods have shown important translational insights, providing a strong evidence-based approach to direct post-GWAS research. However, ML approaches are in their infancy across biological applications, and as they continue to evolve an evaluation of their robustness for GWAS prioritisation is needed. Here, I investigate the landscape of ML across: selected models, input features, bias risk, and output model performance, with a focus on building a prioritisation framework that is applied to blood pressure GWAS results and tested on re-application to blood lipid traits
T Cell Receptor Protein Sequences and Sparse Coding: A Novel Approach to Cancer Classification
Cancer is a complex disease characterized by uncontrolled cell growth and
proliferation. T cell receptors (TCRs) are essential proteins for the adaptive
immune system, and their specific recognition of antigens plays a crucial role
in the immune response against diseases, including cancer. The diversity and
specificity of TCRs make them ideal for targeting cancer cells, and recent
advancements in sequencing technologies have enabled the comprehensive
profiling of TCR repertoires. This has led to the discovery of TCRs with potent
anti-cancer activity and the development of TCR-based immunotherapies. In this
study, we investigate the use of sparse coding for the multi-class
classification of TCR protein sequences with cancer categories as target
labels. Sparse coding is a popular technique in machine learning that enables
the representation of data with a set of informative features and can capture
complex relationships between amino acids and identify subtle patterns in the
sequence that might be missed by low-dimensional methods. We first compute the
k-mers from the TCR sequences and then apply sparse coding to capture the
essential features of the data. To improve the predictive performance of the
final embeddings, we integrate domain knowledge regarding different types of
cancer properties. We then train different machine learning (linear and
non-linear) classifiers on the embeddings of TCR sequences for the purpose of
supervised analysis. Our proposed embedding method on a benchmark dataset of
TCR sequences significantly outperforms the baselines in terms of predictive
performance, achieving an accuracy of 99.8\%. Our study highlights the
potential of sparse coding for the analysis of TCR protein sequences in cancer
research and other related fields
Recommended from our members
Evolutionary and deep mining models for effective biomarker discovery
With the advent of high-throughput biology, large amounts of molecular data are available for purposeful analysis and evaluation. Extracting relevant knowledge from high-throughput biomedical datasets has become a common goal of current approaches to personalised cancer medicine and understanding cancer genotype and phenotype. However, the datasets are characterised by high dimensionality and relatively small sample sizes with small signal-to-noise ratios. Extracting and interpreting relevant knowledge from such complex datasets therefore remains a significant challenge for the fields of machine learning and data mining. This is evidenced by the limited success these methods have had in detecting robust and reliable biomarkers for cancers and other complicated diseases. This could also explain the lack of finding generic biomarkers among the identified published genes for identical diseases or clinical conditions.
This thesis proposes and evaluates the efficacy of two novel feature mining models established on the basis of the evolutionary computation and deep learning paradigms to position and solve biomarker discovery as an optimisation problem. Deep learning methods lack the transparency and interpretability found in the evolutionary paradigm. To overcome the inherent issue of poor explanatory power associated with the deep learning, this research also introduces a novel deep mining model that helps to deconstruct the internal state of such deep learning models to reveal key determinants underlying its latent representations to aid feature selection. As a result, salient biomarkers for breast cancer and the positivity of the Estrogen and Progesterone receptors are discovered robustly and validated reliably across a wide range of independently generated breast cancer data samples
Diagnosing hospital bacteraemia in the framework of predictive, preventive and personalised medicine using electronic health records and machine learning classifiers
Background
The bacteraemia prediction is relevant because sepsis is one of the most important causes of morbidity and mortality. Bacteraemia prognosis primarily depends on a rapid diagnosis. The bacteraemia prediction would shorten up to 6 days the diagnosis, and, in conjunction with individual patient variables, should be considered to start the early administration of personalised antibiotic treatment and medical services, the election of specific diagnostic techniques and the determination of additional treatments, such as surgery, that would prevent subsequent complications. Machine learning techniques could help physicians make these informed decisions by predicting bacteraemia using the data already available in electronic hospital records.
Objective
This study presents the application of machine learning techniques to these records to predict the blood culture’s outcome, which would reduce the lag in starting a personalised antibiotic treatment and the medical costs associated with erroneous treatments due to conservative assumptions about blood culture outcomes.
Methods
Six supervised classifiers were created using three machine learning techniques, Support Vector Machine, Random Forest and K-Nearest Neighbours, on the electronic health records of hospital patients. The best approach to handle missing data was chosen and, for each machine learning technique, two classification models were created: the first uses the features known at the time of blood extraction, whereas the second uses four extra features revealed during the blood culture.
Results
The six classifiers were trained and tested using a dataset of 4357 patients with 117 features per patient. The models obtain predictions that, for the best case, are up to a state-of-the-art accuracy of 85.9%, a sensitivity of 87.4% and an AUC of 0.93.
Conclusions
Our results provide cutting-edge metrics of interest in predictive medical models with values that exceed the medical practice threshold and previous results in the literature using classical modelling techniques in specific types of bacteraemia. Additionally, the consistency of results is reasserted because the three classifiers’ importance ranking shows similar features that coincide with those that physicians use in their manual heuristics. Therefore, the efficacy of these machine learning techniques confirms their viability to assist in the aims of predictive and personalised medicine once the disease presents bacteraemia-compatible symptoms and to assist in improving the healthcare economy
Priorización de genes y búsqueda de dianas terapéuticas por medio de herramientas informáticas y técnicas de aprendizaje automatizado en cáncer de mama
Programa Oficial de Doutoramento en Tecnoloxías da Información e as Comunicacións. 5032V01Tese por compendio de publicacións[Resumen]
El cáncer de mama (CM) es la principal causa de muerte relacionada a neoplasias en
mujeres y es el tipo de cáncer más diagnosticado a nivel mundial. CM es una enfermedad
heterogénea en donde están envueltos diversos factores como alteraciones genómicas,
desregulación de la expresión de proteínas, alteración de cascadas genéticas, desregulación
hormonal, determinantes ambientales y etnicidad. A pesar de los grandes avances
tecnológicos y científicos en los últimos años, la comprensión de los procesos moleculares, la
identificación de nuevas dianas terapéuticas y la predicción de proteínas envueltas
inmunoterapia, metástasis, y unión al ARN es indispensable para el desarrollo de fármacos y
la aplicación de la medicina de precisión en la práctica clínica. La tesis aquí propuesta plantea
el desarrollo de una estrategia consenso altamente eficiente en el reconocimiento de genes y
proteínas asociadas al CM; la validación oncológica de dichos genes y proteínas priorizadas
mediante la estrategia OncoOmics que consistió en el análisis de bases de datos
experimentales de alta relevancia a nivel mundial; la identificación de mutaciones
oncogénicas y fármacos indispensables para el desarrollo y aplicación de la medicina de
precisión; y la predicción de proteínas de CM asociadas a inmunoterapia, metástasis y unión
al ARN mediante diversas herramientas informáticas y métodos de inteligencia artificial.
Todos los resultados se publicaron en revistas internacionales de importante factor de
impacto.Abstract]
Breast cancer (BC) is the leading cause of cancer-related death among women and the
most commonly diagnosed cancer worldwide. BC is a heterogeneous disease where genomic
alterations, protein expression deregulation, signaling pathway alterations, hormone
disruption, ethnicity and environmental determinants are involved. Despite the technological
and scientific advances in recent years, an understanding of molecular processes, the
identification of new therapeutic targets and the prediction of proteins involved in
immunotherapy, metastasis, and RNA binding is essential for drug development and
application of precision medicine in clinical practice. The current thesis proposes the
development of a high efficient consensus strategy in the recognition of genes and proteins
associated with BC; the oncological validation of these prioritized genes and proteins using
the OncoOmics strategy, which consisted of the analysis of outstanding experimental
databases; the identification of oncogenic mutations and essential drugs for the development
and application of precision medicine; and the prediction of BC proteins associated with
immunotherapy, metastasis and RNA-binding using bioinformatics tools and artificial
intelligence methods. All results were published in international journals with a significant
impact factor.[Resumo]
O cancro de mama (CM) é a principal causa de morte relacionada con enfermidades
malignas en mulleres e é o tipo de cancro máis diagnosticado a nivel mundial. A CM é unha
enfermidade heteroxénea onde interveñen varios factores, como alteracións xenómicas,
desregulación da expresión proteica, alteración de cascadas xenéticas, desregulación
hormonal, determinantes ambientais e etnia. A pesar dos grandes avances tecnolóxicos e
científicos dos últimos anos, a comprensión dos procesos moleculares, a identificación de
novas dianas terapéuticas e a predición de proteínas implicadas na inmunoterapia, metástase e
unión ao ARN é fundamental para o desenvolvemento de fármacos e aplicación da medicina
de precisión na práctica clínica. Esta tese propón o desenvolvemento dunha estratexia de
consenso altamente eficiente no recoñecemento de xenes e proteínas asociadas a CM; a
validación oncolóxica destes xenes e proteínas prioritarias mediante a estratexia OncoOmics,
que consistiu na análise de bases de datos experimentais altamente relevantes en todo o
mundo; a identificación de mutacións oncogénicas e fármacos esenciais para o
desenvolvemento e aplicación da medicina de precisión; e a predición de proteínas CM
asociadas á inmunoterapia, metástase e unión ao ARN usando diversas ferramentas
informáticas e métodos de intelixencia artificial. Todos os resultados publicáronse en revistas
internacionais cun importante factor de impacto