14 research outputs found

    ICB-UMA at BioCreative VIII @ AMIA 2023 Task 2 SYMPTEMIST (Symptom TExt Mining Shared Task)

    No full text
    <h3><strong>Abstract</strong></h3><p>These working notes summarize the contribution of the ICB research group from the University of Malaga to the BioCreative VIII Workshop @AMIA 2023, from our participation in Task 2 - SympTEMIST. Engaged in both subtasks, our approaches tackled symptom, sign, and clinical finding entities recognition (subtask 1 - SymptomNER) and their normalization to the corresponding SNOMED CT concepts (subtask 2 - SymptomNorm). For subtask 1, we analyzed the performance of some BERT-based models tailored for the nuances of Spanish clinical data. These models, specifically fine-tuned on the SymptomNER corpus, showed remarkable precision (0.804), recall (0.699), and F1-score (0.748) for the test set. For SymtomNorm subtask, we incorporated recent strategies using bi-encoder and cross-encoder models, especially SapBERT models enhanced with FAISS methods for similarity search. Finally, the model's predictions were further refined by leveraging a gazetteer with more than 150,000 concepts. Our strategy achieved 0.58 accuracy for the test set.</p><p> </p><p>This article is part of the <a href="https://zenodo.org/doi/10.5281/zenodo.10103190">Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models</a>.</p&gt

    Estudio de la influencia de factores fisiológicos y de conectividad de red en la correlación de actividad entre pares de neuronas de integración y disparo

    No full text
    Para conocer las propiedades computacionales del sistema nervioso central es importante comprender como están conectadas entre si las neuronas. Para ello pueden emplearse diferentes estrategias (psiologicas, anatomicas, etc.), siendo tambien posible seguir un enfoque computacional. En este trabajo se llevan a cabo simulaciones de circuitos neuronales, para lo que se ha desarrollado un nuevo modelo de neurona de integración y disparo con un alto grado de realismo. El objetivo principal consiste en comprender cuales son los factores psicológicos y de conectividad de red que determinan la actividad correlacionada entre pares de neuronas

    Methionine residues around phosphorylation sites are preferentially oxidized in vivo under stress conditions.

    No full text
    Protein phosphorylation is one of the most prevalent and well-understood protein modifications. Oxidation of protein-bound methionine, which has been traditionally perceived as an inevitable damage derived from oxidative stress, is now emerging as another modification capable of regulating protein activity during stress conditions. However, the mechanism coupling oxidative signals to changes in protein function remains unknown. An appealing hypothesis is that methionine oxidation might serve as a rheostat to control phosphorylation. To investigate this potential crosstalk between phosphorylation and methionine oxidation, we have addressed the co-occurrence of these two types of modifications within the human proteome. Here, we show that nearly all (98%) proteins containing oxidized methionine were also phosphoproteins. Furthermore, phosphorylation sites were much closer to oxidized methionines when compared to non-oxidized methionines. This proximity between modification sites cannot be accounted for by their co-localization within unstructured clusters because it was faithfully reproduced in a smaller sample of structured proteins. We also provide evidence that the oxidation of methionine located within phosphorylation motifs is a highly selective process among stress-related proteins, which supports the hypothesis of crosstalk between methionine oxidation and phosphorylation as part of the cellular defence against oxidative stress

    A machine learning approach for predicting methionine oxidation sites

    No full text
    Abstract Background The oxidation of protein-bound methionine to form methionine sulfoxide, has traditionally been regarded as an oxidative damage. However, recent evidences support the view of this reversible reaction as a regulatory post-translational modification. The perception that methionine sulfoxidation may provide a mechanism to the redox regulation of a wide range of cellular processes, has stimulated some proteomic studies. However, these experimental approaches are expensive and time-consuming. Therefore, computational methods designed to predict methionine oxidation sites are an attractive alternative. As a first approach to this matter, we have developed models based on random forests, support vector machines and neural networks, aimed at accurate prediction of sites of methionine oxidation. Results Starting from published proteomic data regarding oxidized methionines, we created a hand-curated dataset formed by 113 unique polypeptides of known structure, containing 975 methionyl residues, 122 of which were oxidation-prone (positive dataset) and 853 were oxidation-resistant (negative dataset). We use a machine learning approach to generate predictive models from these datasets. Among the multiple features used in the classification task, some of them contributed substantially to the performance of the predictive models. Thus, (i) the solvent accessible area of the methionine residue, (ii) the number of residues between the analyzed methionine and the next methionine found towards the N-terminus and (iii) the spatial distance between the atom of sulfur from the analyzed methionine and the closest aromatic residue, were among the most relevant features. Compared to the other classifiers we also evaluated, random forests provided the best performance, with accuracy, sensitivity and specificity of 0.7468±0.0567, 0.6817±0.0982 and 0.7557±0.0721, respectively (mean ± standard deviation). Conclusions We present the first predictive models aimed to computationally detect methionine sites that may become oxidized in vivo in response to oxidative signals. These models provide insights into the structural context in which a methionine residue become either oxidation-resistant or oxidation-prone. Furthermore, these models should be useful in prioritizing methinonyl residues for further studies to determine their potential as regulatory post-translational modification sites

    Transfer learning with convolutional neural networks for cancer survival prediction using gene-expression data.

    No full text
    Precision medicine in oncology aims at obtaining data from heterogeneous sources to have a precise estimation of a given patient's state and prognosis. With the purpose of advancing to personalized medicine framework, accurate diagnoses allow prescription of more effective treatments adapted to the specificities of each individual case. In the last years, next-generation sequencing has impelled cancer research by providing physicians with an overwhelming amount of gene-expression data from RNA-seq high-throughput platforms. In this scenario, data mining and machine learning techniques have widely contribute to gene-expression data analysis by supplying computational models to supporting decision-making on real-world data. Nevertheless, existing public gene-expression databases are characterized by the unfavorable imbalance between the huge number of genes (in the order of tenths of thousands) and the small number of samples (in the order of a few hundreds) available. Despite diverse feature selection and extraction strategies have been traditionally applied to surpass derived over-fitting issues, the efficacy of standard machine learning pipelines is far from being satisfactory for the prediction of relevant clinical outcomes like follow-up end-points or patient's survival. Using the public Pan-Cancer dataset, in this study we pre-train convolutional neural network architectures for survival prediction on a subset composed of thousands of gene-expression samples from thirty-one tumor types. The resulting architectures are subsequently fine-tuned to predict lung cancer progression-free interval. The application of convolutional networks to gene-expression data has many limitations, derived from the unstructured nature of these data. In this work we propose a methodology to rearrange RNA-seq data by transforming RNA-seq samples into gene-expression images, from which convolutional networks can extract high-level features. As an additional objective, we investigate whether leveraging the information extracted from other tumor-type samples contributes to the extraction of high-level features that improve lung cancer progression prediction, compared to other machine learning approaches

    Transformers for Clinical Coding in Spanish

    No full text
    Automatic clinical coding is an essential task in the process of extracting relevant information from unstructured documents contained in electronic health records (EHRs). However, most research in the development of computer-based methods for clinical coding focuses on texts written in English due to the limited availability of medical linguistic resources in languages other than English. With nearly 500 million native speakers, there is a worldwide interest in processing healthcare texts in Spanish. In this study, we sys tematically analyzed transformer-based models for automatic clinical coding in Spanish. Using a transfer learning-based approach, the three existing transformer architectures that support the Spanish language, namely, multilingual BERT (mBERT), BETO and XLM-RoBERTa (XLM-R), were first pretrained on a corpus of real-world oncology clinical cases with the goal of adapting transformers to the particularities of Spanish medical texts. The resulting models were fine-tuned on three distinct clinical coding tasks, following a multilabel sentence classification strategy. For each analyzed transformer, the domain-specific version out performed the original general domain model across those tasks. Moreover, the combination of the developed strategy with an ensemble approach leveraging the predictive capacities of the three distinct transformers yielded the best obtained results, with MAP scores of 0.662, 0.544 and 0.884 on CodiEsp-D, CodiEsp-P and Cantemist-Coding shared tasks, which remarkably improved the previous state-of-the-art performance by 11.6%, 10.3% and 4.4%, respectively. We publicly release the mBERT, BETO and XLMR transform ers adapted to the Spanish clinical domain at https://github.com/guilopgar/ClinicalCodingTransformerES, providing the clinical natural language processing community with advanced deep learning methods for performing medical coding and other tasks in the Spanish clinical domain.This work was supported in part by the Ministerio de Economía y Empresa (MINECO), Plan Nacional de I+D+I, under Project TIN2017-88728-C2-1-R, in part by the Andalucía TECH, under Project UMA-CEIATECH-01, in part by the Universidad de Málaga and the Consorcio de Bibliotecas Universitarias de Andalucía (CBUA), and in part by the Plan Andaluz de Investigación, Desarrollo e Innovación (PAIDI), Junta de Andalucía.Ye

    BLASSO: integration of biological knowledge into a regularized linear model

    Get PDF
    Abstract Background In RNA-Seq gene expression analysis, a genetic signature or biomarker is defined as a subset of genes that is probably involved in a given complex human trait and usually provide predictive capabilities for that trait. The discovery of new genetic signatures is challenging, as it entails the analysis of complex-nature information encoded at gene level. Moreover, biomarkers selection becomes unstable, since high correlation among the thousands of genes included in each sample usually exists, thus obtaining very low overlapping rates between the genetic signatures proposed by different authors. In this sense, this paper proposes BLASSO, a simple and highly interpretable linear model with l 1-regularization that incorporates prior biological knowledge to the prediction of breast cancer outcomes. Two different approaches to integrate biological knowledge in BLASSO, Gene-specific and Gene-disease, are proposed to test their predictive performance and biomarker stability on a public RNA-Seq gene expression dataset for breast cancer. The relevance of the genetic signature for the model is inspected by a functional analysis. Results BLASSO has been compared with a baseline LASSO model. Using 10-fold cross-validation with 100 repetitions for models’ assessment, average AUC values of 0.7 and 0.69 were obtained for the Gene-specific and the Gene-disease approaches, respectively. These efficacy rates outperform the average AUC of 0.65 obtained with the LASSO. With respect to the stability of the genetic signatures found, BLASSO outperformed the baseline model in terms of the robustness index (RI). The Gene-specific approach gave RI of 0.15±0.03, compared to RI of 0.09±0.03 given by LASSO, thus being 66% times more robust. The functional analysis performed to the genetic signature obtained with the Gene-disease approach showed a significant presence of genes related with cancer, as well as one gene (IFNK) and one pseudogene (PCNAP1) which a priori had not been described to be related with cancer. Conclusions BLASSO has been shown as a good choice both in terms of predictive efficacy and biomarker stability, when compared to other similar approaches. Further functional analyses of the genetic signatures obtained with BLASSO has not only revealed genes with important roles in cancer, but also genes that should play an unknown or collateral role in the studied disease

    Named Entity Recognition for De-identifying Real-World Health Records in Spanish.

    No full text
    A growing and renewed interest has emerged in Electronic Health Records (EHRs) as a source of information for decision-making in clinical practice. In this context, the automatic de-identification of EHRs constitutes an essential task, since their dissociation from personal data is a mandatory first step before their distribution. However, the majority of previous studies on this subject have been conducted on English EHRs, due to the limited availability of annotated corpora in other languages, such as Spanish. In this study, we addressed the automatic de-identification of medical documents in Spanish. A private corpus of 599 real-world clinical cases have been annotated with 8 different protected health information categories. We have tackled the predictive problem as a named entity recognition task, developing two different deep learning-based methodologies, namely a first strategy based on recurrent neural networks (RNN) and an end-to-end approach based on transformers. Additionally, we have developed a data augmentation procedure to increase the number of texts used to train the models. The results obtained show that transformers outperform RNN on the de-identification of Spanish clinical data. In particular, the best performance was obtained by the XLM-RoBERTa large transformer, with a strict-match micro-averaged value of 0.946 for precision, 0.954 for recall and 0.95 for F1-score, when trained on the augmented version of the corpus. The performance achieved by transformers in this study proves the viability of applying these state-of-the-art models in real-world clinical scenarios.The authors acknowledge the support from the Ministerio de Economía y Empresa (MINECO) through grant TIN2017-88728-C2-1-R, from the Ministerio de Ciencia e Innovación (MICINN) under project PID2020-116898RB-I00, from the Universidad de Málaga and Junta de Andalucía through grant UMA20-FEDERJA-045, from the Malaga-Pfizer consortium for AI research in Cancer - MAPIC, from the Instituto de Investigación Biomédica de Málaga - IBIMA (all including FEDER funds) and from Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
    corecore