Search CORE

16 research outputs found

Linguistically inspired roadmap for building biologically reliable protein language models

Author: Akbar Rahmad
Greiff Victor
Haug Dag Trygve Truslew
Robert Philippe A.
Sandve Geir Kjetil
Swiatczak Bartlomiej
Vu Mai Ha
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/04/2023
Field of study

Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that are more likely to learn relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation. Incorporating linguistic ideas into protein LMs enables the development of next-generation interpretable machine-learning models with the potential of uncovering the biological mechanisms underlying sequence-function relationships.Comment: 27 pages, 4 figure

arXiv.org e-Print Archive

Regular Inference over Recurrent Neural Networks as a Method for Black Box Explainability

Author: Mayr Ojeda Franz
Publication venue: 'Universidad ORT Uruguay'
Publication date: 01/01/2019
Field of study

Incluye bibliografía.El presente Desarrollo de Tesis explora el problema general de explicar el comportamiento de una red neuronal recurrente (RNN por sus siglas en inglés). El objetivo es construir una representación que mejore el entendimiento humano de las RNN como clasificadores de secuencias, con el propósito de proveer entendimiento sobre el proceso de decisión detrás de la clasificación de una secuencia como positiva o negativa, y a su vez, habilitar un mayor análisis sobre las mismas como por ejemplo la verificación formal basada en autómatas. Se propone en concreto, un algoritmo de aprendizaje automático activo para la construcción de un autómata finito determinístico que es aproximadamente correcto respecto a una red neuronal artificial

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas