Reaching a modular, domain-agnostic and containerized development in biomedical Natural Language Processing systems.

Abstract

The last century saw an exponential increase in scientific publications in the biomedical domain. Despite the potential value of this knowledge; most of this data is only available as unstructured textual literature, which have limited their systematic access, use and exploitation. This limitation can be avoided, or at least mitigated, by relying on text mining techniques to automatically extract relevant data and structure it from textual documents. A significant challenge for scientific software applications, including Natural Language Processing (NLP) systems, consists in providing facilities to share, distribute and run such systems in a simple and convenient way. Software containers can host their own dependencies and auxiliary programs, isolating them from the execution environment. In addition, a workflow manager can be used for the automated orchestration and execution of the text mining pipelines. Our work is focused in the study and design of new techniques and approaches to construct, develop, validate and deploy NLP components and workflows with sufficient genericity, scalability and interoperability allowing their use and instantiation across different domains. The results and techniques acquired will be applied in two main uses cases: the detection of relevant information from preclinical toxicological reports, under the eTRANSAFE project [1]; and the indexation of biomaterials publications with relevant concepts as part as the DEBBIE project

    Similar works