1,631 research outputs found
Knowledge extraction from unstructured data
Data availability is becoming more essential, considering the current growth of web-based data. The data available on the web are represented as unstructured, semi-structured, or structured data. In order to make the web-based data available for several Natural Language Processing or Data Mining tasks, the data needs to be presented as machine-readable data in a structured format. Thus, techniques for addressing the problem of capturing knowledge from unstructured data sources are needed. Knowledge extraction methods are used by the research communities to address this problem; methods that are able to capture knowledge in a natural language text and map the extracted knowledge to existing knowledge presented in knowledge graphs (KGs). These knowledge extraction methods include Named-entity recognition, Named-entity Disambiguation, Relation Recognition, and Relation Linking. This thesis addresses the problem of extracting knowledge over unstructured data and discovering patterns in the extracted knowledge. We devise a rule-based approach for entity and relation recognition and linking. The defined approach effectively maps entities and relations within a text to their resources in a target KG. Additionally, it overcomes the challenges of recognizing and linking entities and relations to a specific KG by employing devised catalogs of linguistic and domain-specific rules that state the criteria to recognize entities in a sentence of a particular language, and a deductive database that encodes knowledge in community-maintained KGs. Moreover, we define a Neuro-symbolic approach for the tasks of knowledge extraction in encyclopedic and domain-specific domains; it combines symbolic and sub-symbolic components to overcome the challenges of entity recognition and linking and the limitation of the availability of training data while maintaining the accuracy of recognizing and linking entities. Additionally, we present a context-aware framework for unveiling semantically related posts in a corpus; it is a knowledge-driven framework that retrieves associated posts effectively. We cast the problem of unveiling semantically related posts in a corpus into the Vertex Coloring Problem. We evaluate the performance of our techniques on several benchmarks related to various domains for knowledge extraction tasks. Furthermore, we apply these methods in real-world scenarios from national and international projects. The outcomes show that our techniques are able to effectively extract knowledge encoded in unstructured data and discover patterns over the extracted knowledge presented as machine-readable data. More importantly, the evaluation results provide evidence to the effectiveness of combining the reasoning capacity of the symbolic frameworks with the power of pattern recognition and classification of sub-symbolic models
Meaningful Information Extraction from Unstructured Clinical Documents
Medical concept and entity extraction from the medical narrative unstructured documents is the crucial step in most of the contemporary health systems. For the extraction of medical concepts and entities, the Unified Medical Language System (UMLS) Metathesaurus is a big source of biomedical and health-related concepts. Recently various tools like Sophia, MetaMap and cTAKES, and many other rules-based methods and algorithm like Quick UMLS etc. have been developed which are performing a successful role in the process of medical concept extraction. The goal of this paper is to design a generic algorithm to identify a package consisting of standard concepts, their semantic types, and entity types on the basis of medical phrases and terms used in the clinical unstructured documents. The proposed algorithm implements the UMLS terminology service (UTS) and customizes to extract concepts for all the meaningful phrases and terms used in the narratives and determine their semantic and entity types in order to find exact categorization of the concepts. The proposed algorithm has produced a very useful set of results to use for labeling the biomedical data, which could in term be used for training data-driven approaches such asmachine learning
Knowledge-based Biomedical Data Science 2019
Knowledge-based biomedical data science (KBDS) involves the design and
implementation of computer systems that act as if they knew about biomedicine.
Such systems depend on formally represented knowledge in computer systems,
often in the form of knowledge graphs. Here we survey the progress in the last
year in systems that use formally represented knowledge to address data science
problems in both clinical and biological domains, as well as on approaches for
creating knowledge graphs. Major themes include the relationships between
knowledge graphs and machine learning, the use of natural language processing,
and the expansion of knowledge-based approaches to novel domains, such as
Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages
with 3 table
An artificial intelligence natural language processing pipeline for information extraction in neuroradiology
The use of electronic health records in medical research is difficult because
of the unstructured format. Extracting information within reports and
summarising patient presentations in a way amenable to downstream analysis
would be enormously beneficial for operational and clinical research. In this
work we present a natural language processing pipeline for information
extraction of radiological reports in neurology. Our pipeline uses a hybrid
sequence of rule-based and artificial intelligence models to accurately extract
and summarise neurological reports. We train and evaluate a custom language
model on a corpus of 150000 radiological reports from National Hospital for
Neurology and Neurosurgery, London MRI imaging. We also present results for
standard NLP tasks on domain-specific neuroradiology datasets. We show our
pipeline, called `neuroNLP', can reliably extract clinically relevant
information from these reports, enabling downstream modelling of reports and
associated imaging on a heretofore unprecedented scale.Comment: 20 pages, 2 png image figure
- …