3,032 research outputs found
Acronyms as an integral part of multi–word term recognition - A token of appreciation
Term conflation is the process of linking together different variants of the same term. In automatic term recognition approaches, all term variants should be aggregated into a single normalized term representative, which is associated with a single domain–specific concept as a latent variable. In a previous study, we described FlexiTerm, an unsupervised method for recognition of multi–word terms from a domain–specific corpus. It uses a range of methods to normalize three types of term variation – orthographic, morphological and syntactic variation. Acronyms, which represent a highly productive type of term variation, were not supported. In this study, we describe how the functionality of FlexiTerm has been extended to recognize acronyms and incorporate them into the term conflation process. The main contribution of this study is not acronym recognition per se, but rather its integration with other types of term variation into the term conflation process. We evaluated the effects of term conflation in the context of information retrieval as one of its most prominent applications. On average, relative recall increased by 32 percent points, whereas index compression factor increased by 7 percent points. Therefore, evidence suggests that integration of acronyms provides non–trivial improvement of term conflation
Head to head: Semantic similarity of multi-word terms
Terms are linguistic signifiers of domain–specific concepts. Semantic similarity between terms refers to the corresponding distance in the conceptual space. In this study, we use lexico–syntactic information to define a vector space representation in which cosine similarity closely approximates semantic similarity between the corresponding terms. Given a multi–word term, each word is weighed in terms of its defining properties. In this context, the head noun is given the highest weight. Other words are weighed depending on their relations to the head noun. We formalized the problem as that of determining a topological ordering of a direct acyclic graph, which is based on constituency and dependency relations within a noun phrase. To counteract the errors associated with automatically inferred constituency and dependency relations, we implemented a heuristic approach to approximating the topological ordering. Different weights are assigned to different words based on their positions. Clustering experiments performed on such a vector space representation showed considerable improvement over the conventional bag–of–word representation. Specifically, it more consistently reflected semantic similarity between the terms. This was established by analyzing the differences between automatically generated dendrograms and manually constructed taxonomies. In conclusion, our method can be used to semi–automate taxonomy construction
Word sense disambiguation of acronyms in clinical narratives
Clinical narratives commonly use acronyms without explicitly defining their long forms. This makes it difficult to automatically interpret their sense as acronyms tend to be highly ambiguous. Supervised learning approaches to their disambiguation in the clinical domain are hindered by issues associated with patient privacy and manual annotation, which limit the size and diversity of training data. In this study, we demonstrate how scientific abstracts can be utilised to overcome these issues by creating a large automatically annotated dataset of artificially simulated global acronyms. A neural network trained on such a dataset achieved the F1-score of 95% on disambiguation of acronym mentions in scientific abstracts. This network was integrated with multi-word term recognition to extract a sense inventory of acronyms from a corpus of clinical narratives on the fly. Acronym sense extraction achieved the F1-score of 74% on a corpus of radiology reports. In clinical practice, the suggested approach can be used to facilitate development of institution-specific inventories
Unsupervised multi-word term recognition in Welsh
This paper investigates an adaptation of an existing system for multi-word term recognition, originally developed for English, for Welsh. We overview the modifications required with a special focus on an important difference between the two representatives of two language families, Germanic and Celtic, which is concerned with the directionality of noun phrases. We successfully modelled these differences by means of lexico–syntactic patterns, which represent parameters of the system and, therefore, required no re–implementation of the core algorithm. The performance of the Welsh version was compared against that of the English version. For this purpose, we assembled three parallel domain–specific corpora. The results were compared in terms of precision and recall. Comparable performance was achieved across the three domains in terms of the two measures (P = 68.9%, R = 55.7%), but also in the ranking of automatically extracted terms measured by weighted kappa coefficient (k = 0.7758). These early results indicate that our approach to term recognition can provide a basis for machine translation of multi-word terms
Unsupervised multi-word term recognition in Welsh
This paper investigates an adaptation of an existing system for multi-word term recognition, originally developed for English, for Welsh. We overview the modifications required with a special focus on an important difference between the two representatives of two language families, Germanic and Celtic, which is concerned with the directionality of noun phrases. We successfully modelled these differences by means of lexico–syntactic patterns, which represent parameters of the system and, therefore, required no re–implementation of the core algorithm. The performance of the Welsh version was compared against that of the English version. For this purpose, we assembled three parallel domain–specific corpora. The results were compared in terms of precision and recall. Comparable performance was achieved across the three domains in terms of the two measures (P = 68.9%, R = 55.7%), but also in the ranking of automatically extracted terms measured by weighted kappa coefficient (k = 0.7758). These early results indicate that our approach to term recognition can provide a basis for machine translation of multi-word terms
Simulation and annotation of global acronyms
Motivation: Global acronyms are used in written text without their formal definitions. This makes it difficult to automatically interpret their sense as acronyms tend to be ambiguous. Supervised machine learning approaches to sense disambiguation require large training datasets. In clinical applications, large datasets are difficult to obtain due to patient privacy. Manual data annotation creates an additional bottleneck.
Results: We proposed an approach to automatically modifying scientific abstracts to (1) simulate global acronym usage and (2) annotate their senses without the need for external sources or manual intervention. We implemented it as a web-based application, which can create large datasets that in turn can be used to train supervised approaches to word sense disambiguation of biomedical acronyms.
Availability: https://datainnovation.cardiff.ac.uk/acronyms
Advancements in eHealth Data Analytics through Natural Language Processing and Deep Learning
The healthcare environment is commonly referred to as "information-rich" but
also "knowledge poor". Healthcare systems collect huge amounts of data from
various sources: lab reports, medical letters, logs of medical tools or
programs, medical prescriptions, etc. These massive sets of data can provide
great knowledge and information that can improve the medical services, and
overall the healthcare domain, such as disease prediction by analyzing the
patient's symptoms or disease prevention, by facilitating the discovery of
behavioral factors for diseases. Unfortunately, only a relatively small volume
of the textual eHealth data is processed and interpreted, an important factor
being the difficulty in efficiently performing Big Data operations. In the
medical field, detecting domain-specific multi-word terms is a crucial task as
they can define an entire concept with a few words. A term can be defined as a
linguistic structure or a concept, and it is composed of one or more words with
a specific meaning to a domain. All the terms of a domain create its
terminology. This chapter offers a critical study of the current, most
performant solutions for analyzing unstructured (image and textual) eHealth
data. This study also provides a comparison of the current Natural Language
Processing and Deep Learning techniques in the eHealth context. Finally, we
examine and discuss some of the current issues, and we define a set of research
directions in this area
AI-assisted patent prior art searching - feasibility study
This study seeks to understand the feasibility, technical complexities and effectiveness of using artificial intelligence (AI) solutions to improve operational processes of registering IP rights. The Intellectual Property Office commissioned Cardiff University to undertake this research. The research was funded through the BEIS Regulators’ Pioneer Fund (RPF). The RPF fund was set up to help address barriers to innovation in the UK economy
Piecewise Latent Variables for Neural Variational Text Processing
Advances in neural variational inference have facilitated the learning of
powerful directed graphical models with continuous latent variables, such as
variational autoencoders. The hope is that such models will learn to represent
rich, multi-modal latent factors in real-world data, such as natural language
text. However, current models often assume simplistic priors on the latent
variables - such as the uni-modal Gaussian distribution - which are incapable
of representing complex latent factors efficiently. To overcome this
restriction, we propose the simple, but highly flexible, piecewise constant
distribution. This distribution has the capacity to represent an exponential
number of modes of a latent target distribution, while remaining mathematically
tractable. Our results demonstrate that incorporating this new latent
distribution into different models yields substantial improvements in natural
language processing tasks such as document modeling and natural language
generation for dialogue.Comment: 19 pages, 2 figures, 8 tables; EMNLP 201
Patient triage by topic modelling of referral letters: Feasibility study
Background: Musculoskeletal conditions are managed within primary care but patients can be referred to secondary care if a specialist opinion is required. The ever increasing demand of healthcare resources emphasizes the need to streamline care pathways with the ultimate aim of ensuring that patients receive timely and optimal care. Information contained in referral letters underpins the referral decision-making process but is yet to be explored systematically for the purposes of treatment prioritization for musculoskeletal conditions. Objective: This study aims to explore the feasibility of using natural language processing and machine learning to automate triage of patients with musculoskeletal conditions by analyzing information from referral letters. Specifically, we aim to determine whether referral letters can be automatically assorted into latent topics that are clinically relevant, i.e. considered relevant when prescribing treatments. Here, clinical relevance is assessed by posing two research questions. Can latent topics be used to automatically predict the treatment? Can clinicians interpret latent topics as cohorts of patients who share common characteristics or experience such as medical history, demographics and possible treatments? Methods: We used latent Dirichlet allocation to model each referral letter as a finite mixture over an underlying set of topics and model each topic as an infinite mixture over an underlying set of topic probabilities. The topic model was evaluated in the context of automating patient triage. Given a set of treatment outcomes, a binary classifier was trained for each outcome using previously extracted topics as the input features of the machine learning algorithm. In addition, qualitative evaluation was performed to assess human interpretability of topics. Results: The prediction accuracy of binary classifiers outperformed the stratified random classifier by a large margin giving an indication that topic modelling could be used to predict the treatment thus effectively supporting patient triage. Qualitative evaluation confirmed high clinical interpretability of the topic model. Conclusions: The results established the feasibility of using natural language processing and machine learning to automate triage of patients with knee and/or hip pain by analyzing information from their referral letters
- …