373 research outputs found
Ontology Enrichment from Free-text Clinical Documents: A Comparison of Alternative Approaches
While the biomedical informatics community widely acknowledges the utility of domain ontologies, there remain many barriers to their effective use. One important requirement of domain ontologies is that they achieve a high degree of coverage of the domain concepts and concept relationships. However, the development of these ontologies is typically a manual, time-consuming, and often error-prone process. Limited resources result in missing concepts and relationships, as well as difficulty in updating the ontology as domain knowledge changes. Methodologies developed in the fields of Natural Language Processing (NLP), Information Extraction (IE), Information Retrieval (IR), and Machine Learning (ML) provide techniques for automating the enrichment of ontology from free-text documents. In this dissertation, I extended these methodologies into biomedical ontology development. First, I reviewed existing methodologies and systems developed in the fields of NLP, IR, and IE, and discussed how existing methods can benefit the development of biomedical ontologies. This previously unconducted review was published in the Journal of Biomedical Informatics. Second, I compared the effectiveness of three methods from two different approaches, the symbolic (the Hearst method) and the statistical (the Church and Lin methods), using clinical free-text documents. Third, I developed a methodological framework for Ontology Learning (OL) evaluation and comparison. This framework permits evaluation of the two types of OL approaches that include three OL methods. The significance of this work is as follows: 1) The results from the comparative study showed the potential of these methods for biomedical ontology enrichment. For the two targeted domains (NCIT and RadLex), the Hearst method revealed an average of 21% and 11% new concept acceptance rates, respectively. The Lin method produced a 74% acceptance rate for NCIT; the Church method, 53%. As a result of this study (published in the Journal of Methods of Information in Medicine), many suggested candidates have been incorporated into the NCIT; 2) The evaluation framework is flexible and general enough that it can analyze the performance of ontology enrichment methods for many domains, thus expediting the process of automation and minimizing the likelihood that key concepts and relationships would be missed as domain knowledge evolves
A DATA DRIVEN APPROACH TO IDENTIFY JOURNALISTIC 5WS FROM TEXT DOCUMENTS
Textual understanding is the process of automatically extracting accurate high-quality information from text. The amount of textual data available from different sources such as news, blogs and social media is growing exponentially. These data encode significant latent information which if extracted accurately can be valuable in a variety of applications such as medical report analyses, news understanding and societal studies. Natural language processing techniques are often employed to develop customized algorithms to extract such latent information from text.
Journalistic 5Ws refer to the basic information in news articles that describes an event and include where, when, who, what and why. Extracting them accurately may facilitate better understanding of many social processes including social unrest, human rights violations, propaganda spread, and population migration. Furthermore, the 5Ws information can be combined with socio-economic and demographic data to analyze state and trajectory of these processes.
In this thesis, a data driven pipeline has been developed to extract the 5Ws from text using syntactic and semantic cues in the text. First, a classifier is developed to identify articles specifically related to social unrest. The classifier has been trained with a dataset of over 80K news articles. We then use NLP algorithms to generate a set of candidates for the 5Ws. Then, a series of algorithms to extract the 5Ws are developed. These algorithms based on heuristics leverage specific words and parts-of-speech customized for individual Ws to compute their scores. The heuristics are based on the syntactic structure of the document as well as syntactic and semantic representations of individual words and sentences. These scores are then combined and ranked to obtain the best answers to Journalistic 5Ws. The classification accuracy of the algorithms is validated using a manually annotated dataset of news articles
Deep Learning Methods for Register Classification
For this project the data used is the one collected by, Biber and Egbert (2018) related to various language articles from the internet. I am using BERT model (Bidirectional Encoder Representations from Transformers), which is a deep neural network and FastText, which is a shallow neural network, as a baseline to perform text classiďŹcation. Also, I am using Deep Learning models like XLNet to see if classiďŹcation accuracy is improved.
Also, it has been described by Biber and Egbert (2018) what is register. We can think of register as genre. According to Biber (1988), register is varieties deďŹned in terms of general situational parameters. Hence, it can be inferred that there is a close relation between the language and the context of the situation in which it is being used. This work attempts register classiďŹcation using deep learning methods that use attention mechanism. Working with the models, dealing with the imbalanced datasets in real life problems, tuning the hyperparameters for training the models was accomplished throughout the work. Also, proper evaluation metrics for various kind of data was determined.
The background study shows that how cumbersome the use classical Machine Learning approach used to be. Deep Learning, on the other hand, can accomplish the task with ease. The metric to be selected for the classiďŹcation task for different types of datasets (balanced vs imbalanced), dealing with overďŹtting was also accomplished
A Bigger Fish to Fry:Scaling up the Automatic Understanding of Idiomatic Expressions
In this thesis, we are concerned with idiomatic expressions and how to handle them within NLP. Idiomatic expressions are a type of multiword phrase which have a meaning that is not a direct combination of the meaning of its parts, e.g. 'at a crossroads' and 'move the goalposts'.In Part I, we provide a general introduction to idiomatic expressions and an overview of observations regarding idioms based on corpus data. In addition, we discuss existing research on idioms from an NLP perspective, providing an overview of existing tasks, approaches, and datasets. In Part II, we focus on the building of a large idiom corpus, consisting of developing a system for the automatic extraction of potentially idiom expressions and building a large corpus of idiom using crowdsourced annotation. Finally, in Part III, we improve an existing unsupervised classifier and compare it to other existing classifiers. Given the relatively poor performance of this unsupervised classifier, we also develop a supervised deep neural network-based system and find that a model involving two separate modules looking at different information sources yields the best performance, surpassing previous state-of-the-art approaches.In conclusion, this work shows the feasibility of building a large corpus of sense-annotated potentially idiomatic expressions, and the benefits such a corpus provides for further research. It provides the possibility for quick testing of hypotheses about the distribution and usage of idioms, it enables the training of data-hungry machine learning methods for PIE disambiguation systems, and it permits fine-grained, reliable evaluation of such systems
Domain-Specific Knowledge Exploration with Ontology Hierarchical Re-Ranking and Adaptive Learning and Extension
The goal of this research project is the realization of an artificial intelligence-driven lightweight domain knowledge search framework that returns a domain knowledge structure upon request with highly relevant web resources via a set of domain-centric re-ranking algorithms and adaptive ontology learning models. The re-ranking algorithm, a necessary mechanism to counter-play the heterogeneity and unstructured nature of web data, uses augmented queries and a hierarchical taxonomic structure to get further insight into the initial search results obtained from credited generic search engines. A semantic weight scale is applied to each node in the ontology graph and in turn generates a matrix of aggregated link relation scores that is used to compute the likely semantic correspondence between nodes and documents. Bootstrapped with a light-weight seed domain ontology, the theoretical platform focuses on the core back-end building blocks, employing two supervised automated learning models as well as semi-automated verification processes to progressively enhance, prune, and inspect the domain ontology to formulate a growing, up-to-date, and veritable system.\\ The framework provides an in-depth knowledge search platform and enhances user knowledge acquisition experience. With minimum footprint, the system stores only necessary metadata of possible domain knowledge searches, in order to provide fast fetching and caching. In addition, the re-ranking and ontology learning processes can be operated offline or in a preprocessing stage, the system therefore carries no significant overhead at runtime
The Semantic Prosody of Natural Phenomena in the Qurâan: A Corpus-Based Study
This thesis explores the Semantic Prosody (SP) of natural phenomena in the Qurâan and five of its prominent English translations [Pickthall (1930), Yusuf Ali (1939/ revised edition 1987), Arberry (1957), Saheeh International (1997), and Abdel Haleem (2004)]. SP, scarcely explored in Qurâanic research, is defined as âa form of meaning established through the proximity of a consistent series of collocatesâ (Louw 2000, p.50). Theoretically, it is both an evaluative prosody (i.e., lexical items collocating with semantic word classes that are positive, negative, or neutral) and a discourse prosody (i.e., having a communicative purpose).
Given the stylistic uniqueness of the Qurâan and considering that SP can be examined empirically via corpora, the present study explores the SP of 154 words associated with nature referenced throughout the Qurâan using Corpus Linguistics techniques. Firstly, the Python-based Natural Language Toolkit was used for the following: to define nature terms via WordNet; to disambiguate their variant forms with Stemmers, and to compute their frequencies. Once frequencies were found, a quantitative analysis using Evertâs (2008) five-step statistical analysis was implemented on the 30 most frequent terms to investigate their collocations and SPs. Following this, a qualitative analysis was conducted as per the Extended Lexical Unit via concordance to analyse collocations and the Lexical-Functional Grammar to find the variation of meanings produced by lexico-grammatical patterns. Finally, the resulting datasets were aligned to evaluate their congruency with the Qurâan.
Findings of this research confirm that words referring to nature in the Qurâan do have semantic prosody. For example, astronomical bodies are primed to occur in predominantly positive collocations referring to glorifying God, while weather phenomena in negative ones refer to Day of Judgment calamities. In addition, results show that Abdel-Haleemâs translation can be considered the most congruent.
This research develops an approach to explore themes (e.g., nature) via SP analysis in texts and their translations and provides several linguistic resources that can be used for future corpus-based studies on the language of the Qurâan.
Cross-lingual question answering
Question Answering has become an intensively researched area in the last decade, being seen as the next step beyond Information Retrieval in the attempt to provide more concise and better access to large volumes of available information. Question Answering builds on Information Retrieval technology for a first touch of possible relevant data and uses further natural language processing techniques to search for candidate answers and to look for clues that accept or invalidate the candidates as right answers to the question. Though most of the research has been carried out in monolingual settings, where the question and the answer-bearing documents share the same natural language, current approaches concentrate on cross-language scenarios, where the question and the documents are in different languages. Known in this context and common with the Information Retrieval research are three methods of crossing the language barrier: by translating the question, by translating the documents or by aligning both the question and the documents to a common inter-lingual representation. We present a cross-lingual English to German Question Answering system, for both factoid and definition questions, using a German monolingual system and translating the questions from English to German. Two different techniques of translation are evaluated:
⢠direct translation of the English input question into German and
⢠transfer-based translation, by using an intermediate representation that captures the âmeaningâ of the original question and is translated into the target language.
For both translation techniques two types of translation tools are used: bilingual dictionaries and machine translation. The intermediate representation captures the semantic meaning of the question in terms of Question Type (QType), Expected Answer Type (EAType) and Focus, information that steers the workflow of the question answering process.
The German monolingual Question Answering system can answer both factoid and definition questions and is based on several premises:
⢠facts and definitions are usually expressed locally at the level of a sentence and its surroundings;
⢠proximity of concepts within a sentence can be related to their semantic dependency;
⢠for factoid questions, redundancy of candidate answers is a good indicator of their suitability;
⢠definitions of concepts are expressed using fixed linguistic structures such as appositions, modifiers, and abbreviation extensions.
Extensive evaluations of the monolingual system have shown that the above mentioned hypothesis holds true in most of the cases when dealing with a fairly large collection of documents, like the one used in the CLEF evaluation forum.Innerhalb der letzten zehn Jahre hat sich Question Answering zu einem intensiv erforschten Themengebiet gewandelt, es stellt den nächsten Schritt des Information Retrieval dar, mit dem Bestreben einen präziseren Zugang zu groĂen Datenbeständen von verfĂźgbaren Informationen bereitzustellen. Das Question Answering setzt auf die Information Retrieval-Technologie, um mĂśgliche relevante Daten zu suchen, kombiniert mit weiteren Techniken zur Verarbeitung von natĂźrlicher Sprache, um mĂśgliche Antwortkandidaten zu identifizieren und diese anhand von Hinweisen oder Anhaltspunkten entsprechend der Frage als richtige Antwort zu akzeptieren oder als unpassend zu erklären. Während ein GroĂteil der Forschung den einsprachigen Kontext voraussetzt, wobei Frage- und Antwortdokumente ein und dieselbe Sprache teilen, konzentrieren sich aktuellere Ansätze auf sprachĂźbergreifende Szenarien, in denen die Frage- und Antwortdokumente in unterschiedlichen Sprachen vorliegen. Im Kontext des Information Retrieval existieren drei bekannte Ansätze, die versuchen auf unterschiedliche Art und Weise die Sprachbarriere zu Ăźberwinden: durch die Ăbersetzung der Frage, durch die Ăbersetzung der Dokumente oder durch eine Angleichung von sowohl der Frage als auch der Dokumente zu einer gemeinsamen interlingualen Darstellung. Wir präsentieren ein sprachĂźbergreifendes Question Answering System vom Englischen ins Deutsche, das sowohl fĂźr Faktoid- als auch fĂźr Definitionsfragen funktioniert. Dazu verwenden wir ein einsprachiges deutsches System und Ăźbersetzen die Fragen vom Englischen ins Deutsche. Zwei unterschiedliche Techniken der Ăbersetzung werden untersucht:
⢠die direkte Ăbersetzung der englischen Fragestellung ins Deutsche und
⢠die Abbildungs-basierte Ăbersetzung, die eine Zwischendarstellung verwendet, um die âSemantikâ der ursprĂźnglichen Frage zu erfassen und in die Zielsprache zu Ăźbersetzen.
FĂźr beide aufgelisteten Ăbersetzungstechniken werden zwei Ăbersetzungsquellen verwendet: zweisprachige WĂśrterbĂźcher und maschinelle Ăbersetzung. Die Zwischendarstellung erfasst die Semantik der Frage in Bezug auf die Art der Frage (QType), den erwarteten Antworttyp (EAType) und Fokus, sowie die Informationen, die den Ablauf des Frage-Antwort-Prozesses steuern.
Das deutschsprachige Question Answering System kann sowohl Faktoid- als auch Definitionsfragen beantworten und basiert auf mehreren Prämissen:
⢠Fakten und Definitionen werden in der Regel lokal auf Satzebene ausgedrßckt;
⢠Die Nähe von Konzepten innerhalb eines Satzes kann auf eine semantische Verbindung hinweisen;
⢠Bei Faktoidfragen ist die Redundanz der Antwortkandidaten ein guter Indikator fßr deren Eignung;
⢠Definitionen von Begriffen werden mit festen sprachlichen Strukturen ausgedrßckt, wie Appositionen, Modifikatoren, Abkßrzungen und Erweiterungen.
Umfangreiche Auswertungen des einsprachigen Systems haben gezeigt, dass die oben genannten Hypothesen in den meisten Fällen wahr sind, wenn es um eine ziemlich groĂe Sammlung von Dokumenten geht, wie bei der im CLEF Evaluationsforum verwendeten Version
Recommended from our members
A modular, open-source information extraction framework for identifying clinical concepts and processes of care in clinical narratives
In this thesis, a synthesis is presented of the knowledge models required by clinical informa- tion systems that provide decision support for longitudinal processes of care. Qualitative research techniques and thematic analysis are novelly applied to a systematic review of the literature on the challenges in implementing such systems, leading to the development of an original conceptual framework. The thesis demonstrates how these process-oriented systems make use of a knowledge base derived from workflow models and clinical guidelines, and argues that one of the major barriers to implementation is the need to extract explicit and implicit information from diverse resources in order to construct the knowledge base. Moreover, concepts in both the knowledge base and in the electronic health record (EHR) must be mapped to a common ontological model. However, the majority of clinical guideline information remains in text form, and much of the useful clinical information residing in the EHR resides in the free text fields of progress notes and laboratory reports. In this thesis, it is shown how natural language processing and information extraction techniques provide a means to identify and formalise the knowledge components required by the knowledge base. Original contributions are made in the development of lexico-syntactic patterns and the use of external domain knowledge resources to tackle a variety of information extraction tasks in the clinical domain, such as recognition of clinical concepts, events, temporal relations, term disambiguation and abbreviation expansion. Methods are developed for adapting existing tools and resources in the biomedical domain to the processing of clinical texts, and approaches to improving the scalability of these tools are proposed and evalu- ated. These tools and techniques are then combined in the creation of a novel approach to identifying processes of care in the clinical narrative. It is demonstrated that resolution of coreferential and anaphoric relations as narratively and temporally ordered chains provides a means to extract linked narrative events and processes of care from clinical notes. Coreference performance in discharge summaries and progress notes is largely dependent on correct identification of protagonist chains (patient, clinician, family relation), pronominal resolution, and string matching that takes account of experiencer, temporal, spatial, and anatomical context; whereas for laboratory reports additional, external domain knowledge is required. The types of external knowledge and their effects on system performance are identified and evaluated. Results are compared against existing systems for solving these tasks and are found to improve on them, or to approach the performance of recently reported, state-of-the- art systems. Software artefacts developed in this research have been made available as open-source components within the General Architecture for Text Engineering framework
- âŚ