4 research outputs found
Chrono: A System for Normalizing Temporal Expressions
The Chrono System: Chrono is a hybrid rule-based and machine learning system written in Python and built from the ground up to identify temporal expressions in text and normalizes them into the SCATE schema. Input text is preprocessed using Python’s NLTK package, and is run through each of the four primary modules highlighted here. Note that Chrono does not remove stopwords because they add temporal information and context, and Chrono does not tokenize sentences. Output is an Anafora XML file with annotated SCATE entities. After minor parsing logic adjustments, Chrono has emerged as the top performing system for SemEval 2018 Task 6. Chrono is available on GitHub at https://github.com/AmyOlex/Chrono.
Future Work: Chrono is still under development. Future improvements will include: additional entity parsing, like “event”; evaluating the impact of sentence tokenization; implement an ensemble ML module that utilizes all four ML methods for disambiguation; extract temporal phrase parsing algorithm to be stand-alone and compare to similar systems; evaluate performance on THYME medical corpus; migrate to UIMA framework and implement Ruta Rules for portability and easier customization
Temporal disambiguation of relative temporal expressions in clinical texts using temporally fine-tuned contextual word embeddings.
Temporal reasoning is the ability to extract and assimilate temporal information to reconstruct a series of events such that they can be reasoned over to answer questions involving time. Temporal reasoning in the clinical domain is challenging due to specialized medical terms and nomenclature, shorthand notation, fragmented text, a variety of writing styles used by different medical units, redundancy of information that has to be reconciled, and an increased number of temporal references as compared to general domain texts. Work in the area of clinical temporal reasoning has progressed, but the current state-of-the-art still has a ways to go before practical application in the clinical setting will be possible. Much of the current work in this field is focused on direct and explicit temporal expressions and identifying temporal relations. However, there is little work focused on relative temporal expressions, which can be difficult to normalize, but are vital to ordering events on a timeline. This work introduces a new temporal expression recognition and normalization tool, Chrono, that normalizes temporal expressions into both SCATE and TimeML schemes. Chrono advances clinical timeline extraction as it is capable of identifying more vague and relative temporal expressions than the current state-of-the-art and utilizes contextualized word embeddings from fine-tuned BERT models to disambiguate temporal types, which achieves state-of-the-art performance on relative temporal expressions. In addition, this work shows that fine-tuning BERT models on temporal tasks modifies the contextualized embeddings so that they achieve improved performance in classical SVM and CNN classifiers. Finally, this works provides a new tool for linking temporal expressions to events or other entities by introducing a novel method to identify which tokens an entire temporal expression is paying the most attention to by summarizing the attention weight matrices output by BERT models
Computational approaches to semantic change (Volume 6)
Semantic change — how the meanings of words change over time — has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans
Recommended from our members
Metadata Matters: Adaptation Methods For Robust Document Classification
Metadata, implicitly embedded in documents such as time, demographic factors and user interests, can cause language variations and impact performance of document classifiers. For example, language shifts over periods of time, and males and females express sentiment differently. However, models for document classification, the automatic categorization of documents into categories, typically ignore document metadata. In this thesis, we focus on two types of document metadata, temporality and user factors. We propose to use domain adaptation by treating each metadata attribute as domains (e.g., gender domains: male vs. female), aiming to integrate temporality and user factors into document classifiers and improve classification performance.
First, we propose temporality adaptation that explicitly incorporates time into the representation learning process via feature augmentation and diachronic word embedding. The feature augmentation method aims to learn time-independent feature weights for document classifiers. We then develop an end-to-end time-adapted model with the diachronic word embedding under a time-driven framework. Second, we propose user factor adaptation that models demographic attributes and user interests using multitask learning. To model demographic attributes, document classifiers jointly predict demographic factors and document categories. We further develop a multitask user embedding that jointly learns language, user behaviors and user interests. We examine and visualize impacts of temporality and user factor on word, topic, semantic and classifier levels.
Benefits of adapting demographic attributes motivate us to examine if domain adaptation can reduce demographic biases. We release a multilingual hate speech corpus with author-level demographic labels. We examine demographic variations of user language and demographic biases of document classifiers. Following this, to reduce demographic bias, we apply a feature augmentation method to learn demographic-independent classifiers.</p