3,114 research outputs found
Methods for improving entity linking and exploiting social media messages across crises
Entity Linking (EL) is the task of automatically identifying entity mentions in texts and resolving them to a corresponding entity in a reference knowledge base (KB). There is a large number of tools available for different types of documents and domains, however the literature in entity linking has shown the quality of a tool varies across different corpus and depends on specific characteristics of the corpus it is applied to. Moreover the lack
of precision on particularly ambiguous mentions often spoils the usefulness of automated
disambiguation results in real world applications.
In the first part of this thesis I explore an approximation of the difficulty to link entity
mentions and frame it as a supervised classification task. Classifying difficult to disambiguate entity mentions can facilitate identifying critical cases as part of a semi-automated system, while detecting latent corpus characteristics that affect the entity linking performance. Moreover, despiteless the large number of entity linking tools that have been proposed throughout the past years, some tools work better on short mentions while others perform better when there is more contextual information. To this end, I proposed a solution by exploiting results from distinct entity linking tools on the same corpus by leveraging their individual strengths on a per-mention basis. The proposed solution demonstrated to be effective and outperformed the individual entity systems employed in a series of experiments.
An important component in the majority of the entity linking tools is the probability
that a mentions links to one entity in a reference knowledge base, and the computation of this probability is usually done over a static snapshot of a reference KB. However, an entity’s popularity is temporally sensitive and may change due to short term events. Moreover, these changes might be then reflected in a KB and EL tools can produce different results for a given mention at different times. I investigated the prior probability change over time and the overall disambiguation performance using different KB from different time periods. The second part of this thesis is mainly concerned with short texts. Social media has become an integral part of the modern society. Twitter, for instance, is one of the most popular social media platforms around the world that enables people to share their opinions and post short messages about any subject on a daily basis. At first I presented one
approach to identifying informative messages during catastrophic events using deep learning techniques. By automatically detecting informative messages posted by users during major events, it can enable professionals involved in crisis management to better estimate damages with only relevant information posted on social media channels, as well as to act immediately. Moreover I have also performed an analysis study on Twitter messages posted during the Covid-19 pandemic. Initially I collected 4 million tweets posted in Portuguese since the begining of the pandemic and provided an analysis of the debate aroud the pandemic. I used topic modeling, sentiment analysis and hashtags recomendation techniques to provide isights around the online discussion of the Covid-19 pandemic
A methodology for the resolution of cashtag collisions on Twitter – A natural language processing & data fusion approach
Investors utilise social media such as Twitter as a means of sharing news surrounding financials stocks
listed on international stock exchanges. Company ticker symbols are used to uniquely identify companies
listed on stock exchanges and can be embedded within tweets to create clickable hyperlinks referred to
as cashtags, allowing investors to associate their tweets with specific companies. The main limitation is
that identical ticker symbols are present on exchanges all over the world, and when searching for such
cashtags on Twitter, a stream of tweets is returned which match any company in which the cashtag
refers to - we refer to this as a cashtag collision. The presence of colliding cashtags could sow confusion
for investors seeking news regarding a specific company. A resolution to this issue would benefit investors
who rely on the speediness of tweets for financial information, saving them precious time. We propose
a methodology to resolve this problem which combines Natural Language Processing and Data Fusion
to construct company-specific corpora to aid in the detection and resolution of colliding cashtags, so
that tweets can be classified as being related to a specific stock exchange or not. Supervised machine
learning classifiers are trained twice on each tweet – once on a count vectorisation of the tweet text,
and again with the assistance of features contained in the company-specific corpora. We validate the
cashtag collision methodology by carrying out an experiment involving companies listed on the London
Stock Exchange. Results show that several machine learning classifiers benefit from the use of the custom
corpora, yielding higher classification accuracy in the prediction and resolution of colliding cashtags
The Pisa Audio-visual Corpus Project: A multimodal approach to ESP research and teaching
This paper presents an ongoing project sponsored by the University of Pisa Language Centre to compile an audiovisual corpus of specialized types of discourse of particular relevance to ESP learners in higher education. The first phase of the project focuses on collecting digitally available video clips that encode specialized language in a range of genres along an ‘authentic’ to ‘fictional’ continuum. The video clips will be analyzed from a multimodal perspective to determine how various semiotic resources work together to construct meaning. They will then be utilized in the ESP classroom to increase learners’ awareness of the key contribution of different modes in specialized communication. We present some exploratory multimodal analyses performed on video clips that encode instances of political discourse across two different genres on the extreme poles of the continuum: a fictional political drama film and an authentic political science lecture
Causality Management and Analysis in Requirement Manuscript for Software Designs
For software design tasks involving natural language, the results of a causal investigation
provide valuable and robust semantic information, especially for identifying key
variables during product (software) design and product optimization. As the interest
in analytical data science shifts from correlations to a better understanding of causality,
there is an equal task focused on the accuracy of extracting causality from textual
artifacts to aid requirement engineering (RE) based decisions. This thesis focuses on
identifying, extracting, and classifying causal phrases using word and sentence labeling
based on the Bi-directional Encoder Representations from Transformers (BERT) deep
learning language model and five machine learning models. The aim is to understand
the form and degree of causality based on their impact and prevalence in RE practice.
Methodologically, our analysis is centered around RE practice, and we considered 12,438
sentences extracted from 50 requirement engineering manuscripts (REM) for training
our machine models. Our research reports that causal expressions constitute about 32%
of sentences from REM. We applied four evaluation metrics, namely recall, accuracy,
precision, and F1, to assess our machine models’ performance and accuracy to ensure
the results’ conformity with our study goal. Further, we computed the highest model
accuracy to be 85%, attributed to Naive Bayes. Finally, we noted that the applicability
and relevance of our causal analytic framework is relevant to practitioners for different
functionalities, such as generating test cases for requirement engineers and software
developers and product performance auditing for management stakeholders
Large Language Models to Identify Social Determinants of Health in Electronic Health Records
Social determinants of health (SDoH) have an important impact on patient
outcomes but are incompletely collected from the electronic health records
(EHR). This study researched the ability of large language models to extract
SDoH from free text in EHRs, where they are most commonly documented, and
explored the role of synthetic clinical text for improving the extraction of
these scarcely documented, yet extremely valuable, clinical data. 800 patient
notes were annotated for SDoH categories, and several transformer-based models
were evaluated. The study also experimented with synthetic data generation and
assessed for algorithmic bias. Our best-performing models were fine-tuned
Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The
benefit of augmenting fine-tuning with synthetic data varied across model
architecture and size, with smaller Flan-T5 models (base and large) showing the
greatest improvements in performance (delta F1 +0.12 to +0.23). Model
performance was similar on the in-hospital system dataset but worse on the
MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and
few-shot performance of ChatGPT-family models for both tasks. These fine-tuned
models were less likely than ChatGPT to change their prediction when
race/ethnicity and gender descriptors were added to the text, suggesting less
algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of
patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can
effectively extracted SDoH information from clinic notes, performing better
compare to GPT zero- and few-shot settings. These models could enhance
real-world evidence on SDoH and aid in identifying patients needing social
support.Comment: 38 pages, 5 figures, 5 tables in main, submitted for revie
- …